https://github.com/rumbledb/rumble

⛈️ RumbleDB 2.0.0 "Lemon Ironwood" 🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
✓
Committers with academic emails
12 of 27 committers (44.4%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.1%) to scientific vocabulary

Keywords

avro azure csv data-science dataframes hdfs json jsoniq machine-learning nested parquet query query-engine s3 scale schemaless spark svm text yaml

Keywords from Contributors

distribution projection interactive serializer measurement cycles packaging deep-neural-networks charts network-simulation

Last synced: 10 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: RumbleDB
License: other
Language: Java
Default Branch: master
Homepage: http://rumbledb.org/
Size: 378 MB

Statistics

Stars: 227
Watchers: 24
Forks: 84
Open Issues: 125
Releases: 41

Topics

avro azure csv data-science dataframes hdfs json jsoniq machine-learning nested parquet query query-engine s3 scale schemaless spark svm text yaml

Created almost 9 years ago · Last pushed 11 months ago

Metadata Files

Readme License Codeowners

README.md

RumbleDB

With RumbleDB, you can query with ease a lot of different nested, heterogeneous data formats like JSON, CSV, Parquet, Avro, LibSVM, text, etc.

RumbleDB exposes a query language rather than a DataFrame API, for more flexibility, more productivity but also because a lot of data simply will not fit in DataFrames.

You can query it in place from any local file systems or data lakes (Azure blob storage, Amazon S3, HDFS, etc).

You can prepare, clean up, validate your data and put it right into your machine learning pipelines with RumbleDB ML.

Getting started: you will find a Jupyter notebook that introduces the JSONiq language on top of RumbleDB here. You can also run it locally if you prefer.

The documentation also contains an introduction specific to RumbleDB and how you can read input datasets, but we have not converted it to Jupyter notebooks yet (this will follow).

The documentation of the latest official release is available here.

Contributors (Ghislain Fourny's students at ETH): Stefan Irimescu, Renato Marroquin, Rodrigo Bruno, Falko No, Ioana Stefan, Andrea Rinaldi, Stevan Mihajlovic, Mario Arduini, Can Berker k, Elwin Stephan, David Dao, Zirun Wang, Ingo Mller, Dan-Ovidiu Graur, Thomas Zhou, Olivier Goerens, Alexandru Meterez, Pierre Motard, Remo Rthlisberger, Dominik Bruggisser, David Loughlin, David Buzatu, Marco Schb, Maciej Byczko, Abishek Ramdas, Matteo Agnoletto, Dwij Dixit.

Owner

Name: RumbleDB
Login: RumbleDB
Kind: organization
Location: Zurich, Switzerland

Website: http://rumbledb.org/
Twitter: db_rumble
Repositories: 13
Profile: https://github.com/RumbleDB

Query your large messy datasets, no matter where they are.

GitHub Events

Total

Create event: 46
Release event: 1
Issues event: 48
Watch event: 16
Delete event: 45
Issue comment event: 47
Member event: 4
Push event: 219
Pull request review comment event: 3
Pull request review event: 9
Pull request event: 112
Fork event: 3

Last Year

Create event: 46
Release event: 1
Issues event: 48
Watch event: 16
Delete event: 45
Issue comment event: 47
Member event: 4
Push event: 219
Pull request review comment event: 3
Pull request review event: 9
Pull request event: 112
Fork event: 3

Committers

Last synced: 11 months ago

All Time

Total Commits: 5,734
Total Committers: 27
Avg Commits per committer: 212.37
Development Distribution Score (DDS): 0.539

Past Year

Commits: 548
Committers: 9
Avg Commits per committer: 60.889
Development Distribution Score (DDS): 0.6

Top Committers

Name	Email	Commits
Ghislain Fourny	g**y@i**h	2,643
canberker	c**k@g**m	830
Ghislain Fourny	g**y@i**h	561
dloughlin	d**n@s**h	227
Marco Schöb	m**b@g**m	219
Falko Noe	f**e@g**m	215
Pierre Motard	p**d@o**m	182
davidbuzatu-marian	d**u@g**m	141
EPMatt	3****t	140
AndreaRinaldi1	a**1@o**m	139
Dominik Bruggisser	d**r@s**h	120
mbyczko	m**o@s**h	86
istefan	i**n@s**h	80
Ioana Stefan	i**n@s**h	51
wscsprint3r	s**u@y**m	27
mschoeb	m**b@i**h	20
mstevan	m**n@s**h	15
Pierre Motard	p**d@s**h	12
David-C-L	d**8@g**m	10
Can Cikis	c**n@a**h	6
Ghislain Fourny	g**n@s**h	3
dependabot[bot]	4****]	2
lulunac27a	n**s@g**m	1
dwddao	c**o@g**m	1
Thad Guidry	t**y@g**m	1
Ingo Mueller	i**r@i**h	1
Dario Ackermann	d**k@g**m	1

Committer Domains (Top 20 + Academic)

student.ethz.ch: 7 inf.ethz.ch: 4 staff-net-cx-0603.intern.ethz.ch: 1 atfinity.ch: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 66
Total pull requests: 121
Average time to close issues: almost 2 years
Average time to close pull requests: 3 months
Total issue authors: 20
Total pull request authors: 9
Average comments per issue: 1.36
Average comments per pull request: 0.08
Merged pull requests: 75
Bot issues: 0
Bot pull requests: 2

Past Year

Issues: 20
Pull requests: 52
Average time to close issues: 7 days
Average time to close pull requests: 15 days
Issue authors: 3
Pull request authors: 3
Average comments per issue: 0.0
Average comments per pull request: 0.02
Merged pull requests: 26
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

ghislainfourny (15)
mschoeb (14)
mstevan (10)
ingomueller-net (9)
mario-arduini (8)
CanBerker (4)
wzrain (2)
CVEDetect (1)
bhornerETHZ (1)
salv0 (1)
lthiet (1)
zpgeng (1)
losmi83 (1)
satyanvm (1)
fxgst (1)

Pull Request Authors

ghislainfourny (117)
dependabot[bot] (6)
Byczax (5)
mschoeb (4)
David-C-L (3)
thomastzhou (2)
DavidBuzatu-Marian (2)
MSthe00 (2)
thadguidry (1)
CVEDetect (1)
darioackermann (1)
CanBerker (1)
HelenParr (1)
lulunac27a (1)
alexandrumeterez (1)

Top Labels

Issue Labels

Bug (32) Fix released (27) Enhancement (9) Development (4) Fix committed (3) Refactoring (3) Question (2) On hold (2) In Progress (2) Under discussion (1) Approved (1) Language feature (1) RumbleML (1) Function library (1)

Pull Request Labels

dependencies (6) Release flavor (5) Enhancement (5) Approved (3) On hold (2) Under discussion (1) !Do not merge! (1)

Packages

Total packages: 3
Total downloads: unknown

Total dependent packages: 0
(may contain duplicates)
Total dependent repositories: 0
(may contain duplicates)
Total versions: 7

repo1.maven.org: com.github.rumbledb:rumble

A JSONiq engine to query large-scale JSON datasets stored on HDFS. Spark under the hood.

Homepage: http://rumbledb.org/
Documentation: https://appdoc.app/artifact/com.github.rumbledb/rumble/
License: Apache License 2.0
Latest release: 1.1
published almost 7 years ago

Versions: 1
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Forks count: 14.2%

Stargazers count: 16.5%

Average: 27.9%

Dependent repos count: 32.0%

Dependent packages count: 48.9%

Last synced: 10 months ago

repo1.maven.org: com.github.rumbledb:spark-rumble

A JSONiq engine to query large-scale JSON datasets stored on HDFS. Spark under the hood.

Homepage: http://rumbledb.org/
Documentation: https://appdoc.app/artifact/com.github.rumbledb/spark-rumble/
License: Apache License 2.0
Latest release: 1.9.1
published over 5 years ago

Versions: 5
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Forks count: 14.2%

Stargazers count: 16.5%

Average: 27.9%

Dependent repos count: 32.0%

Dependent packages count: 48.9%

Last synced: 11 months ago

repo1.maven.org: com.github.rumbledb:rumbledb

A JSONiq engine to query large-scale JSON datasets stored on HDFS. Spark under the hood.

Homepage: http://rumbledb.org/
Documentation: https://appdoc.app/artifact/com.github.rumbledb/rumbledb/
License: Apache License 2.0
Latest release: 2.0.0
published 10 months ago

Versions: 1
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent repos count: 32.3%

Average: 39.3%

Dependent packages count: 46.2%

Last synced: 10 months ago

Dependencies

pom.xml maven

org.apache.hadoop:hadoop-aws 3.2.1 provided
org.apache.spark:spark-core_2.12 3.1.3 provided
org.apache.spark:spark-mllib_2.12 3.1.3 provided
org.apache.spark:spark-sql_2.12 3.1.3 provided
com.esotericsoftware:kryo 4.0.2
commons-io:commons-io 2.11.0
joda-time:joda-time 2.10.6
org.antlr:antlr4-runtime 4.8
org.apache.commons:commons-lang3 3.9
org.apache.commons:commons-text 1.6
org.apache.httpcomponents:httpclient 4.5.13
org.apache.spark:spark-avro_2.12 3.1.3
org.jgrapht:jgrapht-core 1.4.0
org.jline:jline 3.11.0
junit:junit 4.13.1 test

.github/workflows/maven.yml actions

actions/cache v3 composite
actions/checkout v3 composite
actions/setup-java v3 composite
actions/upload-artifact v3 composite