https://github.com/rumbledb/rumble
βοΈ RumbleDB 2.0.0 "Lemon Ironwood" π³ for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more
Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
βCITATION.cff file
-
βcodemeta.json file
Found codemeta.json file -
β.zenodo.json file
Found .zenodo.json file -
βDOI references
-
βAcademic publication links
-
βCommitters with academic emails
12 of 27 committers (44.4%) from academic institutions -
βInstitutional organization owner
-
βJOSS paper metadata
-
βScientific vocabulary similarity
Low similarity (10.1%) to scientific vocabulary
Keywords
Keywords from Contributors
Repository
βοΈ RumbleDB 2.0.0 "Lemon Ironwood" π³ for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more
Basic Info
- Host: GitHub
- Owner: RumbleDB
- License: other
- Language: Java
- Default Branch: master
- Homepage: http://rumbledb.org/
- Size: 378 MB
Statistics
- Stars: 227
- Watchers: 24
- Forks: 84
- Open Issues: 125
- Releases: 41
Topics
Metadata Files
README.md
RumbleDB
With RumbleDB, you can query with ease a lot of different nested, heterogeneous data formats like JSON, CSV, Parquet, Avro, LibSVM, text, etc.
RumbleDB exposes a query language rather than a DataFrame API, for more flexibility, more productivity but also because a lot of data simply will not fit in DataFrames.
You can query it in place from any local file systems or data lakes (Azure blob storage, Amazon S3, HDFS, etc).
You can prepare, clean up, validate your data and put it right into your machine learning pipelines with RumbleDB ML.
Getting started: you will find a Jupyter notebook that introduces the JSONiq language on top of RumbleDB here. You can also run it locally if you prefer.
The documentation also contains an introduction specific to RumbleDB and how you can read input datasets, but we have not converted it to Jupyter notebooks yet (this will follow).
The documentation of the latest official release is available here.
Contributors (Ghislain Fourny's students at ETH): Stefan Irimescu, Renato Marroquin, Rodrigo Bruno, Falko No, Ioana Stefan, Andrea Rinaldi, Stevan Mihajlovic, Mario Arduini, Can Berker k, Elwin Stephan, David Dao, Zirun Wang, Ingo Mller, Dan-Ovidiu Graur, Thomas Zhou, Olivier Goerens, Alexandru Meterez, Pierre Motard, Remo Rthlisberger, Dominik Bruggisser, David Loughlin, David Buzatu, Marco Schb, Maciej Byczko, Abishek Ramdas, Matteo Agnoletto, Dwij Dixit.
Owner
- Name: RumbleDB
- Login: RumbleDB
- Kind: organization
- Location: Zurich, Switzerland
- Website: http://rumbledb.org/
- Twitter: db_rumble
- Repositories: 13
- Profile: https://github.com/RumbleDB
Query your large messy datasets, no matter where they are.
GitHub Events
Total
- Create event: 46
- Release event: 1
- Issues event: 48
- Watch event: 16
- Delete event: 45
- Issue comment event: 47
- Member event: 4
- Push event: 219
- Pull request review comment event: 3
- Pull request review event: 9
- Pull request event: 112
- Fork event: 3
Last Year
- Create event: 46
- Release event: 1
- Issues event: 48
- Watch event: 16
- Delete event: 45
- Issue comment event: 47
- Member event: 4
- Push event: 219
- Pull request review comment event: 3
- Pull request review event: 9
- Pull request event: 112
- Fork event: 3
Committers
Last synced: 7 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Ghislain Fourny | g****y@i****h | 2,643 |
| canberker | c****k@g****m | 830 |
| Ghislain Fourny | g****y@i****h | 561 |
| dloughlin | d****n@s****h | 227 |
| Marco SchΓΆb | m****b@g****m | 219 |
| Falko Noe | f****e@g****m | 215 |
| Pierre Motard | p****d@o****m | 182 |
| davidbuzatu-marian | d****u@g****m | 141 |
| EPMatt | 3****t | 140 |
| AndreaRinaldi1 | a****1@o****m | 139 |
| Dominik Bruggisser | d****r@s****h | 120 |
| mbyczko | m****o@s****h | 86 |
| istefan | i****n@s****h | 80 |
| Ioana Stefan | i****n@s****h | 51 |
| wscsprint3r | s****u@y****m | 27 |
| mschoeb | m****b@i****h | 20 |
| mstevan | m****n@s****h | 15 |
| Pierre Motard | p****d@s****h | 12 |
| David-C-L | d****8@g****m | 10 |
| Can Cikis | c****n@a****h | 6 |
| Ghislain Fourny | g****n@s****h | 3 |
| dependabot[bot] | 4****] | 2 |
| lulunac27a | n****s@g****m | 1 |
| dwddao | c****o@g****m | 1 |
| Thad Guidry | t****y@g****m | 1 |
| Ingo Mueller | i****r@i****h | 1 |
| Dario Ackermann | d****k@g****m | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 66
- Total pull requests: 121
- Average time to close issues: almost 2 years
- Average time to close pull requests: 3 months
- Total issue authors: 20
- Total pull request authors: 9
- Average comments per issue: 1.36
- Average comments per pull request: 0.08
- Merged pull requests: 75
- Bot issues: 0
- Bot pull requests: 2
Past Year
- Issues: 20
- Pull requests: 52
- Average time to close issues: 7 days
- Average time to close pull requests: 15 days
- Issue authors: 3
- Pull request authors: 3
- Average comments per issue: 0.0
- Average comments per pull request: 0.02
- Merged pull requests: 26
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- ghislainfourny (15)
- mschoeb (14)
- mstevan (10)
- ingomueller-net (9)
- mario-arduini (8)
- CanBerker (4)
- wzrain (2)
- CVEDetect (1)
- bhornerETHZ (1)
- salv0 (1)
- lthiet (1)
- zpgeng (1)
- losmi83 (1)
- satyanvm (1)
- fxgst (1)
Pull Request Authors
- ghislainfourny (117)
- dependabot[bot] (6)
- Byczax (5)
- mschoeb (4)
- David-C-L (3)
- thomastzhou (2)
- DavidBuzatu-Marian (2)
- MSthe00 (2)
- thadguidry (1)
- CVEDetect (1)
- darioackermann (1)
- CanBerker (1)
- HelenParr (1)
- lulunac27a (1)
- alexandrumeterez (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 3
- Total downloads: unknown
-
Total dependent packages: 0
(may contain duplicates) -
Total dependent repositories: 0
(may contain duplicates) - Total versions: 7
repo1.maven.org: com.github.rumbledb:rumble
A JSONiq engine to query large-scale JSON datasets stored on HDFS. Spark under the hood.
- Homepage: http://rumbledb.org/
- Documentation: https://appdoc.app/artifact/com.github.rumbledb/rumble/
- License: Apache License 2.0
-
Latest release: 1.1
published over 6 years ago
Rankings
repo1.maven.org: com.github.rumbledb:spark-rumble
A JSONiq engine to query large-scale JSON datasets stored on HDFS. Spark under the hood.
- Homepage: http://rumbledb.org/
- Documentation: https://appdoc.app/artifact/com.github.rumbledb/spark-rumble/
- License: Apache License 2.0
-
Latest release: 1.9.1
published about 5 years ago
Rankings
repo1.maven.org: com.github.rumbledb:rumbledb
A JSONiq engine to query large-scale JSON datasets stored on HDFS. Spark under the hood.
- Homepage: http://rumbledb.org/
- Documentation: https://appdoc.app/artifact/com.github.rumbledb/rumbledb/
- License: Apache License 2.0
-
Latest release: 2.0.0
published 6 months ago
Rankings
Dependencies
- org.apache.hadoop:hadoop-aws 3.2.1 provided
- org.apache.spark:spark-core_2.12 3.1.3 provided
- org.apache.spark:spark-mllib_2.12 3.1.3 provided
- org.apache.spark:spark-sql_2.12 3.1.3 provided
- com.esotericsoftware:kryo 4.0.2
- commons-io:commons-io 2.11.0
- joda-time:joda-time 2.10.6
- org.antlr:antlr4-runtime 4.8
- org.apache.commons:commons-lang3 3.9
- org.apache.commons:commons-text 1.6
- org.apache.httpcomponents:httpclient 4.5.13
- org.apache.spark:spark-avro_2.12 3.1.3
- org.jgrapht:jgrapht-core 1.4.0
- org.jline:jline 3.11.0
- junit:junit 4.13.1 test
- actions/cache v3 composite
- actions/checkout v3 composite
- actions/setup-java v3 composite
- actions/upload-artifact v3 composite