https://github.com/rumbledb/rumble

β›ˆοΈ RumbleDB 2.0.0 "Lemon Ironwood" 🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more

https://github.com/rumbledb/rumble

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • β—‹
    CITATION.cff file
  • βœ“
    codemeta.json file
    Found codemeta.json file
  • βœ“
    .zenodo.json file
    Found .zenodo.json file
  • β—‹
    DOI references
  • β—‹
    Academic publication links
  • βœ“
    Committers with academic emails
    12 of 27 committers (44.4%) from academic institutions
  • β—‹
    Institutional organization owner
  • β—‹
    JOSS paper metadata
  • β—‹
    Scientific vocabulary similarity
    Low similarity (10.1%) to scientific vocabulary

Keywords

avro azure csv data-science dataframes hdfs json jsoniq machine-learning nested parquet query query-engine s3 scale schemaless spark svm text yaml

Keywords from Contributors

distribution projection interactive serializer measurement cycles packaging deep-neural-networks charts network-simulation
Last synced: 6 months ago · JSON representation

Repository

β›ˆοΈ RumbleDB 2.0.0 "Lemon Ironwood" 🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more

Basic Info
  • Host: GitHub
  • Owner: RumbleDB
  • License: other
  • Language: Java
  • Default Branch: master
  • Homepage: http://rumbledb.org/
  • Size: 378 MB
Statistics
  • Stars: 227
  • Watchers: 24
  • Forks: 84
  • Open Issues: 125
  • Releases: 41
Topics
avro azure csv data-science dataframes hdfs json jsoniq machine-learning nested parquet query query-engine s3 scale schemaless spark svm text yaml
Created over 8 years ago · Last pushed 6 months ago
Metadata Files
Readme License Codeowners

README.md

RumbleDB

With RumbleDB, you can query with ease a lot of different nested, heterogeneous data formats like JSON, CSV, Parquet, Avro, LibSVM, text, etc.

RumbleDB exposes a query language rather than a DataFrame API, for more flexibility, more productivity but also because a lot of data simply will not fit in DataFrames.

You can query it in place from any local file systems or data lakes (Azure blob storage, Amazon S3, HDFS, etc).

You can prepare, clean up, validate your data and put it right into your machine learning pipelines with RumbleDB ML.

Getting started: you will find a Jupyter notebook that introduces the JSONiq language on top of RumbleDB here. You can also run it locally if you prefer.

The documentation also contains an introduction specific to RumbleDB and how you can read input datasets, but we have not converted it to Jupyter notebooks yet (this will follow).

The documentation of the latest official release is available here.

Contributors (Ghislain Fourny's students at ETH): Stefan Irimescu, Renato Marroquin, Rodrigo Bruno, Falko No, Ioana Stefan, Andrea Rinaldi, Stevan Mihajlovic, Mario Arduini, Can Berker k, Elwin Stephan, David Dao, Zirun Wang, Ingo Mller, Dan-Ovidiu Graur, Thomas Zhou, Olivier Goerens, Alexandru Meterez, Pierre Motard, Remo Rthlisberger, Dominik Bruggisser, David Loughlin, David Buzatu, Marco Schb, Maciej Byczko, Abishek Ramdas, Matteo Agnoletto, Dwij Dixit.

Owner

  • Name: RumbleDB
  • Login: RumbleDB
  • Kind: organization
  • Location: Zurich, Switzerland

Query your large messy datasets, no matter where they are.

GitHub Events

Total
  • Create event: 46
  • Release event: 1
  • Issues event: 48
  • Watch event: 16
  • Delete event: 45
  • Issue comment event: 47
  • Member event: 4
  • Push event: 219
  • Pull request review comment event: 3
  • Pull request review event: 9
  • Pull request event: 112
  • Fork event: 3
Last Year
  • Create event: 46
  • Release event: 1
  • Issues event: 48
  • Watch event: 16
  • Delete event: 45
  • Issue comment event: 47
  • Member event: 4
  • Push event: 219
  • Pull request review comment event: 3
  • Pull request review event: 9
  • Pull request event: 112
  • Fork event: 3

Committers

Last synced: 7 months ago

All Time
  • Total Commits: 5,734
  • Total Committers: 27
  • Avg Commits per committer: 212.37
  • Development Distribution Score (DDS): 0.539
Past Year
  • Commits: 548
  • Committers: 9
  • Avg Commits per committer: 60.889
  • Development Distribution Score (DDS): 0.6
Top Committers
Name Email Commits
Ghislain Fourny g****y@i****h 2,643
canberker c****k@g****m 830
Ghislain Fourny g****y@i****h 561
dloughlin d****n@s****h 227
Marco SchΓΆb m****b@g****m 219
Falko Noe f****e@g****m 215
Pierre Motard p****d@o****m 182
davidbuzatu-marian d****u@g****m 141
EPMatt 3****t 140
AndreaRinaldi1 a****1@o****m 139
Dominik Bruggisser d****r@s****h 120
mbyczko m****o@s****h 86
istefan i****n@s****h 80
Ioana Stefan i****n@s****h 51
wscsprint3r s****u@y****m 27
mschoeb m****b@i****h 20
mstevan m****n@s****h 15
Pierre Motard p****d@s****h 12
David-C-L d****8@g****m 10
Can Cikis c****n@a****h 6
Ghislain Fourny g****n@s****h 3
dependabot[bot] 4****] 2
lulunac27a n****s@g****m 1
dwddao c****o@g****m 1
Thad Guidry t****y@g****m 1
Ingo Mueller i****r@i****h 1
Dario Ackermann d****k@g****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 66
  • Total pull requests: 121
  • Average time to close issues: almost 2 years
  • Average time to close pull requests: 3 months
  • Total issue authors: 20
  • Total pull request authors: 9
  • Average comments per issue: 1.36
  • Average comments per pull request: 0.08
  • Merged pull requests: 75
  • Bot issues: 0
  • Bot pull requests: 2
Past Year
  • Issues: 20
  • Pull requests: 52
  • Average time to close issues: 7 days
  • Average time to close pull requests: 15 days
  • Issue authors: 3
  • Pull request authors: 3
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.02
  • Merged pull requests: 26
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • ghislainfourny (15)
  • mschoeb (14)
  • mstevan (10)
  • ingomueller-net (9)
  • mario-arduini (8)
  • CanBerker (4)
  • wzrain (2)
  • CVEDetect (1)
  • bhornerETHZ (1)
  • salv0 (1)
  • lthiet (1)
  • zpgeng (1)
  • losmi83 (1)
  • satyanvm (1)
  • fxgst (1)
Pull Request Authors
  • ghislainfourny (117)
  • dependabot[bot] (6)
  • Byczax (5)
  • mschoeb (4)
  • David-C-L (3)
  • thomastzhou (2)
  • DavidBuzatu-Marian (2)
  • MSthe00 (2)
  • thadguidry (1)
  • CVEDetect (1)
  • darioackermann (1)
  • CanBerker (1)
  • HelenParr (1)
  • lulunac27a (1)
  • alexandrumeterez (1)
Top Labels
Issue Labels
Bug (32) Fix released (27) Enhancement (9) Development (4) Fix committed (3) Refactoring (3) Question (2) On hold (2) In Progress (2) Under discussion (1) Approved (1) Language feature (1) RumbleML (1) Function library (1)
Pull Request Labels
dependencies (6) Release flavor (5) Enhancement (5) Approved (3) On hold (2) Under discussion (1) !Do not merge! (1)

Packages

  • Total packages: 3
  • Total downloads: unknown
  • Total dependent packages: 0
    (may contain duplicates)
  • Total dependent repositories: 0
    (may contain duplicates)
  • Total versions: 7
repo1.maven.org: com.github.rumbledb:rumble

A JSONiq engine to query large-scale JSON datasets stored on HDFS. Spark under the hood.

  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Forks count: 14.2%
Stargazers count: 16.5%
Average: 27.9%
Dependent repos count: 32.0%
Dependent packages count: 48.9%
Last synced: 6 months ago
repo1.maven.org: com.github.rumbledb:spark-rumble

A JSONiq engine to query large-scale JSON datasets stored on HDFS. Spark under the hood.

  • Versions: 5
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Forks count: 14.2%
Stargazers count: 16.5%
Average: 27.9%
Dependent repos count: 32.0%
Dependent packages count: 48.9%
Last synced: 7 months ago
repo1.maven.org: com.github.rumbledb:rumbledb

A JSONiq engine to query large-scale JSON datasets stored on HDFS. Spark under the hood.

  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent repos count: 32.3%
Average: 39.3%
Dependent packages count: 46.2%
Last synced: 6 months ago

Dependencies

pom.xml maven
  • org.apache.hadoop:hadoop-aws 3.2.1 provided
  • org.apache.spark:spark-core_2.12 3.1.3 provided
  • org.apache.spark:spark-mllib_2.12 3.1.3 provided
  • org.apache.spark:spark-sql_2.12 3.1.3 provided
  • com.esotericsoftware:kryo 4.0.2
  • commons-io:commons-io 2.11.0
  • joda-time:joda-time 2.10.6
  • org.antlr:antlr4-runtime 4.8
  • org.apache.commons:commons-lang3 3.9
  • org.apache.commons:commons-text 1.6
  • org.apache.httpcomponents:httpclient 4.5.13
  • org.apache.spark:spark-avro_2.12 3.1.3
  • org.jgrapht:jgrapht-core 1.4.0
  • org.jline:jline 3.11.0
  • junit:junit 4.13.1 test
.github/workflows/maven.yml actions
  • actions/cache v3 composite
  • actions/checkout v3 composite
  • actions/setup-java v3 composite
  • actions/upload-artifact v3 composite