Recent Releases of https://github.com/rumbledb/rumble

https://github.com/rumbledb/rumble - RumbleDB 1.23.0 "Mountain ash" beta

Use RumbleDB to query data with JSONiq, even data that does not fit in DataFrames.

Try-it-out sandbox: https://colab.research.google.com/github/RumbleDB/rumble/blob/master/RumbleSandbox.ipynb

Instructions to get started: https://rumble.readthedocs.io/en/latest/Getting%20started/

Supported versions The jars are compatible with Java 11 and 17. As we are increasingly focusing our efforts towards the upcoming Spark 4 release and stability and comformance improvements and as Spark 4 will be based on Scala 2.13, RumbleDB 1.23 support for Spark 3.4 as well as Scala 2.12 is dropped. Please use RumbleDB 1.22, which is stable, if you use Spark 3.4 or Spark 3.5 with Scala 2.12. Only Spark 3.5 with Scala 2.13 is supported by RumbleDB 1.23. Spark 4 is currently in preview and not supported yet by RumbleDB---the reason is that we are waiting for Delta Lake to release a binary compatible with Spark 4 preview 2---but we are currently trying it out in order to support in future releases.

The standalone jar already contains Spark 3.5 with Scala 2.13 and will thus just work.

General

  • Dropped support for Scala 2.12.
  • Dropped support for Spark 3.4
  • Renamed json-file() to json-lines(), old name can still be used for now but is marked deprecated
  • Added support for single quotes '. Strings with single quotes may contain double quotes ", but single quotes inside need to be escaped using \'. Analogous, strings with double quotes may contain single quotes, but double quotes inside need to be escaped using \"
  • Add support for some popular features of pandas/numpy libraries

JSONiq 3.1

Added option to use JSONiq 3.1 which brings changes to the JSONiq 1.0 spec to align it closer with XQuery 3.1. Enabling the option results in the following changes: - Objects and Arrays now have no effective boolean value and throw an error when checked - Keys for objects must be quoted - atomic is replaced by anyAtomicType - Remove JNDY0003 and replace it with XQDY0137 - Both the JSONiq and XQuery parsers are available. The parser to use can be selected on the command line or with a language declaration in the query file.

Basic XML/XQuery support for both parsers

  • Add doc() function for reading an XML document
  • Add a new xml-files() function that allows for reading and processing of multiple .xml files in parallel
  • Add XPath steps for navigating XML documents. We are able to navigate through 32+ GB of XML data spread over many documents in just a few minutes on an Amazon EMR cluster.
  • Add data() function for atomization of nodes

Experimental XQuery Parser

Updated option to use XQuery parser instead of JSONiq. To use it, just prefix your query with xquery version "3.1";. Note: this is in a very early state and many features are still missing. - Context item is "." as opposed to "$$" from JSONiq - No JSONiq ObjectLookups with "." - No JSONiq ArrayLookup and ArrayUnboxing - Support for XQuery Map constructor and curly Array constructor - Support for String Lookup on Maps and Integer lookup on arrays with the ? operator

Minor Improvements and Bug fixes

  • subsequence and sequencelookups now use Spark pagination for large positions
  • Rumble shell now keeps history of previous sessions
  • Implements compare() with arities 2 and 3
  • Implements trace() arity 2
  • Implements xs:numeric
  • Adds support for setting base-uri in query and as CLI option
  • Implement FOAR0002, FOAY0001, FOTY0013, FODT0001, FODT0002, XPTY0018, XPTY0019, XQST0032
  • Increase decimal multiplication precision to 18 digits
  • Fixes index lookup with an index >= 1'000'000 throwing an error and incorrect behaviour with non-integer
  • Fixes calling parallelize on an already parallelized structure throwing an error
  • Fixes index lookup with decimal not adhering to spec
  • Fixes unnecessary warning shown when
  • Fixes effective boolean value of NaN and decimals equal to 0
  • Fixes stringToCodepoints() on multibyte ranges
  • Fixes indexof() shoudn't find NaN
  • Fixes some base64 errors
  • Fixes some edgecases in pow, log10, exp10, atan
  • Fixes resolveUri with empty baseUri
  • Fixes some incorrect exceptions of matches()
  • Fixes sum() with zeroElement not behaving correctly if sequence is non-empty
  • Fixes idiv and imult handling of inf and NaN
  • Fixes inner focus sometimes missing in simpleMap
  • Fixes bug allowing missing commas between function arguments

- Java
Published by mschoeb about 1 year ago

https://github.com/rumbledb/rumble - RumbleDB 1.22.0 "Pyrenean oak" beta

Use RumbleDB to query data with JSONiq, even data that does not fit in DataFrames.

Try-it-out sandbox: https://colab.research.google.com/github/RumbleDB/rumble/blob/master/RumbleSandbox.ipynb

Instructions to get started: https://rumble.readthedocs.io/en/latest/Getting%20started/

Spark 3.2 and 3.3 are no longer supported as of RumbleDB 1.22, as they are no longer supported officially by the Spark team. Spark 3.4 and 3.5 are supported (the 3.5 jar will be uploaded shortly, we are still working on making it compatible with scala 1.13). Spark 4 is currently in preview and not supported yet by RumbleDB, but we are currently trying it out in order to support in future releases.

RumbleDB comes in 3 jars that you can pick from depending on your needs:

rumbledb-1.22.0-standalone.jar contains Spark already and can simply be run "out of the box" with java -jar rumbledb-1.21.0-standalone.jar with Java 8 or 11. rumbledb-1.22.0-for-spark-3.X.jar (3.4, 3.5) is smaller in size, does not contain Spark, and can be run in a corresponding, existing Spark environment either local (so you need to download and install Spark) or on a cluster (EMR with just a few clicks, etc) with spark-submit rumbledb-1.22.0-for-spark-3.X.jar

Improvements Support for the W3C-standardized copy-modify-return expression as a more convenient way to transform JSON objects and arrays with the update syntax (insertion, deletion, replacement, renaming) Support for the persistence of updates on objects and arrays read from the DeltaLake (with the same update syntax) Support for scripting: variable assignments, while loops, applying updates in the middle of the execution with visible side effects (under snapshot semantics), statements, block statements, continue, break, exit returning. Many performance improvements Many bugfixes

- Java
Published by ghislainfourny over 1 year ago

https://github.com/rumbledb/rumble - RumbleDB 1.21.0 "Hawthorn blossom" beta

NEW! The jar for Spark 3.5 was added and is available for download.

Use RumbleDB to query data with JSONiq, even data that does not fit in DataFrames.

Try-it-out sandbox: https://colab.research.google.com/github/RumbleDB/rumble/blob/master/RumbleSandbox.ipynb

Instructions to get started: https://rumble.readthedocs.io/en/latest/Getting%20started/

Spark 3.0 and 3.1 are no longer supported as of RumbleDB 1.21, as they are no longer supported officially by the Spark team. Spark 3.4 is newly supported.

RumbleDB comes in 4 jars that you can pick from depending on your needs:

rumbledb-1.21.0-standalone.jar contains Spark already and can simply be run "out of the box" with java -jar rumbledb-1.21.0-standalone.jar with Java 8 or 11. rumbledb-1.21.0-for-spark-3.X.jar (3.2, 3.3, 3.4) is smaller in size, does not contain Spark, and can be run in a corresponding, existing Spark environment either local (so you need to download and install Spark) or on a cluster (EMR with just a few clicks, etc) with spark-submit rumbledb-1.21.0-for-spark-3.X.jar

Improvements - Automatically parallelizes range expressions with more than a million items with no need to call parallelize() any more. - some simple map expressions on homogeneous input are now faster (native SQL behind the scene). - general comparisons on equality are now considerably faster - reverse() is now more efficient and faster on homogeneous sequences - Fixed bug on equijoin involving homogeneous sequences - Add two functions jn:cosh and jn:sinh - Automatic optimization of general comparisons to value comparisons when it is detected that the sequences have at most one item (can be deactivated with --optimize-general-comparison-to-value-comparison on) - Better static type detection - It is now possible to force a sequential execution (without Spark) with --parallel-execution no. This also works with queries containing calls to parallelize() (which will be ineffective), json-doc(), and json-file() (which will simply stream-read from the disk). Other I/O functions (such as csv-file(), etc) will still involve Spark for reading, but immediately materialize for the rest of the execution. - It is now possible to deactivate Native Spark SQL execution (forcing a fallback to the use of UDFs by RumbleDB) with --native-execution no. - annotate expression (similar syntax to validate expression) allows directly annotating an item without checking for validity. - More static types are detected - Non-recursive functions are now automatically inlined for faster execution. This can be deactivated with --function-inlining no (reverting to behavior in previous versions) - TypeSwitch expressions now support DataFrame execution

Bugfixes - Fixed bug when reading longs from DataFrames - Fixed an issue with projection pushdowns in join queries - Fixed a few bugs with queries that navigate JSON in for clauses; they are compiled to native SQL whenever possible, but some chains were throwing errors (e.g., an array unboxing followed by object lookup) - Fixed a bug in which calling count() on a grouping variable did not return 1 when native SQL execution is activated - hexBinary and base64Binary values can now be used in order by clauses with parallel execution

- Java
Published by ghislainfourny about 3 years ago

https://github.com/rumbledb/rumble - RumbleDB 1.20.0 "Honeylocust"

Use RumbleDB to query data with JSONiq, even data that does not fit in DataFrames.

Try-it-out sandbox: https://colab.research.google.com/github/RumbleDB/rumble/blob/master/RumbleSandbox.ipynb

Instructions to get started: https://rumble.readthedocs.io/en/latest/Getting%20started/

Spark 3.0 and 3.1 are no longer supported as of RumbleDB 1.20, as they are no longer supported officially by the Spark team.

RumbleDB comes in 4 jars that you can pick from depending on your needs:

rumbledb-1.20.0-standalone.jar contains Spark already and can simply be run "out of the box" with java -jar rumbledb-1.20.0-standalone.jar with Java 8 or 11. rumbledb-1.20.0-for-spark-3.X.jar (3.2, 3.3) is smaller in size, does not contain Spark, and can be run in a corresponding, existing Spark environment either local (so you need to download and install Spark) or on a cluster (EMR with just a few clicks, etc) with spark-submit rumbledb-1.20.0-for-spark-3.X.jar

New features: - Open and query YAML files (also with multiple documents) with yaml-doc() - Serialize the output of your queries to YAML with --output-format yaml - General comparisons (existential quantification on large sequences) now work with very big sequences and are automatically pushed down to Spark.

Bugfixes: - Fixed an issue preventing reading Decimal types from Parquet with some precisions and ranges - Fixed a few bugs in static typing - Fixed a bug that didn't throw an error when using the concatenation operator || on sequences with more than one item

- Java
Published by ghislainfourny over 3 years ago

https://github.com/rumbledb/rumble - RumbleDB 1.19.0 "Tipuana Tipu"

RumbleDB allows you to query data that does not fit in DataFrames with JSONiq.

Try-it-out sandbox: https://colab.research.google.com/github/RumbleDB/rumble/blob/master/RumbleSandbox.ipynb

Instructions to get started: https://rumble.readthedocs.io/en/latest/Getting%20started/

RumbleDB comes in 4 jars that you can pick from depending on your needs: - rumbledb-1.19.0-standalone.jar contains Spark already and can simply be run "out of the box" with java -jar rumbledb-1.19.0-standalone.jar with Java 8 or 11. - rumbledb-1.19.0-for-spark-3.X.jar (3.0, 3.1, 3.2, 3.3) is smaller in size, does not contain Spark, and can be run in a corresponding, existing Spark environment either local (so you need to download and install Spark) or on a cluster (EMR with just a few clicks, etc) with spark-submit rumbledb-1.19.0-for-spark-3.X.jar

Release notes: - Fixed the bug with spaces in paths - Various fixes and enhancement - New functions repartition#2 to change the number of physical partitions, and binary-classification-metrics#3, binary-classification-metrics#4 for preparing ROC curves, PR curves to evaluation the output of ML pipelines.

- Java
Published by ghislainfourny about 4 years ago

https://github.com/rumbledb/rumble - RumbleDB 1.18.0 "Scarlet Ixora" beta

Instructions to get started: https://rumble.readthedocs.io/en/latest/Getting%20started/

RumbleDB comes in 4 jars that you can pick from depending on your needs: - rumbledb-1.18.0-standalone.jar contains Spark already and can simply be run "out of the box" with java -jar rumbledb-1.18.0-standalone.jar with Java 8 or 11. - rumbledb-1.18.0-for-spark-3.X.jar (3.0, 3.1, 3.2) is smaller in size, does not contain Spark, and can be run in a corresponding, existing Spark environment either local (so you need to download and install Spark) or on a cluster (EMR with just a few clicks, etc) with spark-submit rumbledb-1.18.0-for-spark-3.X.jar

Release notes: - FLWOR expressions starting with a series of let are now better optimized and faster. - A warning with advice is issued in the command window if a group by is used in a FLWOR expression that starts with a let clause. - The shell will no longer exit when an error is thrown. - When a query cannot be executed in parallel, a more informative error message is output inviting the user to rewrite their query, instead of the raw Spark error. - When launching in shell or server mode, instructions are printed on the screen for next steps - Fixed crash in the execution of some where clauses when a join was not successfully detected and it falls back to linear execution - Support for context item declarations and passing an external context item value on the command line - By default, the date type no longer supports timezones (which are rarely used for this type, although supported by ISO 8601). This enables more optimizations (e.g., internal conversion to DataFrame DateType columns and export of datasets with dates to Parquet). Timezones on dates can be activated for those users who need them with a simple CLI argument (--dates-with-timezone yes). - Ctrl+C now elegantly exits the shell.

- Java
Published by ghislainfourny about 4 years ago

https://github.com/rumbledb/rumble - RumbleDB 1.17.0 "Cacao tree" beta

Instructions to get started: https://rumble.readthedocs.io/en/latest/Getting%20started/

  • The CLI was extended with verbs (run, serve, repl) and single-dash shortcuts (-f for --output-format, etc). This is backward compatible.
  • Automatic internal conversion to DataFrames for FLWOR expressions executed in parallel when the statically inferred type is DataFrame-compatible.
  • Fixed bug that prevented calling a variable $type or lookup up a field called "type" without quotes.
  • Fixed but for projecting a sequence internally stored as a DataFrame to dynamically defined keys.
  • Fix some bugs with post-grouping count optimizations on let variables
  • Support for Spark 2.4, which is no longer maintained by the Spark team, is now dropped, but available on request. RumbleDB 1.17 supports Spark 3.0, 3.1 and 3.2.
  • plenty of smaller bug fixes
  • [Experimental] we also provide a jar that embeds Spark and does not require its installation (rumbledb-1.17.0-standalone.jar). It is for use on a local machine only (not a cluster) and works with java -jar rumbledb-1.17.0-standalone.jar run -q '1+1' rather with spark-submit. Feedback is welcome! This is just experimental at this point and we will take it from there.

- Java
Published by ghislainfourny over 4 years ago

https://github.com/rumbledb/rumble - RumbleDB 1.16.2 "Shagbark Hickory" beta

Interim release.

  • Fix recursive view "input" issue.
  • Nicer message for out of memory errors and hint to use CLI parameters.
  • Reverted to Kryo 4 for Spark 3.2, which depends on Twitter Chill 0.10.0 using this version of Kryo in a way incompatible with Kryo5

- Java
Published by ghislainfourny over 4 years ago

https://github.com/rumbledb/rumble - Rumble 1.16.1 "Shagbark Hickory" beta

Interim release.

  • Fixed race condition issue with min() and max() called multiple times that led to possibly incorrect output.
  • The sum() and count() functions are now able to stream locally on very large (non parallelized) sequences.
  • Range expressions now support 64 bit integers as well (before this, an overflow happened)
  • The arrow syntax works for dynamic function calls, too, so in Rumble ML pipelines can also be called with a pipelining syntax: $training-set=>$my-transformer($params)=>my-estimator($params)
  • substring() was fixed to follow standard behavior even with exotic parameters (mostly returning an empty string in these cases)

- Java
Published by ghislainfourny over 4 years ago

https://github.com/rumbledb/rumble - RumbleDB 1.16.0 "Shagbark Hickory" beta

  • new --query parameter for directly passing a query rather than a query path.
  • fixed a bug occurring with group by clauses on native DataFrames with complex aggregations
  • new --shell-filter parameters for modifying the way the output is shown in shell mode (e.g. --shell-filter 'jq . -S -C' for pretty printing)
  • new output formats: json (top-level strings will be quoted), tyson and possibility to indent with --output-format-options:indent yes
  • new JSound validator page at localhost:/jsound-validator.html
  • support for user-defined atomic types with JSound verbose syntax
  • fn:concat is now correctly in the fn namespace
  • When the materialization is reached and the count is unknow, it is no longer shown as the max long value.

- Java
Published by ghislainfourny over 4 years ago

https://github.com/rumbledb/rumble - RumbleDB 1.15.0 "Ivory Palm"

  • Fixed jn:intersect#1 to always be run locally
  • General performance improvements for many expressions and iterators that return at most one item
  • New builtin functions supported: fn:min#2, fn:max#2, fn:unordered#1, fn:distinct-values#2, fn:index-of#3, fn:deep-equal#3, fn:string#0, fn:string#1, fn:substring-before#3, fn:substring-after#3, fn:string-length#0, fn:resolve-uri#1, fn:resolve-uri#2, fn:ends-width#3, fn:starts-width#3, fn:contains#3, , fn:normalize-space#0, fn:default-collation#0, fn:number#0, fn:implicit-timezone#0, fn:not#0, fn:static-base-uri#1, fn:dateTime#2, fn:false#0, fn:true#0
  • all JSONiq builtin types are now supported: newly supported are byte, dateTimeStamp, gDay, gMonth, gYear, gYearMonth, gMonthDay, int, long, negativeInteger, nonNegativeInteger, positiveInteger, nonPositiveInteger, unsignedInt, unsignedLong, unsignedByte, unsignedShort, short,
  • ceiling, floor, round, abs, round-half-to-even are now correctly in the fn namespace (not math) and all accept numeric values (instead of converting everything to doubles) and a few bugs have been fixed
  • support for open object types via the JSound verbose syntax (they are, of course, not implemented as DataFrames, but this makes no difference at the syntactic level except they cannot be used with ML estimators and transformers)
  • support for user-defined array types via the JSound verbose syntax, including subtypes
  • validation of atomic values is now correctly done by casting the lexical value (not the typed value) to the expected type.
  • Fixed serialization of NaN, double/float infinity, dates, etc (the quotes are now correctly included to make them JSON strings)
  • positive and negative zero (for double, float) now compare as equals in value/general comparison

Note that Spark 2.4.x is no longer maintained. We provide rumbledb-1.15.0-for-spark-2.jar only for legacy purposes for a smooth transition, and recommend instead using Spark 3.0.x or 3.1.x with the rumbledb-1.15.0.jar package.

- Java
Published by ghislainfourny almost 5 years ago

https://github.com/rumbledb/rumble - RumbleDB 1.14.0 "Acacia" beta

  • Rumble now outputs error messages displaying the faulty line of code and pointing to the place of error.
  • Machine Learning estimators and models can now run at scale (in parallel) on very large amounts of data. This is automatically detected.
  • Many stability improvements in the Machine Learning library
  • Machine Learning Pipelines are now supported with stages given as function items
  • Static typing is now always done and used to optimize even more
  • Initial (experimental) support for user-defined types with the JSound Compact syntax. Types can be used everywhere builtin types can be used (instance of, treat as, type annotations for variables...).
  • New validate type expression to validate against user-defined types and (if the type is DF-compatible) to create object* instances as optimized dataframes.
  • Features must be assembled with the VectorAssembler transformer prior to being used with an estimator or transformer (for example, at the start of a pipeline). featuresCol and InputCol must specify the name (as a string) of the assembled feature vector field. This is now fully consistent with the Spark ML framework.

Note that Spark 2.4.x is no longer maintained. We provide rumbledb-1.14.0-for-spark-2.jar only for legacy purposes for a smooth transition, and recommend instead using Spark 3.0.x or 3.1.x with the rumbledb-1.14.0.jar package.

- Java
Published by ghislainfourny almost 5 years ago

https://github.com/rumbledb/rumble - Rumble 1.12.0 "Ashoka Tree" beta

  • Fixed performance issue when a big for clause follows other small clauses
  • Fixed grouping and ordering of floats
  • Fixed a bug that prevented grouping with keys of incompatible types when hashcodes collided.
  • Experimental (and incomplete) support for XQuery 3.1 syntax (prefix queries with xquery version "3.1"; to activate)
  • project() calls are pushed down if the argument is structured (e.g., coming from parquet-file(), etc).
  • Performance improvements for round() and abs()
  • Variable references ($x) are resolved quicker
  • Support for general function types (including their signature) and type checking (including statically)
  • When iterating on schema-based data (Parquet, Avro, structured-json-file()...) in a FLWOR expression, some let, for, where, group-by and order-by clauses will be automatically faster if they only involve literals, variable references, object/array lookups, and value comparison (native mapping to Spark SQL)
  • Fixed several bugs in switch expressions
  • Switch expressions and conditional expressions can handle/forward structured data faster (underlying DataFrames)

- Java
Published by ghislainfourny about 5 years ago

https://github.com/rumbledb/rumble - Rumble 1.11.0 "Banyan Tree" beta

  • experimental support for static typing (--static-typing yes) following the W3C standard.
  • performance improvements in arithmetics, logics, comparison
  • spaces are now supported in paths to json-file()
  • HTTP URLs are now supported by unparsed-text() and unparsed-text-lines()
  • yearMonthDuration, dayTimeDurations, hexBinary, base64Binary can now be compared for inequality in addition to equality
  • performance improvements for comparison
  • the effective boolean value is now correctly taken in quantified expressions
  • quantified expressions now work in parallel as well (they leverage the FLWOR iterators)
  • support for floats
  • sum(), avg() are now pushed down and work on large homogeneous as well as heterogeneous sequences
  • stability improvements and improved conformance for comparison, arithmetics and casts
  • dayTimeDuration and yearMonthDuration can now be compared
  • all constructors are now available (semantics identical to cast as)
  • switch and index-of no longer throw an error for incompatible types, which now follows the standard
  • empty function bodies are now allowed (in which case it is considered to return the empty sequence)
  • variable names $null, $array, $object are now allowed
  • annotate() can now automatically cast whenever it makes sense, and is thus more flexible
  • the Item hierarchy is now flat, with a public Item interface available in the Rumble Java API, and individual classes providing the implementation, which should lead to a small performance boost with lighter method calls.
  • fixed an issue (null pointer exception) when an ordering key is always the empty sequence
  • constant predicate lookups with small numbers (<= materialization cap) are pushed down, e.g., json-file("...")[1]
  • general support at the parser level of any type QName. prefixes like xs: and js: are now accepted but remain optional (e.g., xs:integer, js:null).
  • an error is appropriately thrown if an order by expression evaluates to more than an item or a non-atomic item
  • builtin functions can now be called with fn:, jn: and math: prefixes as well (depending on their namespace). It is still, however, possible to refer to them without prefix, i.e., this is backward compatible.

The main jar is for Spark 3, but there is another jar for Spark 2.

- Java
Published by ghislainfourny over 5 years ago

https://github.com/rumbledb/rumble - Rumble 1.10.0 "Buttonwood" beta

  • Fixed navigation issue with structured datasets when objects are nested in arrays.
  • Fixed a bug that prevented calling a user-defined functions repeatedly in a FLWOR expression in some cases
  • Any verbose messages are now printed to stderr, no longer stdout for those who want to pipeline the output in bash
  • Bugfixes in unary expressions (an error is now thrown for more than one item, and multiple unary signs, allowed by the spec are handled correctly)
  • Big integers can now be cast from strings
  • string() now returns serialized numbers consistent with JSON output
  • typeswitch now correctly matches the empty sequence type
  • improved stability for user-defined function calls consuming dataframe parameter. Seamless materialization for ? and 1 arities.
  • max() and min() are now pushed down to Spark and work on big sequences
  • +INF and INF (doubles) are now serialized to strings correctly
  • Fixed the division by 0 on doubles, to correctly produce +INF and -INF, and mod by 0 to produce NaN. idiv raises an error as per the spec.
  • It is now possible to build INF, -INF, und NaN double by casting from a string literal.
  • Fixed bug in the object lookup expression leading to a crash when the field to lookup depends on a variable, and the sequence of objects being looked up is partitioned on Spark. Same fix for array lookup expressions.
  • Fixed a crash happening in a FLWOR expression in a group-by clause executed in parallel, when none of the variables before and including this group clause is used anywhere in the remainder of the FLWOR expression.
  • Performance improvements in the processing of items.
  • Performance improvement for distinct-values call on heterogeneous sequences.
  • support for W3C-standard functions unparsed-text, unparsed-text-lines (in parallel) and parse-json (all with arity 1 for now)
  • Fixed a bug occasionally happening with JsonIter streaming by switching to another JSON parser (gson).

- Java
Published by ghislainfourny over 5 years ago

https://github.com/rumbledb/rumble - Rumble 1.9.1 "Ficus Bonsai" beta

Interim release with the following fixes and improvements:

  • There is a new CLI parameter --deactivate-jsoniter-streaming to set to yes if there is any error regarding the JsonIter dependency, the library we use to parse JSON (the error in question being "com.jsoniter.spi.JsonException: javassist.CannotCompileException: by java.lang.ClassFormatError: class com.jsoniter.IterImpl cannot access its superclass com.jsoniter.IterImplForStreaming"). This flag deactivates streaming (i.e., avoids dynamic code generation by JsonIter) and avoids the error. This is a known issue with the Rumble docker but it never happened on our own machines. We are actively investigating why the Rumble docker has this issue. If you deactivate JsonIter streaming, though, this makes json-doc() unavailable after using json-file() in the same Rumble application (which is why we activate JsonIter streaming by default).

  • The public Rumble API (also accessible via the Rumble Maven dependency) now allows passing any lists of items as an external variable. You can thus gather the results of a query as a list of items, and put it back as the input of another query in Java as a host language.

- Java
Published by ghislainfourny over 5 years ago

https://github.com/rumbledb/rumble - Rumble 1.9.0 "Ficus Bonsai" beta

  • Left-outer equi-joins with let clauses: if you have two large tabular datasets, Rumble can nest one into the other with just a few lines of code, and fast.
  • Inner equi-joins and generic joins with where clauses are detected.
  • Renamed --result-size to --materialization-size to avoid confusion, and adding more hints about --output-path for getting the complete output from a parallel query.
  • New CLI options --output-format and output-format-option:* for outputting structured output to other formats than JSON (Parquet, CSV...).
  • New CLI option --number-of-output-partitions to repartition the output as desired
  • New function local-text-file() to read a file as a sequence of string items, but without Spark parallelism (streaming instead). This makes Rumble faster for smaller files
  • Performance improvements for FLWOR queries on structured data (Avro, Parquet, structured JSON, CSV)...
  • Performance improvement for when parallelism is not used at all
  • Stability improvement for json-doc(), which will now also work after json-file() has been used.

- Java
Published by ghislainfourny over 5 years ago

https://github.com/rumbledb/rumble - Rumble 1.8.1 "Scots Pine" beta

Interim release with small fixes

  • Improve performance of joins whenever possible (quadratic -> linear)
  • fixed a bug with non-exact averages with avg()

Note that Rumble is in beta. Use at your own risks.

- Java
Published by ghislainfourny over 5 years ago

https://github.com/rumbledb/rumble - Rumble 1.8.0 "Scots pine"

New features - Support for joining two large datasets; automatic detection of joins if a for expression is a predicate expression, and the left-hand side can be evaluated independently of the former clauses. The right-hand-side is the joining criterion. Left outer joins are also supported in parallel (allowing empty). - outer joins ("allowing empty" in a for clause) are now supported both locally and in parallel. - support for empty sequence order least/greatest prolog setter (for order by clauses) - positional variables in for clauses are now supported both locally and in parallel (except for large-scale joins). - arbitrary large integer literals are now supported (an error was thrown before beyond 32 bits) - json-file() and json-doc() can both read over HTTP - you can store your JSONiq modules on the Web and import them with an HTTP URL - you can store your queries on the Web and execute them via the Rumble command line with their URL - an error with the appropriate code is now thrown if a collation is specified that is not supported (the W3C standard requires support for at least the Unicode codepoint collation, which Rumble recognizes and supports).
- It is now possible to specify a hostname in the server mode (--host), and to filter for specific URI prefixes for security reasons (--allowed-uri-prefixes)

Bugfixes - big integers are now seamlessly supported: no more overflows, and arbitrary large integer literals are accepted in JSONiq code - fixed display bugs in debug mode (--print-iterator-tree yes) - fixed an error with local group-by queries nested inside local FLWORs - fixed an error when counting items in a variable that was not a post-grouping variable, in parallelized FLWORs. - fixed a bug encountered when a local iteration followed by a parallel for clause produced, and unioned, several Spark jobs internally.

Important: The jar for Spark 3.0.0 does not have Laurelin (ROOT parser) support. We are waiting for a 3.0.0-compatible Laurelin release. If you need to query ROOT files, please use Spark 2.4.6.

- Java
Published by ghislainfourny almost 6 years ago

https://github.com/rumbledb/rumble - Rumble 1.7.0 "Phoenix Atlantica"

New milestone in our feature coverage with the following changes prioritized based on user requests.

New features

  • Rumble is available for Spark 2.4.x as well as for Spark 3.0.0 (pick the right jar). The version for Spark 3.0.0 cannot read ROOT files yet, as we are waiting for the corresponding Laurelin release.
  • library modules are now supported, in order to share and import functions and global variables. Like main modules, library modules can be stored on any file system including S3 or HDFS, which also enables sharing code within the institution (local HDFS system) or even worldwide (S3 or even HTTP).
  • support for the W3C-standard trace function, for outputting intermediate values to the log.
  • support for try-catch expressions to catch and handle dynamic errors
  • support (read-only) for HTTP scheme for reading query files, data, importing modules, etc.

Bugfixes

  • fixed a bug in position semantics in predicate expressions, so that it also works if the position is not a constant.
  • Bugfix: query files are now tested for EOF, and errors will now be thrown if there are extra characters after the complete JSONiq query.
  • it is now possible to define functions and variables in the local namespace, following the W3C standard
  • [BREAKING CHANGE] relative paths passed to input functions are now resolved correctly in a query if it is read from a file, i.e., according to the absolute query file location. In previous releases, relatives paths were resolved against the working directory. If you pass paths via external variables on the command line and (rightfully) expect them to be resolved against the working directory, declare the external variable with an "as anyURI" type annotation so Rumble knows your intent.
  • improvements in error messages when reading from and writing to file systems. Path resolution was also consolidated to provide the same experience everywhere.

- Java
Published by ghislainfourny almost 6 years ago

https://github.com/rumbledb/rumble - Rumble 1.6.4 "Yucca"

Interim release with bugfixes.

  • Support for DivisionByZero error code (div, mod).
  • Fixed a bug that sometimes led the Rumble shell to keep throwing the same error for subsequent queries
  • More informative error message when a range expression is not supplied with integers
  • Fix bug that prevented conditional expressions to be executable in parallel
  • New functions normalize-unicode and encode-for-uri
  • Support for running typeswitch in parallel

- Java
Published by ghislainfourny about 6 years ago

https://github.com/rumbledb/rumble - Rumble 1.6.3 "Yucca"

Interim release to address user requests.

  • More informative error message when the wrong version of Java is used.
  • More informative error messages in Jupyter notebooks for unexpected errors.
  • User-defined functions can now work on parallel input. Rumble automatically detects it.
  • Fixed a bug with local execution of nested order by clauses.

- Java
Published by ghislainfourny about 6 years ago

https://github.com/rumbledb/rumble - Rumble 1.6.2 "Yucca"

Interim release based on user feedback.

Adds a warning message in Jupyter notebooks, Python and other host languages if materialization hits the cap for the final result. Also, in the shell the warning message is now displayed after the results, making it less easy to overlook.

- Java
Published by ghislainfourny about 6 years ago

https://github.com/rumbledb/rumble - Rumble 1.6.1 "Yucca"

Interim release fixing multiline queries with the Rumble magic in Jupyter notebooks as well as an explicit listing of the Joda time dependency for some users who reported it was not included in their environment.

- Java
Published by ghislainfourny about 6 years ago

https://github.com/rumbledb/rumble - Rumble 1.6.0 Yucca

  • the materialization of too many items now throws an error rather than just a warning, to avoid incorrect results
  • a bug was fixed in the closure of inline functions. Now, the variable values in scope where the function is created are correctly taken.
  • new functions: format-date, format-dateTime, format-time, current-date, current-dateTime, current-time, serialize
  • parallelization of existing functions: flatten, intersect, descendant-objects, descendant-arrays, remove-keys, project, insert-before
  • new input format: AVRO, with avro-file() functions
  • global variables are now supported and dependency cycles are identified
  • the shell is more colorful
  • a local FLWOR with a return clause returning big sequences (aka underlying RDD/DF) is no longer materialized.
  • support for => operator to pass the left-hand-side as the first parameter to a function (similar to OO-programming)
  • support for simple map expressions (! operator) also in parallel
  • conditional expression, switch expressions as well as comma expressions can now also run in parallel
  • Rumble can run as an HTTP server for integration with any host language

- Java
Published by ghislainfourny about 6 years ago

https://github.com/rumbledb/rumble - Rumble 1.5.1 Southern Live Oak

This release unifies and stabilizes access to all file systems (S3, HDFS, local) for --query-path, --output-path, --log-path as well as the path passed to input functions.

- Java
Published by ghislainfourny about 6 years ago

https://github.com/rumbledb/rumble - Rumble 1.5.0 Southern Live Oak

Various bugfixes and stability improvements.

Support for more inputs (JSON, Parquet, ROOT, CSV, text) and sources (S3, Azure, local, HDFS).

More built-in functions.

More expressions and functions are parallelized.

- Java
Published by ghislainfourny about 6 years ago

https://github.com/rumbledb/rumble - Rumble 1.4.2 Willow Oak

Various bugfixes - variable bindings are now available and visible in all nested FLWOR clauses - when only a count aggregation is made on a non-grouping variable, use of the count now works on all following clauses (was: some error with DataFrames on Long to [B conversion). - errors are now thrown early if an RDD evaluation is to be made within a big FLWOR expression (improvement over uninformative null pointer exception)

- Java
Published by ghislainfourny over 6 years ago

https://github.com/rumbledb/rumble - Rumble 1.4.1 Willow Oak

This is an interim release with bugfixes:

  • invoking a count on a grouping key after a group-by now works in a let clause (was: "java.lang.Long cannot be cast to [B") in addition to return clauses. This will also be fixed in other clauses later on.
  • the count clause is now more stable on large amounts of data (was: null pointer exception)
  • variables passed from outside the FLWOR clause are now visible in where and return clauses (was: variable does not exist). This will also be fixed in other clauses later on.

- Java
Published by ghislainfourny over 6 years ago

https://github.com/rumbledb/rumble - Rumble 1.4 "Willow Oak"

  • more type support: grouping and ordering on durations, dates, times, datetimes and binaries
  • fixed bug in which more than one grouping key value was bound to the grouping key variable when there were equivalent, but not equal grouping keys (like 1 and 1.0)
  • user-defined functions are supported (no type checking just yet)
  • function items are supported (i.e., functions can be manipulated like any other items)
  • support for position() and last() in predicates, which also works in parallel
  • fixed bug because of which the Effective Boolean Value was not considered in parallel execution or where clauses and filters
  • new structured-json-file() function to increase performance on structured JSON Lines files (i.e., using DataFrames under the hood). This is a bootstrap, and actual optimizations will follow.
  • filtering on a specific position by passing (or computing) a number as a predicate now works in parallel
  • in where clauses and predicates executed in parallel, the effective boolean value is correctly taken

The jar is based on Java 8 and is compatible with all more recent Java versions.

- Java
Published by ghislainfourny over 6 years ago

https://github.com/rumbledb/rumble - Rumble 1.3 White Oak

Various bug fixes.

New features: - json-doc builtin function to open single JSON files locally even when the object is spread over multiple lines (returns a single item, will not automatically parallelize anything) - parquet-file builtin function to open parquet files locally, on HDFS or on S3 - starting to introduce type support (date, dateTime, time, duration, dayTimeDuration, yearMonthDuration, base64Binary, hexBinary) as well as cast as, castable as, treat as and instance of. Should any bug be found, please let us know.

- Java
Published by ghislainfourny over 6 years ago

https://github.com/rumbledb/rumble - Rumble 1.2 Chestnut Oak

Mostly optimizations to group by aggregations. A few more functions implemented.

- Java
Published by ghislainfourny over 6 years ago

https://github.com/rumbledb/rumble - Rumble 1.1 Arbutus Oak

This is the second beta release of Rumble, a JSONiq engine to query large-scale JSON datasets stored on HDFS. Spark under the hood.

New: - bugfixes. - more functions - FLWOR expressions are now internally mapped to DataFrames and Spark SQL, which brings a 2x performance improvement for grouping and sorting queries.

The jar file was built with ANTLR 4.7 and is compatible with all tested distributions of Spark 2.3+. It is meant to be used with the spark-submit script either as an interactive shell, or to execute a single query from a JSONiq file (local or HDFS) and output the result either on stdin or back to the disk (local or HDFS). This works both locally and with a deployed cluster.

The jar file was compiled with Java 8 and is forward compatible with later Java versions (e.g., Java 11).

The jar file for older versions of Spark (2.0+) with ANTLR 4.5.3 is available on request (if you receive a warning on the command line).

Documentation: http://rumble.readthedocs.io/en/latest/

- Java
Published by ghislainfourny almost 7 years ago

https://github.com/rumbledb/rumble - Rumble 1.0.0 "Linden Oak"

This is the first beta release of Rumble, a JSONiq engine to query large-scale JSON datasets stored on HDFS. Spark under the hood.

New: - Bugfixes. - Jar auto-displays CLI examples when invoked with no parameters, also with java. - distinct-values() is pushed down to Spark - Fixes NullPointerException in some cases when exceptions are raised in closures

The jar file was built with ANTLR 4.7 and is compatible with all tested distributions of Spark 2.3+. It is meant to be used with the spark-submit script either as an interactive shell, or to execute a single query from a JSONiq file (local or HDFS) and output the result either on stdin or back to the disk (local or HDFS). This works both locally and with a deployed cluster.

The jar file for older versions of Spark (2.0+) with ANTLR 4.5.3 is available on request (if you receive a warning on the command line).

Documentation: http://rumble.readthedocs.io/en/latest/

- Java
Published by ghislainfourny about 7 years ago

https://github.com/rumbledb/rumble - Sparksoniq 0.9.7 Mahogany

New alpha release for Sparksoniq, a JSONiq engine to query large-scale JSON datasets stored on HDFS. Spark under the hood.

New: - Bugfixes. - It is now possible to read a query locally (--query-path), and output the results on stdin rather than to the local filesystem. - Fix error on non-existing JSONObject keySet() method due to a backward incompatibility of org.json in some environments.

The jar file was built with ANTLR 4.7 and is compatible with all tested distributions of Spark 2.3+. It is meant to be used with the spark-submit script either as an interactive shell, or to execute a single query from a JSONiq file (local or HDFS) and output the result either on stdin or back to the disk (local or HDFS). This works both locally and with a deployed cluster.

The jar file for older versions of Spark (2.0+) with ANTLR 4.5.3 is available on request (if you receive a warning on the command line).

Documentation: http://sparksoniq.readthedocs.io/en/latest/

- Java
Published by ghislainfourny about 7 years ago

https://github.com/rumbledb/rumble - Sparksoniq 0.9.6 "Olive Tree"

New alpha release for Sparksoniq, a JSONiq engine to query large-scale JSON datasets stored on HDFS. Spark under the hood.

New: - New functions text-file#1, text-file#2, tokenize#1, tokenize#2 to open text files as input. Now billions of lines can be manipulated as sequences of strings with FLWORs, in the same way billions of objects could until now. - Fixing serialization bugs (escaping) - Fixing bug in string literal escaping in the shell - Fix bug with local count clause execution - Fix bug in the shell leading to a crash when a parallelized FLWOR execution was outputting the empty sequence - Fix bug leading to a crash when the where clause expression was not returning a boolean in local execution. Now the effective boolean value is taken.

The jar file with ANTLR 4.7 is to be used with Spark 2.3+. Older versions (2.0+) use ANTLR 4.5.3.

Documentation: http://sparksoniq.readthedocs.io/en/latest/

- Java
Published by ghislainfourny about 7 years ago

https://github.com/rumbledb/rumble - Sparksoniq 0.9.5 "Larch"

New alpha release for Sparksoniq, a JSONiq engine to query large-scale JSON datasets stored on HDFS. Spark under the hood.

New: - Many bugfixes - All FLWOR clauses are now supported locally (that is when parallelize() or json-file() is not used) Locally means: without invoking Spark transformations. Local FLWOR expressions can execute on the client but also within a transformation triggered by a non-local FLWOR. - Local FLWOR expressions can fully nest. All queries of the tutorial now work and you can use and abuse let clauses. - Pushdowns: json-file("file.json").foo[].bar[[2]].foobar works on Spark - Significant improvements in memory footprint: some queries are no longer materialized in memory (e.g., filtering query with a where clause or count). - Significant improvements in performance: a file of 16,000,000 objects was successfully tested for count, filtering, grouping and ordering with a local Spark execution on a single laptop. Performance also improved on bigger datasets on clusters.

The jar file with ANTLR 4.7 is to be used with Spark 2.3+. Older versions (2.0+) use ANTLR 4.5.3.

Documentation: http://sparksoniq.readthedocs.io/en/latest/

- Java
Published by ghislainfourny over 7 years ago

https://github.com/rumbledb/rumble - Sparksoniq 0.9.4 "Birch"

New alpha release for Sparksoniq, a JSONiq engine to query large-scale JSON datasets stored on HDFS. Spark under the hood.

New: various bugfixes: - count clauses are supported and pushed down to Spark - simple keys must no longer be quoted when constructing objects (in particular: null pointer exception is fixed) - error message when a function name+arity is not found is more helpful - it is no longer necessary to supply the --master option twice on the CLI: only once to spark-submit is enough.

The jar files no longer contain the Spark libraries, as they are provided by the local environment or the cluster.

The jar file with ANTLR 4.7 is to be used with Spark 2.3+. Older versions (2.0+) use ANTLR 4.5.3.

Documentation: http://sparksoniq.readthedocs.io/en/latest/

- Java
Published by ghislainfourny over 7 years ago

https://github.com/rumbledb/rumble - Sparksoniq 0.9.3 "Cedar"

Third alpha release for Sparksoniq, a JSONiq engine to query large-scale JSON datasets stored on HDFS. Spark under the hood.

New: various bugfixes: - Ctrl+D now exits nicely from the shell - count() calls are pushed down to Spark if the nested expression uses underlying RDDs. - various exceptions are now caught and displayed with a nice error messages. - Strings can be concatenated with atomic types (they get serialized to a string) - Lookup can be done on a sequence of objects

The jar files no longer contain the Spark libraries, as they are provided by the local environment or the cluster.

The jar file with ANTLR 4.7 is to be used with Spark 2.3+. Older versions (2.0+) use ANTLR 4.5.3.

Documentation: http://sparksoniq.readthedocs.io/en/latest/

- Java
Published by ghislainfourny over 7 years ago

https://github.com/rumbledb/rumble - Sparksoniq Alpha 0.9.2 "Cypress"

Second release for Sparksoniq, a JSONiq engine to query large-scale JSON datasets stored on HDFS. Spark under the hood.

New: various bugfixes (e.g., empty sequence handling), richer function library, general comparison operators.

The jar files no longer contain the Spark libraries, as they are provided by the local environment or the cluster.

The jar file with ANTLR 4.7 is to be used with Spark 2.3+. Older versions (2.0+) use ANTLR 4.5.3.

Documentation: http://sparksoniq.readthedocs.io/en/latest/

- Java
Published by ghislainfourny over 7 years ago

https://github.com/rumbledb/rumble - Sparksoniq Alpha 0.9.1 "Spruce"

First release for Sparksoniq, a JSONiq engine to query large-scale JSON datasets stored on HDFS. Spark under the hood.

Documentation: http://sparksoniq.readthedocs.io/en/latest/

- Java
Published by wscsprint3r over 8 years ago