https://github.com/cwida/realnest

RealNest - A Collection of Nested Data from Real-World Datasets

https://github.com/cwida/realnest

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.4%) to scientific vocabulary

Keywords

dataset nested-data research
Last synced: 9 months ago · JSON representation

Repository

RealNest - A Collection of Nested Data from Real-World Datasets

Basic Info
Statistics
  • Stars: 4
  • Watchers: 4
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
dataset nested-data research
Created about 2 years ago · Last pushed over 1 year ago
Metadata Files
Readme License

README.md

RealNest - Nested Data from Real-World Datasets

This repository contains the details of the RealNest dataset, a collection of nested data derived from real-world datasets. The dataset is designed to help computer science researchers benchmark and evaluate data systems and data formats supporting nested data types.

RealNest is provided as a script that downloads and generates the data, but for convenience and to facilitate standardized comparisons, we host (outside of this repository) on the CWI website (https://event.cwi.nl/da/RealNest) two static datasets with data in .jsonl.gz format in sizes of 64 * 1024 resp. 10 * 64 * 1024 rows. These sample datasets were downloaded and generated by our script in mid-May 2024.

Furthermore, the sample-data directory inside this repository contains a small sample of the datasets mentioned above (the first 1024 rows and 100 MiB of each table) as a preview.

Because we provide the script that downloads the original datasets and processes them into a common format, one can create the dataset from newer versions of the underlying data and also enlarge them with respect to the static datasets, since even the larger of the two statically downloadable datasets contains only a small part of each of the original data sources. Please note that the availability of the original datasets is outside our control, and over time, some of the original datasets may become unavailable. The download script will attempt to download the data from the sources, skipping the ones that are not available.

Please refer to the README in the scripts directory for more details.

All materials in this GitHub repository, except the files under the sample-data folder, are released under the CC-NC-SA license (https://creativecommons.org/licenses/by-nc-sa/4.0/); hence, this repository is open-source, requires attribution to this page (which includes the Attribution section below) and does not allow commercial exploitation.

Note that the sample datasets inside this repository and the two static datasets hosted at CWI linked here remain under the same licenses and terms of use as the original datasets they are generated from. If you are the owner of an original dataset, and object to the inclusion of your data in the RealNest static datasets hosted at CWI or to the samples hosted in this repository, please contact Peter Boncz (boncz@cwi.nl), and we will take action.

Please note that below we attempt to properly attribute the individual datasets as required by their various open-source licenses and terms of usage.

Dataset Structure

The dataset contains a directory for each table with the following files:

  • schema.json: The schema of the table. The schema is a JSON object with a single key, columns, containing a list of columns. Each column is a JSON object with 2 or 3 keys:
    • name - The name of the column as a string.
    • type - The type of the column as a string.
    • children - Optional, only exists for nested types (list, struct, map). Describes the child types of the nested type as a list of column objects. The list type always has a single child column with the name child. The map type always has two child columns with the names key and value.
  • data.jsonl or data.jsonl.gz: The data of the table in JSON Lines format (optionally Gzip compressed).

The schema might contain a JSON type, which may happen for empty JSON objects in the data ({}) or when DuckDB's schema inference detects incompatible types. The columns of this type can be ignored since they are not typical for structured data, or they can be handled as VARCHAR columns, where the value is the JSON string.

Attribution

The data has been downloaded from various public sources and converted to a common format. We note that the real-world datasets from which RealNest is derived are released under varying open-source licenses and terms of usage.

The sources of the original datasets are:

  1. Amazon Berkeley Objects (LICENSE)
    • J. Collins, S. Goel, K. Deng, A. Luthra, L. Xu, E. Gundogdu, X. Zhang, T. F. Yago Vicente, T. Dideriksen, H. Arora, M. Guillaumin, and J. Malik, "Abo: Dataset and benchmarks for real-world 3d object understanding," CVPR, 2022.
  2. AWS Public Blockchain Data (LICENSE)
  3. Data Lake as Code (ATTRIBUTIONS)
  4. CORD-19 (LICENSE)
    • L. L. Wang, K. Lo, Y. Chandrasekhar, R. Reas, J. Yang, D. Eide, K. Funk, R. M. Kinney, Z. Liu, W. Merrill, P. Mooney, D. A. Murdick, D. Rishi, J. Sheehan, Z. Shen, B. Stilson, A. D. Wade, K. Wang, C. Wilhelm, B. Xie, D. A. Raymond, D. S. Weld, O. Etzioni, and S. Kohlmeier, "Cord-19: The covid-19 open research dataset," ArXiv, 2020.
  5. Daylight Map Distribution of OpenStreetMap (Open Database License (ODbL))
  6. GitHub Archive
  7. CERN Open Data
    • CMS collaboration (2017). SingleMu primary dataset in AOD format from Run of 2012 ( /SingleMu/Run2012B-22Jan2013-v1/AOD). CERN Open Data Portal. DOI:10.7483/OPENDATA.CMS.IYVQ.1J0W
  8. Overture Maps Foundation Open Map Data
    • Overture data is licensed under the Community Database License Agreement Permissive v2 (CDLA) unless derived from a source that requires publishing under a different license, such as data derived from OpenStreetMap, that constitutes a 'Derivative Database' (as defined under ODbL v1.0), which will be licensed under ODbL v1.0.
  9. Twitter Stream Archive

Owner

  • Name: CWI Database Architectures Group
  • Login: cwida
  • Kind: organization
  • Location: Amsterdam, The Netherlands

GitHub Events

Total
  • Delete event: 1
  • Member event: 1
  • Push event: 2
  • Pull request review event: 1
  • Pull request event: 4
  • Create event: 2
Last Year
  • Delete event: 1
  • Member event: 1
  • Push event: 2
  • Pull request review event: 1
  • Pull request event: 4
  • Create event: 2

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 0
  • Total pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: 1 day
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: 1 day
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • ZiyaZa (5)
  • peterboncz (2)
Top Labels
Issue Labels
Pull Request Labels