dataset

dataset is a command line tool, Go package, shared library and Python package for working with JSON objects as collections

https://github.com/caltechlibrary/dataset

Science Score: 62.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
✓
Committers with academic emails
1 of 5 committers (20.0%) from academic institutions
✓
Institutional organization owner
Organization caltechlibrary has institutional domain (www.library.caltech.edu)
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (16.7%) to scientific vocabulary

Keywords

datasets json

Keywords from Contributors

archival github-template optim interactive legged-robotics genomics hacking yolov5 network-simulation markup

Last synced: 6 months ago · JSON representation ·

Repository

dataset is a command line tool, Go package, shared library and Python package for working with JSON objects as collections

Basic Info

Host: GitHub
Owner: caltechlibrary
License: other
Language: Go
Default Branch: main
Homepage: https://caltechlibrary.github.io/dataset
Size: 15.6 MB

Statistics

Stars: 24
Watchers: 9
Forks: 4
Open Issues: 6
Releases: 131

Topics

datasets json

Created about 9 years ago · Last pushed 7 months ago

Metadata Files

Readme Changelog Contributing License Code of conduct Citation Codemeta

Dataset Project

The Dataset Project provides tools for working with collections of JSON documents. It uses a simple key and object pair to organize JSON documents into a collection. It supports SQL querying of the objects stored in a collection.

It is suitable for temporary storage of JSON objects in data processing pipelines as well as a persistent storage mechanism for collections of JSON objects.

The Dataset Project provides a command line program and a web service for working with JSON objects as a collection or individual objects. As such it is well suited for data science projects as well as building web applications that work with metadata.

dataset, a command line tool

dataset is a command line tool for working with collections of JSON documents. Collections can be stored on the file system in a pairtree or stored in a SQL database that supports JSON columns like SQLite3, PostgreSQL or MySQL.

The dataset command line tool supports common data management operations as

initialization of a collection
dump and load JSON lines files into collection
CRUD operations on a collection
Query a collection using SQL

See Getting started with dataset for a tour and tutorial.

datasetd is dataset implemented as a web service

datasetd is a JSON REST web service and static file host. It provides a JSON API supporting the main operations found in the dataset command line program. This allows dataset collections to be integrated safely into web applications or be used concurrently by multiple processes.

The Dataset Web Service can host multiple collections each with their own custom query API defined in a simple YAML configuration file.

Design choices

dataset and datasetd are intended to be simple tools for managing collections JSON object documents in a predictable structured way. The dataset web service allows multi process or multi user access to a dataset collection via HTTP.

dataset is guided by the idea that you should be able to work with JSON documents as easily as you can any plain text document on the Unix command line. dataset is intended to be simple to use with minimal setup (e.g. dataset init mycollection.ds creates a new collection called 'mycollection.ds').

dataset and datasetd store JSON object documents in collections
- Storage of the JSON documents may be either in a pairtree on disk or in a SQL database using JSON columns (e.g. SQLite3 or MySQL 8)
- dataset collections are made up of a directory containing a collection.json and codemeta.json files.
- collection.json metadata file describing the collection, e.g. storage type, name, description, if versioning is enabled
- codemeta.json is a codemeta file describing the nature of the collection, e.g. authors, description, funding
- collection objects are accessed by their key, a unique identifier, made up of lower case alpha numeric characters
- collection names are usually lowered case and usually have a .ds extension for easy identification

dataset collection storage options - SQL store stores JSON documents in a JSON column - SQLite3 (default), PostgreSQL >= 12 and MySQL 8 are the current SQL databases support - A "DSN URI" is used to identify and gain access to the SQL database - The DSN URI maybe passed through the environment - pairtree (depricated, will be removed in v3) - the pairtree path is always lowercase - non-JSON attachments can be associated with a JSON document and found in a directories organized by semver (semantic version number) - versioned JSON documents are created along side the current JSON document but are named using both their key and semver

datasetd is a web service - it is intended as a back end web service run on localhost - it runs on localhost and a designated port (port 8485 is the default) - supports multiple collections each can have their own configuration for global object permissions and supported SQL queries

The choice of plain UTF-8 is intended to help future proof reading dataset collections. Care has been taken to keep dataset simple enough and light weight enough that it will run on a machine as small as a Raspberry Pi Zero while being equally comfortable on a more resource rich server or desktop environment. dataset can be re-implement in any programming language supporting file input and output, common string operations and along with JSON encoding and decoding functions. The current implementation is in the Go language.

Features

dataset supports

Collection level
- Initialize a new dataset collection
- Codemeta file support for describing the collection contents
- Dump a collection to a JSON lines document
- Load a collection from a JSON lines document
- Listing Keys in a collection
Object level actions
- create
- read
- update
- delete
- keys
- has-key
- Documents as attachments
- attachments (list)
- attach (create/update)
- retrieve (read)
- prune (delete)

datasetd supports

List collections available from the web service
List a collection's metadata
List a collection's Keys
Object level actions
- create
- read
- update
- delete
- Documents as attachments
  - attach
  - retrieve
  - prune

Both dataset and datasetd maybe useful for general data science applications needing JSON object management or in implementing repository systems in research libraries and archives.

Limitations of dataset and datasetd

dataset has many limitations, some are listed below

the pairtree implementation it is not a multi-process, multi-user data store
it is not a general purpose database system
it stores all keys in lower case in order to deal with file systems
it stores collection names as lower case to deal with file systems that are not case sensitive
it should NOT be used for sensitive, confidential or secret information because it lacks access controls and data encryption

datasetd is a simple web service intended to run on "localhost:8485".

it does not include support for authentication
it does not support access control for users or roles
it does not encrypt the data it stores
it does not support HTTPS
it does not provide auto key generation
it limits the size of JSON documents stored to the size supported by with host SQL JSON columns
it limits the size of attached files to less than 250 MiB
it does not support partial JSON record updates or retrieval
it does not provide an interactive Web UI for working with dataset collections
it should NOT be used for sensitive, confidential or secret information because it lacks access controls and data encryption

Authors and history

R. S. Doiel
Tommy Morrell

Releases

Compiled versions are provided for Linux (x86, aarch64), Mac OS X (x86 and M1), Windows 11 (x86, aarch64) and Raspberry Pi OS.

github.com/caltechlibrary/dataset/releases

Related projects

You can use dataset from Python via the py_dataset package.

You can use dataset from Deno+TypeScript by running datasetd and access it with ts_dataset.

Owner

Name: Caltech Library
Login: caltechlibrary
Kind: organization
Email: helpdesk@library.caltech.edu
Location: Pasadena, CA 91125

Website: https://www.library.caltech.edu/
Repositories: 84
Profile: https://github.com/caltechlibrary

We manage the physical and digital holdings of the California Institute of Technology, provide services and training, and develop open-source software.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
type: software
title: dataset
abstract: "The Dataset Project provides tools for working with collections of JSON documents easily. It uses a simple key and object pair to organize JSON documents into a collection. It supports SQL querying of the objects stored in a collection.

It is suitable for temporary storage of JSON objects in data processing pipelines as well as a persistent storage mechanism for collections of JSON objects.

The Dataset Project provides command line programs and a web service for working with JSON objects as a collection or individual objects. As such it is well suited for data science projects as well as building web applications that work with metadata."
authors:
  - family-names: Doiel
    given-names: R. S.
    orcid: https://orcid.org/0000-0003-0900-6903
    email: rsdoiel@caltech.edu
  - family-names: Morrell
    given-names: Thomas E
    orcid: https://orcid.org/0000-0001-9266-5146
    email: tmorrell@caltech.edu

contacts:
  - family-names: Doiel
    given-names: R. S.
    orcid: https://orcid.org/0000-0003-0900-6903
    email: rsdoiel@caltech.edu
  - family-names: Morrell
    given-names: Thomas E
    orcid: https://orcid.org/0000-0001-9266-5146
    email: tmorrell@caltech.edu

repository-code: "https://github.com/caltechlibrary/dataset"
version: 2.3.2
date-released: 2025-07-11

license-url: "https://caltechlibrary.github.io/dataset/LICENSE"
keywords:
  - metadata
  - data
  - software
  - json

CodeMeta (codemeta.json)

{
  "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
  "type": "SoftwareSourceCode",
  "codeRepository": "https://github.com/caltechlibrary/dataset",
  "author": [
    {
      "id": "https://orcid.org/0000-0003-0900-6903",
      "type": "Person",
      "givenName": "R. S.",
      "familyName": "Doiel",
      "affiliation": {
        "@type": "Organization",
        "name": "Caltech Library"
      },
      "email": "rsdoiel@caltech.edu"
    },
    {
      "id": "https://orcid.org/0000-0001-9266-5146",
      "type": "Person",
      "givenName": "Thomas E",
      "familyName": "Morrell",
      "affiliation": {
        "@type": "Organization",
        "name": "Caltech Library"
      },
      "email": "tmorrell@caltech.edu"
    }
  ],
  "maintainer": [
    {
      "id": "https://orcid.org/0000-0003-0900-6903",
      "type": "Person",
      "givenName": "R. S.",
      "familyName": "Doiel",
      "affiliation": {
        "@type": "Organization",
        "name": "Caltech Library"
      },
      "email": "rsdoiel@caltech.edu"
    },
    {
      "id": "https://orcid.org/0000-0001-9266-5146",
      "type": "Person",
      "givenName": "Thomas E",
      "familyName": "Morrell",
      "affiliation": {
        "@type": "Organization",
        "name": "Caltech Library"
      },
      "email": "tmorrell@caltech.edu"
    }
  ],
  "dateCreated": "2016-09-12",
  "dateModified": "2025-07-11",
  "datePublished": "2025-07-11",
  "description": "The Dataset Project provides tools for working with collections of JSON documents easily. It uses a simple key and object pair to organize JSON documents into a collection. It supports SQL querying of the objects stored in a collection.\n\nIt is suitable for temporary storage of JSON objects in data processing pipelines as well as a persistent storage mechanism for collections of JSON objects.\n\nThe Dataset Project provides command line programs and a web service for working with JSON objects as a collection or individual objects. As such it is well suited for data science projects as well as building web applications that work with metadata.",
  "funder": [
    {
      "@id": "https://doi.org/10.13039/100006961",
      "@type": "Organization",
      "name": "Caltech Library"
    }
  ],
  "keywords": [
    "metadata",
    "data",
    "software",
    "json"
  ],
  "name": "dataset",
  "license": "https://caltechlibrary.github.io/dataset/LICENSE",
  "programmingLanguage": [
    "Go",
    "SQL"
  ],
  "softwareRequirements": [
    "Golang >= 1.24.5",
    "CMTools >= 0.0.35"
  ],
  "softwareSuggestions": [
    "Pandoc >= 3.1",
    "GNU Make >= 3.8"
  ],
  "version": "2.3.2",
  "developmentStatus": "active",
  "issueTracker": "https://github.com/caltechlibrary/dataset/issues",
  "downloadUrl": "https://github.com/caltechlibrary/dataset/archives/main.zip",
  "releaseNotes": "Issue #161 fix for handling GET with query were data is passed via URL parameters.\n\nRemoved support for frame, clone, sample, sync and join support removed. The dsimporter cli removed (use jsonl dump and load instead).",
  "copyrightYear": 2025,
  "copyrightHolder": "California Institute of Technology"
}

GitHub Events

Total

Create event: 11
Release event: 10
Issues event: 34
Watch event: 2
Issue comment event: 26
Push event: 97
Pull request event: 4

Last Year

Create event: 11
Release event: 10
Issues event: 34
Watch event: 2
Issue comment event: 26
Push event: 97
Pull request event: 4

Committers

Last synced: over 1 year ago

All Time

Total Commits: 1,340
Total Committers: 5
Avg Commits per committer: 268.0
Development Distribution Score (DDS): 0.08

Past Year

Commits: 52
Committers: 1
Avg Commits per committer: 52.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
R. S. Doiel	r**l@g**m	1,233
R. S. Doiel	=	85
Tom Morrell	t**l@c**u	19
Thomas Morrell	t****l	2
dependabot[bot]	4****]	1

Committer Domains (Top 20 + Academic)

caltech.edu: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 113
Total pull requests: 2
Average time to close issues: 4 months
Average time to close pull requests: 3 days
Total issue authors: 3
Total pull request authors: 2
Average comments per issue: 1.42
Average comments per pull request: 0.0
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 1

Past Year

Issues: 7
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 0
Average comments per issue: 0.29
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

rsdoiel (99)
tmorrell (28)
atomotic (1)

Pull Request Authors

rsdoiel (2)
dependabot[bot] (1)

Top Labels

Issue Labels

enhancement (43) bug (41) Correction (15) Doc Bug (14) critical (10) Someday Maybe (9) declined (4) wontfix (2) question (2) help wanted (1) review in branch (1)

Pull Request Labels

dependencies (1)

Dependencies

go.mod go

github.com/glebarez/go-sqlite v1.17.3
github.com/go-sql-driver/mysql v1.6.0
github.com/google/uuid v1.3.0
github.com/mattn/go-isatty v0.0.14
github.com/remyoudompheng/bigfft v0.0.0-20200410134404-eec4a21b6bb0
golang.org/x/sys v0.0.0-20220405052023-b1e9470b6e64
modernc.org/libc v1.16.8
modernc.org/mathutil v1.4.1
modernc.org/memory v1.1.1
modernc.org/sqlite v1.17.3

go.sum go

github.com/dustin/go-humanize v1.0.0
github.com/glebarez/go-sqlite v1.17.3
github.com/go-sql-driver/mysql v1.6.0
github.com/google/go-cmp v0.5.3
github.com/google/uuid v1.3.0
github.com/kballard/go-shellquote v0.0.0-20180428030007-95032a82bc51
github.com/mattn/go-isatty v0.0.12
github.com/mattn/go-isatty v0.0.14
github.com/mattn/go-sqlite3 v1.14.12
github.com/pmezard/go-difflib v1.0.0
github.com/remyoudompheng/bigfft v0.0.0-20200410134404-eec4a21b6bb0
github.com/yuin/goldmark v1.2.1
golang.org/x/crypto v0.0.0-20190308221718-c2843e01d9a2
golang.org/x/crypto v0.0.0-20191011191535-87dc89f01550
golang.org/x/crypto v0.0.0-20200622213623-75b288015ac9
golang.org/x/mod v0.3.0
golang.org/x/net v0.0.0-20190404232315-eb5bcb51f2a3
golang.org/x/net v0.0.0-20190620200207-3b0461eec859
golang.org/x/net v0.0.0-20201021035429-f5854403a974
golang.org/x/sync v0.0.0-20190423024810-112230192c58
golang.org/x/sync v0.0.0-20201020160332-67f06af15bc9
golang.org/x/sys v0.0.0-20190215142949-d0b11bdaac8a
golang.org/x/sys v0.0.0-20190412213103-97732733099d
golang.org/x/sys v0.0.0-20200116001909-b77594299b42
golang.org/x/sys v0.0.0-20200930185726-fdedc70b468f
golang.org/x/sys v0.0.0-20210630005230-0f9fa26af87c
golang.org/x/sys v0.0.0-20211007075335-d3039528d8ac
golang.org/x/sys v0.0.0-20220405052023-b1e9470b6e64
golang.org/x/text v0.3.0
golang.org/x/text v0.3.3
golang.org/x/tools v0.0.0-20180917221912-90fa682c2a6e
golang.org/x/tools v0.0.0-20191119224855-298f0cb1881e
golang.org/x/tools v0.0.0-20201124115921-2c860bdd6e78
golang.org/x/xerrors v0.0.0-20190717185122-a985d3407aa7
golang.org/x/xerrors v0.0.0-20191011141410-1b5146add898
golang.org/x/xerrors v0.0.0-20191204190536-9bdfabe68543
golang.org/x/xerrors v0.0.0-20200804184101-5ec99f83aff1
lukechampine.com/uint128 v1.1.1
modernc.org/cc/v3 v3.36.0
modernc.org/ccgo/v3 v3.0.0-20220428102840-41399a37e894
modernc.org/ccgo/v3 v3.0.0-20220430103911-bc99d88307be
modernc.org/ccgo/v3 v3.16.4
modernc.org/ccgo/v3 v3.16.6
modernc.org/ccorpus v1.11.6
modernc.org/httpfs v1.0.6
modernc.org/libc v0.0.0-20220428101251-2d5f3daf273b
modernc.org/libc v1.16.0
modernc.org/libc v1.16.1
modernc.org/libc v1.16.7
modernc.org/libc v1.16.8
modernc.org/mathutil v1.2.2
modernc.org/mathutil v1.4.1
modernc.org/memory v1.1.1
modernc.org/opt v0.1.1
modernc.org/sqlite v1.17.3
modernc.org/strutil v1.1.1
modernc.org/tcl v1.13.1
modernc.org/token v1.0.0
modernc.org/z v1.5.1

dataset

Science Score: 62.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Dataset Project

dataset, a command line tool

datasetd is dataset implemented as a web service

Design choices

Features

Limitations of dataset and datasetd

Read next ...

Authors and history

Releases

Related projects

Owner

Citation (CITATION.cff)

CodeMeta (codemeta.json)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies