py-dataset

Python package of dataset (https://github.com/caltechlibrary/dataset) for working with JSON objects as collections on disc

https://github.com/caltechlibrary/py_dataset

Science Score: 62.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    1 of 3 committers (33.3%) from academic institutions
  • Institutional organization owner
    Organization caltechlibrary has institutional domain (www.library.caltech.edu)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (16.3%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Python package of dataset (https://github.com/caltechlibrary/dataset) for working with JSON objects as collections on disc

Basic Info
Statistics
  • Stars: 2
  • Watchers: 5
  • Forks: 1
  • Open Issues: 3
  • Releases: 13
Created almost 7 years ago · Last pushed 11 months ago
Metadata Files
Readme License Citation Codemeta

README.md

DOI

py_dataset

py_dataset is a Python wrapper for the dataset command line tools. It replaces the depreciated libdataset a C shared library starting with the dataset 2.2.x release.

This package wraps all dataset operations such as initialization of collections, creation, reading, updating and deleting JSON objects in the collection. Some of its enhanced features include the ability to generate data frames as well as the ability to import and export JSON objects to and from CSV files.

py_dataset is release under a BSD style license.

Features

dataset supports

  • Basic storage actions (create, read, update and delete)
  • listing of collection keys (including filtering and sorting)
  • import/export of CSV files.
  • The ability to reshape data by performing simple object join
  • The ability to create data frames from collections based on keys lists and dot paths into the JSON objects stored

See docs for detials.

Limitations of dataset

dataset has many limitations, some are listed below

  • it is not a multi-process, multi-user data store (it's files on "disc" without locking)
  • it is not a replacement for a repository management system
  • it is not a general purpose database system
  • it does not supply version control on collections or objects

Install

Available via pip pip install py_dataset or by downloading this repo and typing python setup.py install. This repo includes dataset shared C libraries compiled for Windows, Mac, and Linux and the appripriate library will be used automatically.

Quick Tutorial

This module provides the functionality of the dataset command line tool as a Python 3.10 module. Once installed try out the following commands to see if everything is in order (or to get familier with dataset).

The "#" comments don't have to be typed in, they are there to explain the commands as your type them. Start the tour by launching Python3 in interactive mode.

shell python3

Then run the following Python commands.

```python from pydataset import dataset # Almost all the commands require the collectionname as first paramter, # we're storing that name in cname for convienence. cname = "atourof_dataset.ds"

# Let's create our a dataset collection. We use the method called 
# 'init' it returns True on success or False otherwise.
dataset.init(c_name)

# Let's check to see if our collection to exists, True it exists
# False if it doesn't.
dataset.status(c_name)

# Let's count the records in our collection (should be zero)
cnt = dataset.count(c_name)
print(cnt)

# Let's read all the keys in the collection (should be an empty list)
keys = dataset.keys(c_name)
print(keys)

# Now let's add a record to our collection. To create a record we need to know
# this collection name (e.g. c_name), the key (most be string) and have a 
# record (i.e. a dict literal or variable)
key = "one"
record = {"one": 1}
# If create returns False, we can check the last error message 
# with the 'error_message' method
if not dataset.create(c_name, key, record):
    print(dataset.error_message())

# Let's count and list the keys in our collection, we should see a count of '1' and a key of 'one'
dataset.count(c_name)
keys = dataset.keys(c_name)
print(keys)

# We can read the record we stored using the 'read' method.
new_record, err = dataset.read(c_name, key)
if err != '':
    print(err)
else:
    print(new_record)

# Let's modify new_record and update the record in our collection
new_record["two"] = 2
if not dataset.update(c_name, key, new_record):
    print(dataset.error_message())

# Let's print out the record we stored using read method
# read returns a touple so we're printing the first one.
print(dataset.read(c_name, key)[0])

# Now let's query the collection.
sql_stmt = f'''select src from {c_name} order by created desc'''
print(dataset.query(c_name, sql_stmt))

# Finally we can remove (delete) a record from our collection
if not dataset.delete(c_name, key):
    print(dataset.error_message())

# We should not have a count of Zero records
cnt = dataset.count(c_name)
print(cnt)

```

Owner

  • Name: Caltech Library
  • Login: caltechlibrary
  • Kind: organization
  • Email: helpdesk@library.caltech.edu
  • Location: Pasadena, CA 91125

We manage the physical and digital holdings of the California Institute of Technology, provide services and training, and develop open-source software.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
type: software
title: py_dataset
abstract: "A command line tool for working with JSON documents on local disc"
authors:
  - family-names: Doiel
    given-names: Robert
    orcid: https://orcid.org/0000-0003-0900-6903
    email: rsdoiel@caltech.edu
  - family-names: Morrell
    given-names: Thomas E
    orcid: https://orcid.org/0000-0001-9266-5146
    email: tmorrell@caltech.edu

contacts:
  - family-names: Doiel
    given-names: R. S.
    orcid: https://orcid.org/0000-0003-0900-6903
    email: rsdoiel@caltech.edu
  - family-names: Morrell
    given-names: Thomas E
    orcid: https://orcid.org/0000-0001-9266-5146
    email: tmorrell@caltech.edu

repository-code: "https://github.com/caltechlibrary/py_dataset"
version: 2.2.3.1
date-released: 2025-04-17

license-url: "https://github.com/caltechlibrary/py_dataset/blob/main/LICENSE"
keywords:
  - GitHub
  - metadata
  - data
  - software
  - json

CodeMeta (codemeta.json)

{
  "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
  "type": "SoftwareSourceCode",
  "codeRepository": "https://github.com/caltechlibrary/py_dataset",
  "author": [
    {
      "id": "https://orcid.org/0000-0003-0900-6903",
      "type": "Person",
      "givenName": "Robert",
      "familyName": "Doiel",
      "affiliation": {
        "@type": "Organization",
        "name": "Caltech Library"
      },
      "email": "rsdoiel@caltech.edu"
    },
    {
      "id": "https://orcid.org/0000-0001-9266-5146",
      "type": "Person",
      "givenName": "Thomas E",
      "familyName": "Morrell",
      "affiliation": {
        "@type": "Organization",
        "name": "Caltech Library"
      },
      "email": "tmorrell@caltech.edu"
    }
  ],
  "maintainer": [
    {
      "id": "https://orcid.org/0000-0003-0900-6903",
      "type": "Person",
      "givenName": "R. S.",
      "familyName": "Doiel",
      "affiliation": {
        "@type": "Organization",
        "name": "Caltech Library"
      },
      "email": "rsdoiel@caltech.edu"
    },
    {
      "id": "https://orcid.org/0000-0001-9266-5146",
      "type": "Person",
      "givenName": "Thomas E",
      "familyName": "Morrell",
      "affiliation": {
        "@type": "Organization",
        "name": "Caltech Library"
      },
      "email": "tmorrell@caltech.edu"
    }
  ],
  "dateCreated": "2017-06-18",
  "dateModified": "2025-04-17",
  "datePublished": "2025-04-17",
  "description": "A command line tool for working with JSON documents on local disc",
  "funder": [
    "Caltech Library"
  ],
  "keywords": [
    "GitHub",
    "metadata",
    "data",
    "software",
    "json"
  ],
  "name": "py_dataset",
  "license": "https://github.com/caltechlibrary/py_dataset/blob/main/LICENSE",
  "programmingLanguage": [
    "Python3"
  ],
  "softwareRequirements": [
    "dataset >= 2.2.3"
  ],
  "version": "2.2.3.1",
  "developmentStatus": "active",
  "issueTracker": "https://github.com/caltechlibrary/py_dataset/issues",
  "downloadUrl": "https://github.com/caltechlibrary/py_dataset/releases",
  "releaseNotes": "This patch adds missing dsquery support.",
  "copyrightYear": 2025,
  "copyrightHolder": "California Institute of Technology"
}

GitHub Events

Total
  • Release event: 3
  • Watch event: 1
  • Delete event: 1
  • Push event: 16
  • Create event: 4
Last Year
  • Release event: 3
  • Watch event: 1
  • Delete event: 1
  • Push event: 16
  • Create event: 4

Committers

Last synced: almost 3 years ago

All Time
  • Total Commits: 93
  • Total Committers: 3
  • Avg Commits per committer: 31.0
  • Development Distribution Score (DDS): 0.226
Top Committers
Name Email Commits
R. S. Doiel r****l@g****m 72
Tom Morrell t****l@c****u 20
tmorrell t****l@u****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 14
  • Total pull requests: 2
  • Average time to close issues: 26 days
  • Average time to close pull requests: 8 days
  • Total issue authors: 2
  • Total pull request authors: 1
  • Average comments per issue: 1.0
  • Average comments per pull request: 0.5
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • rsdoiel (9)
  • tmorrell (4)
Pull Request Authors
  • rsdoiel (2)
Top Labels
Issue Labels
enhancement (4) bug (3)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 458 last-month
  • Total docker downloads: 81
  • Total dependent packages: 1
  • Total dependent repositories: 1
  • Total versions: 7
  • Total maintainers: 1
pypi.org: py-dataset

A command line tool for working with JSON documents on local disc

  • Versions: 7
  • Dependent Packages: 1
  • Dependent Repositories: 1
  • Downloads: 458 Last month
  • Docker Downloads: 81
Rankings
Docker downloads count: 2.8%
Dependent packages count: 4.7%
Average: 16.1%
Downloads: 16.6%
Forks count: 19.1%
Dependent repos count: 21.6%
Stargazers count: 31.9%
Maintainers (1)
Last synced: 6 months ago

Dependencies

.github/workflows/codemeta2cff.yml actions
  • EndBug/add-and-commit v7 composite
  • actions/checkout v2 composite
  • caltechlibrary/codemeta2cff main composite