py-dataset

Python package of dataset (https://github.com/caltechlibrary/dataset) for working with JSON objects as collections on disc

https://github.com/caltechlibrary/py_dataset

Science Score: 62.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
✓
Committers with academic emails
1 of 3 committers (33.3%) from academic institutions
✓
Institutional organization owner
Organization caltechlibrary has institutional domain (www.library.caltech.edu)
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (16.3%) to scientific vocabulary

Last synced: 9 months ago · JSON representation ·

Repository

Python package of dataset (https://github.com/caltechlibrary/dataset) for working with JSON objects as collections on disc

Basic Info

Host: GitHub
Owner: caltechlibrary
License: other
Language: Python
Default Branch: main
Homepage: https://caltechlibrary.github.io/py_dataset
Size: 408 MB

Statistics

Stars: 2
Watchers: 5
Forks: 1
Open Issues: 3
Releases: 13

Created about 7 years ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation Codemeta

py_dataset

py_dataset is a Python wrapper for the dataset command line tools. It replaces the depreciated libdataset a C shared library starting with the dataset 2.2.x release.

This package wraps all dataset operations such as initialization of collections, creation, reading, updating and deleting JSON objects in the collection. Some of its enhanced features include the ability to generate data frames as well as the ability to import and export JSON objects to and from CSV files.

py_dataset is release under a BSD style license.

Features

dataset supports

Basic storage actions (create, read, update and delete)
listing of collection keys (including filtering and sorting)
import/export of CSV files.
The ability to reshape data by performing simple object join
The ability to create data frames from collections based on keys lists and dot paths into the JSON objects stored

See docs for detials.

Limitations of dataset

dataset has many limitations, some are listed below

it is not a multi-process, multi-user data store (it's files on "disc" without locking)
it is not a replacement for a repository management system
it is not a general purpose database system
it does not supply version control on collections or objects

Install

Available via pip pip install py_dataset or by downloading this repo and typing python setup.py install. This repo includes dataset shared C libraries compiled for Windows, Mac, and Linux and the appripriate library will be used automatically.

Quick Tutorial

This module provides the functionality of the dataset command line tool as a Python 3.10 module. Once installed try out the following commands to see if everything is in order (or to get familier with dataset).

The "#" comments don't have to be typed in, they are there to explain the commands as your type them. Start the tour by launching Python3 in interactive mode.

shell python3

Then run the following Python commands.

```python from pydataset import dataset # Almost all the commands require the collectionname as first paramter, # we're storing that name in cname for convienence. cname = "atourof_dataset.ds"

# Let's create our a dataset collection. We use the method called 
# 'init' it returns True on success or False otherwise.
dataset.init(c_name)

# Let's check to see if our collection to exists, True it exists
# False if it doesn't.
dataset.status(c_name)

# Let's count the records in our collection (should be zero)
cnt = dataset.count(c_name)
print(cnt)

# Let's read all the keys in the collection (should be an empty list)
keys = dataset.keys(c_name)
print(keys)

# Now let's add a record to our collection. To create a record we need to know
# this collection name (e.g. c_name), the key (most be string) and have a 
# record (i.e. a dict literal or variable)
key = "one"
record = {"one": 1}
# If create returns False, we can check the last error message 
# with the 'error_message' method
if not dataset.create(c_name, key, record):
    print(dataset.error_message())

# Let's count and list the keys in our collection, we should see a count of '1' and a key of 'one'
dataset.count(c_name)
keys = dataset.keys(c_name)
print(keys)

# We can read the record we stored using the 'read' method.
new_record, err = dataset.read(c_name, key)
if err != '':
    print(err)
else:
    print(new_record)

# Let's modify new_record and update the record in our collection
new_record["two"] = 2
if not dataset.update(c_name, key, new_record):
    print(dataset.error_message())

# Let's print out the record we stored using read method
# read returns a touple so we're printing the first one.
print(dataset.read(c_name, key)[0])

# Now let's query the collection.
sql_stmt = f'''select src from {c_name} order by created desc'''
print(dataset.query(c_name, sql_stmt))

# Finally we can remove (delete) a record from our collection
if not dataset.delete(c_name, key):
    print(dataset.error_message())

# We should not have a count of Zero records
cnt = dataset.count(c_name)
print(cnt)

```

Owner

Name: Caltech Library
Login: caltechlibrary
Kind: organization
Email: helpdesk@library.caltech.edu
Location: Pasadena, CA 91125

Website: https://www.library.caltech.edu/
Repositories: 84
Profile: https://github.com/caltechlibrary

We manage the physical and digital holdings of the California Institute of Technology, provide services and training, and develop open-source software.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
type: software
title: py_dataset
abstract: "A command line tool for working with JSON documents on local disc"
authors:
  - family-names: Doiel
    given-names: Robert
    orcid: https://orcid.org/0000-0003-0900-6903
    email: rsdoiel@caltech.edu
  - family-names: Morrell
    given-names: Thomas E
    orcid: https://orcid.org/0000-0001-9266-5146
    email: tmorrell@caltech.edu

contacts:
  - family-names: Doiel
    given-names: R. S.
    orcid: https://orcid.org/0000-0003-0900-6903
    email: rsdoiel@caltech.edu
  - family-names: Morrell
    given-names: Thomas E
    orcid: https://orcid.org/0000-0001-9266-5146
    email: tmorrell@caltech.edu

repository-code: "https://github.com/caltechlibrary/py_dataset"
version: 2.2.3.1
date-released: 2025-04-17

license-url: "https://github.com/caltechlibrary/py_dataset/blob/main/LICENSE"
keywords:
  - GitHub
  - metadata
  - data
  - software
  - json

CodeMeta (codemeta.json)

{
  "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
  "type": "SoftwareSourceCode",
  "codeRepository": "https://github.com/caltechlibrary/py_dataset",
  "author": [
    {
      "id": "https://orcid.org/0000-0003-0900-6903",
      "type": "Person",
      "givenName": "Robert",
      "familyName": "Doiel",
      "affiliation": {
        "@type": "Organization",
        "name": "Caltech Library"
      },
      "email": "rsdoiel@caltech.edu"
    },
    {
      "id": "https://orcid.org/0000-0001-9266-5146",
      "type": "Person",
      "givenName": "Thomas E",
      "familyName": "Morrell",
      "affiliation": {
        "@type": "Organization",
        "name": "Caltech Library"
      },
      "email": "tmorrell@caltech.edu"
    }
  ],
  "maintainer": [
    {
      "id": "https://orcid.org/0000-0003-0900-6903",
      "type": "Person",
      "givenName": "R. S.",
      "familyName": "Doiel",
      "affiliation": {
        "@type": "Organization",
        "name": "Caltech Library"
      },
      "email": "rsdoiel@caltech.edu"
    },
    {
      "id": "https://orcid.org/0000-0001-9266-5146",
      "type": "Person",
      "givenName": "Thomas E",
      "familyName": "Morrell",
      "affiliation": {
        "@type": "Organization",
        "name": "Caltech Library"
      },
      "email": "tmorrell@caltech.edu"
    }
  ],
  "dateCreated": "2017-06-18",
  "dateModified": "2025-04-17",
  "datePublished": "2025-04-17",
  "description": "A command line tool for working with JSON documents on local disc",
  "funder": [
    "Caltech Library"
  ],
  "keywords": [
    "GitHub",
    "metadata",
    "data",
    "software",
    "json"
  ],
  "name": "py_dataset",
  "license": "https://github.com/caltechlibrary/py_dataset/blob/main/LICENSE",
  "programmingLanguage": [
    "Python3"
  ],
  "softwareRequirements": [
    "dataset >= 2.2.3"
  ],
  "version": "2.2.3.1",
  "developmentStatus": "active",
  "issueTracker": "https://github.com/caltechlibrary/py_dataset/issues",
  "downloadUrl": "https://github.com/caltechlibrary/py_dataset/releases",
  "releaseNotes": "This patch adds missing dsquery support.",
  "copyrightYear": 2025,
  "copyrightHolder": "California Institute of Technology"
}

GitHub Events

Total

Release event: 3
Watch event: 1
Delete event: 1
Push event: 16
Create event: 4

Last Year

Release event: 3
Watch event: 1
Delete event: 1
Push event: 16
Create event: 4

Committers

Last synced: about 3 years ago

All Time

Total Commits: 93
Total Committers: 3
Avg Commits per committer: 31.0
Development Distribution Score (DDS): 0.226

Top Committers

Name	Email	Commits
R. S. Doiel	r**l@g**m	72
Tom Morrell	t**l@c**u	20
tmorrell	t**l@u**m	1

Committer Domains (Top 20 + Academic)

caltech.edu: 1

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 14
Total pull requests: 2
Average time to close issues: 26 days
Average time to close pull requests: 8 days
Total issue authors: 2
Total pull request authors: 1
Average comments per issue: 1.0
Average comments per pull request: 0.5
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

rsdoiel (9)
tmorrell (4)

Pull Request Authors

rsdoiel (2)

Top Labels

Issue Labels

enhancement (4) bug (3)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 458 last-month
Total docker downloads: 81

Total dependent packages: 1
Total dependent repositories: 1
Total versions: 7
Total maintainers: 1

pypi.org: py-dataset

A command line tool for working with JSON documents on local disc

Homepage: https://github.com/caltechlibrary/py_dataset
Documentation: https://py-dataset.readthedocs.io/
License: https://data.caltech.edu/license
Latest release: 1.0.1
published almost 5 years ago

Versions: 7
Dependent Packages: 1
Dependent Repositories: 1
Downloads: 458 Last month
Docker Downloads: 81

Rankings

Docker downloads count: 2.8%

Dependent packages count: 4.7%

Average: 16.1%

Downloads: 16.6%

Forks count: 19.1%

Dependent repos count: 21.6%

Stargazers count: 31.9%

Maintainers (1)

rsdoiel

Last synced: 9 months ago

Dependencies

.github/workflows/codemeta2cff.yml actions

EndBug/add-and-commit v7 composite
actions/checkout v2 composite
caltechlibrary/codemeta2cff main composite

py-dataset

Science Score: 62.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

py_dataset

Features

Limitations of dataset

Install

Quick Tutorial

Owner

Citation (CITATION.cff)

CodeMeta (codemeta.json)

GitHub Events

Total

Last Year

Committers

All Time

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: py-dataset

Rankings

Maintainers (1)

Dependencies