pytacite

A flexible and lightweight Python interface to the DataCite database (datacite.org)

https://github.com/j535d165/pytacite

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 4 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.5%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

A flexible and lightweight Python interface to the DataCite database (datacite.org)

Basic Info
  • Host: GitHub
  • Owner: J535D165
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 35.2 KB
Statistics
  • Stars: 3
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 2
Created over 2 years ago · Last pushed over 2 years ago
Metadata Files
Readme License Citation

README.md

pytacite

PyPI DOI

Pytacite is a Python library for DataCite. DataCite is a non-profit organisation that provides persistent identifiers (DOIs) for research data and other research outputs. It holds a large index of metadata of outputs. DataCite offers an open and free REST API to query metadata. Pytacite is a lightweight and thin Python interface to this API. Pytacite aims to stay as close as possible to the design of the original service.

The following features of DataCite are currently supported by pytacite:

  • [x] Get single entities
  • [x] Filter and query entities
  • [x] Sort entities
  • [x] Sample entities
  • [x] Pagination
  • [ ] Usage reports
  • [ ] Authentication
  • [ ] Side-load associations with include

We aim to cover the entire API, and we are looking for help. We are welcoming Pull Requests.

Key features

  • Pipe operations - pytacite can handle multiple operations in a sequence. This allows the developer to write understandable queries. For examples, see code snippets.
  • Permissive license - DataCite data is CC0 licensed :raised_hands:. pytacite is published under the MIT license.

Installation

pytacite requires Python 3.8 or later.

sh pip install pytacite

Getting started

Pytacite offers support for: DOIs, Clients, ClientPrefixes, Events, Prefixes, Providers, and ProviderPrefixes.

python from pytacite import DOIs, Clients, Events, Prefixes, ClientPrefixes, Providers, ProviderPrefixes

Get single entity

Get a single DOI, Event, Prefix, ProviderPrefix from DataCite by the id.

python DOIs()["10.14454/FXWS-0523"]

The result is a DOI object, which is very similar to a dictionary. Find the available fields with .keys(). Most interesting attributes are stored in the "attributes" field.

For example, get the titles:

python DOIs()["10.14454/FXWS-0523"]["attributes"]["titles"]

python [{'title': 'DataCite Metadata Schema for the Publication and Citation of Research Data and Other Research Outputs v4.4'}]

It works similar for other resource collections.

python Prefixes()["10.12682"] Events()["9a34e232-5b30-453b-a393-ea10a6ce565d"]

Get lists of entities

python results = DOIs().get()

For lists of entities, you can also count the number of records found instead of returning the results. This also works for search queries and filters.

```python DOIs().count()

50869984

```

For lists of entities, you can return the result as well as the metadata. By default, only the results are returned.

python results, meta = DOIs().get(return_meta=True)

python print(meta) {'total': 50869984, 'totalPages': 400, 'page': 1, 'states': [{'id': 'findable', 'title': 'Findable', 'count': 50869984}], 'resourceTypes': [{'id': 'dataset', 'title': 'Dataset', 'count': 15426144}, <...>] <...> 'subjects': [{'id': 'FOS: Biological sciences', 'title': 'Fos: Biological Sciences', 'count': 3304486}, <...>], 'citations': [], 'views': [], 'downloads': []}

Filters and queries

DataCite makes use of filter and queries. Filters can narrow down queries (~.~) and queries can help to search fields. See:

  • Filtering: https://support.datacite.org/docs/api-queries#filtering-list-responses
  • Making Queries: https://support.datacite.org/docs/api-queries#making-queries

The following example returns records created in the year 2020 on Dryad.

python DOIs().filter(created=2020, client_id="dryad.dryad").get()

which is identical to:

python DOIs().filter(created=2020).filter(client_id="dryad.dryad").get()

Queries can work in a similar fashion and can be applied to all fields. For example, search for records with climate change in the title.

python DOIs().query("climate change").get()

Important to note, this returns a list of all the DOI records that contain the phrases climate and change in their metadata (potential mistake in DataCite documentation).

Nested attribute filters

Some attribute filters are nested and separated with dots by DataCite. For example, filter on creators.nameIdentifiers.nameIdentifierScheme.

In case of nested attribute filters, use a dict to build the query.

```python DOIs() \ .query(creators={"nameIdentifiers": {"nameIdentifierScheme": "ORCID"}}) \ .query(publicationYear=2016) \ .query(language="es") \ .count()

562

```

Sort entity lists

python Clients().sort("created", ascending=True).get()

Logical expressions

See DataCite on logical operators like AND, OR, and NOT.

Paging

DataCite offers two methods for paging: basic paging and cursor paging. Both methods are supported by pytacite.

Basic (offset) paging

Only the first 10,000 records can be retrieved with basic (offset)paging.

```python pager = DOIs().filter(prefix="10.5438").paginate(method="number", per_page=100)

for page in pager: print(len(page)) ```

Cursor paging

Use paginate() for paging results. By default, paginates argument n_max is set to 10000. Use None to retrieve all results.

```python pager = DOIs().filter(prefix="10.5438").paginate(per_page=100)

for page in pager: print(len(page)) ```

Looking for an easy method to iterate the records of a pager?

```python from itertools import chain from pytacite import DOIs

query = DOIs().filter(prefix="10.5438")

for record in chain(*query.paginate(per_page=100)): print(record["id"]) ```

Get random DOIs

Get random DOIs. Somehow, this has very slow response times (caused by DataCite).

python DOIs().random().get(per_page=10)

Code snippets

A list of awesome use cases of the DataCite dataset.

Creators of a dataset

```python from pytacite import DOIs

w = DOIs()["10.34894/HE6NAQ"]

w["attributes"]["creators"] ```

Get the works of a single creator

Work in progress: get rid of quotes.

python DOIs() \ .query(creators={"nameIdentifiers": {"nameIdentifier": "\"https://orcid.org/0000-0001-7736-2091\""}}) \ .get()

Software published on Zenodo in 2016

Resources: - https://support.datacite.org/reference/getclients - https://support.datacite.org/reference/getdois

Get the DataCite identifier of the client first: ```python from pytacite import Clients

c = Clients().query("Zenodo").get() print(c[0]["id"])

cern.zenodo

```

Filter the DOIs on the client identifier. It can be a bit confusing when to use filter and query here.

```python DOIs() \ .filter(clientid=c[0]["id"]) \ .filter(resourcetype_id="software") \ .query(publicationYear=2016) \ .get()

9720

```

Number of repositories running on Dataverse software

```python from pytacite import Clients

Clients() \ .filter(software="dataverse") \ .count()

31

```

Alternatives

datacite is a nice Python wrapper for Metadata Store API which is not covered by pytacite.

R users can use RDataCite library.

License

MIT

Contact

This library is a community contribution. The authors of this Python library aren't affiliated with DataCite.

Feel free to reach out with questions, remarks, and suggestions. The issue tracker is a good starting point. You can also email me at jonathandebruinos@gmail.com.

Owner

  • Name: Jonathan de Bruin
  • Login: J535D165
  • Kind: user
  • Location: Netherlands
  • Company: Utrecht University

Research engineer working on software, datasets, and tools to advance open science 👐 @UtrechtUniversity @asreview

Citation (CITATION.cff)

cff-version: 1.2.0
title: >-
  pytacite - A flexible and lightweight Python interface to
  the DataCite database
message: 'We appreciate, but do not require, attribution.'
type: software
authors:
  - family-names: De Bruin
    given-names: Jonathan
    orcid: 'https://orcid.org/0000-0002-4297-0502'
repository-code: 'https://github.com/J535D165/pytacite'
url: 'https://github.com/J535D165/pytacite'
repository-artifact: 'https://pypi.org/project/pytacite/'
keywords:
  - data-repository
  - data-management
  - research
  - dataset
  - research-software
  - metadata
license: MIT

GitHub Events

Total
Last Year

Committers

Last synced: over 1 year ago

All Time
  • Total Commits: 16
  • Total Committers: 1
  • Avg Commits per committer: 16.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 16
  • Committers: 1
  • Avg Commits per committer: 16.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Jonathan de Bruin j****s@g****m 16

Issues and Pull Requests

Last synced: 12 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels