pytacite
A flexible and lightweight Python interface to the DataCite database (datacite.org)
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 4 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.5%) to scientific vocabulary
Repository
A flexible and lightweight Python interface to the DataCite database (datacite.org)
Basic Info
Statistics
- Stars: 3
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 2
Metadata Files
README.md
pytacite
Pytacite is a Python library for DataCite. DataCite is a non-profit organisation that provides persistent identifiers (DOIs) for research data and other research outputs. It holds a large index of metadata of outputs. DataCite offers an open and free REST API to query metadata. Pytacite is a lightweight and thin Python interface to this API. Pytacite aims to stay as close as possible to the design of the original service.
The following features of DataCite are currently supported by pytacite:
- [x] Get single entities
- [x] Filter and query entities
- [x] Sort entities
- [x] Sample entities
- [x] Pagination
- [ ] Usage reports
- [ ] Authentication
- [ ] Side-load associations with include
We aim to cover the entire API, and we are looking for help. We are welcoming Pull Requests.
Key features
- Pipe operations - pytacite can handle multiple operations in a sequence. This allows the developer to write understandable queries. For examples, see code snippets.
- Permissive license - DataCite data is CC0 licensed :raised_hands:. pytacite is published under the MIT license.
Installation
pytacite requires Python 3.8 or later.
sh
pip install pytacite
Getting started
Pytacite offers support for: DOIs, Clients, ClientPrefixes, Events, Prefixes, Providers, and ProviderPrefixes.
python
from pytacite import DOIs, Clients, Events, Prefixes, ClientPrefixes, Providers, ProviderPrefixes
Get single entity
Get a single DOI, Event, Prefix, ProviderPrefix from DataCite by the id.
python
DOIs()["10.14454/FXWS-0523"]
The result is a DOI object, which is very similar to a dictionary. Find the
available fields with .keys(). Most interesting attributes are stored in
the "attributes" field.
For example, get the titles:
python
DOIs()["10.14454/FXWS-0523"]["attributes"]["titles"]
python
[{'title': 'DataCite Metadata Schema for the Publication and Citation of Research Data and Other Research Outputs v4.4'}]
It works similar for other resource collections.
python
Prefixes()["10.12682"]
Events()["9a34e232-5b30-453b-a393-ea10a6ce565d"]
Get lists of entities
python
results = DOIs().get()
For lists of entities, you can also count the number of records found
instead of returning the results. This also works for search queries and
filters.
```python DOIs().count()
50869984
```
For lists of entities, you can return the result as well as the metadata. By default, only the results are returned.
python
results, meta = DOIs().get(return_meta=True)
python
print(meta)
{'total': 50869984,
'totalPages': 400,
'page': 1,
'states': [{'id': 'findable', 'title': 'Findable', 'count': 50869984}],
'resourceTypes': [{'id': 'dataset', 'title': 'Dataset', 'count': 15426144}, <...>]
<...>
'subjects': [{'id': 'FOS: Biological sciences',
'title': 'Fos: Biological Sciences',
'count': 3304486}, <...>],
'citations': [],
'views': [],
'downloads': []}
Filters and queries
DataCite makes use of filter and queries. Filters can narrow down queries
(~.~) and queries can help to search fields. See:
- Filtering: https://support.datacite.org/docs/api-queries#filtering-list-responses
- Making Queries: https://support.datacite.org/docs/api-queries#making-queries
The following example returns records created in the year 2020 on Dryad.
python
DOIs().filter(created=2020, client_id="dryad.dryad").get()
which is identical to:
python
DOIs().filter(created=2020).filter(client_id="dryad.dryad").get()
Queries can work in a similar fashion and can be applied to all fields. For example, search for records with climate change in the title.
python
DOIs().query("climate change").get()
Important to note, this returns a list of all the DOI records that contain the phrases climate and change in their metadata (potential mistake in DataCite documentation).
Nested attribute filters
Some attribute filters are nested and separated with dots by DataCite. For
example, filter on creators.nameIdentifiers.nameIdentifierScheme.
In case of nested attribute filters, use a dict to build the query.
```python DOIs() \ .query(creators={"nameIdentifiers": {"nameIdentifierScheme": "ORCID"}}) \ .query(publicationYear=2016) \ .query(language="es") \ .count()
562
```
Sort entity lists
python
Clients().sort("created", ascending=True).get()
Logical expressions
See DataCite on logical operators like AND, OR, and NOT.
Paging
DataCite offers two methods for paging: basic paging and cursor paging. Both methods are supported by pytacite.
Basic (offset) paging
Only the first 10,000 records can be retrieved with basic (offset)paging.
```python pager = DOIs().filter(prefix="10.5438").paginate(method="number", per_page=100)
for page in pager: print(len(page)) ```
Cursor paging
Use paginate() for paging results. By default, paginates argument n_max
is set to 10000. Use None to retrieve all results.
```python pager = DOIs().filter(prefix="10.5438").paginate(per_page=100)
for page in pager: print(len(page)) ```
Looking for an easy method to iterate the records of a pager?
```python from itertools import chain from pytacite import DOIs
query = DOIs().filter(prefix="10.5438")
for record in chain(*query.paginate(per_page=100)): print(record["id"]) ```
Get random DOIs
Get random DOIs. Somehow, this has very slow response times (caused by DataCite).
python
DOIs().random().get(per_page=10)
Code snippets
A list of awesome use cases of the DataCite dataset.
Creators of a dataset
```python from pytacite import DOIs
w = DOIs()["10.34894/HE6NAQ"]
w["attributes"]["creators"] ```
Get the works of a single creator
Work in progress: get rid of quotes.
python
DOIs() \
.query(creators={"nameIdentifiers": {"nameIdentifier": "\"https://orcid.org/0000-0001-7736-2091\""}}) \
.get()
Software published on Zenodo in 2016
Resources: - https://support.datacite.org/reference/getclients - https://support.datacite.org/reference/getdois
Get the DataCite identifier of the client first: ```python from pytacite import Clients
c = Clients().query("Zenodo").get() print(c[0]["id"])
cern.zenodo
```
Filter the DOIs on the client identifier. It can be a bit confusing when to use filter and query here.
```python DOIs() \ .filter(clientid=c[0]["id"]) \ .filter(resourcetype_id="software") \ .query(publicationYear=2016) \ .get()
9720
```
Number of repositories running on Dataverse software
```python from pytacite import Clients
Clients() \ .filter(software="dataverse") \ .count()
31
```
Alternatives
datacite is a nice Python wrapper for Metadata Store API which is not covered by pytacite.
R users can use RDataCite library.
License
Contact
This library is a community contribution. The authors of this Python library aren't affiliated with DataCite.
Feel free to reach out with questions, remarks, and suggestions. The issue tracker is a good starting point. You can also email me at jonathandebruinos@gmail.com.
Owner
- Name: Jonathan de Bruin
- Login: J535D165
- Kind: user
- Location: Netherlands
- Company: Utrecht University
- Repositories: 45
- Profile: https://github.com/J535D165
Research engineer working on software, datasets, and tools to advance open science 👐 @UtrechtUniversity @asreview
Citation (CITATION.cff)
cff-version: 1.2.0
title: >-
pytacite - A flexible and lightweight Python interface to
the DataCite database
message: 'We appreciate, but do not require, attribution.'
type: software
authors:
- family-names: De Bruin
given-names: Jonathan
orcid: 'https://orcid.org/0000-0002-4297-0502'
repository-code: 'https://github.com/J535D165/pytacite'
url: 'https://github.com/J535D165/pytacite'
repository-artifact: 'https://pypi.org/project/pytacite/'
keywords:
- data-repository
- data-management
- research
- dataset
- research-software
- metadata
license: MIT
GitHub Events
Total
Last Year
Committers
Last synced: over 1 year ago
Top Committers
| Name | Commits | |
|---|---|---|
| Jonathan de Bruin | j****s@g****m | 16 |
Issues and Pull Requests
Last synced: 12 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0