isimip-publisher
A command line tool to publish climate impact data from the ISIMIP project.
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.0%) to scientific vocabulary
Repository
A command line tool to publish climate impact data from the ISIMIP project.
Basic Info
Statistics
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 4
Metadata Files
README.md
ISIMIP publisher
A command line tool to publish climate impact data from the ISIMIP project. This tool is used for the ISIMIP repository.
Setup
First create a virtual environment in the directory env using:
python3 -m venv env
Next, install isimip-publisher directly from GitHub using
pip install git+https://github.com/ISI-MIP/isimip-publisher
If you want to make changes to the source code, clone the repository and use pip install -e instead:
git clone git@github.com:ISI-MIP/isimip-publisher
pip install -e isimip-publisher
PostgreSQL has to be available and a database user and a database has to be created, and the pg_trgm extension needs to be activated:
pgsql
CREATE USER "isimip_metadata" WITH PASSWORD 'supersecretpassword';
CREATE DATABASE "isimip_metadata" WITH OWNER "isimip_metadata";
\c isimip_metadata
CREATE EXTENSION pg_trgm;
Usage
The publisher has several options which can be inspected using the help option -h, --help:
``` usage: isimip-publisher [-h] [--config-file CONFIGFILE] [-i INCLUDEFILE] [-e EXCLUDEFILE] [-v VERSION] [--remote-dest REMOTE_DEST] [--remote-dir REMOTE_DIR] [--local-dir LOCAL_DIR] [--public-dir PUBLIC_DIR] [--archive-dir ARCHIVE_DIR] [--database DATABASE] [--mock MOCK] [--restricted RESTRICTED] [--protocol-location PROTOCOL_LOCATIONS] [--datacite-username DATACITE_USERNAME] [--datacite-password DATACITE_PASSWORD] [--datacite-prefix DATACITE_PREFIX] [--datacite-test-mode DATACITETESTMODE] [--isimip-data-url ISIMIPDATAURL] [--rights {None,CC0,BY,BY-SA,BY-NC,BY-NC-SA}] [--log-level LOGLEVEL] [--log-file LOGFILE] {listremote,listremotelinks,listlocal,listpublic,listpubliclinks,matchremote,matchremotelinks,matchlocal,matchpublic,matchpubliclinks,fetchfiles,writelocaljsons,writepublicjsons,writelinkjsons,insertdatasets,updatedatasets,publishdatasets,archivedatasets,check,clean,updatesearch,updatetree,run,insertdoi,updatedoi,registerdoi,linklinks,linkfiles,linkdatasets,link,init,update_views} ...
optional arguments: -h, --help show this help message and exit --config-file CONFIGFILE File path of the config file -i INCLUDEFILE, --include INCLUDEFILE Path to a file containing a list of files to include -e EXCLUDEFILE, --exclude EXCLUDEFILE Path to a file containing a list of files to exclude -v VERSION, --version VERSION Version date override [default: today] --remote-dest REMOTEDEST Remote destination to fetch files from, e.g. user@example.com --remote-dir REMOTEDIR Remote directory to fetch files from --local-dir LOCALDIR Local work directory --public-dir PUBLICDIR Public directory --archive-dir ARCHIVEDIR Archive directory --database DATABASE Database connection string, e.g. postgresql+psycopg2://username:password@host:port/dbname --mock MOCK If set to True, no files are actually copied. Empty mock files are used instead --restricted RESTRICTED If set to True, the files are flaged as restricted in the database. --protocol-location PROTOCOLLOCATIONS URL or file path to the protocol --datacite-username DATACITEUSERNAME Username for DataCite --datacite-password DATACITEPASSWORD Password for DataCite --datacite-prefix DATACITEPREFIX Prefix for DataCite --datacite-test-mode DATACITETESTMODE If set to True, the test version of DataCite is used --isimip-data-url ISIMIPDATAURL URL of the ISIMIP repository [default: https://data.isimip.org/] --rights {None,CC0,BY,BY-SA,BY-NC,BY-NC-SA} Rights/license for the files [default: None] --log-level LOGLEVEL Log level (ERROR, WARN, INFO, or DEBUG) --log-file LOGFILE Path to the log file
subcommands: valid subcommands
{listremote,listremotelinks,listlocal,listpublic,listpubliclinks,matchremote,matchremotelinks,matchlocal,matchpublic,matchpubliclinks,fetchfiles,writelocaljsons,writepublicjsons,writelinkjsons,insertdatasets,updatedatasets,publishdatasets,archivedatasets,check,clean,updatesearch,updatetree,run,insertdoi,updatedoi,registerdoi,linklinks,linkfiles,linkdatasets,link,init,updateviews} ```
The different steps of the publication process are covered by subcommands, which can be invoked separately.
```bash
list remote files
isimip-publisher list_remote
match remote datasets
isimip-publisher match_remote
copy remote files to LOCAL_DIR
isimip-publisher fetch_files
create a JSON file with metadata for each dataset and file
isimip-publisher writelocaljsons
finds dataset and file and ingest their metadata into the database
isimip-publisher ingest_datasets
copy files from LOCALDIR to PUPLICDIR
isimip-publisher publish_datasets
copy files from PUBLICDIR to ARCHIVEDIR
isimip-publisher archive_datasets
insert a new doi resource
isimip-publisher ingest_doi
register a DOI resource with datacite
isimip-publisher register_doi
<path> starts from REMOTE_DIR, LOCAL_DIR, etc., and must start with <simulation_round>/<product>/<sector>. After that more levels can follow to restrict the files to be processed further.
<resource-path> is the path to a JSON file containing metadata on the local disk.
match_remote, fetch_files, write_jsons, ingest_datasets, and publish_datasets can be combined using run:
bash
isimip-publisher run <path>
For all commands a list of files with absolute pathes (as line separated txt file) can be provided to restrict the files processed, e.g.:
bash
isimip-publisher -e exclude.txt -i include.txt run <path>
Default values for the optional arguments are set in the code, but can also be provided via:
a config file given by
--config-file, or located atisimip.conf,~/.isimip.conf, or/etc/isimip.conf. The config file needs to have a sectionisimip-publisherand uses lower case variables and underscores, e.g.: ``` [isimip-publisher] log_level = ERROR mock = falseremotedest = localhost remotedir = /path/to/remote/ localdir = /path/to/local/ publicdir = /path/to/public/ archive_dir = /path/to/public/ database = postgresql+psycopg2://USER:PASSWORD@host:port/DBNAME
protocol_locations = '/path/to/isimip-protocol-3/output/ /path/to/isimip-protocol-3/output/' ```
environment variables (in caps and with underscores, e.g.
MOCK).
Scripts/Notebooks
The different functions of the tool can also be used in Python scripts or Jupyter Notebooks. Before any functions are called, the global settings object needs to be initialized, e.g.:
```python from isimippublisher.main import initsettings from isimippublisher.utils.database import (initdatabasesession, retrievedatasets)
path = 'ISIMIP3b/OutputData/marine-fishery_global'
settings = initsettings(configfile='~/data/isimip/isimip.conf')
session = initdatabasesession(settings.DATABASE)
datasets = retrieve_datasets(session, path)
... ```
Test
Install test dependencies:
pip install -r requirements/pytest.txt
Copy .env.pytest to .env. This sets the environment variables to the directories in testing.
Run:
bash
pytest
Run a specific test, e.g.:
bash
pytest isimip_publisher/tests/test_commands.py::test_empty
Run tests with coverage:
bash
pytest --cov=isimip_publisher
Database schema
The database schema is automatically created when insert_datasets or init is used the first time. The tool creates 3 main tables, one for the datasets, one for the files (in each dataset), and one for the resources, for which DOI are created.:
Table "public.datasets"
Column | Type | Collation | Nullable | Default
-------------+-----------------------------+-----------+----------+---------
id | uuid | | not null |
target_id | uuid | | |
name | text | | not null |
path | text | | not null |
version | character varying(8) | | not null |
size | bigint | | not null |
specifiers | jsonb | | not null |
identifiers | text[] | | not null |
public | boolean | | not null |
tree_path | text | | |
rights | text | | |
created | timestamp without time zone | | |
updated | timestamp without time zone | | |
published | timestamp without time zone | | |
archived | timestamp without time zone | | |
Table "public.files"
Column | Type | Collation | Nullable | Default
---------------+-----------------------------+-----------+----------+---------
id | uuid | | not null |
dataset_id | uuid | | |
target_id | uuid | | |
name | text | | not null |
path | text | | not null |
version | character varying(8) | | not null |
size | bigint | | not null |
checksum | text | | not null |
checksum_type | text | | not null |
netcdf_header | jsonb | | |
specifiers | jsonb | | not null |
identifiers | text[] | | not null |
created | timestamp without time zone | | |
updated | timestamp without time zone | | |
Table "public.resources"
Column | Type | Collation | Nullable | Default
----------+-----------------------------+-----------+----------+---------
id | uuid | | not null |
doi | text | | not null |
title | text | | not null |
version | text | | |
paths | text[] | | not null |
datacite | jsonb | | not null |
created | timestamp without time zone | | |
updated | timestamp without time zone | | |
The many-to-many relation between datasets and resources is implemented using a seperate table:
Table "public.resources_datasets"
Column | Type | Collation | Nullable | Default
-------------+------+-----------+----------+---------
resource_id | uuid | | |
dataset_id | uuid | | |
Additional tables are created for the search and tree functionality of the repository.
Table "public.search"
Column | Type | Collation | Nullable | Default
------------+-----------------------------+-----------+----------+---------
dataset_id | uuid | | not null |
vector | tsvector | | not null |
created | timestamp without time zone | | |
updated | timestamp without time zone | | |
Table "public.trees"
Column | Type | Collation | Nullable | Default
-----------+-----------------------------+-----------+----------+---------
id | uuid | | not null |
tree_dict | jsonb | | not null |
created | timestamp without time zone | | |
updated | timestamp without time zone | | |
Two materialized views are used to allow a fast lookup to all identifiers (with the list of corresponding specifiers), as well as all words (the list of tokens for the search):
Materialized view "public.identifiers"
Column | Type | Collation | Nullable | Default
------------+------+-----------+----------+---------
identifier | text | | |
specifiers | json | | |
Materialized view "public.words"
Column | Type | Collation | Nullable | Default
--------+------+-----------+----------+---------
word | text | | |
Owner
- Name: Inter-Sectoral Impact Model Intercomparison Project
- Login: ISI-MIP
- Kind: organization
- Email: isimip@pik-potsdam.de
- Location: Potsdam, Germany
- Website: https://www.isimip.org
- Repositories: 29
- Profile: https://github.com/ISI-MIP
Citation (CITATION.cff)
cff-version: 1.2.0
title: ISIMIP publisher
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- family-names: Klar
given-names: Jochen
orcid: 'https://orcid.org/0000-0002-5883-4273'
repository-code: 'https://github.com/ISI-MIP/isimip-publisher'
license: MIT
GitHub Events
Total
- Release event: 1
- Push event: 6
- Create event: 2
Last Year
- Release event: 1
- Push event: 6
- Create event: 2
Dependencies
- actions/checkout v3 composite
- actions/setup-python v4 composite
- postgres latest docker
- pytest * test
- pytest-console-scripts * test
- pytest-cov * test