isimip-publisher

A command line tool to publish climate impact data from the ISIMIP project.

https://github.com/isi-mip/isimip-publisher

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.0%) to scientific vocabulary
Last synced: 9 months ago · JSON representation ·

Repository

A command line tool to publish climate impact data from the ISIMIP project.

Basic Info
  • Host: GitHub
  • Owner: ISI-MIP
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 496 KB
Statistics
  • Stars: 0
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 4
Created almost 7 years ago · Last pushed 10 months ago
Metadata Files
Readme License Citation

README.md

ISIMIP publisher

Latest release Python Version License pytest Workflow Status Coverage Status

A command line tool to publish climate impact data from the ISIMIP project. This tool is used for the ISIMIP repository.

Setup

First create a virtual environment in the directory env using:

python3 -m venv env

Next, install isimip-publisher directly from GitHub using

pip install git+https://github.com/ISI-MIP/isimip-publisher

If you want to make changes to the source code, clone the repository and use pip install -e instead:

git clone git@github.com:ISI-MIP/isimip-publisher pip install -e isimip-publisher

PostgreSQL has to be available and a database user and a database has to be created, and the pg_trgm extension needs to be activated:

pgsql CREATE USER "isimip_metadata" WITH PASSWORD 'supersecretpassword'; CREATE DATABASE "isimip_metadata" WITH OWNER "isimip_metadata"; \c isimip_metadata CREATE EXTENSION pg_trgm;

Usage

The publisher has several options which can be inspected using the help option -h, --help:

``` usage: isimip-publisher [-h] [--config-file CONFIGFILE] [-i INCLUDEFILE] [-e EXCLUDEFILE] [-v VERSION] [--remote-dest REMOTE_DEST] [--remote-dir REMOTE_DIR] [--local-dir LOCAL_DIR] [--public-dir PUBLIC_DIR] [--archive-dir ARCHIVE_DIR] [--database DATABASE] [--mock MOCK] [--restricted RESTRICTED] [--protocol-location PROTOCOL_LOCATIONS] [--datacite-username DATACITE_USERNAME] [--datacite-password DATACITE_PASSWORD] [--datacite-prefix DATACITE_PREFIX] [--datacite-test-mode DATACITETESTMODE] [--isimip-data-url ISIMIPDATAURL] [--rights {None,CC0,BY,BY-SA,BY-NC,BY-NC-SA}] [--log-level LOGLEVEL] [--log-file LOGFILE] {listremote,listremotelinks,listlocal,listpublic,listpubliclinks,matchremote,matchremotelinks,matchlocal,matchpublic,matchpubliclinks,fetchfiles,writelocaljsons,writepublicjsons,writelinkjsons,insertdatasets,updatedatasets,publishdatasets,archivedatasets,check,clean,updatesearch,updatetree,run,insertdoi,updatedoi,registerdoi,linklinks,linkfiles,linkdatasets,link,init,update_views} ...

optional arguments: -h, --help show this help message and exit --config-file CONFIGFILE File path of the config file -i INCLUDEFILE, --include INCLUDEFILE Path to a file containing a list of files to include -e EXCLUDEFILE, --exclude EXCLUDEFILE Path to a file containing a list of files to exclude -v VERSION, --version VERSION Version date override [default: today] --remote-dest REMOTEDEST Remote destination to fetch files from, e.g. user@example.com --remote-dir REMOTEDIR Remote directory to fetch files from --local-dir LOCALDIR Local work directory --public-dir PUBLICDIR Public directory --archive-dir ARCHIVEDIR Archive directory --database DATABASE Database connection string, e.g. postgresql+psycopg2://username:password@host:port/dbname --mock MOCK If set to True, no files are actually copied. Empty mock files are used instead --restricted RESTRICTED If set to True, the files are flaged as restricted in the database. --protocol-location PROTOCOLLOCATIONS URL or file path to the protocol --datacite-username DATACITEUSERNAME Username for DataCite --datacite-password DATACITEPASSWORD Password for DataCite --datacite-prefix DATACITEPREFIX Prefix for DataCite --datacite-test-mode DATACITETESTMODE If set to True, the test version of DataCite is used --isimip-data-url ISIMIPDATAURL URL of the ISIMIP repository [default: https://data.isimip.org/] --rights {None,CC0,BY,BY-SA,BY-NC,BY-NC-SA} Rights/license for the files [default: None] --log-level LOGLEVEL Log level (ERROR, WARN, INFO, or DEBUG) --log-file LOGFILE Path to the log file

subcommands: valid subcommands

{listremote,listremotelinks,listlocal,listpublic,listpubliclinks,matchremote,matchremotelinks,matchlocal,matchpublic,matchpubliclinks,fetchfiles,writelocaljsons,writepublicjsons,writelinkjsons,insertdatasets,updatedatasets,publishdatasets,archivedatasets,check,clean,updatesearch,updatetree,run,insertdoi,updatedoi,registerdoi,linklinks,linkfiles,linkdatasets,link,init,updateviews} ```

The different steps of the publication process are covered by subcommands, which can be invoked separately.

```bash

list remote files

isimip-publisher list_remote

match remote datasets

isimip-publisher match_remote

copy remote files to LOCAL_DIR

isimip-publisher fetch_files

create a JSON file with metadata for each dataset and file

isimip-publisher writelocaljsons

finds dataset and file and ingest their metadata into the database

isimip-publisher ingest_datasets

copy files from LOCALDIR to PUPLICDIR

isimip-publisher publish_datasets

copy files from PUBLICDIR to ARCHIVEDIR

isimip-publisher archive_datasets

insert a new doi resource

isimip-publisher ingest_doi

register a DOI resource with datacite

isimip-publisher register_doi ```

<path> starts from REMOTE_DIR, LOCAL_DIR, etc., and must start with <simulation_round>/<product>/<sector>. After that more levels can follow to restrict the files to be processed further.

<resource-path> is the path to a JSON file containing metadata on the local disk.

match_remote, fetch_files, write_jsons, ingest_datasets, and publish_datasets can be combined using run:

bash isimip-publisher run <path>

For all commands a list of files with absolute pathes (as line separated txt file) can be provided to restrict the files processed, e.g.:

bash isimip-publisher -e exclude.txt -i include.txt run <path>

Default values for the optional arguments are set in the code, but can also be provided via:

  • a config file given by --config-file, or located at isimip.conf, ~/.isimip.conf, or /etc/isimip.conf. The config file needs to have a section isimip-publisher and uses lower case variables and underscores, e.g.: ``` [isimip-publisher] log_level = ERROR mock = false

    remotedest = localhost remotedir = /path/to/remote/ localdir = /path/to/local/ publicdir = /path/to/public/ archive_dir = /path/to/public/ database = postgresql+psycopg2://USER:PASSWORD@host:port/DBNAME

    protocol_locations = '/path/to/isimip-protocol-3/output/ /path/to/isimip-protocol-3/output/' ```

  • environment variables (in caps and with underscores, e.g. MOCK).

Scripts/Notebooks

The different functions of the tool can also be used in Python scripts or Jupyter Notebooks. Before any functions are called, the global settings object needs to be initialized, e.g.:

```python from isimippublisher.main import initsettings from isimippublisher.utils.database import (initdatabasesession, retrievedatasets)

path = 'ISIMIP3b/OutputData/marine-fishery_global'

settings = initsettings(configfile='~/data/isimip/isimip.conf')

session = initdatabasesession(settings.DATABASE)

datasets = retrieve_datasets(session, path)

... ```

Test

Install test dependencies:

pip install -r requirements/pytest.txt

Copy .env.pytest to .env. This sets the environment variables to the directories in testing.

Run:

bash pytest

Run a specific test, e.g.:

bash pytest isimip_publisher/tests/test_commands.py::test_empty

Run tests with coverage:

bash pytest --cov=isimip_publisher

Database schema

The database schema is automatically created when insert_datasets or init is used the first time. The tool creates 3 main tables, one for the datasets, one for the files (in each dataset), and one for the resources, for which DOI are created.:

Table "public.datasets" Column | Type | Collation | Nullable | Default -------------+-----------------------------+-----------+----------+--------- id | uuid | | not null | target_id | uuid | | | name | text | | not null | path | text | | not null | version | character varying(8) | | not null | size | bigint | | not null | specifiers | jsonb | | not null | identifiers | text[] | | not null | public | boolean | | not null | tree_path | text | | | rights | text | | | created | timestamp without time zone | | | updated | timestamp without time zone | | | published | timestamp without time zone | | | archived | timestamp without time zone | | |

Table "public.files" Column | Type | Collation | Nullable | Default ---------------+-----------------------------+-----------+----------+--------- id | uuid | | not null | dataset_id | uuid | | | target_id | uuid | | | name | text | | not null | path | text | | not null | version | character varying(8) | | not null | size | bigint | | not null | checksum | text | | not null | checksum_type | text | | not null | netcdf_header | jsonb | | | specifiers | jsonb | | not null | identifiers | text[] | | not null | created | timestamp without time zone | | | updated | timestamp without time zone | | |

Table "public.resources" Column | Type | Collation | Nullable | Default ----------+-----------------------------+-----------+----------+--------- id | uuid | | not null | doi | text | | not null | title | text | | not null | version | text | | | paths | text[] | | not null | datacite | jsonb | | not null | created | timestamp without time zone | | | updated | timestamp without time zone | | |

The many-to-many relation between datasets and resources is implemented using a seperate table:

Table "public.resources_datasets" Column | Type | Collation | Nullable | Default -------------+------+-----------+----------+--------- resource_id | uuid | | | dataset_id | uuid | | |

Additional tables are created for the search and tree functionality of the repository.

Table "public.search" Column | Type | Collation | Nullable | Default ------------+-----------------------------+-----------+----------+--------- dataset_id | uuid | | not null | vector | tsvector | | not null | created | timestamp without time zone | | | updated | timestamp without time zone | | |

Table "public.trees" Column | Type | Collation | Nullable | Default -----------+-----------------------------+-----------+----------+--------- id | uuid | | not null | tree_dict | jsonb | | not null | created | timestamp without time zone | | | updated | timestamp without time zone | | |

Two materialized views are used to allow a fast lookup to all identifiers (with the list of corresponding specifiers), as well as all words (the list of tokens for the search):

Materialized view "public.identifiers" Column | Type | Collation | Nullable | Default ------------+------+-----------+----------+--------- identifier | text | | | specifiers | json | | |

Materialized view "public.words" Column | Type | Collation | Nullable | Default --------+------+-----------+----------+--------- word | text | | |

Owner

  • Name: Inter-Sectoral Impact Model Intercomparison Project
  • Login: ISI-MIP
  • Kind: organization
  • Email: isimip@pik-potsdam.de
  • Location: Potsdam, Germany

Citation (CITATION.cff)

cff-version: 1.2.0
title: ISIMIP publisher
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - family-names: Klar
    given-names: Jochen
    orcid: 'https://orcid.org/0000-0002-5883-4273'
repository-code: 'https://github.com/ISI-MIP/isimip-publisher'
license: MIT

GitHub Events

Total
  • Release event: 1
  • Push event: 6
  • Create event: 2
Last Year
  • Release event: 1
  • Push event: 6
  • Create event: 2

Dependencies

.github/workflows/pytest.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
  • postgres latest docker
requirements/pytest.txt pypi
  • pytest * test
  • pytest-console-scripts * test
  • pytest-cov * test