cessda.cdc.aggregator.oai-pmh-repo-handler

Component that provides an OAI-PMH endpoint so that metadata can be harvested by other aggregators

https://github.com/cessda/cessda.cdc.aggregator.oai-pmh-repo-handler

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (3.0%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

Component that provides an OAI-PMH endpoint so that metadata can be harvested by other aggregators

Basic Info
  • Host: GitHub
  • Owner: cessda
  • License: eupl-1.2
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 155 KB
Statistics
  • Stars: 0
  • Watchers: 3
  • Forks: 0
  • Open Issues: 11
  • Releases: 0
Created over 3 years ago · Last pushed 10 months ago
Metadata Files
Readme Changelog Contributing License Code of conduct Citation

README.md

CESSDA Metadata Aggregator - OAI-PMH Repo Handler

SQAaaS badge

SQAaaS badge shields.io

Build Status Bugs Code Smells Coverage Duplicated Lines (%) Lines of Code Maintainability Rating Quality Gate Status Reliability Rating Security Rating Technical Debt Vulnerabilities

HTTP server providing an OAI-PMH aggregator endpoint serving DocStore records. This program is part of CESSDA Metadata Aggregator.

Source code is hosted at Github https://github.com/cessda/cessda.cdc.aggregator.oai-pmh-repo-handler.

Features

The OAI-PMH Repo Handler implements an OAI-PMH Aggregator service. The aggregator provides an OAI-PMH endpoint which enables tracing of record origin using OAI-PMH provenance containers. The OAI-PMH specification is publicly available at http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm and provenance containers are described at http://www.openarchives.org/OAI/2.0/guidelines-provenance.htm. The aggregator adheres to the implementation Guidelines for Aggregators, Caches and Proxies, which is available at http://www.openarchives.org/OAI/2.0/guidelines-aggregator.htm.

The aggregator implements the following OAI-PMH features:

  • All six OAI-PMH verbs of OAI-PMH protocol 2.0.
  • ResumptionTokens to partition large list responses.
  • Selective harvesting via OAI-sets and datestamps.
  • Configurable support for deleted records.
  • Configurable support for OAI-identifiers.
  • Configurable support for arbitrary OAI-sets.
  • Built-in OAI set for grouping by study language.
  • Built-in OAI set for grouping by OpenAIRE data.

The following metadata formats are supported:

  • OAI-DC using metadataprefix oai_dc.
  • DDI 2.5 using metadataprefix oai_ddi25.
  • OpenAIRE Datacite using metadataprefix oai_datacite.

The application exposes a /metrics endpoint, which provides certain statistics about the running instance of the application. This endpoint is provided by prometheus-client. The following metrics are exposed:

| Metric | Type | Explanation | | ----------------------------------- | ------- | ------------------------------------------------------------------------------------------- | | requests_total | Counter | Total number of requests received | | requests_per_user_agent_total | Counter | Number of requests received per user-agent | | requests_succeeded_total | Counter | Number of successful requests | | requests_failed_total | Counter | Number of failed requests | | requests_duration | Summary | Response time in milliseconds | | records_total | Gauge | Total number of OAI-PMH records (includes records marked as deleted) | | records_total_without_deleted | Gauge | Total number of OAI-PMH records (excludes records marked as deleted) | | publishers_total | Gauge | Total number of distinct publishers (defined by the repository's declared OAI-PMH base URL) | | publishers_counts | Gauge | Number of OAI-PMH records per publisher (includes records marked as deleted) | | publishers_counts_without_deleted | Gauge | Number of OAI-PMH records per publisher (excludes records marked as deleted) |

Requirements

  • Python 3.8 or newer.
  • Running CESSDA Metadata Aggregator DocStore instance.

Installation

On Ubuntu 20.04

Get Package

Clone the repository using Git.

sh git clone https://github.com/cessda/cessda.cdc.aggregator.oai-pmh-repo-handler.git

Or fetch a specific release using a tag. For example to get 0.2.0 release.

sh git clone --branch 0.2.0 https://github.com/cessda/cessda.cdc.aggregator.oai-pmh-repo-handler.git

Install OAI-PMH Repo Handler

It is recommended to install packages inside Python virtual environment to isolate the install. This package also provides a Dockerfile to help setup a containerized environment.

Create the Python virtual environment and activate it.

sh python3 -m venv cdcagg-env source cdcagg-env/bin/activate

Install Python packages.

sh cd cessda.cdc.aggregator.oai-pmh-repo-handler pip install -r requirements.txt pip install .

To upgrade existing install, use --upgrade flag in pip commands. Pip uses only-if-needed upgrade strategy by default since version 10.0.0, but for backwards compatibility the option is also included in the example.

sh pip install --upgrade -r requirements.txt --upgrade-strategy=only-if-needed pip install . --upgrade --upgrade-strategy=only-if-needed

Run

Replace <docstore-url> with an URL pointing to a DocStore server. Replace <base-url> with your endpoint OAI-PMH Base URL. Replace <admin-email> with administrator email address.

sh python -m cdcagg_oai --document-store-url <docstore-url> --oai-pmh-base-url <base-url> --oai-pmh-admin-email <admin-email>

Configuration reference

To list all available configuration options, use --help.

sh python -m cdcagg_oai --help

Note that most configuration options can be specified via command line arguments, configuration file options and environment variables.

Prometheus client provides additional configuration options that can be set using environment variables:

  • PROMETHEUS_DISABLE_CREATED_SERIES for disabling series suffixed by _created.
  • PROMETHEUS_MULTIPROC_DIR for storing metrics when running in multiprocess mode.

Refer to Prometheus client documentation for more information.

Build OAI sets based on source endpoint

The aggregator provides a way to define OAI sets which group records by the source OAI-PMH endpoint. This functionality relies on a mapping file which maps the source OAI-PMH endpoint base-url value to a OAI-PMH setspec value. In order to use the mapping file, its filepath must given to the program via configuration and the program must be able to read the file.

See example of a mapping file for syntax reference. The mapping file is expected to be valid YAML.

The value that corresponds with the url in the mapping file is used to query the Document Store. Results are grouped using a setspec value source:<source-key-value>, where corresponds to the value of source in the mapping file.

For example, if the mapping file has the following definition

-
  url: 'archive.org'
  source: 'archive'
  setname: 'Some archive'
  description: 'Describe some archive'

then all records that are harvested from archive.org are grouped in setspec source:archive.

Values for setname and description are used in ListSets-response to describe the set contents.

When the mapping file is defined, the OAI-PMH Repo Handler must be configured using configuration option --oai-set-source-path <mapping-file-path>.

Build arbitrary OAI sets

Arbitrary sets can be built using configurable sets -functionality.

Records can be grouped into arbitrary sets by creating a mapping file which defines OAI set properties and record identifiers belonging to the defined set. The record identifiers correspond to Study records _aggregator_identifier values, which are the same values that are used as default OAI-identifiers.

The set builder supports a single top-level spec value with multiple second-level spec values. Second-level spec values are always prepended with the top-level spec value <setSpec>top-level:second-level</setSpec>. The top-level setspec contains records matching all identifiers defined in second-level set definitions.

See example of a mapping file for syntax reference. The mapping file is expected to be valid YAML.

A single spec must be found from top-level. The spec value is used as a top-level OAI setspec value and identifies that this setspec-value gets intepreted as a configurable OAI-set. The nodes contain a list of second-level set definitions. The second-level spec values must be unique and the list item must contain list of identifiers that belong to that particular OAI set.

For example, if the mapping file has the following definition

spec: 'thematic'
name: 'Thematic'
description: 'Thematic grouping of records'
nodes:
  - spec: 'social_sciences'
    name: 'Social sciences'
    description: 'Studies in social sciences'
    identifiers:
    - id_1
    - id_2
  - spec: 'humanities'
    name: 'Humanities'
    description: 'Studies in humanities'
    identifiers:
    - id_2
    - id_3
    - id_4

then thematic is the top-level setspec node. It contains two child nodes: social_sciences and humanities. ListRecords-request with spec=thematic will return records from all its second-level nodes. ListRecords-request with spec=thematic:social_sciences will return records with _aggregator_identifiers values id_1 and id_2. The record with identifier id_2 belongs to both second-level setspec nodes.

Only a single top-level node is supported. It must contain at least one second-level child node.

Instead of specifying set definitions directly, the second level node may alternatively specify a path which points to an absolute path of an external mapping file that contains second-level set definitions.

The external configuration must specify spec, name and identifiers keys and may have an optional description key. The external configuration file can specify a single node or multiple nodes in a list.

Main configuration file with path

spec: 'thematic'
name: 'Thematic'
description: 'Thematic grouping of records'
nodes:
  - path: '/absolute/path/to/ext/conf.yaml'

External configuration file with a single node

spec: 'history'
name: 'History'
description: 'Studies in history'
identifiers:
- id_5
- id_6

External configuration file with a list of nodes

- spec: 'history'
  name: 'History'
  description: 'Studies in history'
  identifiers:
  - id_5
  - id_6
- spec: 'literature'
  name: 'Literature'
  description: 'Literature Studies'
  identifiers:
  - id_7
  - id_8

The external configuration cannot further refer to an external configuration file.

The mapping file syntax is validated on server startup. The file is not loaded in-memory, but always read on-demand. Exceptions may occur after server startup, if the file is changed after initial syntax check.

When the mapping file is defined, the OAI-PMH Repo Handler must be configured using configuration option --oai-set-configurable-path <mapping-file-path>

License

See the LICENSE file.

Owner

  • Name: CESSDA
  • Login: cessda
  • Kind: organization
  • Location: Norway

Citation (CITATION.cff)

#
# Copyright © 2017-2023 CESSDA ERIC (support@cessda.eu)
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: CDC Aggregator OAI-PMH handler
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Toni
    family-names: Sissala
    affiliation: >-
      FSD - Finnish Social Science Data Archive
    email: toni.sissala@tuni.fi
  - given-names: Matthew
    family-names: Morris
    email: matthew.morris@cessda.eu
    affiliation: CESSDA ERIC

GitHub Events

Total
  • Issues event: 3
  • Delete event: 3
  • Issue comment event: 2
  • Push event: 6
  • Pull request review event: 3
  • Pull request event: 3
  • Create event: 4
Last Year
  • Issues event: 3
  • Delete event: 3
  • Issue comment event: 2
  • Push event: 6
  • Pull request review event: 3
  • Pull request event: 3
  • Create event: 4

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 2
  • Total pull requests: 2
  • Average time to close issues: about 1 month
  • Average time to close pull requests: about 1 month
  • Total issue authors: 1
  • Total pull request authors: 1
  • Average comments per issue: 0.0
  • Average comments per pull request: 1.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 2
  • Pull requests: 2
  • Average time to close issues: about 1 month
  • Average time to close pull requests: about 1 month
  • Issue authors: 1
  • Pull request authors: 1
  • Average comments per issue: 0.0
  • Average comments per pull request: 1.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • toni-sissala (4)
  • matthew-morris-cessda (1)
Pull Request Authors
  • toni-sissala (5)
  • Joshocan (1)
Top Labels
Issue Labels
bug (1)
Pull Request Labels

Dependencies

Dockerfile docker
  • python 3.11-slim build
requirements.txt pypi
  • ConfigArgParse ==1.5.3
  • Genshi ==0.7.7
  • PyYAML ==6.0.0
  • cdcagg_common 0.5.0
  • kuha_common 2.0.1
  • kuha_oai_pmh_repo_handler 1.3.0
  • prometheus-client ==0.17.0
  • py12flogging ==0.5.0
  • six ==1.16.0
  • tornado ==6.3.3
setup.py pypi