cessda.cdc.osmh-indexer.cmm

Parses DDI XML and converts it into CMM Metadata. Part of the OSMH harvester

https://github.com/cessda/cessda.cdc.osmh-indexer.cmm

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.0%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Parses DDI XML and converts it into CMM Metadata. Part of the OSMH harvester

Basic Info
  • Host: GitHub
  • Owner: cessda
  • License: apache-2.0
  • Language: Java
  • Default Branch: main
  • Homepage:
  • Size: 2.5 MB
Statistics
  • Stars: 2
  • Watchers: 4
  • Forks: 1
  • Open Issues: 0
  • Releases: 0
Created almost 3 years ago · Last pushed 6 months ago
Metadata Files
Readme Changelog Contributing License Code of conduct Citation

README.md

OSMH Consumer Indexer (PaSC-OCI)

Build Status Quality Gate Status Coverage SQAaaS badge shields.io

CESSDA CDC Consumer Indexer (an OSMH Consumer) for Metadata harvesting and ingestion into Elasticsearch. See the OSMH System Architecture Document for more information about The Open Source Metadata Harvester (OSMH).

Quality - Software Maturity Level

The overall Software Maturity Level for this product, and the individual scores for each attribute can be found in the SML file.

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.

Prerequisites

The following tools must be installed before compiling

  • Java JDK 17

Sonar it

To perform SonarQube analysis locally, run SonarQube and then execute

./mvnw sonar:sonar

Build and test it

./mvnw verify

Run it

./mvnw spring-boot:run

Run it — with a specified profile

To run the OSMH consumer with a custom profile, use the following command line:

java -jar target/pasc-oci*.jar --spring.profiles.active=${profile_name}

If no profile is specified, the default profile will be used. This profile is configured to use a local Elasticsearch instance hosted at http://localhost:9200.

Notes

  • Makes use of TDD
  • When running integration tests, a standalone Elasticsearch server is launched

Deployment

At startup

The application loads configuration in this order as defined by the Spring Boot Framework.

  • Command line parameters
    • e.g. --logging.level.ROOT=DEBUG sets logging level for all classes
  • Environment Variables e.g. SECURITY_USER_NAME
    • Spring can use weak binding to convert environment variables into Java properties
    • e.g. SPRING_BOOT_ADMIN_USERNAME converts to spring.boot.admin.username
  • application-[dev,local,prod].yml
  • application.yml

See https://docs.spring.io/spring-boot/docs/3.1.x/reference/html/features.html#features.external-config for detailed documentation.

Configuring the Indexer

The OSMH indexer has many settings that change the behaviour of the indexing process.

| Property | Type | Description | |------------------------------------------|------------|----------------------------------------------------------------------------------------------------------------| | baseDirectory | Path | Directory to look for pipeline.json repository definitions. | | languages | Languages | Configure which languages Elasticsearch indices will be created for. | | repos | List | Manually configured repository definitions. | | oaiPmh.concatSeparator | String | The string to use to concatenate repeated elements, concatenation is disabled if null. | | oaiPmh.metadataParsingDefaultLang.lang | String | The language to fall back to if @xml:lang is not present. Individual repositories can override this setting. |

Elasticsearch Properties

Elasticsearch properties are configured under the elasticsearch key.

yaml elasticsearch: host: localhost # The Elasticsearch host username: elastic # The username to use when connecting to a secured Elasticsearch cluster password: examplePassword # The password to use when connecting to a secured Elasticsearch cluster numberOfShards: 2 # The number of primary shards the created indices will have numberOfReplicas: 0 # The number of replicas each primary shard has

Language Settings

The languages that the OSMH indexer will attempt to harvest are specified under languages. These languages will be parsed and indexed into Elasticsearch. The default languages are specified below.

yaml languages: ['cs', 'da', 'de', 'el', 'en', 'et', 'fi', 'fr', 'hu', 'it', 'nl', 'no', 'pt', 'sk', 'sl', 'sr', 'sv']

Custom mappings and settings can be defined in src/main/resources/elasticsearch. Mappings are global for all defined languages, whereas settings are selected per language. If the required mappings and settings can't be loaded, the index will not be created and an error will be logged.

Indexing a Repository

In most cases, repositories to index are detected using instances of pipeline.json. These are generated by the CESSDA Metadata Harvester and contain all the information needed to index the XMLs present alongside them.

Repositories are discovered by searching for instances of pipeline.json in the baseDirectory. The baseDirectory can be specified using the --baseDirectory command line parameter, or by specifying baseDirectory in application.yml.

Explicitly Declaring a Repository

Repositories are declared in application.yml and are specified under the key endpoints.repos.

yaml endpoints: repos: - url: http://194.117.18.18:6003/v0/oai path: path/to/directory code: APIS name: 'Portuguese Archive of Social Information (APIS)' preferredMetadataParam: oai_ddi25 defaultLanguage: pt

| Property | Type | Description | |--------------------------|--------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | url | URI | Location of the OAI-PMH endpoint. | | code | String | Short name of the repository, acts as a unique identifier. This is a mandatory field. | | name | String | The friendly name of the repository, displayed in the user interface. This falls back to using code if null. | |path| Path | Location of the XML source files to be indexed. This is a mandatory parameter. | |preferredMetadataParam| String | The metadata prefix used when harvesting from the OAI-PMH repository. | |defaultLanguage| String | Used to set a language on an element that doesn't have@xml:langdefined. Defaults tooaiPmh.metadataParsingDefaultLang.langif not set. This setting is only considered ifoaiPmh.metadataParsingDefaultLang.activeis set totrue`. |

Data Access Mappings

Data Access is primarily read in DDI-C 2.5 from /codeBook/stdyDscr/dataAccs/useStmt/conditions by checking for the values in Access Rights CV but free text values are also supported through the use of mappings JSON. Mappings for each repository can be specified in dataaccessmappings.json by which XPath to use from XPaths.java and then which free texts to map to Open / Restricted. Any new XPaths that aren't already used for Data Access for some repository will also be needed to be added as a part of parseDataAccess in CMMStudyMapper.java.

Repository names in mapping JSON should be the same as code set in harvesting configuration (which follows the configuration from cessda.cdc.aggregator.deploy).

Built With

  • Maven - Dependency Management

Contributing

Please read Contributing to CESSDA Open Source Software for information on contribution to CESSDA software.

Versioning

Authors

You can find the list of all contributors here.

License

This project is licensed under the Apache 2 Licence - see the LICENSE file for details.

Acknowledgments

Owner

  • Name: CESSDA
  • Login: cessda
  • Kind: organization
  • Location: Norway

Citation (CITATION.cff)

#
# Copyright © 2017-2023 CESSDA ERIC (support@cessda.eu)
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: CDC Consumer Indexer 
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Matthew
    family-names: Morris
    email: matthew.morris@cessda.eu
    affiliation: CESSDA ERIC
  - given-names: Joshua Tetteh
    family-names: Ocansey
    affiliation: CESSDA ERIC
    email: joshua.ocansey@cessda.eu

GitHub Events

Total
  • Release event: 2
  • Watch event: 1
  • Delete event: 45
  • Issue comment event: 16
  • Push event: 65
  • Pull request review comment event: 4
  • Pull request event: 81
  • Pull request review event: 42
  • Create event: 52
Last Year
  • Release event: 2
  • Watch event: 1
  • Delete event: 45
  • Issue comment event: 16
  • Push event: 65
  • Pull request review comment event: 4
  • Pull request event: 81
  • Pull request review event: 42
  • Create event: 52

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 1
  • Total pull requests: 57
  • Average time to close issues: about 12 hours
  • Average time to close pull requests: 6 days
  • Total issue authors: 1
  • Total pull request authors: 3
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.16
  • Merged pull requests: 39
  • Bot issues: 1
  • Bot pull requests: 46
Past Year
  • Issues: 1
  • Pull requests: 34
  • Average time to close issues: about 12 hours
  • Average time to close pull requests: 8 days
  • Issue authors: 1
  • Pull request authors: 3
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.21
  • Merged pull requests: 20
  • Bot issues: 1
  • Bot pull requests: 26
Top Authors
Issue Authors
Pull Request Authors
  • dependabot[bot] (100)
  • matthew-morris-cessda (14)
  • markusjt (6)
  • Joshocan (1)
Top Labels
Issue Labels
Pull Request Labels
dependencies (101) java (98) enhancement (5) github_actions (3) bug (2)

Dependencies

Dockerfile docker
  • openjdk 17 build
pom.xml maven
  • org.projectlombok:lombok 1.18.26 provided
  • ch.qos.logback:logback-classic
  • co.elastic.clients:elasticsearch-java 8.7.1
  • com.fasterxml.jackson.core:jackson-databind
  • com.github.java-json-tools:json-schema-validator 2.2.14
  • com.github.mizosoft.methanol:methanol 1.7.0
  • com.neovisionaries:nv-i18n 1.29
  • commons-codec:commons-codec
  • commons-lang:commons-lang 2.6
  • jaxen:jaxen
  • net.logstash.logback:logstash-logback-encoder 7.3
  • org.apache.commons:commons-lang3
  • org.jdom:jdom2
  • org.springframework.boot:spring-boot-starter
  • com.pgs-soft:HttpClientMock 1.0.0 test
  • org.junit.vintage:junit-vintage-engine test
  • org.springframework.boot:spring-boot-starter-test test