cessda.cdc.osmh-indexer.cmm
Parses DDI XML and converts it into CMM Metadata. Part of the OSMH harvester
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.0%) to scientific vocabulary
Repository
Parses DDI XML and converts it into CMM Metadata. Part of the OSMH harvester
Basic Info
Statistics
- Stars: 2
- Watchers: 4
- Forks: 1
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
OSMH Consumer Indexer (PaSC-OCI)
CESSDA CDC Consumer Indexer (an OSMH Consumer) for Metadata harvesting and ingestion into Elasticsearch. See the OSMH System Architecture Document for more information about The Open Source Metadata Harvester (OSMH).
Quality - Software Maturity Level
The overall Software Maturity Level for this product, and the individual scores for each attribute can be found in the SML file.
Getting Started
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.
Prerequisites
The following tools must be installed before compiling
- Java JDK 17
Sonar it
To perform SonarQube analysis locally, run SonarQube and then execute
./mvnw sonar:sonar
Build and test it
./mvnw verify
Run it
./mvnw spring-boot:run
Run it — with a specified profile
To run the OSMH consumer with a custom profile, use the following command line:
java -jar target/pasc-oci*.jar --spring.profiles.active=${profile_name}
If no profile is specified, the default profile will be used. This profile is configured to use a local Elasticsearch instance hosted at http://localhost:9200.
Notes
- Makes use of TDD
- When running integration tests, a standalone Elasticsearch server is launched
Deployment
At startup
The application loads configuration in this order as defined by the Spring Boot Framework.
- Command line parameters
- e.g.
--logging.level.ROOT=DEBUGsets logging level for all classes
- e.g.
- Environment Variables e.g.
SECURITY_USER_NAME- Spring can use weak binding to convert environment variables into Java properties
- e.g.
SPRING_BOOT_ADMIN_USERNAMEconverts tospring.boot.admin.username
- application-[dev,local,prod].yml
- dev, local and prod refer to Spring profiles
- A Spring profile can be specified by the command line
--spring.profiles.activeor the environment variableSPRING_PROFILES_ACTIVE
- A Spring profile can be specified by the command line
- See https://docs.spring.io/spring-boot/docs/3.1.x/reference/html/features.html#features.profiles for more details
- dev, local and prod refer to Spring profiles
- application.yml
See https://docs.spring.io/spring-boot/docs/3.1.x/reference/html/features.html#features.external-config for detailed documentation.
Configuring the Indexer
The OSMH indexer has many settings that change the behaviour of the indexing process.
| Property | Type | Description |
|------------------------------------------|------------|----------------------------------------------------------------------------------------------------------------|
| baseDirectory | Path | Directory to look for pipeline.json repository definitions. |
| languages | Languages | Configure which languages Elasticsearch indices will be created for. |
| repos | ListoaiPmh.concatSeparator | String | The string to use to concatenate repeated elements, concatenation is disabled if null. |
| oaiPmh.metadataParsingDefaultLang.lang | String | The language to fall back to if @xml:lang is not present. Individual repositories can override this setting. |
Elasticsearch Properties
Elasticsearch properties are configured under the elasticsearch key.
yaml
elasticsearch:
host: localhost # The Elasticsearch host
username: elastic # The username to use when connecting to a secured Elasticsearch cluster
password: examplePassword # The password to use when connecting to a secured Elasticsearch cluster
numberOfShards: 2 # The number of primary shards the created indices will have
numberOfReplicas: 0 # The number of replicas each primary shard has
Language Settings
The languages that the OSMH indexer will attempt to harvest are specified under languages. These languages will be parsed and indexed into Elasticsearch. The default languages are specified below.
yaml
languages: ['cs', 'da', 'de', 'el', 'en', 'et', 'fi', 'fr', 'hu', 'it', 'nl', 'no', 'pt', 'sk', 'sl', 'sr', 'sv']
Custom mappings and settings can be defined in src/main/resources/elasticsearch. Mappings are global for all defined languages, whereas settings are selected per language. If the required mappings and settings can't be loaded, the index will not be created and an error will be logged.
Indexing a Repository
In most cases, repositories to index are detected using instances of pipeline.json. These are generated by the CESSDA Metadata Harvester and contain all the information needed to index the XMLs present alongside them.
Repositories are discovered by searching for instances of pipeline.json in the baseDirectory. The baseDirectory can be specified using the --baseDirectory command line parameter, or by specifying baseDirectory in application.yml.
Explicitly Declaring a Repository
Repositories are declared in application.yml and are specified under the key endpoints.repos.
yaml
endpoints:
repos:
- url: http://194.117.18.18:6003/v0/oai
path: path/to/directory
code: APIS
name: 'Portuguese Archive of Social Information (APIS)'
preferredMetadataParam: oai_ddi25
defaultLanguage: pt
| Property | Type | Description |
|--------------------------|--------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| url | URI | Location of the OAI-PMH endpoint. |
| code | String | Short name of the repository, acts as a unique identifier. This is a mandatory field. |
| name | String | The friendly name of the repository, displayed in the user interface. This falls back to using code if null. |
|path| Path | Location of the XML source files to be indexed. This is a mandatory parameter. |
|preferredMetadataParam| String | The metadata prefix used when harvesting from the OAI-PMH repository. |
|defaultLanguage| String | Used to set a language on an element that doesn't have@xml:langdefined. Defaults tooaiPmh.metadataParsingDefaultLang.langif not set. This setting is only considered ifoaiPmh.metadataParsingDefaultLang.activeis set totrue`. |
Data Access Mappings
Data Access is primarily read in DDI-C 2.5 from /codeBook/stdyDscr/dataAccs/useStmt/conditions by checking for the values in Access Rights CV but free text values are also supported through the use of mappings JSON. Mappings for each repository can be specified in dataaccessmappings.json by which XPath to use from XPaths.java and then which free texts to map to Open / Restricted. Any new XPaths that aren't already used for Data Access for some repository will also be needed to be added as a part of parseDataAccess in CMMStudyMapper.java.
Repository names in mapping JSON should be the same as code set in harvesting configuration (which follows the configuration from cessda.cdc.aggregator.deploy).
Built With
- Maven - Dependency Management
Contributing
Please read Contributing to CESSDA Open Source Software for information on contribution to CESSDA software.
Versioning
Authors
You can find the list of all contributors here.
License
This project is licensed under the Apache 2 Licence - see the LICENSE file for details.
Acknowledgments
Owner
- Name: CESSDA
- Login: cessda
- Kind: organization
- Location: Norway
- Website: https://cessda.eu
- Twitter: cessda_data
- Repositories: 33
- Profile: https://github.com/cessda
Citation (CITATION.cff)
#
# Copyright © 2017-2023 CESSDA ERIC (support@cessda.eu)
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: CDC Consumer Indexer
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: Matthew
family-names: Morris
email: matthew.morris@cessda.eu
affiliation: CESSDA ERIC
- given-names: Joshua Tetteh
family-names: Ocansey
affiliation: CESSDA ERIC
email: joshua.ocansey@cessda.eu
GitHub Events
Total
- Release event: 2
- Watch event: 1
- Delete event: 45
- Issue comment event: 16
- Push event: 65
- Pull request review comment event: 4
- Pull request event: 81
- Pull request review event: 42
- Create event: 52
Last Year
- Release event: 2
- Watch event: 1
- Delete event: 45
- Issue comment event: 16
- Push event: 65
- Pull request review comment event: 4
- Pull request event: 81
- Pull request review event: 42
- Create event: 52
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 1
- Total pull requests: 57
- Average time to close issues: about 12 hours
- Average time to close pull requests: 6 days
- Total issue authors: 1
- Total pull request authors: 3
- Average comments per issue: 0.0
- Average comments per pull request: 0.16
- Merged pull requests: 39
- Bot issues: 1
- Bot pull requests: 46
Past Year
- Issues: 1
- Pull requests: 34
- Average time to close issues: about 12 hours
- Average time to close pull requests: 8 days
- Issue authors: 1
- Pull request authors: 3
- Average comments per issue: 0.0
- Average comments per pull request: 0.21
- Merged pull requests: 20
- Bot issues: 1
- Bot pull requests: 26
Top Authors
Issue Authors
Pull Request Authors
- dependabot[bot] (100)
- matthew-morris-cessda (14)
- markusjt (6)
- Joshocan (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- openjdk 17 build
- org.projectlombok:lombok 1.18.26 provided
- ch.qos.logback:logback-classic
- co.elastic.clients:elasticsearch-java 8.7.1
- com.fasterxml.jackson.core:jackson-databind
- com.github.java-json-tools:json-schema-validator 2.2.14
- com.github.mizosoft.methanol:methanol 1.7.0
- com.neovisionaries:nv-i18n 1.29
- commons-codec:commons-codec
- commons-lang:commons-lang 2.6
- jaxen:jaxen
- net.logstash.logback:logstash-logback-encoder 7.3
- org.apache.commons:commons-lang3
- org.jdom:jdom2
- org.springframework.boot:spring-boot-starter
- com.pgs-soft:HttpClientMock 1.0.0 test
- org.junit.vintage:junit-vintage-engine test
- org.springframework.boot:spring-boot-starter-test test