harvester-curator
Collect metadata from a data and/or software directory and create the corresponding dataset in the data repository, DaRUS
https://github.com/simtech-research-data-management/harvester-curator
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.0%) to scientific vocabulary
Repository
Collect metadata from a data and/or software directory and create the corresponding dataset in the data repository, DaRUS
Basic Info
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 3
- Releases: 0
Metadata Files
README.md
harvester-curator
harvester-curator is a Python-based automation tool designed to streamline metadata collection and management in research data management. It automates the extraction of metadata from source repositories or directories, and then seamlessly maps and adapts this metadata to comply with the designated repository's metadata schemas, preparing it for integration into datasets.
In essence, harvester-curator synergizes file crawling and crosswalking capabilities to automate the complex and labor-intensive processes of metadata collection and repository population. Tailored for efficiency and accuracy in Dataverse environments, it equips researchers with a streamlined method to accelerate data management workflows, ensuring that their research data aligns with the FAIR principles of Findability, Accessibility, Interoperability, and Reusability.
Tool Workflow
harvester-curator simplifies metadata collection and integration into research data repositories through two primary phases: the Harvester phase, focusing on the automated extraction of metadata, and the Curator phase, dedicated to mapping and adapting this metadata for integrating into datasets within a target repository.
<!--
-->
Detailed Tool Workflow (click to expand)
Let's delve deeper into the operational details of `harvester-curator`'s workflow. `harvester-curator` optimizes metadata collection and integration in two main phases:
***Curator Phase:** Seamlessly maps and adapts the harvested metadata to ensure its integration into the target repository.
We currently support a variety of parsers, including VTK, HDF5, CFF, BibTeX, YAML and JSON:
**VTK-parser:** Supports file types such as `vtk`, `vti`, `vtr`, `vtp`, `vts`, `vtu`, `pvti`, `pvtr`, `pvtp`, `pvts` and `pvtu`.
**HDF5-parser:** Handles formats including `hdf5`, `h5`, `he5`.
**JSON-parser:** Processes types `json` and `jsonld`.
### Curator Phase
In the subsequent `Curator` phase, `harvester-curator` aligns the harvested metadata with the metadata schemas of the target repository, such as DaRUS. It matches the harvested metadata attributes with those defined in the metadata schemas and integrates the values into the appropriate locations. Additionally, it supports direct upload of curated metadata to the destination repository.
The `Curator` algorithm employs mappings to reconcile discrepancies between the naming conventions of harvested metadata and the metadata schemas of the target repository. Given that harvested metadata typically features a flat structure -- where attributes, values, and paths are at the same level, unlike the hierarchical organization common in repository schemas—-the algorithm adapts harvested metadata to ensure compatibility:
1. **Mapping and Matching:** It begins by updating attribute values and paths of harvested metadata based on predefined mappings, taking into account the hierarchical structure of repository schemas.
2. **Attribute Matching:** The algorithm searches for matching attributes within the target repository's schema. If no direct match is found, it combines parent and attribute information in search of a suitable match. Attributes that remain unmatched are noted for subsequent matching attempts with an alternative schema.
3. **Parent Matching:** Upon finding a match, the algorithm designates the corresponding parent from the schema as the "matching parent." If a direct parent match does not exist, or if multiple matches are found, it examines common elements between the schema and harvested metadata to determine the most appropriate matching parent.
4. **Dictionary Preparation:** Attributes that successfully match are compiled into a dictionary that includes the mapped attribute, value, parent, and schema name, ensuring the metadata is compatible with the target repository.
5. **Similarity Matching:** When exact matches are not found across all schemas, the algorithm employs similarity matching with an 85% threshold to accommodate differences in metadata schema integration.
This systematic approach ensures compatibility with the requirements of the target repository and enhances the precision of metadata integration by utilizing direct mapping, exact matching and similarity matching to overcome schema alignment challenges.
Project Structure
The harvester-curator project is organized as follows:
src/harvester_curator/: The main app package directory containing all the source code.tests/: Contains all tests for theharvester-curatorapplication.images/: Contains images used in the documentation, such as the workflow diagram.
How to Install harvester-curator:
harvester-curator can be easily installed via pip, the recommended tool for installing python packages.
0. Install pip (if not already installed):
If you don’t have pip installed, you can install it with the following command:
bash
python3 -m ensurepip --upgrade
For more detailed instructions on installing pip, please visit the official pip installation guide.
1. Install harvester-curator:
To install harvester-curator from PyPI, simply run:
bash
python3 -m pip install harvester-curator
This will automatically download and install harvester-curator and its dependencies.
2. Verify Installation:
After the installation, you can verify it by running:
bash
harvester-curator --help
Usage
The harvester-curator app is designed to facilitate the efficient collection, curation and uploading of metadata. Follow these instructions to utilize the app and its available subcommands effectively.
General Help
For an overview of all commands and their options:
bash
harvester-curator --help
Harvesting Metadata
To collect metadata from files in a specified local directory:
bash
harvester-curator harvest --dir_path "/path/to/directory" --output_filepath "/path/to/harvested_output.json"
Or, using short options:
bash
harvester-curator harvest -d "/path/to/directory" -o "/path/to/harvested_output.json"
If the files are located in a GitHub or GitLab repository, one can specify the repository link instead of a local directory path to harvest metadata. For example:
bash
harvester-curator harvest -d https://gitlab.com/"your repo" --output_filepath "/path/to/harvested_output.json"
or,
bash
harvester-curator harvest -d git@github.com:"your repo".git -o output/harvested_output.json
For GitHub, it is recommended to use the SSH link when cloning the repository.
Important Note: Without --dir_path, the default is the example folder within the harvester_curator package. Without --output_filepath, harvested metadata is saved to output/harvested_output.json by default.
Curating Metadata
To process and align harvested curation with specified schema metadata blocks:
bash
harvester-curator curate --harvested_metadata_filepath "/path/to/harvested_output.json" --output_filepath "/path/to/curated_output.json" --api_endpoints_filepath "/path/to/schema_api_endpoints.json"
Or, using short options:
bash
harvester-curator curate -h "/path/to/harvested_output.json" -o "/path/to/curated_output.json" -a "/path/to/schema_api_endpoints.json"
Important Note: Default file paths are used if options are not specified:
* --harvested_metadata_filepath defaults to output/harvested_output.json.
* --output_filepath defaults to output/curated_output.json.
* --api_endpoints_filepath defaults to curator/api_end_points/darus_md_schema_api_endpoints.json.
Uploading Metadata
To upload curated metadata to a Dataverse repository as dataset metadata:
bash
harvester-curator upload --server_url "https://xxx.xxx.xxx" --api_token "abc0_def123_gkg456__hijk789" --dataverse_id "mydataverse_alias" --curated_metadata_filepath "/path/to/curated_output.json"
Or, using short options:
bash
harvester-curator upload -s "https://xxx.xxx.xxx" -a "abc0_def123_gkg456__hijk789" -d "mydataverse_alias" -c "/path/to/curated_output.json"
Important Note: The default for --curated_metadata_filepath is output/curated_output.json.
Owner
- Name: SimTech-RDM
- Login: SimTech-Research-Data-Management
- Kind: organization
- Repositories: 1
- Profile: https://github.com/SimTech-Research-Data-Management
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: harvester-curator
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: Sarbani
family-names: Roy
email: sarbani.roy@simtech.uni-stuttgart.de
- given-names: FangFang
family-names: Wang
email: fangfang.wang@simtech.uni-stuttgart.de
identifiers:
- type: url
value: >-
https://github.com/SimTech-Research-Data-Management/harvester-curator
repository-code: >-
https://github.com/SimTech-Research-Data-Management/harvester-curator
abstract: >-
Harvester-Curator is a tool, designed to elevate metadata
provision in data repositories. In the first phase,
Harvester-Curator acts as a scanner, navigating through
user code and/or data repositories to identify suitable
parsers for different file types. It collects metadata
from each of the files by applying corresponding parsers
and then compiles this information into a structured JSON
file, providing researchers with a seamless and automated
solution for metadata collection. Moving to the second
phase, Harvester-Curator transforms into a curator,
leveraging the harvested metadata to populate metadata
fields in a target repository. By automating this process,
it not only relieves researchers of the manual burden but
also ensures the accuracy and comprehensiveness of the
metadata.Beyond its role in streamlining the intricate
task of metadata collection, this tool contributes to the
broader objective of elevating data accessibility and
interoperability within repositories.
keywords:
- Automation
- Metadata
- Harvester
- Curator
license: MIT
CodeMeta (codemeta.json)
{
"name": "harvester-curator",
"version": "0.0.3",
"description": "Harvester-Curator is a tool, designed to elevate metadata provision in data repositories. In the first phase, Harvester-Curator acts as a scanner, navigating through user code and/or data repositories to identify suitable parsers for different file types. It collects metadata from each of the files by applying corresponding parsers and then compiles this information into a structured JSON file, providing researchers with a seamless and automated solution for metadata collection. Moving to the second phase, Harvester-Curator transforms into a curator, leveraging the harvested metadata to populate metadata fields in a target repository. By automating this process, it not only relieves researchers of the manual burden but also ensures the accuracy and comprehensiveness of the metadata.Beyond its role in streamlining the intricate task of metadata collection, this tool contributes to the broader objective of elevating data accessibility and interoperability within repositories.",
"keywords": [
"Automation",
"Metadata",
"Harvester",
"Curator"
],
"license": "MIT",
"contributors": [
{
"name": "Sarbani Roy",
"affiliation": "University of Stuttgart",
"email": "sarbani.roy@simtech.uni-stuttgart.de"
},
{
"name": "FangFang Wang",
"affiliation": "University of Stuttgart",
"email": "fangfang.wang@simtech.uni-stuttgart.de"
}
],
"dateCreated": "2023-04-21",
"codeRepository": "https://github.com/SimTech-Research-Data-Management/harvester-curator",
"programmingLanguage": [
"Python"
],
"dependencies": [
{
"name": "cffconvert",
"version": "2.0.0"
},
{
"name": "PyYAML",
"version": "6.0.1"
},
{
"name": "h5py",
"version": "3.11.0"
},
{
"name": "pyvista",
"version": "0.43.3"
},
{
"name": "vtk",
"version": "9.3.0"
},
{
"name": "python-magic",
"version": "0.4.27"
},
{
"name": "easyDataverse",
"version": "0.4.0"
}
]
}
GitHub Events
Total
- Push event: 6
Last Year
- Push event: 6