h5rdmtoolbox
Supporting a FAIR Research Data lifecycle using Python and HDF5.
Science Score: 59.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in README -
✓Academic publication links
Links to: nature.com, zenodo.org -
✓Committers with academic emails
2 of 4 committers (50.0%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.8%) to scientific vocabulary
Keywords
Repository
Supporting a FAIR Research Data lifecycle using Python and HDF5.
Basic Info
- Host: GitHub
- Owner: matthiasprobst
- License: mit
- Language: Python
- Default Branch: main
- Homepage: https://h5rdmtoolbox.readthedocs.io/en/latest/
- Size: 63.1 MB
Statistics
- Stars: 17
- Watchers: 1
- Forks: 1
- Open Issues: 1
- Releases: 27
Topics
Metadata Files
README.md
HDF5 Research Data Management Toolbox
Note, that the project is still under development!
The "HDF5 Research Data Management Toolbox" (h5RDMtoolbox) is a Python package supporting everybody who is working with HDF5 to achieve a sustainable data lifecycle which follows the FAIR (Findable, Accessible, Interoperable, Reusable) principles. It specifically supports the five main steps of planning, collecting, analyzing, sharing and reusing data. Please visit the documentation for detailed information of try the quickstart using colab.
Highlights
- Combining HDF5 and xarray to allow easy access to metadata and data during analysis and processing ( see here).
- Assigning metadata with "globally unique and persistent identifiers" as required by F1 of the FAIR principles. This can be achieved by using RDF triples, which removes "ambiguity in the meaning of your published data".
- Define standard attributes through conventions and enforce users to use certain attributes in their HDF5 files, such as units and a description, for example.
- Upload HDF5 files directly to repositories like Zenodo or use them with noSQL databases like mongoDB.
Who is the package for?
For everybody, who is...
- ... looking for a management approach for his or her data.
- ... community has not yet established a stable convention.
- ... working with small and big data, that fits into HDF5 files.
- ... looking for an easy way to work with HDF5, especially through Jupyter Notebooks.
- ... trying to integrate HDF5 with repositories and databases.
- ... wishing to enrich data semantically with the RDF standard.
- ... looking for a way to do all the above whiles not needing to learn a new syntax.
- ... new to HDF5 and wants to learn about it, especially with respect to the FAIR principles and data management.
Who is it not for?
For everybody, who ...
- ... is looking for a management approach which at the same time allows high-performance and/or parallel work with HDF5
- ... has already well-established conventions and managements approaches in his or her community
Package Architecture/structure
The toolbox implements six modules, which are shown below. The numbers reference to their main usage in the stages in
the data lifecycle above. The wrapper module implements the main interface between the user and the HDF5 file. It
extends the features of the underlying h5py library. Some of the features are implemented in other modules, hence the
wrapper module depends on the convention, database and linked data (ld) module.
Current implementation highlights in the modules:
- The wrapper module adds functionality on top of the
h5pypackage. It allows to include so-called standard names, which are defined in conventions. And it implements interfaces, such as to the packagexarray, which allows to carry metadata from HDF5 to the user. Other high-level interfaces like.rdfallows assigning semantic information to the HDF5 file. - For the database module,
hdfDBandmongoDBare implemented. ThehdfDBmodule allows to use HDF5 files as a database. ThemongoDBmodule allows to use mongoDB as a database by mapping the metadata of HDF5 files to the database. - For the repository module, a Zenodo interface is implemented. Zenodo is a repository, which allows to upload and download data with a persistent identifier.
- For the convention module, the standard attributes are implemented.
- The layout module allows to define expectations on the internal layout (object names, location, attributes, properties) of HDF5 files.
Quickstart
A quickstart notebook can be tested by clicking on the following badge:
Documentation
Please find a comprehensive documentation with many examples here or by click on the image, which shows the research data lifecycle in the center and the respective toolbox features on the outside:
A paper is published in the journal inggrid.
Installation
Use python 3.8 or higher (automatic testing is performed until 3.12). If you are a regular user, you can install the package via pip:
pip install h5RDMtoolbox
Install from source:
Developers may clone the repository and install the package from source. Clone the repository first:
git clone https://github.com/matthiasprobst/h5RDMtoolbox.git@main
Then, run
pip install h5RDMtoolbox/
Add --user if you do not have root access.
For development installation run
pip install -e h5RDMtoolbox/
Dependencies
The core functionality depends on the following packages. Some of them are for general management others are very specific to the features of the package:
General dependencies are ...
numpy>=1.20: Scientific computing, handling of arraysmatplotlib>=3.5.2: Plottingappdirs>=1.4.4: Managing user and application directoriespackaging: Version handlingIPython>=8.4.0: Pretty display of data in notebooksregex>=2020.7.9: Working with regular expressions
Specific to the package are ...
h5py=3.7.0: HDF5 file interfacexarray>=2022.3.0: Working with scientific arrays in combination with attributes. Allows carrying metadata from HDF5 to userpint>=0.19.2: Allows working with unitspint_xarray>=0.2.1: Working with units for usage with xarraypython-forge==18.6.0: Used to update function signatures when using the standard attributespydantic: Used to validate standard attributespyyaml>6.0.0: Reading and writing of yaml files, e.g. metadata definitions (conventions). Note, lower versions collide with python 3.11requests: Used to download files from the internet or validate URLs, e.g. metadata definitions (conventions)rdflib: Used to enable working with RDFontolutils: Required to work with RDF and derive semantic description of HDF5 file content
Optional dependencies
To run unit tests or to enable certain features, additional dependencies must be installed.
Install optional dependencies by specifying them in square brackets after the package name, e.g.:
pip install h5RDMtoolbox[mongodb]
[mongodb]
pymongo>=4.2.0: Database solution for HDF5 files
[csv]
pandas>=1.4.3: Mainly used for reading csv and pretty printing
[snt]
xmltodict: Reading of xml filestabulate>=0.8.10: Pretty printing of tablespython-gitlab: Access to gitlab repositoriespypandoc>=2.3: Conversion of markdown files to html
Citing the package
If you intend to use the package in your work, you may cite the software itself as published on paper in the Zenodo repository. A related paper is published in the journal inggrid. Thank you!
Here's the bibtext to it:
@article{probst2024h5rdmtoolbox,
author = {Matthias Probst, Balazs Pritz},
title = {h5RDMtoolbox - A Python Toolbox for FAIR Data Management around HDF5},
volume = {2},
year = {2024},
url = {https://www.inggrid.org/article/id/4028/},
issue = {1},
doi = {10.48694/inggrid.4028},
month = {8},
keywords = {Data management,HDF5,metadata,data lifecycle,Python,database},
issn = {2941-1300},
publisher={Universitts- und Landesbibliothek Darmstadt},
journal = {ing.grid}
}
Contribution
Feel free to contribute. Make sure to write docstrings to your methods and classes and please write tests and use PEP
8 (https://peps.python.org/pep-0008/)
Please write tests for your code and put them into the test/ folder. Visit the README file in the
test-folder for more information.
Pleas also add a jupyter notebook in the docs/ folder in order to document your code. Please visit
the README file in the docs-folder for more information on how to compile the documentation.
Please use the numpy style for the docstrings: https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_numpy.html#example-numpy
Owner
- Name: MatthiasProbst
- Login: matthiasprobst
- Kind: user
- Location: Karlsruhe
- Company: Karlsruhe Institute of Technology
- Repositories: 3
- Profile: https://github.com/matthiasprobst
I have fun programming with Python, whether it is for scientific or private projects. Most of my repos are related to fluid mechanics or data management.
CodeMeta (codemeta.json)
{
"@context": "https://doi.org/10.5063/schema/codemeta-2.0",
"@type": "SoftwareSourceCode",
"license": "https://spdx.org/licenses/MIT",
"codeRepository": "git+https://github.com/matthiasprobst/h5RDMtoolbox.git",
"name": "h5RDMtoolbox",
"version": "2.5.1",
"description": "Supporting a FAIR Research Data lifecycle using Python and HDF5.",
"applicationCategory": "Engineering",
"programmingLanguage": [
"Python 3",
"Python 3.9",
"Python 3.10",
"Python 3.11",
"Python 3.12"
],
"operatingSystem": [
"Linux",
"Windows",
"macOS"
],
"author": [
{
"@type": "Person",
"@id": "https://orcid.org/0000-0001-8729-0482",
"givenName": "Matthias",
"familyName": "Probst",
"email": "matth.probst@gmail.com",
"affiliation": {
"@type": "Organization",
"@id": "https://ror.org/04t3en479",
"name": "Karlsruhe Institute of Technology, Institute of Thermal Turbomachinery"
}
},
{
"@type": "Person",
"@id": "https://orcid.org/0000-0002-4116-0065",
"givenName": "Lucas",
"familyName": "Bttner",
"affiliation": {
"@type": "Organization",
"@id": "https://ror.org/04t3en479",
"name": "Karlsruhe Institute of Technology, Institute of Thermal Turbomachinery"
}
},
{
"@type": "Person",
"@id": "https://orcid.org/0000-0001-9560-500X",
"givenName": "Balazs",
"familyName": "Pritz",
"affiliation": {
"@type": "Organization",
"@id": "https://ror.org/04t3en479",
"name": "Karlsruhe Institute of Technology, Institute of Thermal Turbomachinery"
}
}
]
}
GitHub Events
Total
- Create event: 20
- Commit comment event: 3
- Issues event: 2
- Release event: 11
- Watch event: 4
- Delete event: 3
- Issue comment event: 4
- Push event: 112
- Pull request event: 16
- Fork event: 1
Last Year
- Create event: 20
- Commit comment event: 3
- Issues event: 2
- Release event: 11
- Watch event: 4
- Delete event: 3
- Issue comment event: 4
- Push event: 112
- Pull request event: 16
- Fork event: 1
Committers
Last synced: almost 2 years ago
Top Committers
| Name | Commits | |
|---|---|---|
| Matthias Probst | m****t@g****m | 756 |
| Probst | m****t@k****u | 143 |
| lucasbuettner | 1****r | 4 |
| uyecw | u****w@s****u | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 4 months ago
All Time
- Total issues: 9
- Total pull requests: 17
- Average time to close issues: 25 days
- Average time to close pull requests: 1 day
- Total issue authors: 2
- Total pull request authors: 1
- Average comments per issue: 1.33
- Average comments per pull request: 0.06
- Merged pull requests: 15
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 1
- Pull requests: 16
- Average time to close issues: 4 days
- Average time to close pull requests: 1 day
- Issue authors: 1
- Pull request authors: 1
- Average comments per issue: 4.0
- Average comments per pull request: 0.0
- Merged pull requests: 14
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- matthiasprobst (7)
- siferati (1)
Pull Request Authors
- matthiasprobst (18)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- actions/checkout v2 composite
- actions/setup-python v2 composite
- pytest >=7.1.2 development
- pytest-cov * development
- jupyterlab *
- myst-nb *
- scikit-image *
- scikit-learn *
- sphinx >=4.5.0
- sphinx-copybutton *
- sphinx-design *
- sphinx_book_theme *
- sphinxcontrib-bibtex *
- IPython >=8.4.0
- appdirs >=1.4.4
- h5py >3.7.0
- matplotlib >=3.5.2
- numpy >=1.20,<1.23.0
- opencv-python >=4.5.3.56
- packaging *
- pandas >=1.4.3
- pco_tools >=1.0.0
- pint ==0.21.1
- pint_xarray >=0.2.1
- pydantic >=2.3.0
- pymongo >=4.2.0
- pypandoc >=1.11
- python-forge ==18.6.0
- python-gitlab *
- pyyaml *
- regex >=2020.7.9
- requests *
- tabulate >=0.8.10
- tqdm >=4.64.0
- xarray >=2022.3.0
- xmltodict *
- zenodo_search ==0.1.0