nds-lucid-ingestion

Watch a folder to automatically chunk and validate incoming rdf data bundles

https://github.com/sdsc-ordes/nds-lucid-ingestion

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 10 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (17.9%) to scientific vocabulary

Keywords

pipeline rdf validation
Last synced: 6 months ago · JSON representation ·

Repository

Watch a folder to automatically chunk and validate incoming rdf data bundles

Basic Info
  • Host: GitHub
  • Owner: sdsc-ordes
  • License: mit
  • Language: Nextflow
  • Default Branch: main
  • Homepage:
  • Size: 34.8 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Topics
pipeline rdf validation
Created over 1 year ago · Last pushed about 1 year ago
Metadata Files
Readme License Citation

README.md

ingestion

Nextflow Podman DOI

Copyright © 2023-2025 SDSC - Swiss Data Science Center.
Licensed under the MIT License - see the LICENSE file for details.
Funded by SPHN and PHRT.

Context

This repository includes "digital infrastructure" code from a Swiss National Data Stream (NDS): LUCID. The general goal of the NDS initiative is to collect clinical data across five Swiss University Hospitals and share it with researchers. In the case of LUCID, research focuses on low-value care: services that provide little or no benefit to patients. If you're interested, check the official project page.

Digital infrastructure for LUCID project is also available in these repositories.

Code overview

This repository provides an automated pipeline to perform (RDF) data validation with two flows:

  • Success: upon successful validation, data is provided in the output folder
  • Failure: upon unsuccessful validation, a report is generated in the notification folder

The code was first built around the BioMedIT environment, but to allow re-usability, most software and tools rely on public containers, meaning that there are few requirements to test it on any machine (see Requirements section).

The sections below provide more technical details about the pipeline, its implementation and use.
For any question, feel free to open an issue or contact us directly.

Workflow framework

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses podman containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies.

Requirements

By default, the workflow assumes that: * Nextflow is installed (>=22.10.1) * Podman is installed for full pipeline reproducibility (tested on version 4.9.4) * Basic UNIX libraries are installed: gzip, cat, unzip, md5sum and fdfind

With the biomedit profile, in addition to the points above, the workflow assumes that: * sett-rs is installed with the command-line interface available (sett-cli, tested on version 5.3.0) * A Nextflow secret SETT_OPENPGP_KEY_PWD is set to provide the secret OpenPGP key used to decrypt data * jq is installed and available (>=1.6)

See usage instructions for more information.

Pipeline summary

  1. Check for new zip files or rerun worfklow on current zip files in source directory
  2. Decrypt and decompress files to extract datasets
    • With the biomedit profile, metadata is extracted and used to rename the datasets' directory
  3. Create batches of data with sizes defined in nextflow.config
  4. One by one, each data batch is bundled with external terminologies and validated using SPHN SHACL rules:
    a. For invalid or empty batches: create a file with datasets and their error in a notification folder
    b. For valid batches: continue to steps 5-7
  5. Convert valid datasets to nt format
  6. Compress all nt files into gzip
  7. Move gzip to output directory

mermaid flowchart TD input_dir(input_dir) sett_unpack[sett_unpack] patient_data(patient_data) bundled_batches(bundled_batches) val[validation] report(report.ttl) exit(exitStatus) copy[copy_to_output] compress[compress] output_dir[output_dir] notification[send_notification] input_dir -->|ch_sett_pkg| check_integrity check_integrity --> get_sett_metadata check_integrity --> unpack unpack --> patient_data subgraph biomedit get_sett_metadata --> sett_unpack end subgraph config input_dir output_dir SPHN_SHACL_shapes SPHN_schema terminologies end output_dir --> sett_unpack output_dir --> unpack sett_unpack --> patient_data patient_data --> |batching| bundled_batches SPHN_SHACL_shapes --> val terminologies --> |nt converter| enriched_terms SPHN_schema --> |nt_converter| enriched_terms enriched_terms --> bundled_batches val --> report val --> exit exit --> |if !=0| notification exit --> |else| nt_converter nt_converter --> compress compress --> copy bundled_batches --> |if empty| notification bundled_batches --> val

Quick Start

See usage docs for all of the available options when running the pipeline.

  1. Download the pipeline and test it on a minimal dataset with a single command:

bash nextflow run main.nf -profile standard,test

Note that some form of configuration will be needed so that Nextflow knows how to fetch the required software. This is usually done in the form of a config profile (test in the example command above). You can chain multiple config profiles in a comma-separated string.

  1. Start running your own analysis!

bash nextflow run main.nf -profile standard,test --input_dir /data/source --output_dir /data/target --notification_dir /data/notification --shapes /data/shapes

Production use

To use ingestion within the BioMedIT system, we advise pointing the Nextflow working directory to a folder on a separate partition with sufficient volume and appropriate permissions. For the pipeline constantly monitoring for incoming data:

nextflow run main.nf -profile biomedit -w /data/work/ For the pipeline to re-run on already landed data:

nextflow run main.nf -profile biomedit -w /data/work/ --rerun=true

Credits

nds-lucid/ingestion was originally written by Stefan Milosavljevic and Cyril Matthey-Doret.

Cite this work by getting citation information from the GitHub menu on the right or the Zenodo DOI record, or like below (APA style citation):

Milosavljevic, S., Matthey-Doret, C., & Riba Grognuz, O. (2025). LUCID BioMedIT Ingestion Pipeline. Zenodo. https://doi.org/10.5281/zenodo.14726408

References

This pipeline uses code developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Owner

  • Name: Swiss Data Science Center - ORD
  • Login: sdsc-ordes
  • Kind: organization
  • Location: Switzerland

Open Research Data team at the Swiss Data Science Center.

Citation (CITATION.cff)

cff-version: 1.2.0
title: LUCID BioMedIT Ingestion Pipeline
message: 'If you use this software, please cite it using the metadata below.'
type: software
authors:
  - given-names: Stefan
    family-names: Milosavljevic
    affiliation: 'Swiss Data Science Center'
    orcid: 'https://orcid.org/0000-0002-9135-1353'
  - given-names: 'Cyril '
    family-names: Matthey-Doret
    affiliation: 'Swiss Data Science Center'
    orcid: 'https://orcid.org/0000-0002-1126-1535'
  - family-names: Riba Grognuz
    given-names: Oksana
    affiliation: 'Swiss Data Science Center'
    orcid: 'https://orcid.org/0000-0002-2961-2655'
identifiers:
  - type: doi
    value: 10.5281/zenodo.14726408
    description: 'This DOI represents all versions, and will always resolve to the latest one.'
repository-code: 'https://github.com/sdsc-ordes/nds-lucid-ingestion'
license: MIT

GitHub Events

Total
  • Release event: 1
  • Member event: 1
  • Push event: 5
  • Public event: 1
  • Create event: 1
Last Year
  • Release event: 1
  • Member event: 1
  • Push event: 5
  • Public event: 1
  • Create event: 1

Issues and Pull Requests

Last synced: 11 months ago

All Time
  • Total issues: 0
  • Total pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: 7 days
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 2.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: 7 days
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 2.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • supermaxiste (1)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

environment.yml pypi