https://github.com/bencardoen/datacurator.jl

A scalable Julia package to transparently validate and transform large biomedical datasets using human readable recipes that are translated to machine verifiable templates.

Keywords

julia julia-package portable postprocessing preprocessing reproducible-research scalability

Last synced: 5 months ago · JSON representation

Repository

A scalable Julia package to transparently validate and transform large biomedical datasets using human readable recipes that are translated to machine verifiable templates.

Basic Info

Host: GitHub
Owner: bencardoen
License: agpl-3.0
Language: Julia
Default Branch: main
Homepage:
Size: 52.6 MB

Statistics

Stars: 7
Watchers: 2
Forks: 1
Open Issues: 6
Releases: 0

Topics

julia julia-package portable postprocessing preprocessing reproducible-research scalability

Created almost 4 years ago · Last pushed over 1 year ago

Metadata Files

Readme License

DataCurator

Untitled drawing

A multithreaded package to validate, curate, and transform large heterogeneous datasets using reproducible recipes, which can be created both in TOML human readable format, or in Julia.

A key aim of this package is that recipes can be read/written by any researcher without the need for being able to write code, making data sharing/validation faster, more accurate, and reproducible.

DataCurator is a Swiss army knife that ensures: - pipelines can focus on the algorithm/problem solving - human readable "recipes" for future reproducibility - validation of huge datasets at high speed - out-of-the-box operation without the need for code or dependencies

DataCurator requires a command-line interface and is supported on Linux, Windows Subsystem for Linux (WSL2), and MacOS. See Quickstart and Installation for detail.

Quickstart via Singularity

The recommended way to use DataCurator is via the Singularity container. Note this is only supported in Linux, Windows Subsystem for Linux (WSL2), and MacOS (x86). For ARM-based Macs (e.g. from early 2021 onward), use the Docker container or source codes - see installation for detail.

1. Install Singularity

Linux or Windows + WSL

bash wget https://github.com/apptainer/singularity/releases/download/v3.8.7/singularity-container_3.8.7_amd64.deb sudo apt-get install ./singularity-container_3.8.7_amd64.deb

MacOS (x86)

Please refer to the Singularity docs.

1.1 Test Singularity works

After installation, test by typing in a terminal singularity --version. This will return singularity version 3.8.7

2. Download the DataCurator container

bash singularity pull datacurator.sif library://bcvcsert/datacurator/datacurator:latest The container image can be also found at Sylabs.

3. Set executable

bash chmod u+x ./datacurator.sif Depending on the directory you're in, you may need to grant Singularity read/write access. By default Singularity has read/write access to $HOME, no other directory. export SINGULARITY_BINDPATH=${PWD}

4. Test DataCurator with a minimal example

Copy the example recipe

bash wget https://raw.githubusercontent.com/bencardoen/DataCurator.jl/main/example_recipes/count.toml

Create test data

bash mkdir testdir touch testdir/text.txt

Run

bash ./datacurator.sif -r count.toml

That should show output similar to

Results

The recipe used can be found here.

What next? Check out two simple examples of use cases and TOML recipes, and follow that with the large collection of well commented example recipes or the complete walkthrough of DataCurator. Please see the documentation.

Status

The outcome of automated tests (including building on Mac OS & Debian docker image) :

Code coverage (which parts of the source code are tested) :

Documentation

For full documentation, click here >> . This includes more detailed installation docs, two simple examples of use cases and TOML recipes, well-commented example recipes, complete walkthrough of DataCurator, and more.

What to find where

bash repository ├── example_recipes ## Start here for easy to copy example recipes ├── docs │ ├── src ## Documentation in markdown format (viewable online as well) │ │ ├── make.jl ## `cd docs && julia --project=.. make.jl` to rebuild docs ├── singularity ## Singularity image instructions ├── src ## source code of the package itself ├── scripts ## Utility scripts to run DC, generate test data, ... ├── test ## test suite and related files └── runjulia.sh ## Required for Singularity image └── buildimage.sh ## Rebuilds singularity image for you (Needs root !!)

Publication

bibtex @article{10.1093/bioadv/vbad068, author = {Cardoen, Ben and Ben Yedder, Hanene and Lee, Sieun and Nabi, Ivan Robert and Hamarneh, Ghassan}, title = "{DataCurator.jl: efficient, portable and reproducible validation, curation and transformation of large heterogeneous datasets using human-readable recipes compiled into machine-verifiable templates}", journal = {Bioinformatics Advances}, volume = {3}, number = {1}, pages = {vbad068}, year = {2023}, month = {06}, abstract = "{Large-scale processing of heterogeneous datasets in interdisciplinary research often requires time-consuming manual data curation. Ambiguity in the data layout and preprocessing conventions can easily compromise reproducibility and scientific discovery, and even when detected, it requires time and effort to be corrected by domain experts. Poor data curation can also interrupt processing jobs on large computing clusters, causing frustration and delays. We introduce DataCurator, a portable software package that verifies arbitrarily complex datasets of mixed formats, working equally well on clusters as on local systems. Human-readable TOML recipes are converted into executable, machine-verifiable templates, enabling users to easily verify datasets using custom rules without writing code. Recipes can be used to transform and validate data, for pre- or post-processing, selection of data subsets, sampling and aggregation, such as summary statistics. Processing pipelines no longer need to be burdened by laborious data validation, with data curation and validation replaced by human and machine-verifiable recipes specifying rules and actions. Multithreaded execution ensures scalability on clusters, and existing Julia, R and Python libraries can be reused. DataCurator enables efficient remote workflows, offering integration with Slack and the ability to transfer curated data to clusters using OwnCloud and SCP. Code available at: https://github.com/bencardoen/DataCurator.jl.}", issn = {2635-0041}, doi = {10.1093/bioadv/vbad068}, url = {https://doi.org/10.1093/bioadv/vbad068}, eprint = {https://academic.oup.com/bioinformaticsadvances/article-pdf/3/1/vbad068/50693195/vbad068.pdf}, }

Troubleshooting

If you have any issue, please search the issues to see if your problem has been encountered before. If not, please create a new issue, and follow the templates for bugs and / or features you wish to be added.

If you have a workflow that DataCurator right now does not support, or not the way you'd like it to, you can mention this too. In that case, do share a minimum example of your data so we can add, upon completion of the feature, a new testcase.

Dependencies

DataCurator relies heavily on existing Julia packages for specialized functionality: - Images.jl - DataFrames.jl - CSV.jl - RCall.jl - PyCall.jl

Related software

Open Microscopy OMERO

Owner

Name: Ben Cardoen
Login: bencardoen
Kind: user
Location: Vancouver
Company: https://github.com/sfu-mial

Twitter: BenCardoen
Repositories: 29
Profile: https://github.com/bencardoen

PhD Student Computing Science @sfu-mial Simon Fraser University

GitHub Events

Total

Issues event: 1
Push event: 1

Last Year

Issues event: 1
Push event: 1

Committers

Last synced: 8 months ago

All Time

Total Commits: 783
Total Committers: 5
Avg Commits per committer: 156.6
Development Distribution Score (DDS): 0.057

Past Year

Commits: 22
Committers: 1
Avg Commits per committer: 22.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Ben Cardoen	2****n	738
slee04	4****4	26
haneneby	h**y@g**m	12
Pietro Monticone	3****e	6
Ghassan Hamarneh	h**h@g**m	1

Issues and Pull Requests

Last synced: 8 months ago

All Time

Total issues: 25
Total pull requests: 1
Average time to close issues: about 1 month
Average time to close pull requests: about 1 hour
Total issue authors: 1
Total pull request authors: 1
Average comments per issue: 0.8
Average comments per pull request: 1.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 3
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

https://github.com/bencardoen/datacurator.jl

Science Score: 26.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

DataCurator

Table of Contents

Quickstart via Singularity

1. Install Singularity

Linux or Windows + WSL

MacOS (x86)

1.1 Test Singularity works

2. Download the DataCurator container

3. Set executable

4. Test DataCurator with a minimal example

Copy the example recipe

Create test data

Run

Status

Documentation

What to find where

Publication

Troubleshooting

Dependencies

Related software

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels