scidataflow

Command line scientific data management tool

https://github.com/vsbuffalo/scidataflow

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.9%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Command line scientific data management tool

Basic Info

Host: GitHub
Owner: vsbuffalo
License: mit
Language: Rust
Default Branch: main
Homepage:
Size: 2.4 MB

Statistics

Stars: 226
Watchers: 5
Forks: 9
Open Issues: 5
Releases: 0

Created almost 3 years ago · Last pushed over 1 year ago

Metadata Files

Readme License Citation

SciDataFlow — Facilitating the Flow of Data in Science

SciDataFlow demo screencast

Problem 1: Have you ever wanted to reuse and build upon a research project's output or supplementary data, but can't find it?

SciDataFlow solves this issue by making it easy to unite a research project's data with its code. Often, code for open computational projects is managed with Git and stored on a site like GitHub. However, a lot of scientific data is too large to be stored on these sites, and instead is hosted by sites like Zenodo or FigShare.

Problem 2: Does your computational project have dozens or even hundreds of intermediate data files you'd like to keep track of? Do you want to see if these files are changed by updates to computational pipelines.

SciDataFlow also solves this issue by keeping a record of the necessary information to track when data is changed. This is stored alongside the information needed to retrieve data from and push data to remote data repositories. All of this is kept in a simple YAML "Data Manifest" (data_manifest.yml) file that SciDataFlow manages. This file is stored in the main project directory and meant to be checked into Git, so that Git commit history can be used to see changes to data. The Data Manifest is a simple, minimal, human and machine readable specification. But you don't need to know the specifics — the simple sdf command line tool handles it all for you.

The SciDataFlow manuscript has been published in Bioinformatics. If you use SciDataFlow, please consider citing it:

V. Buffalo, SciDataFlow: A Tool for Improving the Flow of Data through Science. 
Bioinformatics (2024), doi:10.1093/bioinformatics/btad754.

The BibTeX entry can be accessed by clicking "Cite this repository" on the right side of the main GitHub repository page.

Documentation

SciDataFlow has extensive documentation full of examples of how to use the various subcommands.

SciDataFlow's Vision

The larger vision of SciDataFlow is to change how data flows through scientific projects. The way scientific data is currently shared is fundamentally broken, which prevents the reuse of data that is the output of some smaller step in the scientific process. We call these scientific assets.

Scientific Assets are the output of some computational pipeline or analysis which has the following important characteristic: Scientific Assets should be reusable by everyone, and be reused by everyone. Being reusable means all other researchers should be able to quickly reuse a scientific asset (without having to spend hours trying to find and download data). Being reused by everyone means that using a scientific asset should be the best way to do something.

For example, if I lift over a recombination map to a new reference genome, that pipeline and output data should be a scientific asset. It should be reusable to everyone — we should not each be rewriting the same bioinformatics pipelines for routine tasks. There are three problems with this: (1) each reimplementation has an independent chance of errors, (2) it's a waste of time, (3) there is no cumulative improvement of the output data. It's not an asset; the result of each implementation is a liability!

Lowering the barrier to reusing computational steps is one of SciDataFlow's main motivations. Each scientific asset should have a record of what computational steps produced output data, and with one command (sdf pull) it should be possible to retrieve all data outputs from that repository. If the user only wants to reuse the data, they can stop there — they have the data locally and can proceed with their research. If the user wants to investigate how the input data was generated, the code is right there too. If they want to try rerunning the computational steps that produced that analysis, they can do that too. Note that SciDataFlow is agnostic to this — by design, it does not tackle the hard problem of managing software versions, computational environments, etc. It can work alongside software (e.g. Docker or Singularity) that tries to solve that problem.

By lowering the barrier to sharing and retrieving scientific data, SciDataFlow hopes to improve the reuse of data.

Future Plans

In the long run, the SciDataFlow YAML specification would allow for recipe-like reuse of data. I would like to see, for example, a set of human genomics scientific assets on GitHub that are continuously updated and reused. Then, rather than a researcher beginning a project by navigating many websites for human genome annotation or data, they might do something like:

console $ mkdir -p new_adna_analysis/data/annotation $ cd new_adna_analysis/data/annotation $ git clone git@github.com:human_genome_assets/decode_recmap_hg38 $ (cd decode_recmap/ && sdf pull) $ git clone git@github.com:human_genome_assets/annotation_hg38 $ (cd annotation_hg38 && sdf pull)

and so forth. Then, they may look at the annotation_hg38/ asset, find a problem, fix it, and issue a GitHub pull request. If the change is fixed, the maintainer would then just do sdf push --overwrite to push the data file to the data repository. Then, the Scientific Asset is then updated for everyone to use an benefit from. All other researchers can then instantly use the updated asset; all it takes is a mere sdf pull --overwrite.

Installing SciDataFlow

If you'd like to the Rust Programming Language manually, see this page, which instructs you to run:

$ curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Then, to install SciDataFlow, just run:

console $ cargo install scidataflow

To test, just try running sdf --help.

Reporting Bugs

If you are a user of SciDataFlow and encounter an issue, please submit an issue to https://github.com/vsbuffalo/scidataflow/issues!

Contributing to SciDataFlow

If you are a Rust developer, please contribute! Here are some great ways to get started (also check the TODO list below, or for TODOs in code!):

Write some API tests. See some of the tests in src/lib/api/zenodo.api as an example.
Write some integration tests. See tests/test_project.rs for examples.
A cleaner error framework. Currently SciDataflow uses anyhow, which works well, but it would be nice to have more specific error enums.
Improve the documentation!

Todo

[] sdf mv tests, within different directories.

Owner

Name: Vince Buffalo
Login: vsbuffalo
Kind: user
Location: Berkeley, CA
Company: UC Berkeley

Website: http://vincebuffalo.com
Repositories: 129
Profile: https://github.com/vsbuffalo

Evolutionary geneticist at UC Berkeley, former bioinformatician. ♥s probability, statistics. Author of book Bioinformatics Data Skills.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Buffalo"
  given-names: "Vince"
  orcid: "https://orcid.org/0000-0003-4510-1609"
title: "SciDataFlow: A Tool for Improving the Flow of Data through Science"
version: 0.8.12
doi: http://dx.doi.org/10.1093/bioinformatics/btad754
date-released: 2024-01-05
url: "https://github.com/vsbuffalo/scidataflow/"

GitHub Events

Total

Issues event: 3
Watch event: 15
Delete event: 1
Issue comment event: 6
Push event: 11
Pull request review event: 1
Pull request event: 6
Create event: 2

Last Year

Issues event: 3
Watch event: 15
Delete event: 1
Issue comment event: 6
Push event: 11
Pull request review event: 1
Pull request event: 6
Create event: 2

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 18
Total pull requests: 12
Average time to close issues: about 1 month
Average time to close pull requests: about 1 month
Total issue authors: 11
Total pull request authors: 2
Average comments per issue: 2.5
Average comments per pull request: 0.17
Merged pull requests: 8
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 3
Average time to close issues: N/A
Average time to close pull requests: 10 days
Issue authors: 1
Pull request authors: 2
Average comments per issue: 0.0
Average comments per pull request: 0.67
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

apraga (6)
mrvollger (2)
davidmasp (2)
Wesady (1)
Tomas-Pe (1)
nrminor (1)
jsalignon (1)
shdam (1)
andrewkennard (1)
BioGeek (1)
vsbuffalo (1)

Pull Request Authors

vsbuffalo (9)
apraga (6)

Top Labels

Issue Labels

enhancement (3) upstream API bug (1)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- cargo 11,380 total

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 8
Total maintainers: 1

crates.io: scidataflow

A command-line tool to manage scientific research project data.

Documentation: https://docs.rs/scidataflow/
License: MIT
Latest release: 0.8.11
published over 2 years ago

Versions: 8
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 11,380 Total

Rankings

Dependent repos count: 30.2%

Dependent packages count: 31.5%

Average: 53.3%

Downloads: 98.3%

Maintainers (1)

vsbuffalo