https://github.com/bigbio/sdrf-pipelines

`sdrf-pipelines` is the official SDRF file validator and converts SDRF to pipeline configuration files

Science Score: 59.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
✓
Academic publication links
Links to: nature.com, acs.org
✓
Committers with academic emails
4 of 11 committers (36.4%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.7%) to scientific vocabulary

Keywords

mass-spectrometry maxquant msstats multiomics openms proteomics proteomics-data-analysis sdrf

Keywords from Contributors

nf-core nextflow sdrf-proteomics proteomics-experiments proteomics-datasets proteomics-community proteomexchange pride-metadata msrun-metadata mage-tab

Last synced: 5 months ago · JSON representation

Repository

`sdrf-pipelines` is the official SDRF file validator and converts SDRF to pipeline configuration files

Basic Info

Host: GitHub
Owner: bigbio
License: apache-2.0
Language: Python
Default Branch: dev
Homepage:
Size: 91.9 MB

Statistics

Stars: 20
Watchers: 9
Forks: 26
Open Issues: 33
Releases: 30

Topics

mass-spectrometry maxquant msstats multiomics openms proteomics proteomics-data-analysis sdrf

Created almost 6 years ago · Last pushed 5 months ago

Metadata Files

Readme Changelog Contributing License

sdrf-pipelines | SDRF Validator | SDRF Converter

PyPI - Downloads

Validate and convert SDRF files with sdrf-pipelines and its parse_sdrf CLI.

This is the official SDRF file validation tool and it can convert SDRF files to different workflow configuration files such as MSstats, OpenMS and MaxQuant.

Installation

bash pip install sdrf-pipelines

Validate SDRF files

You can validate an SDRF file by executing the following command:

bash parse_sdrf validate-sdrf --sdrf_file {here_the_path_to_sdrf_file}

New JSON Schema-Based Validation

The SDRF validator now uses a YAML schema-based validation system that makes it easier to define and maintain validation rules. The new system offers several advantages:

Key Features

YAML-Defined Schemas: All validation templates are defined in YAML files:
- default.yaml - Common fields for all SDRF files (includes mass spectrometry fields)
- human.yaml - Human-specific fields
- vertebrates.yaml - Vertebrate-specific fields
- nonvertebrates.yaml - Non-vertebrate-specific fields
- plants.yaml - Plant-specific fields
- cell_lines.yaml - Cell line-specific fields
- disease_example.yaml - Example schema for disease terms with multiple ontologies
Enhanced Ontology Validation:
- Support for multiple ontologies per field
- Rich error messages with descriptions and examples
- Special value handling for "not available" and "not applicable"
Schema Inheritance: Templates can extend other templates, making it easy to create specialized validation rules.

Example JSON Schema

json { "name": "characteristics_cell_type", "sdrf_name": "characteristics[cell type]", "description": "Cell type", "required": true, "validators": [ { "type": "whitespace", "params": {} }, { "type": "ontology", "params": { "ontologies": ["cl", "bto", "clo"], "allow_not_applicable": true, "allow_not_available": true, "description": "The cell type should be a valid Cell Ontology term", "examples": ["hepatocyte", "neuron", "fibroblast"] } } ] }

Simplified Validation Command

A simplified validation command is also available:

bash parse_sdrf validate-sdrf-simple {here_the_path_to_sdrf_file} --template {template_name}

This command provides a more straightforward interface for validating SDRF files, without the additional options for skipping specific validations.

Creating Custom Validation Templates

You can create your own validation templates by defining JSON schema files. Here's how:

Create a JSON file with your validation rules: json { "name": "my_template", "description": "My custom template", "extends": "default", "min_columns": 7, "fields": [ { "name": "characteristics_my_field", "sdrf_name": "characteristics[my field]", "description": "My custom field", "required": true, "validators": [ { "type": "whitespace", "params": {} }, { "type": "ontology", "params": { "ontology_name": "my_ontology", "allow_not_applicable": true, "description": "My field description", "examples": ["example1", "example2"] } } ] } ] }
Place the file in the sdrf_pipelines/sdrf/schemas/ directory.
Use your template with the validation command: bash parse_sdrf validate-sdrf --sdrf_file {path_to_sdrf_file} --template my_template

The template system supports inheritance, so you can extend existing templates to add or override fields.

Convert SDRF files

sdrf-pipelines provides a multitude of converters which take an SDRF file and other inputs to create configuration files consumed by other software.

Convert to OpenMS

bash parse_sdrf convert-openms -s sdrf.tsv

Description:

experiment settings (search engine settings etc.)
experimental design

The experimental settings file contains one row for every raw file. Columns contain relevevant parameters like precursor mass tolerance, modifications etc. These settings can usually be derived from the sdrf file.

| URI | Filename | FixedModifications | VariableModifications | Label | PrecursorMassTolerance | PrecursorMassToleranceUnit | FragmentMassTolerance | FragmentMassToleranceUnit | DissociationMethod | Enzyme | |------| ------------- |-------------|-----|---| ------------- |-------------|-----|---| ------------- |-------------| | ftp://ftp.pride.ebi.ac.uk/pride/data/archive/XX/PXD324343/A02181ARFR01.raw | A02181ARFR01.raw | Acetyl (Protein N-term) | Gln->pyro-glu (Q),Oxidation (M) | label free sample| 10 | ppm | 10 | ppm | HCD | Trypsin | | ftp://ftp.pride.ebi.ac.uk/pride/data/archive/XX/PXD324343/A02181ARFR02.raw | A02181ARFR02.raw | Acetyl (Protein N-term) | Gln->pyro-glu (Q),Oxidation (M) | label free sample| 10 | ppm | 10 | ppm | HCD | Trypsin |

The experimental design file contains information how to unambiguously map a single quantitative value. Most entries can be derived from the sdrf file. However, definition of conditions might need manual changes.

Fraction_Group identifier that indicates which fractions belong together. In the case of label-free data, the fraction group identifier has the same cardinality as the sample identifier.
The Fraction identifier indicates which fraction was measured in this file. In the case of unfractionated data the fraction identifier is 1 for all samples.
The Label identifier. 1 for label-free, 1 and 2 for SILAC light/heavy, e.g. 1-10 for TMT10Plex
The Spectra_Filepath (e.g., path = "/data/SILAC_file.mzML")
MSstats_Condition the condition identifier as used by MSstats
MSstats_BioReplicate an identifier to indicate replication. (MSstats requires that there are no duplicate entries. E.g., if MSstatsCondition, FractionGroup group and Fraction number are the same - as in the case of biological or technical replication, one uses the MSstats_BioReplicate to make entries non-unique)

| FractionGroup| Fraction | SpectraFilepath | Label | MSstatsCondition | MSstatsBioReplicate | | ------------- |-------------|-----|---| ------------- |-----------| | 1 | 1 | A02181ARFR01.raw | 1 | 1 | 1 | 1 | | 1 | 2 | A02181ARFR02.raw | 1 | 1 | 1 | 1 | | . | . | ... | . | . | . | . | | 1 | 15 | A02182AFR15.raw | 1 | 1 | 1 | 1 | | 2 | 1 | A02182AFR01.raw | 1 | 2 | 2 | 1 | | . | . | ... | . | . | . | . | | . | . | ... | . | . | . | . | | 10 | 15 | A021810AFR15.raw | 1 | 10 | 10 | 1 |

For details, please see the MSstats documentation

Convert to MaxQuant: Usage

bash parse_sdrf convert-maxquant -s sdrf.tsv -f {here_the_path_to_protein_database_file} -m {True or False} -pef {default 0.01} -prf {default 0.01} -t {temporary folder} -r {raw_data_folder} -n {number of threads:default 1} -o1 {parameters(.xml) output file path} -o2 {maxquant experimental design(.txt) output file path} eg. bash parse_sdrf convert-maxquant -s /root/ChengXin/Desktop/sdrf.tsv -f /root/ChengXin/MyProgram/search_spectra/AT/TAIR10_pep_20101214.fasta -r /root/ChengXin/MyProgram/virtuabox/share/raw_data/ -o1 /root/ChengXin/test.xml -o2 /root/ChengXin/test_exp.xml -t /root/ChengXin/MyProgram/virtuabox/share/raw_data/ -pef 0.01 -prf 0.01 -n 4

-s : SDRF file
-f : fasta file
-r : spectra raw file folder
-mcf : MaxQuant default configure path (if given, Can add new modifications)
-m : via matching between runs to boosts number of identifications
-pef : posterior error probability calculation based on target-decoy search
-prf : protein score = product of peptide PEPs (one for each sequence)
-t : place on SSD (if possible) for faster search，It is recommended not to be the same as the raw file directory
-n : each thread needs at least 2 GB of RAM,number of threads should be ≤ number of logical cores available(otherwise, MaxQuant can crash)

Description

maxquant parameters file (mqpar.xml)
maxquant experimental design file (.txt)

The maxquant parameters file mqpar.xml contains the parameters required for maxquant operation.some settings can usually be derived from the sdrf file such as enzyme, fixed modification, variable modification, instrument, fraction and label etc.Set other parameters as default.The current version of maxquant supported by the script is 1.6.10.43

Some parameters are listed： - <fastaFilePath>TAIR10pep20101214.fasta</fastaFilePath> - <matchBetweenRuns>True</matchBetweenRuns> - <maxQuantVersion>1.6.10.43</maxQuantVersion> - <tempFolder>C:/Users/test</tempFolder> - <numThreads>2</numThreads> - <filePaths> - <string>C:\Users\searchspectra\AT\13040208.raw</string> - <string>C:\Users\searchspectra\AT\13041208.raw</string> - </filePaths> - <experiments> - <string>sample 1Tr1</string> - <string>sample 2Tr1</string> - </experiments> - <fractions> - <short>32767</short> - <short>32767</short> - </fractions> - <paramGroupIndices> - <int>0</int> - <int>1</int> - </paramGroupIndices> - <msInstrument>0</msInstrument> - <fixedModifications> - <string>Carbamidomethyl (C)</string> - </fixedModifications> - <enzymes> - <string>Trypsin</string> - </enzymes> - <variableModifications> - <string>Oxidation (M)</string> - <string>Phospho (Y)</string> - <string>Acetyl (Protein N-term)</string> - <string>Phospho (T)</string> - <string>Phospho (S)</string> - </variableModifications>

For details, please see the MaxQuant documentation

The maxquant experimental design file contains name,Fraction,Experiement and PTM column.Most entries can be derived from the sdrf file. - Name raw data file name. - Fraction In the Fraction column you must assign if the corresponding files shown in the left column belong to a fraction of a gel fraction. If your data is not obtained through gel-based pre-fractionation you must assign the same number(default 1) for all files in the column Fraction. - Experiment In the column named as Experiment if you want to combine all experimental replicates as a single dataset to be analyzed by MaxQuant, you must enter the same identifier for the files which should be concatenated . However, if you want each individual file to be treated as a different experiment which you want to compare further you should assign different identifiers to each of the files as shown below.

| Name | Fraction | Experiment | PTM | | :----:| :----: | :----: | :----: | | 13040208.raw | 1 | sample 1Tr1 | | | 13041208.raw | 1 | sample 2Tr1 | |

Convert to MSstats annotation file: Usage

bash parse_sdrf convert-msstats -s ./testdata/PXD000288.sdrf.tsv -o ./test1.csv

-s : SDRF file
-c : Create conditions from provided (e.g., factor) columns as used by MSstats
-o : annotation out file path
-swath : from openswathtomsstats output to msstats default false
-mq : from maxquant output to msstats default false

Convert to NormalyzerDE design file: Usage

bash parse_sdrf convert-normalyzerde -s ./testdata/PXD000288.sdrf.tsv -o ./testPXD000288_design.tsv

-s : SDRF file
-c : Create groups from provided (e.g., factor) columns as used by NormalyzerDE, for example -c ["characteristics[spiked compound]"] (optional)
-o : NormalyzerDE design out file path
-oc : Out file path for comparisons towards first group (optional)
-mq : Path to MaxQuant experimental design file for mapping MQ sample names. (optional)

Help

``` $ parsesdrf --help Usage: parsesdrf [OPTIONS] COMMAND [ARGS]...

This is the main tool that gives access to all commands to convert SDRF files into pipelines specific configuration files.

Options: --version Show the version and exit. -h, --help Show this message and exit.

Commands: build-index-ontology Convert an ontology file to an index file convert-maxquant convert sdrf to maxquant parameters file and generate an experimental design file convert-msstats convert sdrf to msstats annotation file convert-normalyzerde convert sdrf to NormalyzerDE design file convert-openms convert sdrf to openms file output split-sdrf Command to split the sdrf file validate-sdrf Command to validate the sdrf file validate-sdrf-simple Simple command to validate the sdrf file

```

Citations

Dai C, Füllgrabe A, Pfeuffer J, Solovyeva EM, Deng J, Moreno P, Kamatchinathan S, Kundu DJ, George N, Fexova S, Grüning B, Föll MC, Griss J, Vaudel M, Audain E, Locard-Paulet M, Turewicz M, Eisenacher M, Uszkoreit J, Van Den Bossche T, Schwämmle V, Webel H, Schulze S, Bouyssié D, Jayaram S, Duggineni VK, Samaras P, Wilhelm M, Choi M, Wang M, Kohlbacher O, Brazma A, Papatheodorou I, Bandeira N, Deutsch EW, Vizcaíno JA, Bai M, Sachsenberg T, Levitsky LI, Perez-Riverol Y. A proteomics sample metadata representation for multiomics integration and big data analysis. Nat Commun. 2021 Oct 6;12(1):5854. doi: 10.1038/s41467-021-26111-3. PMID: 34615866; PMCID: PMC8494749. Manuscript
Perez-Riverol, Yasset, and European Bioinformatics Community for Mass Spectrometry. "Toward a Sample Metadata Standard in Public Proteomics Repositories." Journal of Proteome Research 19.10 (2020): 3906-3909. Manuscript

Owner

Name: BigBio Stack
Login: bigbio
Kind: organization
Email: proteomicsstack@gmail.com
Location: Cambridge, UK

Website: http://bigbio.xyz
Repositories: 24
Profile: https://github.com/bigbio

Provide big data solutions Bioinformatics

GitHub Events

Total

Create event: 3
Release event: 1
Issues event: 24
Watch event: 3
Delete event: 2
Member event: 1
Issue comment event: 138
Push event: 53
Pull request review comment event: 91
Pull request review event: 137
Pull request event: 61
Fork event: 4

Last Year

Create event: 3
Release event: 1
Issues event: 24
Watch event: 3
Delete event: 2
Member event: 1
Issue comment event: 138
Push event: 53
Pull request review comment event: 91
Pull request review event: 137
Pull request event: 61
Fork event: 4

Committers

Last synced: almost 3 years ago

All Time

Total Commits: 293
Total Committers: 11
Avg Commits per committer: 26.636
Development Distribution Score (DDS): 0.686

Top Committers

Name	Email	Commits
yperez	y**l@g**m	92
chengxin Dai	3**n@u**m	63
veitveit	v**t@g**m	53
Timo Sachsenberg	s**b@i**e	33
fabianegli	f**i@p**h	24
Lev Levitsky	l**y@p**u	15
jpfeuffer	p**r@i**e	4
Fredrik Levander	f**r@b**e	3
Julianus Pfeuffer	j**r@f**e	3
Fabian Egli	f**i@u**m	2
Björn Grüning	b**n@g**u	1

Committer Domains (Top 20 + Academic)

informatik.uni-tuebingen.de: 2 gruenings.eu: 1 fu-berlin.de: 1 bils.se: 1 phystech.edu: 1 protonmail.ch: 1

Issues and Pull Requests

Last synced: 5 months ago

All Time

Total issues: 88
Total pull requests: 144
Average time to close issues: 9 months
Average time to close pull requests: 6 days
Total issue authors: 16
Total pull request authors: 12
Average comments per issue: 1.6
Average comments per pull request: 1.67
Merged pull requests: 114
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 23
Pull requests: 75
Average time to close issues: 5 days
Average time to close pull requests: 2 days
Issue authors: 7
Pull request authors: 6
Average comments per issue: 1.39
Average comments per pull request: 2.21
Merged pull requests: 52
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

fabianegli (31)
ypriverol (27)
jpfeuffer (6)
enryH (4)
levitsky (3)
veitveit (3)
jgriss (2)
MetteBoge (2)
noatgnu (2)
foellmelanie (2)
di-hardt (1)
kai-lawsonmcdowall (1)
timosachsenberg (1)
nicni16 (1)
TineClaeys (1)

Pull Request Authors

fabianegli (52)
ypriverol (41)
daichengxin (27)
levitsky (9)
jpfeuffer (3)
enryH (3)
veitveit (3)
lazear (2)
johnne (1)
timosachsenberg (1)
jgriss (1)
WangHong007 (1)

Top Labels

Issue Labels

enhancement (17) bug (11) high-priority (9) help wanted (8) question (4) good first issue (1) low-priority (1)

Pull Request Labels

Review effort [1-5]: 2 (7) Bug fix (6) Review effort [1-5]: 1 (6) Review effort 1/5 (6) Review effort 2/5 (3) WIP (3) configuration changes (2) dependencies (2) wontfix (2) bug (1) enhancement (1) help wanted (1)

Packages

Total packages: 1
Total downloads:
- pypi 2,357 last-month

Total dependent packages: 1
Total dependent repositories: 3
Total versions: 30
Total maintainers: 1

pypi.org: sdrf-pipelines

Translate, convert SDRF to configuration pipelines

Homepage: https://github.com/bigbio/sdrf-pipelines
Documentation: https://sdrf-pipelines.readthedocs.io/
License: 'Apache 2.0
Latest release: 0.0.32
published 9 months ago

Versions: 30
Dependent Packages: 1
Dependent Repositories: 3
Downloads: 2,357 Last month
Docker Downloads: 0

Rankings

Docker downloads count: 1.6%

Dependent packages count: 4.7%

Downloads: 7.0%

Average: 7.8%

Forks count: 8.7%

Dependent repos count: 9.0%

Stargazers count: 16.0%

Maintainers (1)

ypriverol

Last synced: 5 months ago

Dependencies

package-lock.json npm

194 dependencies

package.json npm

remark-cli ^11.0.0
remark-preset-lint-consistent ^5.1.1
remark-preset-lint-recommended ^6.1.2

requirements-dev.txt pypi

black * development
isort * development
pre-commit * development

requirements.txt pypi

click *
pandas *
pandas_schema *
pytest *
pyyaml *
requests *

setup.py pypi

click *

.github/workflows/ci.yml actions

actions/checkout v2 composite
actions/setup-python v1 composite
actions/setup-python v2 composite
isort/isort-action v0.1.0 composite
mshick/add-pr-comment v1 composite
psf/black stable composite

.github/workflows/pythonapp.yml actions

actions/checkout v2 composite
actions/setup-python v1 composite

.github/workflows/pythonpackage.yml actions

actions/checkout v3 composite
actions/setup-python v4 composite

.github/workflows/pythonpublish.yml actions

actions/checkout v3 composite
actions/setup-python v3 composite
pypa/gh-action-pypi-publish 27b31702a0e7fc50959f5ad993c78deac1bdfc29 composite

pyproject.toml pypi

https://github.com/bigbio/sdrf-pipelines

Science Score: 59.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

sdrf-pipelines | SDRF Validator | SDRF Converter

Installation

Validate SDRF files

New JSON Schema-Based Validation

Key Features

Example JSON Schema

Simplified Validation Command

Creating Custom Validation Templates

Convert SDRF files

Convert to OpenMS

Description:

Convert to MaxQuant: Usage

Description

Convert to MSstats annotation file: Usage

Convert to NormalyzerDE design file: Usage

Help

Citations

Owner

GitHub Events

Total

Last Year

Committers

All Time

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: sdrf-pipelines

Rankings

Maintainers (1)

Dependencies