https://github.com/bigbio/spectrafuse
Incremental clustesting pipeline from quantms data.
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.4%) to scientific vocabulary
Repository
Incremental clustesting pipeline from quantms data.
Basic Info
- Host: GitHub
- Owner: bigbio
- License: mit
- Language: Python
- Default Branch: main
- Size: 376 KB
Statistics
- Stars: 1
- Watchers: 5
- Forks: 0
- Open Issues: 4
- Releases: 0
Metadata Files
README.md
spectrafuse
Incremental clustering pipeline from quantms data. quantms is a workflow for reanalysis of public proteomics data. The quantms not only release a workflow to the public but also perform reanalysis of public proteomics data in a systematic way for TMT, LFQ, ITRAQ and other DDA methods.
quantms has reanalyzed an extensive number of datasets with almost 1 billion MS/MS (Mass Spectrometry/Mass Spectrometry) MS2 analyzed, comprising nearly 100 million PSMs (Peptide-Spectrum Matches) derived from various tissues, cell lines, and diseases. In light of this vast wealth of data,The spectrafuse aims to apply spectral clustering techniques to organize this data and construct spectral libraries.
spectrafuse is a nextflow workflow that perform incremental clustering of quantms and is based in the tool MaRaCluster.
The workflow in a nutshell:
Reference: https://github.com/bigbio/spectrafuse/blob/main/docs/algorithm.png
The workflow is designed to be run in a high-performance computing environment, and it is built using Nextflow. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies.
Workflow steps
The start of the workflow is the SDRF of each reanalyzed project in quantms and the corresponding PSM parquet files generated by the quantms workflow. Note: the PSM parquet file MUST contain the corresponding spectra for each identified peptide.
This workflow mainly consists of the following processes:
- Tool mgf-converter:A tool for converting each project file analyzed by QuantMS into an MGF file.
- Incremental Maracluster Algorithm : where we will utilize the incremental clustering method of Maracluster to cluster MGF files from the same species, instrument, and charge within each project.
- Library converter: - After all the clustering is done we should have a folder with the corresponding structure.
The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies.
Usage:
First, you should generate a flat text file directory containing the absolute/relative paths to each MS2 spectrum file in all projects.
Now, you can run the pipeline using:
shell
nextflow run main.nf \
--files_list_folder <FILE_LIST_FOLER>
--maracluster_output <OUTDIR>
Owner
- Name: BigBio Stack
- Login: bigbio
- Kind: organization
- Email: proteomicsstack@gmail.com
- Location: Cambridge, UK
- Website: http://bigbio.xyz
- Repositories: 24
- Profile: https://github.com/bigbio
Provide big data solutions Bioinformatics
GitHub Events
Total
- Push event: 2
Last Year
- Push event: 2