secat

Size-Exclusion Chromatography Algorithmic Toolkit

https://github.com/grosenberger/secat

Science Score: 77.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Committers with academic emails
    3 of 7 committers (42.9%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (17.3%) to scientific vocabulary

Keywords

data-independent-acquisition machine-learning mass-spectrometry proteomics signal-processing size-exclusion-chromatography swath-ms
Last synced: 6 months ago · JSON representation ·

Repository

Size-Exclusion Chromatography Algorithmic Toolkit

Basic Info
  • Host: GitHub
  • Owner: grosenberger
  • License: other
  • Language: Python
  • Default Branch: master
  • Size: 249 KB
Statistics
  • Stars: 6
  • Watchers: 1
  • Forks: 3
  • Open Issues: 1
  • Releases: 15
Topics
data-independent-acquisition machine-learning mass-spectrometry proteomics signal-processing size-exclusion-chromatography swath-ms
Created almost 8 years ago · Last pushed over 2 years ago
Metadata Files
Readme License Citation

README.md

SECAT: Size-Exclusion Chromatography Algorithmic Toolkit

SECAT is an algorithm for the network-centric data analysis of SEC-SWATH-MS data. The tool is implemented as a multi-step command line application.

Dependencies

SECAT depends on several Python packages (listed in setup.py). SECAT has been tested on Linux (CentOS 7) and macOS (10.14) operating systems and might run on other versions too.

Installation

We strongly advice to install SECAT in a Python virtualenv. SECAT is compatible with Python 3.7 and higher and installation should require a few minutes with a correctly set-up Python environment.

Install the development version of SECAT from GitHub:

pip install git+https://github.com/grosenberger/secat.git@master

Install the stable version of SECAT from the Python Package Index (PyPI):

pip install secat

You can alternatively create a conda environment with SECAT. First create a new conda environment and install python, numpy and pip.

conda create -n secat python=3.10.8 numpy pip -y

Activate the secat environment conda activate secat

Install secat and its dependencies. pip install secat

Docker

SECAT is also available from Dockerhub:

docker pull grosenberger/secat:latest # "latest" can be replaced by the version number, e.g. "1.0.4"

You can also build the Docker image on your machine with the command below. Again, make sure you are at the root level of this repository when executing this command. When building locally, feel free to replace the part after -t with anything you find convenient. This is simply a tag to easily identify the Docker container on your machine. Here it is tagged as grosenberger/secat:latest to remain interoperable with the other instructions in the README.md.

docker build . -t grosenberger/secat:latest

Print the installed Python versions:

docker run --name secat --rm -v $PWD:/data -i -t grosenberger/secat:latest pip list

Run SECAT:

docker run --name secat --rm -v $PWD:/data -i -t grosenberger/secat:latest secat --help

Running SECAT

SECAT requires 1-4h running time with a SEC-SWATH-MS data set of two conditions and three replicates each, covering about 5,000 proteins and 80,000 peptides on a typical desktop computer with 4 CPU cores and 16GB RAM.

The exemplary input data (HeLa-CC.tgz and Common.tgz are required) can be found on Zenodo: DOI

The data set includes the expected output as SQLite-files. Note: Since the PyProphet semi-supervised learning step is initialized by a randomized seed, the output might vary slightly from run-to-run with numeric deviations. To completely reproduce the results, the pretrained PyProphet classifier can be applied to as described in the secat learn step. The Zenodo repository contains all parameters and instructions to reproduce the SECAT analysis results of the other data sets.

SECAT consists of the following steps:

1. Data preprocessing

The primary input for SECAT are quantitative, proteotypic/unique peptide-level profiles, e.g. acquired by SEC-SWATH-MS. The input can be supplied either as matrix (protein, peptide and run-wise peptide intensities columns) or as transposed long list. Protein identifiers need to be provided in UniProtKB/Swiss-Prot format. The column names can be freely specified (secat preprocess --columns; see help for a complete description).

The second required input file represents the experimental design and molecular weight calibration of the experiment. The primary column covers the run identifiers (matching the quantitative profiles above), with additional columns for SEC fraction identifiers (integer value), SEC molecular weight (float value), a group condition identifier (freetext value) and a replicate identifier (freetext value). The column names can be freely specified (secat preprocess --columns; see help for a complete description).

The third required file covers UniProtKB/Swiss-Prot meta data in XML format, matching the proteome, and can be obtained from UniProt.

Optionally, reference PPI networks can be specified to support semi-supervised learning and to restrict the peptide query space. SECAT can accept three files: A positive reference network and a negative reference network for the learning steps and a separate reference network to restrict the query space. SECAT natively supports HUPO-PSI MITAB (2.5-2.7), STRING-DB, BioPlex and PrePPI formats and provides filtering options to optionally exclude lower confidence PPIs. The inverted CORUM reference PPI network was generated by using the inverted set of PPI (i.e. all possible PPI that are not covered by CORUM) and removing all PPI in this set covered by STRING, IID, PrePPI or BioPlex.

The Zenodo archives linked above contain example files and parameter sets for all described analyses and can be used to test the algorithm and reproduce the results.

First, the input quantitative proteomics matrix and parameters are preprocessed to a single file:

secat preprocess --out=hela_string.secat \ # Output filename --sec=input/hela_sec_mw.csv \ # SEC annotation file --net=common/9606.protein.links.v11.0.txt.gz \ # Reference PPI network --posnet=common/corum_targets.txt.gz \ # Reference positive interaction network for learning --negnet=common/corum_decoys.txt.gz \ # Reference negative interaction network for learning --uniprot=common/uniprot_9606_20190402.xml.gz \ # Uniprot reference XML file --min_interaction_confidence=0 # Minimum interaction confidence input/pep*.tsv \ # Input data files

2. Signal processing

Next, the signal processing is conducted in a parallelized fashion:

secat score --in=hela_string.secat --threads=8

3. PPI detection

The statistical confidence of the PPI is evaluated by machine learning:

secat learn --in=hela_string.secat --threads=5

4. PPI quantification

Quantitative features are generated for all PPIs and proteins:

secat quantify --in=hela_string.secat --control_condition=inter

5. Export of results

CSV tables can be exported for import in downstream tools, e.g. Cytoscape:

secat export --in=hela_string.secat

6. Plotting of chromatograms

PDF reports can be generated for the top (or selected) results:

secat plot --in=hela_string.secat

7. Report of statistics

Statistics reports can be generated for the top (or selected) results:

secat statistics --in=hela_string.secat

Further options and default parameters

All options and the default parameters can be displayed by: secat --help secat preprocess --help secat score --help secat learn --help secat quantify --help secat export --help secat plot --help secat statistics --help

Owner

  • Name: George Rosenberger
  • Login: grosenberger
  • Kind: user

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: SECAT
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: George
    family-names: Rosenberger
    email: gr2578@cumc.columbia.edu
    affiliation: Columbia University
    orcid: 'https://orcid.org/0000-0002-1655-6789'
identifiers:
  - type: doi
    value: 10.1016/j.cels.2020.11.006
    description: Original SECAT publication
keywords:
  - SEC-SWATH-MS
  - SEC
  - Proteomics
  - CF-MS
  - DIA
license: BSD-3-Clause

GitHub Events

Total
  • Member event: 1
Last Year
  • Member event: 1

Committers

Last synced: almost 3 years ago

All Time
  • Total Commits: 223
  • Total Committers: 7
  • Avg Commits per committer: 31.857
  • Development Distribution Score (DDS): 0.274
Top Committers
Name Email Commits
George Rosenberger g****8@c****u 162
Darvesh Sanjeev Gorhe d****7@b****r 36
Darvesh Gorhe d****3@g****m 13
benbokor b****1@c****u 9
dependabot[bot] 4****]@u****m 1
George Rosenberger g****r@u****h 1
George Rosenberger g****e@r****o 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 1
  • Total pull requests: 7
  • Average time to close issues: N/A
  • Average time to close pull requests: 9 days
  • Total issue authors: 1
  • Total pull request authors: 3
  • Average comments per issue: 1.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 6
  • Bot issues: 0
  • Bot pull requests: 1
Past Year
  • Issues: 1
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 1.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • ScarlettQGG (1)
Pull Request Authors
  • dgorhe (4)
  • benbokor (2)
  • dependabot[bot] (1)
Top Labels
Issue Labels
Pull Request Labels
dependencies (1)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 17 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 13
  • Total maintainers: 1
pypi.org: secat

Size-Exclusion Chromatography Algorithmic Toolkit

  • Versions: 13
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 17 Last month
Rankings
Dependent packages count: 10.0%
Forks count: 16.9%
Average: 19.7%
Dependent repos count: 21.7%
Downloads: 24.9%
Stargazers count: 25.0%
Maintainers (1)
Last synced: 6 months ago

Dependencies

setup.py pypi
  • Click *
  • ggplot *
  • lxml *
  • matplotlib *
  • minepy *
  • numpy *
  • pandas *
  • pyprophet *
  • rpy2 *
  • scipy *
  • sklearn *
  • statsmodels *
  • tqdm *
  • tzlocal *
.github/workflows/dockerpublish.yml actions
  • actions/checkout v3 composite
  • docker/build-push-action ad44023a93711e3deb337508980b4b5e9bcdc5dc composite
  • docker/login-action f054a8b539a109f9f41c372932f1ae047eff08c9 composite
  • docker/metadata-action 98669ae865ea3cffbcbaa878cf57c20bbf1c6c38 composite
.github/workflows/pythonpublish.yml actions
  • actions/checkout v1 composite
  • actions/setup-python v1 composite
Dockerfile docker
  • python 3.10.9 build
environment.yml pypi
  • brewer2mpl ==1.4.1
  • click ==8.1.3
  • cloudpickle ==2.2.0
  • contourpy ==1.0.6
  • cycler ==0.11.0
  • cython ==0.29.32
  • fonttools ==4.38.0
  • future ==0.18.2
  • ggplot ==0.11.5
  • hyperopt ==0.2.7
  • joblib ==1.2.0
  • kiwisolver ==1.4.4
  • lxml ==4.9.1
  • matplotlib ==3.6.2
  • minepy ==1.2.6
  • networkx ==2.8.8
  • numexpr ==2.8.4
  • numpy ==1.23.5
  • packaging ==22.0
  • pandas ==1.5.2
  • patsy ==0.5.3
  • pillow ==9.3.0
  • py4j ==0.10.9.7
  • pyparsing ==3.0.9
  • pyprophet ==2.1.12
  • python-dateutil ==2.8.2
  • pytz ==2022.6
  • pytz-deprecation-shim ==0.1.0.post0
  • scikit-learn ==1.2.0
  • scipy ==1.9.3
  • six ==1.16.0
  • sklearn ==0.0.post1
  • statsmodels ==0.13.5
  • tabulate ==0.9.0
  • threadpoolctl ==3.1.0
  • tqdm ==4.64.1
  • tzdata ==2022.7
  • tzlocal ==4.2
  • xgboost ==1.7.2
requirements.txt pypi
  • Cython ==0.29.32
  • Pillow ==9.3.0
  • Pygments ==2.13.0
  • anndata ==0.8.0
  • brewer2mpl ==1.4.1
  • click ==8.1.3
  • cloudpickle ==2.2.0
  • commonmark ==0.9.1
  • contourpy ==1.0.6
  • cramjam ==2.6.1
  • cycler ==0.11.0
  • decoupler ==1.2.0
  • fastparquet ==0.8.3
  • fonttools ==4.38.0
  • fsspec ==2022.10.0
  • future ==0.18.2
  • ggplot ==0.11.5
  • h5py ==3.7.0
  • hyperopt ==0.2.7
  • joblib ==1.2.0
  • kiwisolver ==1.4.4
  • llvmlite ==0.39.1
  • lxml ==4.9.1
  • matplotlib ==3.6.2
  • minepy ==1.2.6
  • natsort ==8.2.0
  • networkx ==2.8.8
  • numba ==0.56.3
  • numexpr ==2.8.4
  • numpy ==1.23.5
  • packaging ==22.0
  • pandas ==1.5.2
  • patsy ==0.5.3
  • pip ==22.3.1
  • psutil ==5.9.3
  • py4j ==0.10.9.7
  • pynvml ==11.4.1
  • pyparsing ==3.0.9
  • pyprophet ==2.1.12
  • python-dateutil ==2.8.2
  • pytz ==2022.6
  • pytz-deprecation-shim ==0.1.0.post0
  • rich ==12.6.0
  • scalene ==1.5.13
  • scikit-learn ==1.2.0
  • scipy ==1.9.3
  • setuptools ==65.5.1
  • six ==1.16.0
  • sklearn ==0.0.post1
  • statsmodels ==0.13.5
  • tabulate ==0.9.0
  • threadpoolctl ==3.1.0
  • tqdm ==4.64.1
  • tzdata ==2022.7
  • tzlocal ==4.2
  • wheel ==0.38.4
  • xgboost ==1.7.2