proteomegenerator3

Generate proteogenomics search databases from long-read RNAseq

https://github.com/shahcompbio/proteomegenerator3

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 13 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.1%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Generate proteogenomics search databases from long-read RNAseq

Basic Info
  • Host: GitHub
  • Owner: shahcompbio
  • License: mit
  • Language: Nextflow
  • Default Branch: main
  • Size: 188 KB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created 8 months ago · Last pushed 8 months ago
Metadata Files
Readme Changelog Contributing License Citation

README.md

kentsislab/proteomegenerator3

GitHub Actions CI Status GitHub Actions Linting StatusCite with Zenodo nf-test

Nextflow run with docker run with singularity

Introduction

kentsislab/proteomegenerator3 is a bioinformatics pipeline that can be used to create sample-specific, proteogenomics search databases from long-read RNAseq data. It takes in a samplesheet and aligned long-read RNAseq data as input, performs guided, de novo transcript assembly, ORF prediction, and then produces a protein fasta file suitable for use with computational proteomics search platforms (e.g, Fragpipe, DIA-NN).

  1. Pre-processing of aligned reads to create transcript read classes with bambu which can be re-used in future analyses. Optional filtering:
    1. Filtering on MAPQ and read length with samtools
  2. Transcript assembly, quantification, and filtering with bambu. Option to merge multiple samples into a unified transcriptome.
  3. ORF prediction with Transdecoder. Option to provide fusion contigs from JAFFAL.
  4. Formatting of ORFs into a fasta file which can be used for computational proteomics searchs with Fragpipe, DIA-NN, Spectronaut.
  5. MultiQC to collate package versions used (MultiQC)

Usage

[!NOTE] If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data. When using the profile, it will run on a minimal test dataset that can be run in 5-10 minutes on most modern laptops.

First, prepare a samplesheet with your input data that looks as follows:

samplesheet.csv:

csv sample,bam,bai,rcFile,jaffal_fasta,jaffal_table CONTROL_REP1,AEG588A1_S1_L002_R1_001.bam,AEG588A1_S1_L002_R1_001.bam.bai,,jaffal_results.fasta,jaffal_results.csv

Each row represents a long-read RNAseq sample. The columns are as follows:

  1. sample: name of the sample
  2. bam: aligned, sorted long-read RNAseq bam
  3. bai: index file for bam
  4. rcFile: read class file from Bambu if you've already done some pre-processing; you can provide this and then use the --skip_preprocessing flag to speed up run time and re-analyze previous samples
  5. jaffal_fasta: Fusion contigs which are output from JAFFAL (see description here).
  6. jaffal_table: Fusion table which is output from JAFFAL (see description here)

To produce the necessary files, we recommend using the nf-core/nanoseq pipeline, which will run both alignment and call fusions with JAFFAL.

Now, you can run the pipeline using:

bash nextflow run kentsislab/proteomegenerator3 -r 1.0.0dev \ -profile <docker/singularity/.../institute> \ --input samplesheet.csv \ --fasta <REF_GENOME> \ --gtf <REF_GTF> \ --outdir <OUTDIR>

Where REF_GENOME and REF_GTF are the reference genome and transcriptome respectively. These can be from GENCODE or Ensembl, but should match the reference used to align the data.

[!WARNING] Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

Additional parameters

To see all optional parameters that could be used with the pipeline and their explanations, use the help menu:

bash nextflow run kentsislab/proteomegenerator3 -r 1.0.0dev --help

This options can be run using flags. For example:

bash nextflow run kentsislab/proteomegenerator3 -r 1.0.0dev \ -profile <docker/singularity/.../institute> \ --input samplesheet.csv \ --fasta <REF_GENOME> \ --gtf <REF_GTF> \ --outdir <OUTDIR> \ --filter_reads

Will pre-filter the bam file before transcript assembly is performed on mapq and read length.

As another example, you can use the following flag to perform ORF calling on fusion contigs:

bash nextflow run kentsislab/proteomegenerator3 -r 1.0.0dev \ -profile <docker/singularity/.../institute> \ --input samplesheet.csv \ --fasta <REF_GENOME> \ --gtf <REF_GTF> \ --outdir <OUTDIR> \ --fusions

To run with the latest version, which may not be stable you can use the -r main -latest flags:

bash nextflow run kentsislab/proteomegenerator3 -r main -latest --help

I have highlighted the following options here:

  1. filter_reads: use this flag to pre-filter reads using mapq and read length
  2. mapq: min mapq for read filtering [default: 20]
  3. read_len: min read length for read filtering [default: 500]
  4. filter_acc_reads: filter reads on accessory chromosomes; sometimes causes issues for bambu
  5. skip_preprocessing: use previously generated bambu read classes
  6. NDR: modulate bambu's novel discovery rate [default: 0.1]
  7. recommended_NDR: run bambu with recommended NDR (as determined by bambu's algorithm)
  8. single_sample: Run bambu on samples individually, and skip merging of transcriptomes; if you provide a single sample or fusions, this will be automatically run.
  9. skip_multisample: skip multisample transcript assembly (see #8).
  10. fusions: Perform ORF predictions on fusions from JAFFAL [default: false]
  11. multiple_orfs: Allow for multiple ORFs per transcript (this is in beta-testing)

Credits

kentsislab/proteomegenerator3 was originally written by Asher Preska Steinberg.

We thank the following people for their extensive assistance in the development of this pipeline:

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

Citations

If you use kentsislab/proteomegenerator3 for your analysis, please cite our manuscript:

End-to-end proteogenomics for discovery of cryptic and non-canonical cancer proteoforms using long-read transcriptomics and multi-dimensional proteomics

Katarzyna Kulej, Asher Preska Steinberg, Jinxin Zhang, Gabriella Casalena, Eli Havasov, Sohrab P. Shah, Andrew McPherson, Alex Kentsis.

BioRXiv. 2025 Aug 28. doi: 10.1101/2025.08.23.671943.

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Owner

  • Name: ShahCompBio
  • Login: shahcompbio
  • Kind: organization

Computational biology tools from the Shah Lab

Citation (CITATIONS.md)

# kentsislab/proteomegenerator3: Citations

## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/)

> Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.

## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/)

> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

## Pipeline tools

- [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

> Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online].

- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)

> Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.

## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)

  > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.

- [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/)

  > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506.

- [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/)

  > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671.

- [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241)

  > Merkel, D. (2014). Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2. doi: 10.5555/2600239.2600241.

- [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/)

  > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.

GitHub Events

Total
  • Delete event: 2
  • Member event: 1
  • Push event: 31
  • Fork event: 1
  • Create event: 1
Last Year
  • Delete event: 2
  • Member event: 1
  • Push event: 31
  • Fork event: 1
  • Create event: 1

Dependencies

.github/workflows/branch.yml actions
  • mshick/add-pr-comment b8f338c590a895d50bcbfa6c5859251edc8952fc composite
.github/workflows/ci.yml actions
  • actions/checkout 11bd71901bbe5b1630ceea73d27597364c9af683 composite
  • conda-incubator/setup-miniconda a4260408e20b96e80095f42ff7f1a15b27dd94ca composite
  • eWaterCycle/setup-apptainer main composite
  • jlumbroso/free-disk-space 54081f138730dfa15788a46383842cd2f914a1be composite
  • nf-core/setup-nextflow v2 composite
.github/workflows/clean-up.yml actions
  • actions/stale 28ca1036281a5e5922ead5184a1bbf96e5fc984e composite
.github/workflows/download_pipeline.yml actions
  • actions/setup-python 0b93645e9fea7318ecaed2b359559ac225c90a2b composite
  • eWaterCycle/setup-apptainer 4bb22c52d4f63406c49e94c804632975787312b3 composite
  • jlumbroso/free-disk-space 54081f138730dfa15788a46383842cd2f914a1be composite
  • nf-core/setup-nextflow v2 composite
.github/workflows/fix-linting.yml actions
  • actions/checkout 11bd71901bbe5b1630ceea73d27597364c9af683 composite
  • actions/setup-python 0b93645e9fea7318ecaed2b359559ac225c90a2b composite
  • peter-evans/create-or-update-comment 71345be0265236311c031f5c7866368bd1eff043 composite
.github/workflows/linting.yml actions
  • actions/checkout 11bd71901bbe5b1630ceea73d27597364c9af683 composite
  • actions/setup-python a26af69be951a213d495a4c3e4e4022e16d87065 composite
  • actions/upload-artifact ea165f8d65b6e75b540449e92b4886f43607fa02 composite
  • nf-core/setup-nextflow v2 composite
  • pietrobolcato/action-read-yaml 9f13718d61111b69f30ab4ac683e67a56d254e1d composite
.github/workflows/linting_comment.yml actions
  • dawidd6/action-download-artifact 4c1e823582f43b179e2cbb49c3eade4e41f992e2 composite
  • marocchino/sticky-pull-request-comment 52423e01640425a022ef5fd42c6fb5f633a02728 composite
.github/workflows/template_version_comment.yml actions
  • actions/checkout 11bd71901bbe5b1630ceea73d27597364c9af683 composite
  • mshick/add-pr-comment b8f338c590a895d50bcbfa6c5859251edc8952fc composite
  • nichmor/minimal-read-yaml v0.0.2 composite
modules/local/bambu/assembly/meta.yml cpan
modules/local/bambu/filter/meta.yml cpan
modules/local/bambu/readclasses/meta.yml cpan
modules/local/semerge/meta.yml cpan
modules/nf-core/multiqc/meta.yml cpan
modules/nf-core/samtools/view/meta.yml cpan
subworkflows/local/assembly_quant/meta.yml cpan
subworkflows/local/preprocess_reads/meta.yml cpan
subworkflows/local/single_transcript_quant/meta.yml cpan
subworkflows/local/transcript_quant/meta.yml cpan
subworkflows/nf-core/utils_nextflow_pipeline/meta.yml cpan
subworkflows/nf-core/utils_nfcore_pipeline/meta.yml cpan
subworkflows/nf-core/utils_nfschema_plugin/meta.yml cpan
docker/bambu_3.10/Dockerfile docker
  • quay.io/shahlab_singularity/bambu 3.10.0 build
docker/bambu_3.5.1/Dockerfile docker
  • quay.io/preskaa/bambu 3.5.1 build
docker/bambu_hongyhongfix/Dockerfile docker
  • quay.io/preskaa/bambu HongYhong_fix build
modules/local/bambu/assembly/environment.yml pypi
modules/local/bambu/filter/environment.yml pypi
modules/local/bambu/readclasses/environment.yml pypi
modules/local/semerge/environment.yml pypi
modules/nf-core/multiqc/environment.yml pypi
modules/nf-core/samtools/view/environment.yml pypi