https://github.com/bzhanglab/neoflow

NeoFlow: a proteogenomics pipeline for neoantigen discovery

https://github.com/bzhanglab/neoflow

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 6 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.8%) to scientific vocabulary

Keywords

neoantigen-discovery neoantigen-prediction nextflow-pipeline novel-peptide-identifications proteogenomics
Last synced: 6 months ago · JSON representation

Repository

NeoFlow: a proteogenomics pipeline for neoantigen discovery

Basic Info
  • Host: GitHub
  • Owner: bzhanglab
  • Language: Nextflow
  • Default Branch: master
  • Homepage:
  • Size: 4.01 MB
Statistics
  • Stars: 22
  • Watchers: 5
  • Forks: 12
  • Open Issues: 9
  • Releases: 0
Topics
neoantigen-discovery neoantigen-prediction nextflow-pipeline novel-peptide-identifications proteogenomics
Created over 7 years ago · Last pushed over 1 year ago
Metadata Files
Readme

README.md

NeoFlow

Overview

NeoFlow: a proteogenomics pipeline for neoantigen discovery

NeoFlow includes four modules:

  1. Variant annotation and customized database construction: neoflow_db.nf;
  2. Variant peptide identification: neoflow_msms.nf;
  • MS/MS searching. Three search engines are available: MS-GF+, X!Tandem and Comet;
  • FDR estimation: global FDR estimation;
  • Novel peptide validation by PepQuery;
  • RT based validation for novel peptide identifications using AutoRT: optional (GPU required).
  1. HLA typing: neoflow_hlatyping.nf;
  2. Neoantigen prediction: neoflow_neoantigen.nf.

NeoFlow supports both label free and iTRAQ/TMT data.

Installation

  1. Download neoflow:

sh git clone https://github.com/bzhanglab/neoflow

  1. Install Docker (>=19.03).

  2. Install Nextflow. More information can be found in the Nextflow get started page.

  3. Install ANNOVAR by following the instruction at http://annovar.openbioinformatics.org/en/latest/.

  4. Install netMHCpan 4.0 by following the instruction at http://www.cbs.dtu.dk/services/doc/netMHCpan-4.0.readme. Please set TMPDIR in file netMHCpan-4.0/netMHCpan as /tmp as shown below:

```sh

determine where to store temporary files (must be writable to all users)

if ( ${?TMPDIR} == 0 ) then setenv TMPDIR /tmp endif ```

  1. Install nvidia-docker (>=2.2.2) for AutoRT by following the instruction at https://github.com/NVIDIA/nvidia-docker. This is optional and it is only required when users want to use the RT based validation for novel peptide identifications using AutoRT.

All other tools used by NeoFlow have been dockerized and will be automatically installed when NeoFlow is run in the first time on a computer.

Usage

1. Variant annotation and customized database construction

```sh $ nextflow run neoflow_db.nf --help N E X T F L O W ~ version 19.10.0

Launching neoflow_db.nf [irreverent_faggin] - revision: 741bf1a931

neoflow => variant annotation and customized database construction

Usage: nextflow run neoflowdb.nf Arguments: --vcffile A txt file contains VCF file(s) --annovardir ANNOVAR folder --protocol The parameter of "protocol" for ANNOVAR, default is "refGene" --refdir ANNOVAR annotation data folder --refver The genome version, hg19 or hg38, default is "hg19" --outdir Output folder, default is "./output" --cpu The number of CPUs --help Print help message ```

The input file for parameter --vcf_file is a tab-delimited text file which contains the path of variant file(s). The variant file can be VCF format or simple text-based format (ANNOVAR input format). The input txt file (a tab-delimited text file) for --vcf_file format is shown below:

| experiment | sample | file | filetype | |---|---|---|---| | TMT01 | T1 | T1somatic.vcf;T1rna.vcf | somatic;rna | | TMT01 | T2 | T2somatic.vcf;T2rna.vcf | somatic;rna | | TMT02 | T3 | T3somatic.vcf;T3rna.vcf | somatic;rna | | TMT02 | T4 | T4somatic.vcf;T4_rna.vcf | somatic;rna |

The column of experiment is label free, TMT or iTRAQ experiment name and the column of sample is sample name. If it's iTRAQ or TMT data, the samples from the same iTRAQ or TMT experiment should have the same experiment name. If it's label free data, different samples should have different experiment name. All variant files (for example, somatic variant vcf file and variant calling result vcf file based on RNA-Seq data) for the same sample should be in the same row (column file) and different files should be separated by ";". The column of file_type indicates the corresponding variant types for the vcf files in column file. Please note that all variant files should be under the folder where you run neoflow. We recommend users to provide absolute path for each variant file in the input txt file for --vcf_file.

The ANNOVAR annotation data (--annovar_dir) can be downloaded following the instruction at http://annovar.openbioinformatics.org/en/latest/user-guide/download/.

The output files of neoflow_db.nf include customized protein databases in FASTA format for each experiment, variant annotation result files for each sample.

Example

sh nextflow run neoflow_db.nf --ref_dir /data/tools/annovar/humandb_hg19/ \ --vcf_file example_data/test_vcf_files.tsv \ --annovar_dir /data/tools/annovar/ \ --ref_ver hg19 \ --out_dir output Please update inputs for parameters --ref_dir and --annovar_dir before run the above example. The input file for --vcf_file can be downloaded from the example data (Right click and Select "Save link as…" or "Download Linked File") prepared for testing. After the example data is downloaded to users' computer, unzip the data and all the testing data are available in the example_data folder.

The running time of above example is less than 5 minutes on a Linux server with 40 cores.

2. Variant peptide identification

Please note that the customized database generated in the first step will be used in this step. ```sh $ ./nextflow run neoflow_msms.nf --help N E X T F L O W ~ version 19.10.0

Launching neoflow_msms.nf [drunk_nobel] - revision: 6d58fb19bd

neoflow => Variant peptide identification

Usage: nextflow run neoflow-msms.nf MS/MS searching arguments: --db The customized protein database (target + decoy sequences) in FASTA format which is generated by neoflowdb.nf --ms MS/MS data in MGF format --msmsparafile Parameter file for MS/MS searching --outdir Output folder, default is "./" --prefix The prefix of output files --search_engine The search engine used for MS/MS searching, comet=Comet, msgf=MS-GF+ or xtandem=X!Tandem

PepQuery arguments: --pvenzyme Enzyme used for protein digestion. 0:Non enzyme, 1:Trypsin (default), 2:Trypsin (no P rule), 3:Arg-C, 4:Arg-C (no P rule), 5:Arg-N, 6:Glu-C, 7:Lys-C --pvc The max missed cleavages, default is 2 --pvtol Precursor ion m/z tolerance, default is 10 --pvtolu The unit of --tol, ppm or Da. Default is ppm --pvitol The error window for fragment ion, default is 0.5 --pvfixmod Fixed modification. The format is like : 1,2,3. Different modification is represented by different number --pvvarmod Variable modification. The format is the same with --fixMod; --pvrefdb Reference protein database

AutoRT parameters: --rt_validation Perform RT based validation

--help Print help message ```

The output files of neoflow_msms.nf include MS/MS searching raw identification files, FDR estimation result files at both PSM and peptide levels, PepQuery validation result files.

Example

sh nextflow run neoflow_msms.nf --ms example_data/mgf/ \ --msms_para_file example_data/comet_parameter.txt \ --search_engine comet \ --db output/customized_database/neoflow_crc_target_decoy.fasta \ --out_dir output \ --pv_refdb output/customized_database/ref.fasta \ --pv_tol 20 \ --pv_itol 0.05

The input files for --ms and --msms_para_file can be downloaded from the example data (Right click and Select "Save link as…" or "Download Linked File") prepared for testing.

The variant peptide identification result is in this file output/novel_peptide_identification/novel_peptides_psm_pepquery.tsv.

The running time of above example is less than 15 minutes on a Linux server with 40 cores.

3. HLA typing

```sh $ ./nextflow run neoflow_hlatyping.nf --help N E X T F L O W ~ version 19.10.0

Launching neoflow_hlatyping.nf [spontaneous_hawking] - revision: 5fd970e701

neoflow => HLA typing

Usage: nextflow run neoflowhlatyping.nf Arguments: --reads Reads data in fastq.gz or fastq format. For example, "*{1,2}.fq.gz" --hlarefdir HLA reference folder --seqtype Read type, dna or rna. Default is dna. --singleEnd Single end or not, default is false (pair end reads) --cpu The number of CPUs, default is 6. --outdir Output folder, default is "./" --help Print help message `` The output ofneoflowhlatyping.nf` is a txt format file containing HLA alleles for a sample. This file is generated by OptiType.

Example

sh nextflow run neoflow_hlatyping.nf --hla_ref_dir example_data/hla_reference \ --reads "example_data/dna/*_{1,2}.fastq.gz" \ --out_dir output/ \ --cpu 40

The input files for --hla_ref_dir and --reads can be downloaded from the example data (Right click and Select "Save link as…" or "Download Linked File") prepared for testing.

The HLA typing result is in this file output/hla_type/sample1/sample1_result.tsv.

The running time of above example is less than 10 minutes on a Linux server with 40 cores.

4. Neoantigen prediction

Please note that the results generated in step 1-3 will be used in this step. ```sh $ ./nextflow run neoflow_neoantigen.nf --help N E X T F L O W ~ version 19.10.0

Launching neoflow_neoantigen.nf [mighty_roentgen] - revision: e4261baca3

neoflow => Neoantigen prediction

Usage: nextflow run neoflowneoantigen.nf Arguments: --vardb Variant (somatic) database in fasta format generated by neoflowdb.nf --varinfofile Variant (somatic) information in txt format generated by neoflowdb.nf --refdb Reference (known) protein database --hlatype HLA typing result in txt format generated by Optitype --netmhcpandir NetMHCpan 4.0 folder --varpepfile Variant peptide identification result generated by neoflowmsms.nf, optional. --varpepinfo Variant information in txt format for customized database used for variant peptide identification --prefix The prefix of output files --out_dir Output directory --cpu The number of CPUs --help Print help message ```

The output of neoflow_neoantigen.nf is a tsv format file containing neoantigen prediction result as shown below:

Variant_ID|Chr|Start|End|Ref|Alt|Variant_Type|Variant_Function|Gene|mRNA|Neoepitope|Variant_Start|Variant_End|AA_before|AA_after|HLA_type|netMHCpan_binding_affinity_nM|netMHCpan_precentail_rank|protein_var_evidence_pep :-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----: VAR|NM_002536|10054|chrX|48418659|48418659|G|A|nonsynonymous SNV|protein-altering|TBC1D25|NM_002536|TGFGGHRG|1|1|A|T|HLA-A01:01|44216.6|88.5537|- VAR|NM_002536|10054|chrX|48418659|48418659|G|A|nonsynonymous SNV|protein-altering|TBC1D25|NM_002536|TGFGGHRG|1|1|A|T|HLA-C07:01|43330|73.7774|- VAR|NM_002536|10054|chrX|48418659|48418659|G|A|nonsynonymous SNV|protein-altering|TBC1D25|NM_002536|TGFGGHRG|1|1|A|T|HLA-B08:01|35925.8|70.8561|- VAR|NM_001348265|10055|chrX|48418659|48418659|G|A|nonsynonymous SNV|protein-altering|TBC1D25|NM_001348265|TGFGGHRG|1|1|A|T|HLA-A01:01|44216.6|88.5537|- VAR|NM_001348265|10055|chrX|48418659|48418659|G|A|nonsynonymous SNV|protein-altering|TBC1D25|NM_001348265|TGFGGHRG|1|1|A|T|HLA-C*07:01|43330|73.7774|-

Column description for the above table: Variant_ID: variant ID defined by neoflow Chr: variant chromosome Start: start position on genome End: end position on genome Ref: reference base Alt: alterative base Variant_Type: variant type annotated by ANNOVAR Variant_Function: variant function annotated by ANNOVAR Gene: gene ID mRNA: mRNA ID Neoepitope: neoepitope peptide Variant_Start: variant start position on neoepitope peptide Variant_End: variant end position on neoepitope peptide AA_before: reference amino acid AA_after: alterative amino acid HLA_type: HLA type netMHCpan_binding_affinity_nM: MHC-peptide binding affinity from NetMHCpan 4.0. The lower the value, the higher the binding affinity between MHC and neoepitope peptide. netMHCpan_precentail_rank: MHC-peptide binding affinity rank from NetMHCpan 4.0 protein_var_evidence_pep: variant peptide. "-" means no variant peptide identified covers the mutation site.

Example

sh nextflow run neoflow_neoantigen.nf --prefix sample1 \ --hla_type output/hla_type/sample1/sample1_result.tsv \ --var_db output/customized_database/sample1-somatic-var.fasta \ --var_info_file output/customized_database/sample1-somatic-varInfo.txt \ --out_dir output/ \ --netmhcpan_dir /data/tools/netMHCpan-4.0/ \ --cpu 40 \ --ref_db output/customized_database/ref.fasta \ --var_pep_file output/novel_peptide_identification/novel_peptides_psm_pepquery.tsv \ --var_pep_info output/customized_database/neoflow_crc_anno-varInfo.txt

Please update input for parameter --netmhcpan_dir before run the above example.

The neoantigen prediction result is in this file output/neoantigen_prediction/sample1_neoepitope_filtered_by_reference_add_variant_protein_evidence.tsv.

The running time of above example is less than 30 minutes on a Linux server with 40 cores.

Example data

The test data used for above examples can be downloaded by clicking test data (Right click and Select "Save link as…" or "Download Linked File").

How to cite:

Wen, B., Li, K., Zhang, Y. et al. Cancer neoantigen prioritization through sensitive and reliable proteogenomics analysis. Nature Communications 11, 1759 (2020). https://doi.org/10.1038/s41467-020-15456-w

Owner

  • Name: Zhang Lab
  • Login: bzhanglab
  • Kind: organization
  • Location: Houston, TX

Translating omics data into biological insights.

GitHub Events

Total
  • Watch event: 3
  • Push event: 3
Last Year
  • Watch event: 3
  • Push event: 3