https://github.com/bioinfo-pf-curie/nf-neoant
Detection of neoantigens from WES and RNA sequencing data
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.0%) to scientific vocabulary
Repository
Detection of neoantigens from WES and RNA sequencing data
Basic Info
- Host: GitHub
- Owner: bioinfo-pf-curie
- License: other
- Language: Python
- Default Branch: main
- Size: 2.89 MB
Statistics
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
nf-neoant
Detection of neoantigens from WES and RNA sequencing data
nf-neoAnt pipeline
Introduction
The pipeline is built using Nextflow, a workflow manager to run tasks across multiple compute infrastructures in a very portable manner. It supports conda package manager and singularity / Docker containers making installation easier and results highly reproducible.
Pipeline summary
The objective of the pipeline is to predict tumor-specific neoantigen based on both DNA and RNA next generation sequencing data from patients. <!-- * HLA typing are divided into two parts: - Optitype (v1.3.5) for MHCI, based on the nf-core hlatyping pipeline - HLA-LA (v1.0.3) for MHCII --> * HLA typing is performed by seq2HLA (v2.2) on both MHCI and MHCII, based on the paired RNA fast files.
Detection of neoantigen is performed by the pVACtools suite (v4.1.1). The pipeline is divided into two parts, one focusing on DNA-based analysis (pVACseq) and the other one based on fusions events derived from RNAseq data (pVACfuse).
MiXCR (v4.5.0) was added to provide a fast analysis of raw T- or B- cell receptor repertoires.
pVACseq
Paired RNAseq reads are aligned using STAR (v2.7.6a) on the STAR index using the --quantMode TranscriptomeSAM option to obtain a transcriptome-based alignments BAM file. Per gene and per transcript TPM (transcript per million) are then estimated using Salmon (v1.10.2) with the adequate Gencode GFF3 and transcripts fasta files.
Small somatic variants (snvs, indels) were first called using the GATK Mutect2 (v4.1.8.0).
- Variants were annotated using VEP (ENSEMBL v110.1).
- Both gene (GX) and transcript (TX) expressions were then added using vatools (v5.1.0) and previously computed expression files
- RNA depth (RDP) and RNA allelic ratio (RAF) were then added using a combination of bcftools (v1.15.1), GATK SelectVariants (v4.1.9.0) and bam-readcount (v0.8).
pVACseq was then run using HLA typing files (for MHCI & MHCII) on the resulting variant file.
pVACfuse
- Arriba (v2.4.0) was run on a subset of the original STAR aligned file containing only reads of putative relevance to fusion detection, such as unmapped and clipped reads.
- pVACfuse was then run on the list of filtered fusions of interest, using both HLA typing files.
Workflow

Run the pipeline from a sample plan
Arguments & Parameters
sample_plan: csv file containing per-row samples information
assembly: the genome assembly for the analysis (example: hg38)
genomePath: path containing the different files described in "conf/genomes.config"
singularityImagePath: path to singularity images
vepdircache: path to the downloaded VEP cache from those instructions (here: species="homosapiens" & version="110GRCh38")
veppluginrepo: path to the VEP_plugins repository in which the Frameshift.pm was downloaded.
blacklisttsv: file obtained from downloading arriba archive (in the /database folder) called "blacklist${assembly}*.tsv.gz"
proteinGff: file obtained from downloading arriba archive (in the /database folder) called "proteindomain${assembly}*.gff3"
mi_license: path to the "mi.license" file neeeded for mixcr, free for academic
tmpdir: path to temporary folder
bash
nextflow run main.nf --samplePlan ${sample_plan} \
--genome ${assembly} \
--genomeAnnotationPath ${genomePath} \
--outDir ${outputDir} \
--singularityImagePath ${sif} \
--vepDirCache ${vep_dir_cache} \
--vepPluginRepo ${vep_plugin_repo} \
--miLicense ${mi_license} \
--tmpdir ${tmpdirp} \
-profile singularity,cluster \
-w ${tmp_dir} \
-resume
Sample plan
A sample plan is a csv file (comma separated) that lists all the samples with a biological IDs. The sample plan is expected to contain the following fields (with no header):
sampleID, sampleName, normalName, path_to_fastqDnaR1, path_to_fastqDnaR2, path_to_sampleDnaBam, path_to_sampleDnaBamIndex, path_to_vcf, path_to_fastqRnaR1, path_to_fastqRnaR2, path_to_sampleRnaBam, path_to_sampleRnaBamIndex
Steps
Basic steps are the following: HLAtyping, RNAquant, pVacseq, pVacfuse, mixcr. They can be use separately (e.g.: --step HLAtyping or --step RNAquant or --step mixcr) or combined partially (e.g.: --step HLAtyping,RNAquant,pVacseq ; --step HLAtyping,pVacfuse) or all together (default mode ; --step HLAtyping, RNAquant, pVacseq, pVacfuse, mixcr) using the --step option.
HLA typing
If you only want to get HLA alleles (MHCI & MHCII), add the step "--step HLAtyping" to your command line. If you already have the two HLA allele files (MHCI & MHCII), add the full path to the sample plan as follow:
sampleID, sampleName, normalName, path_to_fastqDnaR1, path_to_fastqDnaR2, path_to_sampleDnaBam, path_to_sampleDnaBamIndex, path_to_vcf, path_to_fastqRnaR1, path_to_fastqRnaR2, path_to_sampleRnaBam, path_to_sampleRnaBamIndex,path_to_HLAI_file,path_toHLAII_file
RNA expression
If you only want to get transcript/gene based expression files (tpm), add the step "--step RNAquant" to your command line. If you already have the two gene-based and transcript-based expression files, add the full path to the sample plan as follow:
sampleID, sampleName, normalName, path_to_fastqDnaR1, path_to_fastqDnaR2, path_to_sampleDnaBam, path_to_sampleDnaBamIndex, path_to_vcf, path_to_fastqRnaR1, path_to_fastqRnaR2, path_to_sampleRnaBam, path_to_sampleRnaBamIndex,path_to_HLAI_file,path_toHLAII_file,path_to_gene_tpm_file,path_to_transcript_tpm_file
or, if you want to run the HLAtyping step (--step HLAtyping,RNAquant,pVacseq)
sampleID, sampleName, normalName, path_to_fastqDnaR1, path_to_fastqDnaR2, path_to_sampleDnaBam, path_to_sampleDnaBamIndex, path_to_vcf, path_to_fastqRnaR1, path_to_fastqRnaR2, path_to_sampleRnaBam, path_to_sampleRnaBamIndex,,,path_to_gene_tpm_file,path_to_transcript_tpm_file
Test
Run the pipeline on the test dataset that will launch HLAtyping:
bash
nextflow run main.nf -profile test,singularity --outDir ${outputDir} --singularityImagePath ${sif} -w ${work_dir}
Credits
This pipeline has been written by Institut Curie bioinformatics platform CUBIC (E.Girard, N.Servant). The project was funded by IMMUcan, the integrated European immuno-oncology profiling platform.
Contacts
For any question, bug or suggestion, please use the issue system or contact the bioinformatics core facility.
Owner
- Name: Institut Curie, Bioinformatics Core Facility
- Login: bioinfo-pf-curie
- Kind: organization
- Location: Paris, France
- Website: https://bioinfo-pf-curie.github.io/
- Repositories: 11
- Profile: https://github.com/bioinfo-pf-curie
bioinformatics platform of the Institut Curie
GitHub Events
Total
- Watch event: 1
Last Year
- Watch event: 1
Dependencies
- sphinx-rtd-theme ==1.0.0
- GitPython ==3.1.20
- cmake ==3.21.3
- colorlog ==6.5.0
- dotty-dict ==1.3.0
- importlib-metadata *
- sphinx ==3.5.4
- sphinx-rtd-theme ==1.0.0
- validators ==0.18.2
- GitPython ==3.1.20
- colorlog ==6.5.0
- dotty-dict ==1.3.0
- geniac *
- pre-commit ==2.15.0
- pytest ==6.2.5
- pytest-cov ==3.0.0
- pytest-datadir ==1.3.1
- pytest-datafiles ==2.0
- pytest-icdiff ==0.5
- pytest-sugar ==0.9.4
- setuptools-scm ==6.3.2
- tox-conda ==0.8.3
- validators ==0.18.2
- wheel ==0.37.0