vegan
Variant calling pipeline for whole Exome and whole Genome sequencing cANcer data
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.7%) to scientific vocabulary
Repository
Variant calling pipeline for whole Exome and whole Genome sequencing cANcer data
Basic Info
- Host: GitHub
- Owner: bioinfo-pf-curie
- License: other
- Language: Nextflow
- Default Branch: master
- Size: 15.8 MB
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 3
Metadata Files
README.md
VEGAN
Variant calling pipeline for whole Exome and whole Genome sequencing cANcer data
Introduction
This pipeline was built for Whole Exome Sequencing and Whole Genome Sequencing analysis. It provides a detailed quality controls of both frozen and FFPE samples as well as a first downstream analysis including mutation calling, structural variants and copy number analysis. Most of the pipeline steps can work for tumor/normal paired samples and tumor-only samples. VEGAN can run from raw fastq files or from intermediates results such as BAM/CRAM aligned files or VCF files.
The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner.
It comes with conda / singularity containers making installation easier and results highly reproducible.
The first version of VEGAN was inspired from the nf-core Sarek pipeline with several common processes, additional modifications and new analysis steps.
Pipeline summary
- Run quality control of raw sequencing reads (
fastqc) - Align reads on reference genome (
bwa-mem,bwa-mem2,dragmap) - Filtering and quality controls of aligned reads
- GATK preprocessing (
GATK) - Germline Variants calling (
haplotypecaller/bcftools)- HaplotypeCaller
- Somatic Variants calling (
mutect2/bcftools)- Mutect2 (including learnReadOrientationModel, GetPileupSummaries, CalculateContamination)
- FilterMutectCalls
- Technical filters for somatic variants (DP, VAF, MAF) (
SnpSift,bcftools) - Variants annotation (
SnpEff/SnpSift) - Copy-number analysis (
ASCAT,FACETS) - Structural variants analysis (
MANTA) - Biomarkers analysis
- Microsatellite instability analysis (
MSIsensor-pro) - Tumor Mutational Burden (
pyTMB)
- Microsatellite instability analysis (
- Gather all QC results in a final report (
MultiQC)
Quick help
```bash nextflow run main.nf --help N E X T F L O W ~ version 21.10.6
Launching main.nf [lethal_torricelli] - revision: 4d570988d2
_ _ _____ __ __ _____ ____ _ _ _
| \ | | | | \ \ / / | _| / | / \ | \ | | | | | | | ___ \ \ / / | | | | _ / _ \ | | | | |\ | | _| |_| \ V / | | | || | / __ \ | |\ | || _| || _/ |__| _| // _\ || _|
VEGAN v2.3.0
Usage:
The typical command for running the pipeline is as follows:
nextflow run main.nf --profile STRING --samplePlan PATH --design PATH --step STRING --genome STRING --genomeAnnotationPath PATH
MANDATORY ARGUMENTS: --design PATH Path to designf ile specifying the metadata ssociated with the samples --genome STRING [hg19, hg19base, hg38, hg38base, mm10, mm39,...] Name of the reference genome. --genomeAnnotationPath PATH PATH to the reference genome folder. --profile STRING [test, multiconda, singularity, cluster, docker, conda, path, multipath] Configuration profile to use. Can use multiple (comma separated). --step STRING [mapping, markduplicates, filtering, calling, annotate] Specify starting step --outDir PATH The output directory where the results will be saved --tools STRING [haplotypecaller, mutect2, manta, snpeff, facets, ascat, tmb, msisensor] Specify tools to use for variant calling
INPUTS:
--reads PATH Path to input data (must be surrounded with quotes)
--samplePlan PATH Path to sample plan (csv format) raw reads (if --reads is not secified), or intermediate files according to the --step parameter
--singleEnd For single-end input data
--splitFastq Split fastq files in chunks
--fastqChunksSize INTEGER Reads chunks size
ALIGNMENT: --aligner STRING [bwa-mem, bwa-mem2, dragmap] Specify tools to use for mapping --cram Generate CRAM alignment files --mapQual INTEGER Minimum mapping quality to consider for an alignment --saveAlignedIntermediates Save intermediates alignment files --splitFastq Split fastq files in chunks
FILTERING: --keepDups Specify to keep duplicate reads when filtering the alignment --keepMultiHits Specify to keep multi hit reads when filtering the alignment --keepSingleton Specify to keep singleton reads when filtering the alignment --targetBed PATH Target Bed file for targeted or whole exome sequencing
VARIANT CALLING: --baseQual INTEGER Minimum base quality used by Facets for CNV calling --saveVcfIntermediates Save intermediate vcf files --saveVcfMetrics Save complementary vcf metrics files --skipMutectContamination Do not apply the Contamination step for Mutect2 calls filtering --skipMutectOrientationModel Do not apply the LearnOrientationModel step for Mutect2 calls filtering
TUMOR ONLY: --msiBaselineConfig PATH PATH to Msisensor-pro baseline config file for tumor-only mode --pon PATH PATH to panels of normals (.vcf.gz) --ponIndex PATH PATH to panels of normals index file (.tbi)
VCF FILTERS: --filterSomaticDP INTEGER Minimum sequencing depth to consider a somatic variant --filterSomaticMAF INTEGER Maximum variant frequency in the general population to consider a somatic variant --filterSomaticVAF INTEGER Minimum variant allele frequency to consider a somatic variant
ANNOTATION: --annotDb STRING [cosmic, icgc, cancerhotspots, gnomad, dbnsfp] Annotation databases to use with SnpEff and SnpSift --ffpe Specify to use the ffpe parameters and filters for TMB computation
SKIP OPTIONS: --skipBQSR Disable BQSR --skipBamQC Disable QCs on BAM files --skipFastqc Disable Fastqc --skipIdentito Disable Identito --skipMultiqc Disable MultiQC --skipSaturation Disable Preseq
OTHER OPTIONS: --disableAutoClean Disable cleaning of work directory --multiqcConfig PATH Specify a custom config file for MultiQC --name STRING Name for the pipeline run. If not specified, Nextflow will automatically generate a random mnemonic --sequencingCenter STRING Name of sequencing center to be displayed in BAM file
=======================================================
Available Profiles
-profile test Run the test dataset
-profile conda Build a new conda environment before running the pipeline. Use --condaCacheDir to define the conda cache path
-profile multiconda Build a new conda environment per process before running the pipeline. Use --condaCacheDir to define the conda cache path
-profile path Use the installation path defined for all tools. Use --globalPath to define the insallation path
-profile multipath Use the installation paths defined for each tool. Use --globalPath to define the insallation path
-profile docker Use the Docker images for each process
-profile singularity Use the Singularity images for each process. Use --singularityPath to define the insallation path
-profile cluster Run the workflow on the cluster, instead of locally
```
Quick run
The pipeline can be run on any infrastructure from a list of input files or from a sample plan as follow
Run the pipeline on the test dataset
The test dataset is a downsampled Whole Exome Sequencing. It can be launched with the following command.
nextflow run main.nf -profile test,multiconda \
--step mapping \ # or filtering, calling, annotate
--condaCacheDir /bioinfo/local/curie/ngs-data-analysis/centos/tools/containers/conda/vegan-2.0.0/ \
--genomeAnnotationPath /data/annotations/pipelines/
Run the pipeline for WES analysis from a sample plan with specified tools and genome on the cluster, using singularity containers
nextflow run main.nf -profile singularity,cluster \
--samplePlan samples-WES.csv \
--design samples.design.csv \
--step mapping \
--singularityImagePath /bioinfo/local/curie/ngs-data-analysis/centos/tools/containers/singularity/vegan-2.0.0/images/ \
--targetBed capture.bed \
--tools manta,mutect2,snpeff,facets,tmb,haplotypecaller,msisensor \
--genome hg38 --genomeAnnotationPath /data/annotations/pipelines/ \
-resume
Run the pipeline on the cluster, using existing conda
``` nextflow run main.nf -profile multiconda,cluster \ --samplePlan samples-WES.csv \ --design samples.design.csv \ --step mapping \ --targetBed capture.bed \ --tools manta,mutect2,snpeff,facets,tmb,haplotypecaller,msisensor,ascat \ --genome hg38 \ --genomeAnnotationPath /data/annotations/pipelines/ \ --condaCacheDir /bioinfo/local/curie/ngs-data-analysis/centos/tools/containers/conda/vegan-1.2.0/ \
```
To build new conda environments, point to an empty folder for --condaCacheDir parameter
Defining the '-profile'
By default (whithout any profile), Nextflow will excute the pipeline locally, expecting that all tools are available from your PATH variable.
In addition, we set up a few profiles that should allow you
- 1) to use containers instead of local installation,
- 2) to run the pipeline on a cluster instead of on a local architecture.
The description of each profile is available on the help message (see above).
Here are a few examples of how to set the profile option. See the full documentation for details.
```
Run the pipeline locally, using a global environment where all tools are installed (build by conda for instance)
-profile path --globalPath INSTALLATION_PATH
Run the pipeline on the cluster, using the Singularity containers
-profile cluster,singularity --singularityImagePath SINGULARITY_PATH
Run the pipeline on the cluster, building new conda environments
-profile cluster,multiconda --condaCacheDir CONDA_CACHE
```
Sample Plan
A sample plan is a csv file (comma separated) that list all samples with their biological IDs, with no header.
The sample plan is expected to be created as below :
SAMPLEID,SAMPLENAME,PATHTOR1FASTQ,[PATHTOR2FASTQ]
Design
A design file is a csv file that list all experimental samples, their IDs, the associated germinal sample, the sex of the patient and the status (tumor / normal). The design control is expected to have the following header :
GERMLINEID,TUMORID,PAIR_ID,SEX
Both files will be checked by the pipeline and have to be rigorously defined in order to make the pipeline work. Note that the control is optional if not available but is highly recommanded. If the design file is not specified, the pipeline will run until the alignment. The variant calling and the annotation will be skipped.
Full Documentation
- Installation
- Geniac
- Reference genomes
- Running the pipeline
- Profiles
- Output and how to interpret the results
- Troubleshooting
Fundings
This pipeline has been written by the Institut Curie bioinformatics platform (PA. Nicolas, T. Gutman, F. Jarlier, F. Allain, , P. La Rosa, P. Hupe, N. Servant). The project was funded by the European Union’s Horizon 2020 research and innovation programme and the Canadian Institutes of Health Research under the grant agreement No 825835 in the framework of the European-Canadian Cancer Network, as well as the Canceropole Ile de France (GENOPROFILE - RIC2021) project.
Contacts
For any question, bug or suggestion, please send an issue or contact the bioinformatics core facility.
Owner
- Name: Institut Curie, Bioinformatics Core Facility
- Login: bioinfo-pf-curie
- Kind: organization
- Location: Paris, France
- Website: https://bioinfo-pf-curie.github.io/
- Repositories: 11
- Profile: https://github.com/bioinfo-pf-curie
bioinformatics platform of the Institut Curie
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: Nicolas
given-names: Paul-Antoine
- family-names: Gutman
given-names: Tom
- family-names: Jarlier
given-names: Frederic
- family-names: Allain
given-names: Fabrice
- family-names: La Rosa
given-names: Philippe
- family-names: Hupe
given-names: Philippe
- family-names: Servant
given-names: Nicolas
title: "VEGAN - Variant calling pipeline for whole Exome and whole Genome sequencing cANcer data"
version: 2.3.0
date-released: 2024-01-01