vegan

Variant calling pipeline for whole Exome and whole Genome sequencing cANcer data

https://github.com/bioinfo-pf-curie/vegan

Last synced: 6 months ago · JSON representation ·

Repository

Variant calling pipeline for whole Exome and whole Genome sequencing cANcer data

Basic Info

Host: GitHub
Owner: bioinfo-pf-curie
License: other
Language: Nextflow
Default Branch: master
Size: 15.8 MB

Statistics

Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 3

Created over 4 years ago · Last pushed over 1 year ago

Metadata Files

Readme Changelog License Citation

VEGAN

Variant calling pipeline for whole Exome and whole Genome sequencing cANcer data

Introduction

This pipeline was built for Whole Exome Sequencing and Whole Genome Sequencing analysis. It provides a detailed quality controls of both frozen and FFPE samples as well as a first downstream analysis including mutation calling, structural variants and copy number analysis. Most of the pipeline steps can work for tumor/normal paired samples and tumor-only samples. VEGAN can run from raw fastq files or from intermediates results such as BAM/CRAM aligned files or VCF files.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with conda / singularity containers making installation easier and results highly reproducible. The first version of VEGAN was inspired from the nf-core Sarek pipeline with several common processes, additional modifications and new analysis steps.

Pipeline summary

Run quality control of raw sequencing reads (fastqc)
Align reads on reference genome (bwa-mem, bwa-mem2, dragmap)
Filtering and quality controls of aligned reads
- Report mapping metrics (picard)
- Mark and remove duplicates (markdup)
- Library complexity analysis (Preseq)
- Filtering aligned BAM files (SAMTools)
- Insert size distribution (picard)
- Identity monitoring (bcftools / R)
GATK preprocessing (GATK)
Germline Variants calling (haplotypecaller / bcftools)
- HaplotypeCaller
Somatic Variants calling (mutect2 / bcftools)
- Mutect2 (including learnReadOrientationModel, GetPileupSummaries, CalculateContamination)
- FilterMutectCalls
Technical filters for somatic variants (DP, VAF, MAF) (SnpSift, bcftools)
Variants annotation (SnpEff / SnpSift)
Copy-number analysis (ASCAT, FACETS)
Structural variants analysis (MANTA)
Biomarkers analysis
- Microsatellite instability analysis (MSIsensor-pro)
- Tumor Mutational Burden (pyTMB)
Gather all QC results in a final report (MultiQC)

Quick help

```bash nextflow run main.nf --help N E X T F L O W ~ version 21.10.6

Launching `main.nf` [lethal_torricelli] - revision: 4d570988d2

_   _   _____          __     __  _____    ____      _      _   _

| \ | | | | \ \ / / | _| / | / \ | \ | | | | | | | ___ \ \ / / | | | | _ / _ \ | | | | |\ | | _| |_| \ V / | | | || | / __ \ | |\ | || _| || _/ |__| _| // _\ || _|

               VEGAN v2.3.0

Usage:

The typical command for running the pipeline is as follows:

nextflow run main.nf --profile STRING --samplePlan PATH --design PATH --step STRING --genome STRING --genomeAnnotationPath PATH

MANDATORY ARGUMENTS: --design PATH Path to designf ile specifying the metadata ssociated with the samples --genome STRING [hg19, hg19base, hg38, hg38base, mm10, mm39,...] Name of the reference genome. --genomeAnnotationPath PATH PATH to the reference genome folder. --profile STRING [test, multiconda, singularity, cluster, docker, conda, path, multipath] Configuration profile to use. Can use multiple (comma separated). --step STRING [mapping, markduplicates, filtering, calling, annotate] Specify starting step --outDir PATH The output directory where the results will be saved --tools STRING [haplotypecaller, mutect2, manta, snpeff, facets, ascat, tmb, msisensor] Specify tools to use for variant calling

INPUTS: --reads PATH Path to input data (must be surrounded with quotes) --samplePlan PATH Path to sample plan (csv format) raw reads (if --reads is not secified), or intermediate files according to the --step parameter --singleEnd For single-end input data --splitFastq Split fastq files in chunks --fastqChunksSize INTEGER Reads chunks size

ALIGNMENT: --aligner STRING [bwa-mem, bwa-mem2, dragmap] Specify tools to use for mapping --cram Generate CRAM alignment files --mapQual INTEGER Minimum mapping quality to consider for an alignment --saveAlignedIntermediates Save intermediates alignment files --splitFastq Split fastq files in chunks

FILTERING: --keepDups Specify to keep duplicate reads when filtering the alignment --keepMultiHits Specify to keep multi hit reads when filtering the alignment --keepSingleton Specify to keep singleton reads when filtering the alignment --targetBed PATH Target Bed file for targeted or whole exome sequencing

VARIANT CALLING: --baseQual INTEGER Minimum base quality used by Facets for CNV calling --saveVcfIntermediates Save intermediate vcf files --saveVcfMetrics Save complementary vcf metrics files --skipMutectContamination Do not apply the Contamination step for Mutect2 calls filtering --skipMutectOrientationModel Do not apply the LearnOrientationModel step for Mutect2 calls filtering

TUMOR ONLY: --msiBaselineConfig PATH PATH to Msisensor-pro baseline config file for tumor-only mode --pon PATH PATH to panels of normals (.vcf.gz) --ponIndex PATH PATH to panels of normals index file (.tbi)

VCF FILTERS: --filterSomaticDP INTEGER Minimum sequencing depth to consider a somatic variant --filterSomaticMAF INTEGER Maximum variant frequency in the general population to consider a somatic variant --filterSomaticVAF INTEGER Minimum variant allele frequency to consider a somatic variant

ANNOTATION: --annotDb STRING [cosmic, icgc, cancerhotspots, gnomad, dbnsfp] Annotation databases to use with SnpEff and SnpSift --ffpe Specify to use the ffpe parameters and filters for TMB computation

SKIP OPTIONS: --skipBQSR Disable BQSR --skipBamQC Disable QCs on BAM files --skipFastqc Disable Fastqc --skipIdentito Disable Identito --skipMultiqc Disable MultiQC --skipSaturation Disable Preseq

OTHER OPTIONS: --disableAutoClean Disable cleaning of work directory --multiqcConfig PATH Specify a custom config file for MultiQC --name STRING Name for the pipeline run. If not specified, Nextflow will automatically generate a random mnemonic --sequencingCenter STRING Name of sequencing center to be displayed in BAM file

======================================================= Available Profiles -profile test Run the test dataset -profile conda Build a new conda environment before running the pipeline. Use --condaCacheDir to define the conda cache path -profile multiconda Build a new conda environment per process before running the pipeline. Use --condaCacheDir to define the conda cache path -profile path Use the installation path defined for all tools. Use --globalPath to define the insallation path -profile multipath Use the installation paths defined for each tool. Use --globalPath to define the insallation path -profile docker Use the Docker images for each process -profile singularity Use the Singularity images for each process. Use --singularityPath to define the insallation path -profile cluster Run the workflow on the cluster, instead of locally ```

Quick run

The pipeline can be run on any infrastructure from a list of input files or from a sample plan as follow

Run the pipeline on the test dataset

The test dataset is a downsampled Whole Exome Sequencing. It can be launched with the following command.

nextflow run main.nf -profile test,multiconda \ --step mapping \ # or filtering, calling, annotate --condaCacheDir /bioinfo/local/curie/ngs-data-analysis/centos/tools/containers/conda/vegan-2.0.0/ \ --genomeAnnotationPath /data/annotations/pipelines/

Run the pipeline for WES analysis from a sample plan with specified tools and genome on the cluster, using singularity containers

nextflow run main.nf -profile singularity,cluster \ --samplePlan samples-WES.csv \ --design samples.design.csv \ --step mapping \ --singularityImagePath /bioinfo/local/curie/ngs-data-analysis/centos/tools/containers/singularity/vegan-2.0.0/images/ \ --targetBed capture.bed \ --tools manta,mutect2,snpeff,facets,tmb,haplotypecaller,msisensor \ --genome hg38 --genomeAnnotationPath /data/annotations/pipelines/ \ -resume

Run the pipeline on the cluster, using existing conda

``` nextflow run main.nf -profile multiconda,cluster \ --samplePlan samples-WES.csv \ --design samples.design.csv \ --step mapping \ --targetBed capture.bed \ --tools manta,mutect2,snpeff,facets,tmb,haplotypecaller,msisensor,ascat \ --genome hg38 \ --genomeAnnotationPath /data/annotations/pipelines/ \ --condaCacheDir /bioinfo/local/curie/ngs-data-analysis/centos/tools/containers/conda/vegan-1.2.0/ \

```

To build new conda environments, point to an empty folder for --condaCacheDir parameter

Defining the '-profile'

By default (whithout any profile), Nextflow will excute the pipeline locally, expecting that all tools are available from your PATH variable. In addition, we set up a few profiles that should allow you - 1) to use containers instead of local installation, - 2) to run the pipeline on a cluster instead of on a local architecture. The description of each profile is available on the help message (see above).

Here are a few examples of how to set the profile option. See the full documentation for details.

```

Run the pipeline locally, using a global environment where all tools are installed (build by conda for instance)

-profile path --globalPath INSTALLATION_PATH

Run the pipeline on the cluster, using the Singularity containers

-profile cluster,singularity --singularityImagePath SINGULARITY_PATH

Run the pipeline on the cluster, building new conda environments

-profile cluster,multiconda --condaCacheDir CONDA_CACHE

```

Sample Plan

A sample plan is a csv file (comma separated) that list all samples with their biological IDs, with no header.
The sample plan is expected to be created as below :

SAMPLEID,SAMPLENAME,PATHTOR1FASTQ,[PATHTOR2FASTQ]

Design

A design file is a csv file that list all experimental samples, their IDs, the associated germinal sample, the sex of the patient and the status (tumor / normal). The design control is expected to have the following header :

GERMLINEID,TUMORID,PAIR_ID,SEX

Both files will be checked by the pipeline and have to be rigorously defined in order to make the pipeline work. Note that the control is optional if not available but is highly recommanded. If the design file is not specified, the pipeline will run until the alignment. The variant calling and the annotation will be skipped.

Full Documentation

Fundings

This pipeline has been written by the Institut Curie bioinformatics platform (PA. Nicolas, T. Gutman, F. Jarlier, F. Allain, , P. La Rosa, P. Hupe, N. Servant). The project was funded by the European Union’s Horizon 2020 research and innovation programme and the Canadian Institutes of Health Research under the grant agreement No 825835 in the framework of the European-Canadian Cancer Network, as well as the Canceropole Ile de France (GENOPROFILE - RIC2021) project.

Contacts

For any question, bug or suggestion, please send an issue or contact the bioinformatics core facility.

Owner

Name: Institut Curie, Bioinformatics Core Facility
Login: bioinfo-pf-curie
Kind: organization
Location: Paris, France

Website: https://bioinfo-pf-curie.github.io/
Repositories: 11
Profile: https://github.com/bioinfo-pf-curie

bioinformatics platform of the Institut Curie

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Nicolas
    given-names: Paul-Antoine
  - family-names: Gutman
    given-names: Tom
  - family-names: Jarlier
    given-names: Frederic
  - family-names: Allain
    given-names: Fabrice
  - family-names: La Rosa
    given-names: Philippe
  - family-names: Hupe
    given-names: Philippe
  - family-names: Servant
    given-names: Nicolas
title: "VEGAN - Variant calling pipeline for whole Exome and whole Genome sequencing cANcer data"
version: 2.3.0
date-released: 2024-01-01

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

vegan

Science Score: 44.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

VEGAN

Introduction

Pipeline summary

Quick help

Launching `main.nf` [lethal_torricelli] - revision: 4d570988d2

Quick run

Run the pipeline on the test dataset

Run the pipeline for WES analysis from a sample plan with specified tools and genome on the cluster, using singularity containers

Run the pipeline on the cluster, using existing conda

Defining the '-profile'

Run the pipeline locally, using a global environment where all tools are installed (build by conda for instance)

Run the pipeline on the cluster, using the Singularity containers

Run the pipeline on the cluster, building new conda environments

Sample Plan

Design

Full Documentation

Fundings

Contacts

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Dependencies

vegan

Science Score: 44.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

VEGAN

Introduction

Pipeline summary

Quick help

Launching main.nf [lethal_torricelli] - revision: 4d570988d2

Quick run

Run the pipeline on the test dataset

Run the pipeline for WES analysis from a sample plan with specified tools and genome on the cluster, using singularity containers

Run the pipeline on the cluster, using existing conda

Defining the '-profile'

Run the pipeline locally, using a global environment where all tools are installed (build by conda for instance)

Run the pipeline on the cluster, using the Singularity containers

Run the pipeline on the cluster, building new conda environments

Sample Plan

Design

Full Documentation

Fundings

Contacts

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Dependencies

Launching `main.nf` [lethal_torricelli] - revision: 4d570988d2