vegan

Variant calling pipeline for whole Exome and whole Genome sequencing cANcer data

https://github.com/bioinfo-pf-curie/vegan

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.7%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Variant calling pipeline for whole Exome and whole Genome sequencing cANcer data

Basic Info
  • Host: GitHub
  • Owner: bioinfo-pf-curie
  • License: other
  • Language: Nextflow
  • Default Branch: master
  • Size: 15.8 MB
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 3
Created over 4 years ago · Last pushed over 1 year ago
Metadata Files
Readme Changelog License Citation

README.md

VEGAN

Variant calling pipeline for whole Exome and whole Genome sequencing cANcer data

Nextflow Install with conda Singularity Container available Docker Container available

Introduction

This pipeline was built for Whole Exome Sequencing and Whole Genome Sequencing analysis. It provides a detailed quality controls of both frozen and FFPE samples as well as a first downstream analysis including mutation calling, structural variants and copy number analysis. Most of the pipeline steps can work for tumor/normal paired samples and tumor-only samples. VEGAN can run from raw fastq files or from intermediates results such as BAM/CRAM aligned files or VCF files.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with conda / singularity containers making installation easier and results highly reproducible. The first version of VEGAN was inspired from the nf-core Sarek pipeline with several common processes, additional modifications and new analysis steps.

Pipeline summary

  1. Run quality control of raw sequencing reads (fastqc)
  2. Align reads on reference genome (bwa-mem, bwa-mem2, dragmap)
  3. Filtering and quality controls of aligned reads
    • Report mapping metrics (picard)
    • Mark and remove duplicates (markdup)
    • Library complexity analysis (Preseq)
    • Filtering aligned BAM files (SAMTools)
    • Insert size distribution (picard)
    • Identity monitoring (bcftools / R)
  4. GATK preprocessing (GATK)
  5. Germline Variants calling (haplotypecaller / bcftools)
    • HaplotypeCaller
  6. Somatic Variants calling (mutect2 / bcftools)
    • Mutect2 (including learnReadOrientationModel, GetPileupSummaries, CalculateContamination)
    • FilterMutectCalls
  7. Technical filters for somatic variants (DP, VAF, MAF) (SnpSift, bcftools)
  8. Variants annotation (SnpEff / SnpSift)
  9. Copy-number analysis (ASCAT, FACETS)
  10. Structural variants analysis (MANTA)
  11. Biomarkers analysis
  12. Gather all QC results in a final report (MultiQC)

Quick help

```bash nextflow run main.nf --help N E X T F L O W ~ version 21.10.6

Launching main.nf [lethal_torricelli] - revision: 4d570988d2

_   _   _____          __     __  _____    ____      _      _   _

| \ | | | | \ \ / / | _| / | / \ | \ | | | | | | | ___ \ \ / / | | | | _ / _ \ | | | | |\ | | _| |_| \ V / | | | || | / __ \ | |\ | || _| || _/ |__| _| // _\ || _|

               VEGAN v2.3.0

Usage:

The typical command for running the pipeline is as follows:

nextflow run main.nf --profile STRING --samplePlan PATH --design PATH --step STRING --genome STRING --genomeAnnotationPath PATH

MANDATORY ARGUMENTS: --design PATH Path to designf ile specifying the metadata ssociated with the samples --genome STRING [hg19, hg19base, hg38, hg38base, mm10, mm39,...] Name of the reference genome. --genomeAnnotationPath PATH PATH to the reference genome folder. --profile STRING [test, multiconda, singularity, cluster, docker, conda, path, multipath] Configuration profile to use. Can use multiple (comma separated). --step STRING [mapping, markduplicates, filtering, calling, annotate] Specify starting step --outDir PATH The output directory where the results will be saved --tools STRING [haplotypecaller, mutect2, manta, snpeff, facets, ascat, tmb, msisensor] Specify tools to use for variant calling

INPUTS: --reads PATH Path to input data (must be surrounded with quotes) --samplePlan PATH Path to sample plan (csv format) raw reads (if --reads is not secified), or intermediate files according to the --step parameter --singleEnd For single-end input data --splitFastq Split fastq files in chunks --fastqChunksSize INTEGER Reads chunks size

ALIGNMENT: --aligner STRING [bwa-mem, bwa-mem2, dragmap] Specify tools to use for mapping --cram Generate CRAM alignment files --mapQual INTEGER Minimum mapping quality to consider for an alignment --saveAlignedIntermediates Save intermediates alignment files --splitFastq Split fastq files in chunks

FILTERING: --keepDups Specify to keep duplicate reads when filtering the alignment --keepMultiHits Specify to keep multi hit reads when filtering the alignment --keepSingleton Specify to keep singleton reads when filtering the alignment --targetBed PATH Target Bed file for targeted or whole exome sequencing

VARIANT CALLING: --baseQual INTEGER Minimum base quality used by Facets for CNV calling --saveVcfIntermediates Save intermediate vcf files --saveVcfMetrics Save complementary vcf metrics files --skipMutectContamination Do not apply the Contamination step for Mutect2 calls filtering --skipMutectOrientationModel Do not apply the LearnOrientationModel step for Mutect2 calls filtering

TUMOR ONLY: --msiBaselineConfig PATH PATH to Msisensor-pro baseline config file for tumor-only mode --pon PATH PATH to panels of normals (.vcf.gz) --ponIndex PATH PATH to panels of normals index file (.tbi)

VCF FILTERS: --filterSomaticDP INTEGER Minimum sequencing depth to consider a somatic variant --filterSomaticMAF INTEGER Maximum variant frequency in the general population to consider a somatic variant --filterSomaticVAF INTEGER Minimum variant allele frequency to consider a somatic variant

ANNOTATION: --annotDb STRING [cosmic, icgc, cancerhotspots, gnomad, dbnsfp] Annotation databases to use with SnpEff and SnpSift --ffpe Specify to use the ffpe parameters and filters for TMB computation

SKIP OPTIONS: --skipBQSR Disable BQSR --skipBamQC Disable QCs on BAM files --skipFastqc Disable Fastqc --skipIdentito Disable Identito --skipMultiqc Disable MultiQC --skipSaturation Disable Preseq

OTHER OPTIONS: --disableAutoClean Disable cleaning of work directory --multiqcConfig PATH Specify a custom config file for MultiQC --name STRING Name for the pipeline run. If not specified, Nextflow will automatically generate a random mnemonic --sequencingCenter STRING Name of sequencing center to be displayed in BAM file

======================================================= Available Profiles -profile test Run the test dataset -profile conda Build a new conda environment before running the pipeline. Use --condaCacheDir to define the conda cache path -profile multiconda Build a new conda environment per process before running the pipeline. Use --condaCacheDir to define the conda cache path -profile path Use the installation path defined for all tools. Use --globalPath to define the insallation path -profile multipath Use the installation paths defined for each tool. Use --globalPath to define the insallation path -profile docker Use the Docker images for each process -profile singularity Use the Singularity images for each process. Use --singularityPath to define the insallation path -profile cluster Run the workflow on the cluster, instead of locally ```

Quick run

The pipeline can be run on any infrastructure from a list of input files or from a sample plan as follow

Run the pipeline on the test dataset

The test dataset is a downsampled Whole Exome Sequencing. It can be launched with the following command.

nextflow run main.nf -profile test,multiconda \ --step mapping \ # or filtering, calling, annotate --condaCacheDir /bioinfo/local/curie/ngs-data-analysis/centos/tools/containers/conda/vegan-2.0.0/ \ --genomeAnnotationPath /data/annotations/pipelines/

Run the pipeline for WES analysis from a sample plan with specified tools and genome on the cluster, using singularity containers

nextflow run main.nf -profile singularity,cluster \ --samplePlan samples-WES.csv \ --design samples.design.csv \ --step mapping \ --singularityImagePath /bioinfo/local/curie/ngs-data-analysis/centos/tools/containers/singularity/vegan-2.0.0/images/ \ --targetBed capture.bed \ --tools manta,mutect2,snpeff,facets,tmb,haplotypecaller,msisensor \ --genome hg38 --genomeAnnotationPath /data/annotations/pipelines/ \ -resume

Run the pipeline on the cluster, using existing conda

``` nextflow run main.nf -profile multiconda,cluster \ --samplePlan samples-WES.csv \ --design samples.design.csv \ --step mapping \ --targetBed capture.bed \ --tools manta,mutect2,snpeff,facets,tmb,haplotypecaller,msisensor,ascat \ --genome hg38 \ --genomeAnnotationPath /data/annotations/pipelines/ \ --condaCacheDir /bioinfo/local/curie/ngs-data-analysis/centos/tools/containers/conda/vegan-1.2.0/ \

```

To build new conda environments, point to an empty folder for --condaCacheDir parameter

Defining the '-profile'

By default (whithout any profile), Nextflow will excute the pipeline locally, expecting that all tools are available from your PATH variable. In addition, we set up a few profiles that should allow you - 1) to use containers instead of local installation, - 2) to run the pipeline on a cluster instead of on a local architecture. The description of each profile is available on the help message (see above).

Here are a few examples of how to set the profile option. See the full documentation for details.

```

Run the pipeline locally, using a global environment where all tools are installed (build by conda for instance)

-profile path --globalPath INSTALLATION_PATH

Run the pipeline on the cluster, using the Singularity containers

-profile cluster,singularity --singularityImagePath SINGULARITY_PATH

Run the pipeline on the cluster, building new conda environments

-profile cluster,multiconda --condaCacheDir CONDA_CACHE

```

Sample Plan

A sample plan is a csv file (comma separated) that list all samples with their biological IDs, with no header.
The sample plan is expected to be created as below :

SAMPLEID,SAMPLENAME,PATHTOR1FASTQ,[PATHTOR2FASTQ]

Design

A design file is a csv file that list all experimental samples, their IDs, the associated germinal sample, the sex of the patient and the status (tumor / normal). The design control is expected to have the following header :

GERMLINEID,TUMORID,PAIR_ID,SEX

Both files will be checked by the pipeline and have to be rigorously defined in order to make the pipeline work. Note that the control is optional if not available but is highly recommanded. If the design file is not specified, the pipeline will run until the alignment. The variant calling and the annotation will be skipped.

Full Documentation

  1. Installation
  2. Geniac
  3. Reference genomes
  4. Running the pipeline
  5. Profiles
  6. Output and how to interpret the results
  7. Troubleshooting

Fundings

This pipeline has been written by the Institut Curie bioinformatics platform (PA. Nicolas, T. Gutman, F. Jarlier, F. Allain, , P. La Rosa, P. Hupe, N. Servant). The project was funded by the European Union’s Horizon 2020 research and innovation programme and the Canadian Institutes of Health Research under the grant agreement No 825835 in the framework of the European-Canadian Cancer Network, as well as the Canceropole Ile de France (GENOPROFILE - RIC2021) project.

Contacts

For any question, bug or suggestion, please send an issue or contact the bioinformatics core facility.

Owner

  • Name: Institut Curie, Bioinformatics Core Facility
  • Login: bioinfo-pf-curie
  • Kind: organization
  • Location: Paris, France

bioinformatics platform of the Institut Curie

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Nicolas
    given-names: Paul-Antoine
  - family-names: Gutman
    given-names: Tom
  - family-names: Jarlier
    given-names: Frederic
  - family-names: Allain
    given-names: Fabrice
  - family-names: La Rosa
    given-names: Philippe
  - family-names: Hupe
    given-names: Philippe
  - family-names: Servant
    given-names: Nicolas
title: "VEGAN - Variant calling pipeline for whole Exome and whole Genome sequencing cANcer data"
version: 2.3.0
date-released: 2024-01-01

GitHub Events

Total
Last Year

Dependencies

recipes/conda/environment.yml pypi