Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: science.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.6%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: zhanyinx
  • License: other
  • Language: Python
  • Default Branch: main
  • Size: 81.6 MB
Statistics
  • Stars: 2
  • Watchers: 1
  • Forks: 1
  • Open Issues: 0
  • Releases: 6
Created about 3 years ago · Last pushed about 2 years ago
Metadata Files
Readme License Citation

README.md

Nextflow Active Development run with docker run with singularity

Variant annotation and prioritization pipeline

Contents

Overview

Variant annotation in cancer genomics involves identifying and characterizing the genetic changes (variants) that contribute to cancer development and progression. The challenge is that there are many different types of variants that can occur in the genome, and not all of them are relevant to cancer. Therefore, accurate annotation is critical for identifying the key driver mutations and designing targeted therapies. However, this process is complicated by the large number of potential variants, the need to integrate data from multiple sources, and the ongoing discovery of new cancer-associated variants.

We have developed a Nextflow pipeline called variantalker that enables users to annotate variants from VCF files. Our pipeline supports VCF files generated from dragen, nf-sarek, and ION-torrent platforms.

BETA version: we have implemented the possibility to extract biomarkers such as TMB, mutational signatures (apobec, uv and tabacco), clonal TMB (if bam/cram files and sex are provided), expression of specific genes (if RNA-seq data are provided), gene cnv, etc. For more information, look at here

Installation

Clone the repo

bash git clone https://github.com/zhanyinx/variantalker.git

variantalker relies on Annovar software and Funcotator databases.

Download the updated databases. Separate repositories for hg19 and hg38 are available.

bash wget -r -N --no-parent -nH --cut-dirs=3 -P public_databases/hg38 https://bioserver.ieo.it/repo/dima/hg38 wget -r -N --no-parent -nH --cut-dirs=3 -P public_databases/hg19 https://bioserver.ieo.it/repo/dima/hg19

Documentation

The pipeline employs several tools to annotate and prioritize variants:

  • Funcotator for variant annotation
  • CancerVar for somatic variants prioritization
  • InterVar for germline variants annotation
  • Annovar: cancervar and intervar reply on Annovar.
  • CIViC: somatic variant classification using CIViC evidence level.
  • AlphaMissense: somatic and germline variant prioritization.

To ensure the accuracy of the pipeline, the databases for Funcotator and Annovar must be regularly updated using the provided tools found here: update utilities.

Usage

If you are using for the first time, please consider updating the databases following the instructions.

Modify the configuration file (nextflow.config) by setting the following parameters:

  • funcotatorgermlinedb: e.g. path2/publicdatabases/funcotatordataSources.v1.7.20200521g

  • funcotatorsomaticdb: e.g. path2/publicdatabases/funcotatordataSources.v1.7.20200521s

  • annovardb: e.g. path2/publicdatabases/humandb

  • annovarsoftwarefolder: e.g. path2/annovar

  • alphamisgenomebasedir: e.g. path2/publicdatabases

  • fasta: path to fasta file used to generate the vcf

  • target: path to the target bed file

The main command line for the annotation is the following

bash nextflow run path_to/main.nf -c yourconfig -profile singularity --input samplesheet.csv --outdir outdir

bash nextflow run path_to/main.nf --help --show_hidden_params

Input

variantalker takes as input a csv samplesheet with 4 columns

IMPORTANT: HEADER is required

| patient | tumortissue | samplefile | sample_type | | -------------- | -------------- | ----------------- | -------------| | patient1 | Lung | path/tumor.vcf.gz | somatic | | ..... | ..... | ..... | ..... |

Samplefile must be provided with full path, _not__ relative path

Available sample_type are: somatic, germline, cnv.

  • somatic sample type: it can be tumoronly (single sample) or tumornormal (multi sample) vcf.gz file. Requires tumortissue to be specified

  • germline: single sample vcf.gz file. It does not require tumor_tissue

  • cnv: for nfcore/sarek, CNVKit output is supported (cnr file). For dragen, vcf.gz file required. It does not require tumor_tissue

Available tumortissue are: AdrenalGland BileDuct Bladder Blood Bone BoneMarrow Brain Breast Cancerall Cervix Colorectal Esophagus Eye HeadandNeck Inflammatory Intrahepatic Kidney Liver Lung LymphNodes NervousSystem Other Ovary Pancreas Pleura Prostate Skin SoftTissue Stomach Testis Thymus Thyroid Uterus

Output

Output structure:

params.outdir |-- date | `-- annotation | |-- germline | | `-- patient | | |-- filtered.patient.maf.pass.tsv | | |-- filtered.patient.maf.nopass.tsv | | |-- patient.vcf | | `-- patient.maf | `-- somatic | `-- patient | | |-- filtered.patient.maf.pass.tsv | | |-- filtered.patient.maf.nopass.tsv | | |-- patient.vcf | `-- patient.maf | `-- cnv | `-- patient | | |-- patient.cnv.annotated.tsv

variantalker outputs for each sample multiple files

1) maf file with all the annotations 2) vcf file with the PASS variants 3) filtered pass file with variants passing the filters (see below). 4) filtered nopass file with variants not passing the filters (see below). 5) cnv annotated file (if cnv samples provided)

Default filters applied:

  • "Silent", "IGR", "RNA" variant types are filtered out (unless it's pathogenic or likely pathogenic for clinvar/cancervar/intervar)

  • minimum coverage 50 (unless it's pathogenic or likely pathogenic for clinvar/cancervar/intervar)

  • minimum somatic VAF: 0.01

  • minimum germline VAF: 0.2

  • InterVar classes to be kept: Pathogenic,Likely pathogenic (logic OR)

  • CancerVar classes to be kept: TierIIpotential,TierIstrong (logic OR)

  • ReNOVo class to be kept: LP Pathogenic,IP Pathogenic,HP Pathogenic (logic OR)

  • CIViC evidence levels to be kept: A,B,C (logic OR)

  • no filters on genes (somatic or germline)

Logic OR filters: a variant is kept if at least one of the OR filters is true

Liability

Variantalker assumes no responsibility for any injury to person or damage to persons or property arising out of, or related to any use of Variantalker, or for any errors or omissions. The user recognizes they are using Liability at their own risk.

Owner

  • Name: Yinxiu Zhan
  • Login: zhanyinx
  • Kind: user
  • Location: Milan
  • Company: IEO

Head of data science unit at European Institute for Oncology (IEO)

GitHub Events

Total
  • Release event: 1
  • Watch event: 1
  • Member event: 3
  • Push event: 11
  • Pull request event: 2
  • Fork event: 2
  • Create event: 3
Last Year
  • Release event: 1
  • Watch event: 1
  • Member event: 3
  • Push event: 11
  • Pull request event: 2
  • Fork event: 2
  • Create event: 3

Dependencies

Dockerfile docker
  • r-base latest build
streamlit_app/requirements.txt pypi
  • numpy *
  • pandas *
  • streamlit *
  • streamlit-aggrid *