varca

Use an ensemble of variant callers to call variants from ATAC-seq data

https://github.com/mcvickerlab/varca

Science Score: 75.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 3 DOI reference(s) in README
✓
Academic publication links
Links to: biorxiv.org
○
Academic email domains
✓
Institutional organization owner
Organization mcvickerlab has institutional domain (mcvicker.salk.edu)
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.6%) to scientific vocabulary

Keywords

atac-seq-data machine-learning random-forest snakemake variant-calling

Last synced: 6 months ago · JSON representation ·

Repository

Use an ensemble of variant callers to call variants from ATAC-seq data

Basic Info

Host: GitHub
Owner: mcvickerlab
License: mit
Language: Python
Default Branch: master
Homepage:
Size: 357 KB

Statistics

Stars: 23
Watchers: 2
Forks: 7
Open Issues: 21
Releases: 8

Topics

atac-seq-data machine-learning random-forest snakemake variant-calling

Created over 6 years ago · Last pushed 9 months ago

Metadata Files

Readme Contributing License Citation

varCA

A pipeline for running an ensemble of variant callers to predict variants from ATAC-seq reads.

The entire pipeline is made up of two smaller subworkflows. The prepare subworkflow calls each variant caller and prepares the resulting data for use by the classify subworkflow, which uses an ensemble classifier to predict the existence of variants at each site.

[!NOTE]
VarCA does not output genotypes (GT fields) because of the possibility of inaccuracy in the presence of allele-specific open chromatin. Please refer to https://github.com/mcvickerlab/varCA/issues/43#issuecomment-1088028758

Code Ocean

Using our Code Ocean compute capsule, you can execute VarCA v0.2.1 on example data without downloading or setting up the project. To interpret the output of VarCA, see the output sections of the prepare subworkflow and the classify subworkflow in the rules README.

download

Execute the following command or download the latest release manually. git clone https://github.com/mcvickerlab/varCA.git Also consider downloading the example data. cd varCA wget -O- -q https://github.com/mcvickerlab/varCA/releases/latest/download/data.tar.gz | tar xvzf -

setup

The pipeline is written as a Snakefile which can be executed via Snakemake. We recommend installing version 5.18.0: conda create -n snakemake -c bioconda -c conda-forge --no-channel-priority 'snakemake==5.18.0' We highly recommend you install Snakemake via conda like this so that you can use the --use-conda flag when calling snakemake to let it automatically handle all dependencies of the pipeline. Otherwise, you must manually install the dependencies listed in the env files.

execution

Activate snakemake via conda: conda activate snakemake
Execute the pipeline on the example data

Locally: ./run.bash & or on an SGE cluster: ./run.bash --sge-cluster &

Output

VarCA will place all of its output in a new directory (out/, by default). Log files describing the progress of the pipeline will also be created there: the log file contains a basic description of the progress of each step, while the qlog file is more detailed and will contain any errors or warnings. You can read more about the pipeline's output in the rules README.

Executing the pipeline on your own data

You must modify the config.yaml file to specify paths to your data. The config file is currently configured to run the pipeline on the example data provided.

Executing each portion of the pipeline separately

The pipeline is made up of two subworkflows. These are usually executed together automatically by the master pipeline, but they can also be executed on their own for more advanced usage. See the rules README for execution instructions and a description of the outputs. You will need to execute the subworkflows separately if you ever want to create your own trained models.

Reproducing our results

We provide the example data so that you may quickly (in ~1 hr, excluding dependency installation) verify that the pipeline can be executed on your machine. This process does not reproduce our results. Those with more time can follow these steps to create all of the plots and tables in our paper.

If this is your first time using Snakemake

We recommend that you run snakemake --help to learn about Snakemake's options. For example, to check that the pipeline will be executed correctly before you run it, you can call Snakemake with the -n -p -r flags. This is also a good way to familiarize yourself with the steps of the pipeline and their inputs and outputs (the latter of which are inputs to the first rule in each workflow -- ie the all rule).

Note that Snakemake will not recreate output that it has already generated, unless you request it. If a job fails or is interrupted, subsequent executions of Snakemake will just pick up where it left off. This can also apply to files that you create and provide in place of the files it would have generated.

By default, the pipeline will automatically delete some files it deems unnecessary (ex: unsorted copies of a BAM). You can opt to keep these files instead by providing the --notemp flag to Snakemake when executing the pipeline.

files and directories

Snakefile

A Snakemake pipeline for calling variants from a set of ATAC-seq reads. This pipeline automatically executes two subworkflows:

the prepare subworkflow, which prepares the reads for classification and
the classify subworkflow, which creates a VCF containing predicted variants

citation

There is an option to "Cite this repository" on the right sidebar of the repository homepage.

Massarat, A. R., Sen, A., Jaureguy, J., Tyndale, S. T., Fu, Y., Erikson, G., & McVicker, G. (2021). Discovering single nucleotide variants and indels from bulk and single-cell ATAC-seq. Nucleic Acids Research, gkab621. https://doi.org/10.1093/nar/gkab621

Owner

Name: McVicker Lab
Login: mcvickerlab
Kind: organization
Location: Salk Institute for Biological Studies

Website: http://mcvicker.salk.edu/
Repositories: 1
Profile: https://github.com/mcvickerlab

Citation (CITATION.cff)

# YAML 1.2
---
abstract: "Genetic variants and de novo mutations in regulatory regions of the genome are typically discovered by whole-genome sequencing (WGS), however WGS is expensive and most WGS reads come from non-regulatory regions. The Assay for Transposase-Accessible Chromatin (ATAC-seq) generates reads from regulatory sequences and could potentially be used as a low-cost ‘capture’ method for regulatory variant discovery, but its use for this purpose has not been systematically evaluated. Here we apply seven variant callers to bulk and single-cell ATAC-seq data and evaluate their ability to identify single nucleotide variants (SNVs) and insertions/deletions (indels). In addition, we develop an ensemble classifier, VarCA, which combines features from individual variant callers to predict variants. The Genome Analysis Toolkit (GATK) is the best-performing individual caller with precision/recall on a bulk ATAC test dataset of 0.92/0.97 for SNVs and 0.87/0.82 for indels within ATAC-seq peak regions with at least 10 reads. On bulk ATAC-seq reads, VarCA achieves superior performance with precision/recall of 0.99/0.95 for SNVs and 0.93/0.80 for indels. On single-cell ATAC-seq reads, VarCA attains precision/recall of 0.98/0.94 for SNVs and 0.82/0.82 for indels. In summary, ATAC-seq reads can be used to accurately discover non-coding regulatory variants in the absence of whole-genome sequencing data and our ensemble method, VarCA, has the best overall performance."
authors: 
  -
    affiliation: "Bioinformatics and Systems Biology Graduate Program, University of California San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, USA"
    family-names: Massarat
    given-names: Arya
    orcid: "https://orcid.org/0000-0002-3679-0345"
  -
    affiliation: "Integrative Biology Laboratory, Salk Institute for Biological Studies, 10010 N. Torrey Pines Road, La Jolla, CA 92037, USA"
    family-names: Sen
    given-names: Arko
    orcid: "https://orcid.org/0000-0001-9876-281X"
  -
    affiliation: "Bioinformatics and Systems Biology Graduate Program, University of California San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, USA"
    family-names: Jaureguy
    given-names: Jeff
    orcid: "https://orcid.org/0000-0002-6303-422X"
  -
    affiliation: "Integrative Biology Laboratory, Salk Institute for Biological Studies, 10010 N. Torrey Pines Road, La Jolla, CA 92037, USA"
    family-names: Tyndale
    given-names: "Sélène"
    orcid: "https://orcid.org/0000-0001-9805-1049"
  -
    affiliation: "Razavi Newman Integrative Genomics and Bioinformatics Core, Salk Institute for Biological Studies, 10010 N. Torrey Pines Road, La Jolla, CA 92037, USA"
    family-names: Fu
    given-names: Yi
  -
    affiliation: "Razavi Newman Integrative Genomics and Bioinformatics Core, Salk Institute for Biological Studies, 10010 N. Torrey Pines Road, La Jolla, CA 92037, USA"
    family-names: Erikson
    given-names: Galina
  -
    affiliation: "Integrative Biology Laboratory, Salk Institute for Biological Studies, 10010 N. Torrey Pines Road, La Jolla, CA 92037, USA"
    family-names: McVicker
    given-names: Graham
    orcid: "https://orcid.org/0000-0003-0991-0951"
cff-version: "1.1.0"
date-released: 2021-07-21
doi: "10.1093/nar/gkab621"
identifiers: 
  - 
    type: doi
    value: "10.1093/nar/gkab621"
  - 
    type: url
    value: "https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkab621/6329114"
license: MIT
message: "If you use this software, please cite it using these metadata."
repository-code: "https://github.com/mcvickerlab/varCA"
title: "Discovering single nucleotide variants and indels from bulk and single-cell ATAC-seq"
version: "v0.3.1"
...

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

varca

Science Score: 75.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

varCA

Code Ocean

download

setup

execution

Output

Executing the pipeline on your own data

Executing each portion of the pipeline separately

Reproducing our results

If this is your first time using Snakemake

files and directories

Snakefile

rules/

configs/

callers/

breakCA/

scripts/

run.bash

citation

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year