atacseq_pipeline

Ultimate ATAC-seq Data Processing, Quantification and Annotation Snakemake Workflow and MrBiomics Module.

https://github.com/epigen/atacseq_pipeline

Keywords

atac-seq bash bioinformatics biomedical-data-science ngs pipeline python snakemake workflow

Last synced: 10 months ago · JSON representation ·

Repository

Ultimate ATAC-seq Data Processing, Quantification and Annotation Snakemake Workflow and MrBiomics Module.

Basic Info

Host: GitHub
Owner: epigen
License: mit
Language: Python
Default Branch: main
Homepage: https://epigen.github.io/atacseq_pipeline/
Size: 46.4 MB

Statistics

Stars: 59
Watchers: 3
Forks: 2
Open Issues: 2
Releases: 10

Topics

atac-seq bash bioinformatics biomedical-data-science ngs pipeline python snakemake workflow

Created over 5 years ago · Last pushed 11 months ago

Metadata Files

Readme License Citation

Ultimate ATAC-seq Data Processing, Quantification & Annotation Pipeline

A Snakemake 8 workflow implementation of the BSF's ATAC-seq Data Processing Pipeline extended by downstream quantification and annotation steps using Bash and Python.

[!NOTE]
This workflow adheres to the module specifications of MrBiomics, an effort to augment research by modularizing (biomedical) data science. For more details, instructions, and modules check out the project's repository.

⭐️ Star and share modules you find valuable 📤 - help others discover them, and guide our future work!

[!IMPORTANT]
If you use this workflow in a publication, please don't forget to give credit to the authors by citing it using this DOI 10.5281/zenodo.6323634.

Workflow Rulegraph

🖋️ Authors

💿 Software

This project wouldn't be possible without the following software and their dependencies:

🔬 Methods

This is a template for the Methods section of a scientific publication and is intended to serve as a starting point. Only retain paragraphs relevant to your analysis. References [ref] to the respective publications are curated in the software table above. Versions (ver) have to be read out from the respective conda environment specifications (workflow/envs/*.yaml file) or post-execution in the result directory (atacseq_pipeline/envs/*.yaml). Parameters that have to be adapted depending on the data or workflow configurations are denoted in squared brackets e.g., [X].

Processing. Sequencing adapters were removed using the software fastp (ver) [ref]. Bowtie2 (ver) [ref] was used for the alignment of the short reads (representing locations of transposition events) to the [GRCh38 (hg38)/GRCm38 (mm10)] assembly of the [human/mouse] genome using the “--very-sensitive” parameter. PCR duplicates were marked using samblaster (ver) [ref]. Aligned BAM files were then sorted, filtered using ENCODE blacklisted regions [ref], samtools view flags [SAMflag], and indexed using samtools (ver) [ref]. To detect the open chromatin regions, peak calling was performed using MACS2 (ver) [ref] using the “--nomodel”, “--keep-dup [macs2keep_dup]” and “--extsize 147” options on each sample. HOMER (ver) [ref] function findMotifs was used for motif enrichment analysis of the detected open chromatin regions. Quality control metrics were aggregated and reported using MultiQC (ver) [ref], [X] sample(s) needed to be removed.

Quantification. A consensus region set, comprising of [X] genomic regions, was generated, by merging the identified peak summits, extended by [slopextension]bp on both sides using the slop function from bedtools (ver) [ref] and pybedtools (ver) [ref], across all samples while again discarding peaks overlapping blacklisted features as defined by the ENCODE project [ref]. The consensus region set was used to quantify the chromatin accessibility in each sample by summing the number of reads overlapping each consensus region. The consensus region set, and sample-wise quantification of accessibility was performed using bedtools (ver) [ref] and pybedtools (ver) [ref]. Furthermore, the consensus region set was used to quantify the peak support per sample and each region was mapped to its closest TSS according to the HOMER annotation within proximal TSS up and downstream distances [proximalsize_up/down] yielding a gene/TSS-based quantification. Complementary, all promoter regions, defined by the same parameters, were quantified for each sample and aggregated to yield a gene/promoter-based quantification. Finally, all sample-wise enriched known motifs according to HOMER were aggregated.

Annotation. Consensus regions were annotated using annotatePeaks function from HOMER (ver) [ref]. Additionally, we annotated all consensus regions using UROPA (ver) [ref] and genomic features from the [GENCODE vX] basic gene annotation as: “TSS proximal” if the region’s midpoint was within [X] bp upstream or [X] bp downstream from a TSS, or if the region overlapped with a TSS; “gene body” if the region overlapped with a gene; “distal” if the region’s midpoint was within [X] bp of a TSS; and “intergenic” otherwise. For each region, only the closest feature was considered, and the annotations took precedence in the following order: TSS proximal, gene body, distal, and intergenic. Finally, bedtools was employed to quantify nucleotide counts and proportional content per consensus region.

The processing and quantification described here was performed using a publicly available Snakemake ver workflow [10.5281/zenodo.6323634].

🚀 Features

Processing (results/)
- Alignment of both single-end and paired-end reads in raw/unaligned/unmapped uBAM format with Bowtie2.
  - Filtering using samtools view can be configured using SAM Flags (SAM_flag).
- Peak calling with MACS2.
  - Duplicate handling can be configured using macs2_keep_dup.
  - Even though the peak support of a region in a certain sample is 0, does not mean that there are no reads counted in the count matrix, it just states that there was no peak called.
  - The peak support can be >1 for certain samples in case of a consensus region spanning more than one peak within the respective sample.
- Peak annotation and motif analysis HOMER.
- Quantification of TSS coverage.
Reporting (report/)
- MultiQC report generation using MultiQC, extended with an in-house developed plugin atacseq_report.
- Sample annotation is visualized as a hierarchically-clustered QC heatmap with matching metadata annotation, exported both as a PNG and an interactive HTML with metadata as tooltips (sample_annotation.{png|html}).
Quantification (counts/)
- Consensus region set generation across all called peaks (consensus_regions.bed).
- Read count quantification of the consensus regions across samples, yielding a count matrix with dimensions consensus regions X samples (consensus_counts.csv).
- Peak support quantification of the consensus regions across samples, yielding a count matrix with dimensions consensus regions X samples (support_counts.csv).
- Consensus regions mapped to closest gene TSS according to HOMER (Distance to TSS) within proximal TSS up and downstream distances (TSS_regions.bed, TSS_counts.csv, TSS_annotation.csv).
- Read count quantification of promoter regions based on provided proximal TSS up and downstream distances (promoter_regions.bed, promoter_counts.csv, promoter_annotation.csv).
  - Pseudoautosomal regions in human chromosome Y are skipped.
- Aggregation of all sample-wise HOMER known motif enrichment results into one CSV in long-format (HOMER_knownMotifs.csv).
Annotation (counts/)
- Sample annotation file based on MultiQC general stats and provided annotations for downstream analysis (sample_annotation.csv).
- Consensus region set annotation using (consensus_annotation.csv)
  - UROPA with regulatory build and gencode as references, configurable here: workflow/resources/UROPA/*.txt.
  - HOMER with annotatePeaks.pl. NB: We have empirically found, that some human sex genes, e.g., the well established protein coding genes UTY and STS, are not annotated.
  - bedtools for nucleotide counts/content (e.g., % of GC).

[!IMPORTANT] Duplciate reads can be filtered during the alignment step by samtools and/or ignored during peak calling by MACS2. The inclusion of duplicates should be intentional, and may lead to a large number of consensus regions. The removal of duplicates should be intentional, might remove real biological signal. The decision depends on your downstream analysis steps e.g., rigorous filtering (e.g., using edgeR::filterByExpr) and/or accounting for PCR bias by normalization conditional on genomic region length and GC content (e.g., CQN) and goals (e.g., differential accessibility analysis). We recommend reading this ChIP-seq tutorial's section on "Removing redundancy".

🛠️ Usage

These steps are the recommended usage for this workflow:

Configure the workflow by pointing to the relevant resources, e.g., downloaded from Zenodo for hg38 or mm10 (see instructions below).
Perform only the processing, by setting the pass_qc annotation for all samples to 0.
Use the generated MultiQC report (resultpath/ataceqpipeline/report/multiqc_report.html) to judge the quality of each sample (see tips in the next section on Quality Control).
Fill out the mandatory quality control column (pass_qc) in the annotation file accordingly (everything >0 will be included in the downstream steps).
Finally, execute the remaining downstream quantification and annotation steps by running the workflow. Thereby only the samples that passed quality control will be included in the consensus region set generation (i.e., the feature space) and all downstream steps.

[!NOTE] Although inputs and parameters may be identical, MACS2 peak calling can yield slightly varying results (± a few peaks) due to stochastic elements in its algorithm (e.g., duplicate handling).This minor variability in peak calls sohuld have no impact on downstream analyses or the overall robustness of results.

This workflow is written with Snakemake and its usage is described in the Snakemake Workflow Catalog.

⚙️ Configuration

Detailed specifications can be found here ./config/README.md

📖 Examples

Explore a detailed example showcasing module usage and downstream analysis in our comprehensive end-to-end MrBiomics Recipe for ATAC-seq Analysis, including data, configuration, annotation and results.

🔍 Quality Control

Below are some guidelines for the manual quality control of each sample using the generated MultiQC report and visualized (interactive) sample annotation, but keep in mind that every experiment/dataset is different. Thresholds are general suggestions and may vary based on experiment type, organism, and library prep.

Reads Mapped ~ $30\cdot 10^{6}$ ($>20\cdot 10^{6}$ at least)
% Aligned >90%
% Mitochondrial <10%
Peaks (depend on reads)
- FriP (Fraction of reads in Peaks) ~ >20% (can be misleading as 80-90% are also not good)
- Regulatory regions >10% (as it is roughly 10% of the genome)
- TSS (Transcription Start Site) normalized coverage ideally > 4 (at least >2)
- % Duplications “not excessive”
Inspect Genome Browser Tracks using UCSC Genome Browser (online) or IGV (local)
- Compare all samples to the best, based on above's QC metrics.
- Check cell type / experiment-specific markers or sex chromosome (X/Y) for accessibility as positive controls.
- Check e.g., developmental regions for accessibility as negative controls.
Unsupervised Analysis (e.g., PCA or UMAP)
- Identify outliers/drivers of variation, especially in the control samples and within replicates.

[!IMPORTANT]
Sometimes reads map to Y in females, because X and Y chromosomes both have pseudoautosomal regions (PARs) that are common between the two chromosomes.

My personal QC value scheme to inform downstream analyses (e.g., unsupervised analysis) - 0 = did not pass - 2 options - for every metric that is not ok subtract 0.25 from 1, which means it requires 4 “strikes” for a sample to be removed due to low quality. - alternative - 0.5 = passed with reservations (e.g., metrics and genome browser tracks were not optimal, but still good enough) - 0.75 = not ideal (e.g., at least metrics or IGV tracks were not optimal) - 1 = passed (perfect)

Finally, a previous PhD student in our lab, André Rendeiro, wrote about "ATAC-seq sample quality, wet lab troubleshooting and advice".

🔗 Links

📚 Resources

Data Resources: To ensure the reproducibility of results and to make the workflow accessible we provide all required reference data for the analysis of ATAC-seq samples for human GRCh38 (hg38) and mouse GRCm38 (mm10) genomes on Zendodo.

Command line ```console # download Zenodo records using zenodo_get

# install zenodoget v1.3.4 conda install -c conda-forge zenodoget=1.3.4

# human GRCh38 (hg38) zenodoget --record 6344173 --output-dir=resources/atacseqpipeline/hg38/ cd resources/atacseqpipeline/hg38 unzip indicesforBowtie2.zip && rm indicesfor_Bowtie2.zip

# mouse GRCm38 (mm10) zenodoget --record 6344322 --output-dir=resources/atacseqpipeline/mm10/ cd resources/atacseqpipeline/mm10 unzip indicesforBowtie2.zip && rm indicesforBowtie2.zip **Snakemake rule** for workflowspython #### Get resources from Zenodo as custom Snakemake rule #### # Downloads Bowtie2 indices for hg38 from Zenodo record 6344173 and unpacks them. rule MyATACgetresources: output: "resources/MyATAC/atacseqpipeline/hg38/gencode.v38.basic.annotation.gtf", resourcedir = directory("resources/MyATAC/atacseqpipeline/hg38/"), params: zenodorecord = "6344173", zipfilename = "indicesforBowtie2.zip" conda: "../envs/zenodoget.yaml" shell: """ # Download the specific record to the target directory zenodoget --record {params.zenodorecord} --output-dir={output.resourcedir}

        # Change directory, unzip the specific file, and remove the zip archive
        # Using && ensures commands run sequentially and stop if one fails
        cd {output.resource_dir} && \
        unzip {params.zip_filename} && \
        rm {params.zip_filename}
        """

``` - How to convert feature lists to BED files - Recommended compatible MrBiomics modules for - upstream analysis: - Fetch Public Sequencing Data and Metadata Using iSeq to retrieve and prepare public ATAC-ses data for downstream processing. - downstream analysis (in that order): - Genome Browser Track Visualization for quality control and visual inspection/analysis of genomic regions/genes of interest or top hits. - Split, Filter, Normalize and Integrate Sequencing Data after count quantification. - Unsupervised Analysis to understand and visualize similarities and variations between cells/samples, including dimensionality reduction and cluster analysis. Useful for all tabular data including single-cell and bulk sequencing data. - Differential Analysis with limma to identify and visualize statistically significantly different features (e.g., genes or genomic regions) between sample groups. - Enrichment Analysis for biomedical interpretation of (differential) analysis results using prior knowledge. - Introduction to ChIP-seq using high performance computing

📑 Publications

The following publications successfully used this module for their analyses. - Haladik et al. (2025) Cell Reports Medicine - Image-based drug screening combined with molecular profiling identifies signatures and drivers of therapy resistance in pediatric AML - Traxler, Reichl et al. (2025) Cell Systems - Integrated time-series analysis and high-content CRISPR screening delineate the dynamics of macrophage immune regulation - Casteels et al. (2022) Cell Reports - SMNDC1 links chromatin remodeling and splicing to regulate pancreatic hormone expression - ...

⭐ Star History

Owner

Name: Computational Epigenetics
Login: epigen
Kind: organization
Location: Vienna, Austria

Website: http://www.medical-epigenomics.org/
Repositories: 17
Profile: https://github.com/epigen

Computational Epigenetics Research and Software

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: Ultimate ATAC-seq Data Processing & Quantification Pipeline
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - affiliation: CeMM Research Center for Molecular Medicine
    given-names: Stephan
    family-names: Reichl
    orcid: 'https://orcid.org/0000-0001-8555-7198'
  - given-names: Bekir
    family-names: Ergüner
    orcid: 'https://orcid.org/0000-0001-5475-0892'
    affiliation: CeMM Research Center for Molecular Medicine
  - given-names: Daniele
    family-names: Barreca
    orcid: 'https://orcid.org/0000-0003-3373-9939'
    affiliation: CeMM Research Center for Molecular Medicine
  - given-names: Lukas
    family-names: Folkman
    orcid: 'https://orcid.org/0000-0002-5811-8875'
    affiliation: CeMM Research Center for Molecular Medicine
  - given-names: Fangwen
    family-names: Zhao
    affiliation: CeMM Research Center for Molecular Medicine
  - given-names: Rob
    family-names: ter Horst
    orcid: 'https://orcid.org/0000-0003-0576-5873'
    affiliation: CeMM Research Center for Molecular Medicine
  - given-names: Lina
    family-names: Dobnikar
    affiliation: CeMM Research Center for Molecular Medicine
  - given-names: Christoph
    family-names: Bock
    orcid: 'https://orcid.org/0000-0001-6091-3088'
    affiliation: CeMM Research Center for Molecular Medicine
identifiers:
  - type: doi
    value: 10.5281/zenodo.6323634
    description: >-
      This DOI represents all versions, and will always
      resolve to the latest one.
repository-code: 'https://github.com/epigen/atacseq_pipeline'
url: 'https://epigen.github.io/atacseq_pipeline/'
abstract: >-
  A Snakemake implementation of the BSF's ATAC-seq Data
  Processing Pipeline extended by downstream quantification 
  and annotation steps using Bash and Python.
keywords:
  - Bioinformatics
  - ATAC-seq
  - Epigenetics
  - Workflow
  - Snakemake
license: MIT

GitHub Events

Total

Create event: 2
Issues event: 5
Release event: 3
Watch event: 12
Issue comment event: 5
Push event: 8

Last Year

Create event: 2
Issues event: 5
Release event: 3
Watch event: 12
Issue comment event: 5
Push event: 8

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 47
Total pull requests: 1
Average time to close issues: 2 months
Average time to close pull requests: 3 minutes
Total issue authors: 5
Total pull request authors: 1
Average comments per issue: 0.64
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 7
Pull requests: 1
Average time to close issues: 3 months
Average time to close pull requests: 3 minutes
Issue authors: 2
Pull request authors: 1
Average comments per issue: 0.29
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

sreichl (41)
BiotechPedro (2)
dariarom94 (2)
zhangzhen (1)
descostesn (1)