scatacpipe

Nextflow pipeline for processing scATAC-seq

https://github.com/marykthompson/scatacpipe

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (16.0%) to scientific vocabulary

Last synced: 11 months ago · JSON representation ·

Repository

Nextflow pipeline for processing scATAC-seq

Basic Info

Host: GitHub
Owner: marykthompson
License: mit
Language: R
Default Branch: main
Size: 0 Bytes

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 0

Created about 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme Changelog License Code of conduct Citation

README.md

Introduction
Pipeline summary - PREPROCESS_DEFAULT - PREPROCESS_10XGENOMICS - PREPROCESS_CHROMAP - DOWNSTREAM_ARCHR

Quick Start
- Web GUI

An example using human genome with matched scRNA-seq data - Commands and config - Pipeline info: time and resource usage

An example using plant genome without matched scRNA-seq data

Documentation
Credits
Bug report/Support
Citations
Release notes

Introduction

scATACpipe is a bioinformatic pipeline for single-cell ATAC-seq (scATAC-seq) data analysis.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker / Singularity containers making installation trivial and results highly reproducible.

The development of the pipeline is guided by nf-core TEMPLATE.

Pipeline Summary

The pipeline consists of 2 relevant parts: preprocessing (from fastq to fragment file) and downstream analysis. If fragment files are directly available, you can choose to skip preprocessing and run downstream analysis only.

For preprocessing, 3 alternative strategies are available that are implemented in 3 sub-workflows respectively, namely, PREPROCESS_DEFAULT, PREPROCESS_10XGENOMICS, and PREPROCESS_CHROMAP. Each of them supports various input types that are demonstrated in further detail below (also see usage).

For downstream analysis, we implemented DOWNSTREAM_ARCHR sub-workflow that integrates ArchR and other tools (e.g. AMULET for doublet detection).

Below is a simplified diagram to illustrate the design logic and functionalities of scATACpipe.

The main functionalities of each sub-workflow are summarized below. You can also refer to the output - Result folders for more details.

PREPROCESS_DEFAULT:

Add barcodes to reads
Correct barcodes (optional)
- if false: skip barcode correction
- if pheniqs or naive: also filter out non-cells using "inflection point" method
Trim off adapters
Mapping
- download genome/annotation or use custom genome
- build genome index if not supplied
Filter BAM
Remove PCR duplicates
Quality control
Generate fragment file, etc.

PREPROCESS_10XGENOMICS:

Build 10XGENOMICS index if not supplied
- download genome/annotation or use custom genome
Execute cellranger_atac count command
Extract fragments from valid cells according to filtered_peak_bc_matrix/barcodes.tsv

PREPROCESS_CHROMAP:

Build Chromap index if not supplied
- download genome/annotation or use custom genome
Execute chromap --preset atac command
Filter out non-cells

Note that no BAM file will be generated for PREPROCESS_CHROMAP option.

DOWNSTREAM_ARCHR:

Build ArchR-compatible genome/annotation files if not natively supported (ArchR supports hg19, hg38, mm9, and mm10 as of 02/2022)
- download genome/annotation if not supplied
- build ArchR genome/gene annotation files if needed
Perform downstream analysis with ArchR and generate various analytical plots
- filter doublets (with ArchR built-in method or AMULET)
- dimension reduction
- batch effect correction
- clustering
- embedding
- pseudo-bulk clustering
- scRNAseq integration if supplied
- marker gene detection
- call peaks
- marker peak detection
- pairwise testing
- motif enrichment
- footprinting
- coaccessibility, etc.

The pipeline also splits BED and/or BAM files according to ArchR clusterings and summarizes all results into a single MultiQC report for easy view.

Quick Start

Install nextflow(>=21.10.0).
For full reproducibility, install either Docker or Singularity (Apptainer) and specify -profile singularity or -profile docker accordingly when running the pipeline so that all dependencies are satisfied. Otherwise, all of the dependencies must be available locally on your PATH, which is likely not true!
Download the pipeline: bash git clone https://github.com/hukai916/scATACpipe.git
Download a minimal test dataset:
- The test_data1 is prepared by downsampling (5% and 10%) a dataset named "500 Peripheral blood mononuclear cells (PBMCs) from a healthy donor (Next GEM v1.1)" provided by 10xgenomics. Note that, in test_data1, I1 refers to index1, which is for sample demultiplexing and not relevant in our case; R1 refers to Read1; R2 refers to index2, which represents the cell barcode fastq; R3 refers to Read2.

bash cd scATACpipe wget https://www.dropbox.com/s/uyiq18zk7dts9fx/test_data1.zip unzip test_data1.zip

Edit the replace_with_full_path in the assets/samplesheettest_data1.csv to use the actual full path.
Test the pipeline with this minimal test_data1:
- At least 8GB memory is recommended for test_data1.
- By default, the local executor will be used (-profile local) meaning that all jobs will be executed on your local computer. Nextflow supports many other executors including SLURM, LSF, etc.. You can create a profile file to config which executor to use. Multiple profiles can be supplied with comma, e.g. -profile docker,lsf.
- Please check nf-core/configs to see what other custom config files can be supplied.

Example run with Docker using local executor: bash nextflow run main.nf -profile docker --preprocess default --outdir res_test_data1 --input_fastq assets/sample_sheet_test_data1.csv --ref_fasta_ensembl homo_sapiens --species_latin_name 'homo sapiens' By executing the above command:
- The local executor will be used.
- PREPROCESS_DEFAULT will be used.
- Output will be saved into res_test_data1.
- Ensembl genome homo_sapiens will be downloaded and used as reference.
  - Example run with Singularity using LSF executor: bash nextflow run main.nf -profile singularity,lsf --preprocess default --outdir res_test_data1 --input_fastq assets/sample_sheet_test_data1.csv --ref_fasta_ensembl homo_sapiens --species_latin_name 'homo sapiens'
- By specifying -profile lsf, the lsf executor will be used for job submission.
- By specifying -profile singularity, Singularity images will be downloaded and saved to work/singularity directory. It is recommended to config the NXF_SINGULARITY_CACHEDIR or singularity.cacheDir settings to store the images in a central location.

Run your own analysis:
- A typical command: bash nextflow run main.nf -profile <singularity/docker/lsf> --preprocess <default/10xgenomics/chromap> --outdir <path_to_result_dir> --input_fastq <path_to_samplesheet> --ref_fasta_ensembl <ENSEMBL_genome_name> --species_latin_name <e.g. 'homo sapiens'>
- For help: bash nextflow run main.nf --help

See documentation usage for all of the available options.

Web GUI

For easy generation of configuration file, we have implemented an interactive config generator.

It was implemented with pure HTML/JavaScript/CSS codes so that it can be used locally. To use it locally, simply click here to download the web_gui.

Then, then open it with your web browser by double clicking web_gui/index.html file or execute the following command in your terminal: ``` open web_gui/index.html

```

An example using human genome with matched scRNA-seq data

Commands and config

This section describes how the plots in the manuscript (to be added) were generated using scATACpipe. For comparison, the manuscript conducted 3 separate analyses, each using a different preprocessing strategy (default, 10xgenomics, chromap). Since the commands and preprocessed results are quite similar across the three methods, only the chromap option will be demonstrated here.

The initial execution: nextflow run main.nf -profile singularity,lsf --preprocess chromap --outdir ./results_chromap_initial --input_fastq ./assets/10X_human_scatac_fastq.csv --ref_fasta_ensembl homo_sapiens --species_latin_name 'homo sapiens' --archr_scrnaseq '/path/scRNA-Hematopoiesis-Granja-2019.rds' --archr_blacklist /home/hl84w/lucio_castilla/scATAC-seq/docs/hg38-blacklist.v2.bed.gz

Break down:

-profile singularity,lsf:

This option instructs scATACpipe to use Singularity containers and LSF as the executor. Multiple parameters are separated by commas. Since profile is pipeline-level flag, it is prefixed with a single dash (-). Module-level flags are prefixed with double dash (--).

--preprocess chromap:

This instructs scATACpipe to use Chromap preprocessing strategy.

--outdir ./results_chromap_initial:

Output will be saved into ./results_chromap_initial folder.

--input_fastq ./assets/10X_human_scatac_fastq.csv:

Please replace the /path/ in the 10X_human_scatac_fastq.csv with absolute paths. Details regarding the 6 samples can be found in the supplementary section of the paper. If you detect any outlier samples, you can remove them from the downstream analyses using the --filter_sample = 'PBMC_10K_C, PBMC_10K_X' flag.

--ref_fasta_ensembl homo_sapiens:

This specifies that the genome Homo Sapiens from ENSEMBLE will be used as reference. To view all supported genomes, check out nextflow run main.nf --support_genome.

--species_latin_name 'homo sapiens':

Simply the Latin name of the reference genome.

--archr_scrnaseq '/path/scRNA-Hematopoiesis-Granja-2019.rds'

Matching scRNA-seq data. Can ignore if not available. The example file can be downloaded here.

--archr_blacklist ./assets/hg38-blacklist.v2.bed.gz:

Blacklist to exclude for downstream analysis. Click here for other species.

Instead of passing each flag option via the command line, you can include them all in a configuration file and supply it with the -c option. Below is equivalent to above: nextflow run main.nf -profile singularity,lsf -c ./conf/test_chromap_initial.config

What are inside testchromapinitial.config: https://github.com/hukai916/scATACpipe/blob/216f1460577926a04a2d2918f36588afb6217f8a/conf/testchromapinitial.config#L1-L10

Again, you have to replace /path/ with full absolute paths.

The final execution:

After examining the results from the initial execution, we decided to remove the outlier clusters (C1, C6) from downstream analyses. These two clusters are considered problematic according to the following two plots: * The clustering heatmap plot from ./resultschromapinitial/archr_clustering/ folder: the cell proportions from PBMC_5K_N and PBMC_5K_V samples are unbalanced for C1, C6.

The marker gene heatmap plot from ./resultschromapinitial/archrmarkergene_clusters/ folder: no distinct marker gene pattern detected in cluster C1, C6.

We used the following line to remove C1 and C6: https://github.com/hukai916/scATACpipe/blob/20e0c820dd685833438a5d24e3eeb0fc5c174a87/conf/testchromapfinal.config#L18

Also, we would like to perform constrained integration of scRNA-seq data in addition to the unconstrained integration. The following line was used to supply the grouping information: https://github.com/hukai916/scATACpipe/blob/20e0c820dd685833438a5d24e3eeb0fc5c174a87/conf/testchromapfinal.config#L12

To specify marker genes to plot, edit the following lines: https://github.com/hukai916/scATACpipe/blob/20e0c820dd685833438a5d24e3eeb0fc5c174a87/conf/testchromapfinal.config#L30 https://github.com/hukai916/scATACpipe/blob/20e0c820dd685833438a5d24e3eeb0fc5c174a87/conf/testchromapfinal.config#L43 https://github.com/hukai916/scATACpipe/blob/20e0c820dd685833438a5d24e3eeb0fc5c174a87/conf/testchromapfinal.config#L53 https://github.com/hukai916/scATACpipe/blob/20e0c820dd685833438a5d24e3eeb0fc5c174a87/conf/testchromapfinal.config#L56

To specify a set of motifs for downstream analyses, edit the following lines: https://github.com/hukai916/scATACpipe/blob/b0bed3f63c7044fd6ab98c39c9d81166fe476edc/conf/modules.config#L289 https://github.com/hukai916/scATACpipe/blob/b0bed3f63c7044fd6ab98c39c9d81166fe476edc/conf/modules.config#L298

To specify a set of motifs for footprinting analyses, edit the following lines: https://github.com/hukai916/scATACpipe/blob/b0bed3f63c7044fd6ab98c39c9d81166fe476edc/conf/modules.config#L305 https://github.com/hukai916/scATACpipe/blob/b0bed3f63c7044fd6ab98c39c9d81166fe476edc/conf/modules.config#L314

We ended up with a final testchromapfinal.config: https://github.com/hukai916/scATACpipe/blob/20e0c820dd685833438a5d24e3eeb0fc5c174a87/conf/testchromapfinal.config#L1-L66

The final execution command looks like below: nextflow run main.nf -profile singularity,lsf -c ./conf/test_chromap_final.config -resume session_id In order to skip already-performed analyses, you must provide the -resume session_id option. The corresponding session ID can be found using the nextflow log command. Please note that the --outdir directory is set to results_chromap_final.

When the final execution is complete, we can look at clustering heatmaps and marker gene heatmaps, and they both look good now: * The marker gene heatmap plot from ./resultschromapfinal/archrmarkergene_clusters/ folder:

The marker gene heatmap plot from ./resultschromapfinal/archrmarkergene_clusters/ folder:

Pipeline info

Upon pipeline execution completion, Nextflow will produce time and resource usage reports that are stored under pipeline_info: - Using chromap option: resultschromapfinal/pipeline_info - Using default option: resultsdefaultfinal/pipeline_info - Using 10xgenomics option: results10xgenomicsfinal/pipeline_info

An example using plant genome without matched scRNA-seq data

For this example (GSE155304), integrated analysis cannot be performed due to the lack of matched scRNA-seq data. Also note, for motifSet, when set to 'cisbp', only human and mouse are currently supported by ArchR. Therefore, for this dataset, we need to replace all occurrences of motifSet = "cisbp" to motifSet = "encode" in the ./conf/motif.config file.

Command line used for default option: nextflow run main.nf -c conf/test_default_plant.config -profile singularity,lsf Results are stored here.
Command used for chromap option: nextflow run main.nf -c conf/test_chromap_plant.config -profile singularity,lsf Results are stored here.
Command used for 10xgenomics option: nextflow run main.nf -c conf/test_10xgenomics_plant.config -profile singularity,lsf Results are stored here.

Documentation

The scATACpipe workflow comes with documentation about the pipeline: usage and output.

Credits

scATACpipe was originally designed and written by Kai Hu, Haibo Liu, and Lihua Julie Zhu.

We thank the following people for their extensive assistance in the development of this pipeline: Nathan Lawson.

Bug report/Support

For help, bug report, or feature requests, the developers would prefer and appreciate that you create a GitHub issue by clicking here. If you would like to extend scATACpipe for your own good, feel free to fork the repo.

Citations

Please cite scATACpipe [to be added] if you use it for your research.

A Template of Method can be found here.

A complete list of references for the tools used by scATACpipe can be found here.

Release notes

v0.1.0

* initial release

v0.1.1

* add web_gui for easy generation of config file * add one more example for Arabidopsis Thaliana * minor improvements

Owner

Login: marykthompson
Kind: user

Repositories: 2
Profile: https://github.com/marykthompson

Citation (CITATIONS.md)

# nf-core/scatacpipe: Citations

## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/)

> Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.

## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/)

> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

## Pipeline tools

* [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

* [MultiQC](https://www.ncbi.nlm.nih.gov/pubmed/27312411/)
    > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.

## Software packaging/containerisation tools

* [Anaconda](https://anaconda.com)
    > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.

* [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/)
    > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506.

* [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/)
    > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671.

* [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241)

* [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/)
    > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.

GitHub Events

Total

Push event: 1

Last Year

Push event: 1

Dependencies

modules/nf-core/modules/multiqc/meta.yml cpan

assets/ArchR/DESCRIPTION cran

BiocGenerics * depends
GenomicRanges * depends
Matrix * depends
S4Vectors >= 0.9.25 depends
SummarizedExperiment * depends
data.table * depends
ggplot2 * depends
magrittr * depends
rhdf5 * depends
Biostrings * imports
ComplexHeatmap * imports
Rcpp >= 0.12.16 imports
Rsamtools * imports
chromVAR * imports
ggrepel * imports
grid * imports
gridExtra * imports
gtable * imports
gtools * imports
matrixStats * imports
motifmatchr * imports
nabor * imports
plyr * imports
stringr * imports
uwot * imports

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

scatacpipe

Science Score: 44.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Table of Contents

Introduction

Pipeline Summary

PREPROCESS_DEFAULT:

PREPROCESS_10XGENOMICS:

PREPROCESS_CHROMAP:

DOWNSTREAM_ARCHR:

Quick Start

Web GUI

An example using human genome with matched scRNA-seq data

Commands and config

Pipeline info

An example using plant genome without matched scRNA-seq data

Documentation

Credits

Bug report/Support

Citations

Release notes

Owner

Citation (CITATIONS.md)

GitHub Events

Total

Last Year

Dependencies