atac_chip_preprocess
Preprocessing workflow for ATAC-seq and ChIP-seq data
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (1.6%) to scientific vocabulary
Keywords
Repository
Preprocessing workflow for ATAC-seq and ChIP-seq data
Basic Info
Statistics
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 2
- Releases: 3
Topics
Metadata Files
README.md
atacchippreprocess
--| Overview
--| Usage
--| Options
--| Output
--| Resources
--| Schedulers
--| Software
Overview
atacchippreprocess is a containerized Nextflow pipeline for preprocessing of ATAC-seq and ChIP-seq data.
The pipeline consists of these steps, starting from a samplesheet to read the fastq files:
- Validation of the samplesheet to ensure that fastq files are not duplicated and all file paths exist. If any validation fails then the process will return an informative error for debuggung.
- Initial fastq QC with
fastqc. - Merging of lane/technical replicates per sample, see the samplesheet section below on how to indicate technical replicates (=multiple fastq files) per sample.
- Adapter and quality trimming with
fastp. - Mapping of reads with
bowtie2, by default with the--very-sensitive(and-X 2000for paired-end data) flags. The pipeline does not include a dedicated indexing step for the reference genome since this is a one-liner viabowtie2-build. The user needs to build the index upfront and then provide the path to the folder with the*.bt2index files via--index. The files will then automatically be found in that folder. It is expected that only a single set of index file is in that folder. - Duplicate marking with
samblaster. - Removal of alignments with MAPQ below 20, removal of non-primary or supplementary alignments, removal of reads not mapped to primary chromosomes (the regex to keep alignments is
chr[1-9,X,Y]), removal of mitochondrial alignments and duplicate reads withsamtoolsand combinations of GNU tools. - For paired-end data fetching of insert size metrics with
picard CollectInsertSizeMetrics. - For ATAC-seq data extraction of transposome insertion events (cutsites, that is the 5' end of alignments) using GNU tool combinations, output in gzipped BED format.
- Peak calling with
macs2. For ATAC-seq the options--keep-dup=all --nomodel --extsize 100 --shift -50 --min-length 250are used by default, else it is only--keep-dup=allsince the deduplicated BAM files are used as input. Then, peaks are filtering against NGS blacklists (ENCODE + mitochondrial homologs in the nuclear genome, the latter for ATAC-seq only) usingbedtools. This is species-specific. Currently, human and mouse is supported via the flag--specieswith eitherhsormm. - Calculation of Fractions Of Reads in Peaks (FRiPs) as a signal-to-noise QC metric with
featureCounts. - Creation of bigwig tracks (raw counts) for quick visual inspection of data quality with
bedtools genomecov. - Summary report collecting fastqc, trimming, alignment metrics and FRiP/featureCounts metrics with
MultiQC. - Output of all used software versions and the exact command lines per process and sample using custom scripts.
Execute this command to run the pipeline on a tiny test dataset with minimal resources to explore outputs:
bash
NXF_VER=23.04.0 nextflow run atpoint/atac_chip_preprocess -r main -profile docker,test --keep_merge --keep_trim
An overview of current software versions and exact command lines when using default settings of the pipeline can be found here.
Usage
The minimal parameters the user has to provide are the following ones:
--samplesheet: path to a samplesheet csv file with three columns, beingsample(the sample name),r1(path to R1) andr2(path to R2), where r2 can be empty. If empty, then the sample is considered single-end.--index: path to a folder containing abowtie2index with the typical*.bt2files. Note, it is the path to the folder, not the path to the index basename, as the pipeline will find the bt2 files automatically.--species: either ofmmorhsto let the peak caller know whether mouse or human data are provided, so it gets the effective genome length right.
Note that the bowtie2 index must be produced upfront, we did not include that into the pipeline as it is trivially just bowtie2-build genome.fa idx.
On our HPC we typically use this command below, with Apptainer (currently the profiles docker, singularity and apptainer are supported) as container engine and SLURM as executor. If any other executor shall be used then the user needs to add it to the scheduler config file file.
```bash
Example for mouse ATAC-seq data
NXFVER=23.04.0 nextflow run atpoint/atacchippreprocess -r main -profile apptainer,slurm --samplesheet path/to/samplesheet.csv --index path/to/indexfolder --species mm
Example for mouse ChIP-seq data
NXFVER=23.04.0 nextflow run atpoint/atacchippreprocess -r main -profile apptainer,slurm --samplesheet path/to/samplesheet.csv --index path/to/indexfolder --species mm --atacseq false ```
Use either of -profile docker/singularity/apptainer to use any of these container engines.
Options
The pipeline uses (in our opinion) reasonable defaults for all processing steps. Still, the following options exist for customization:
General options
--outdir, path to desired output folder collecting all results. By default, it is./atac_chip_preprocess_results/in the directory from which the pipeline is launched.--atacseq, a logical, set tofalseif processing something like ChIP-seq data, by defaulttruefor ATAC-seq data.
Filtering options
--blacklist: path to a BED file to filter peaks against. By default when--speciesismmthen the provided mm10 blacklist is used, forhsthe hg38 one is used.--filter_blacklist: logical, set tofalseto turn off any blacklist filtering, defaulttrue. If any species other than human and mouse is used this can be set to false so no filtering takes place and both--speciesand--blacklisthave no effect, hence can be left at defaults.--flag_remove: a numeric flag to be used withsamtools view -F, so indicating which alignments to remove. Default is 3332, so discard unmapped, not primary, supplementary alignments and duplicates . See here for details.--chr_regex: a groovy-compatible regex to indicate which chromosomes to keep in the BAM alignments. Default ischr[1-9,X,Y]which means keep everything starting withchrand then a number or X/Y. That in turn removes typical decoys (chrEBV) and unplaced/random contigs such aschrU.... As a result it keeps only the primary autosomes and sex chromosomes.--min_mapq: an integer, keep only alignments with MAPQ greater than that, default is 20.--fragment_length: for single-end data an average expected fragment length to extend reads to fragments for bigwig creation and FRiP calculation, default is 250. That is only used if--atacseq falseas for ATAC-seq data everything is based on the transposome cutsites (that is the 5' ends of the alignments).keep_merge: logical, whether to keep the merged fastq files, else they're not published to the output directory.keep_trim: logical, whether to keep the trimmed fastq files, else they're not published to the output directory.
Process options
--do_not_trim: logical, whether to skip adapter and quality trimming.--trim_additional: additional arguments for thefastptrimming process beyond what is coded in the module definition, default--dont_eval_duplication -z 6to skip duplicate level assessment and to compress outputs--align_additional: additional arguments for thebowtie2alignment process beyond what is coded in the module definition, default is-X 2000 --very-sensitive, seebowtie2manual.--sort_additional: additional arguments for thesamtools sortprocess beyond what is coded in the module definition, default is-l 6to compress the resulting BAM file to that level. Do not add-mor-@here, as resources are hardcoded in thenextflow.configfile.--filter_additional: additional arguments for thesamtools viewfiltering process beyond what is described above and given with the-qand-Fflags--macs_additional: additional arguments for themacs2 callpeak, default is in any case--keep-dup=allsince we provide already deduplicated data to that process and if ATAC-seq data are processed (default) then--nomodel --extsize 100 --shift -50 --min-length 250to provide some smoothing to the pileup when using the cutsites for peak calling.
Output
By default, all outputs will be collected in ./atac_chip_preprocess_results/ relative to the directory from which the pipeline is launched. Use --outdir to change this. Output folders are:
- alignments_filtered: Sorted bam, bam index and flagstats for the filtered alignments.
- alignments_unfiltered: Sorted bam, bam index and flagstats for the unfiltered alignments.
- bed_files: In ATAC-seq mode gzipped BED files with the cutsites (5' end of filtered alignments) per sample. These are used for peak calling in the pipeline.
- bigwig: Bigwig files of filtered alignments, without any normalization, for a quick visual inspection of data quality.
- fastq_merged: The merged fastq files in case there were technical replicates per sample and
--keep_mergewas used. - fastq_trimmed: The trimmed fastq files and trimming stats from
fastpin case trimming was activated (default yes) and--keep-trimwas used. - fastqc: The fastqc per-sample outputs.
- frips: A file
frips_all.txtwith the FRiP score per sample. If a sample had a FRiP of zero then currently it is not listed but ignored. - misc: Contains the chromsizes extracted from the BAM files and the insert sizes per sample in case of paired-end data.
- multiqc: The multiQC summary report summarizing all fastqc, trim, alignment and FRiP stats.
- peaks: The narrowPeak and summit BED files per sample from
macs2. - pipeline_info: A file
command_lines.txtsummarizing all used command lines per process and sample, and a filesortware_versions.txtsummarizing all software versions that were used in the pipeline.
Resources
The nextflow.config files contains hardcoded defaults towards resources for the individual processes, suitable for use on HPC or workstation environments. The most demanding process is the alignment steps, requiring 16 threads and 16GB of RAM per sample to finish in a reasonable amount of time.
Schedulers
The schedulers.config file currently contains a single scheduler profile for SLURM as used on or HPC,
submitting jobs (if using -profile slurm) to a quere called normal with a maximum 8h of walltime. Custom profiles should be added to this config. Users can add custom configurations here.
Software
For reproducibility we recommend to use the container options (Docker, Singularity, Apptainer) for -profile docker,singularity,apptainer to take care of all software, using a provided Docker-based image. By default, the pipeline does not support conda/mamba. However, the user can create such a software environment with the environment.yml file, and then simply run the pipeline without specifying any of the above container engines, so the software available in the current environment will be used.
Owner
- Name: Alexander Bender (né Toenges)
- Login: ATpoint
- Kind: user
- Location: Germany
- Website: https://www.biostars.org/u/25721/
- Repositories: 6
- Profile: https://github.com/ATpoint
Postdoc, working in the context of inflammation and cardiovascular disease. Wannabe cyclist and salsa dancer. Dad. Not in that order.
Citation (CITATIONS.md)
# Citations ## Nextflow - [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/) - [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/) ## Pipeline tools - [bedtools](https://pubmed.ncbi.nlm.nih.gov/20110278/) - [bowtie2](https://pubmed.ncbi.nlm.nih.gov/22388286/) - [fastp](https://academic.oup.com/bioinformatics/article/34/17/i884/5093234) - [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) - [featureCounts](https://pubmed.ncbi.nlm.nih.gov/24227677/pi) - [macs2](https://pubmed.ncbi.nlm.nih.gov/18798982/) - [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/) - [picard](http://broadinstitute.github.io/picard/) - [samblaster](https://pubmed.ncbi.nlm.nih.gov/24812344/) - [SAMtools](https://doi.org/10.1093/gigascience/giab008) ## R packages - [R](https://www.R-project.org/) ## Software packaging/container tools - [Anaconda](https://anaconda.com) - [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/) - [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241) - [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/) - [Mambaforge](https://mamba.readthedocs.io/en/latest/installation.html)
GitHub Events
Total
- Watch event: 3
- Push event: 1
Last Year
- Watch event: 3
- Push event: 1
Dependencies
- actions/checkout v2 composite
- condaforge/mambaforge 4.14.0-0 build