Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 5 DOI reference(s) in README -
✓Academic publication links
Links to: biorxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (21.5%) to scientific vocabulary
Repository
Quick ATAC-seq QC
Basic Info
Statistics
- Stars: 5
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
quaqc: QUick ATAC-seq Quality Control
Also works with any unspliced DNA-seq experiment! Compatible with plant genomes!!
Installation
Requires gcc/clang and GNU Make, tested with macOS and Linux. Basic install:
sh
git clone https://github.com/bjmt/quaqc # Or download latest release
cd quaqc
make release-full
make test # Make sure quaqc produces the expected outputs (optional)
make install # Copy quaqc + manual to /usr/local (optional, may require sudo)
See INSTALL for configuration options.
Alternatively for conda users, grab it from Bioconda:
```sh
Example environment setup:
conda create -y --name quaqc -c conda-forge -c bioconda quaqc ```
Quick start
To see a basic description of all available parameters, run ./quaqc -h. For
more detailed help, see the manual page.
quaqc comes built-in with sensible defaults, meaning few parameters must be specified for regular usage. (The hardcoded defaults can be changed before compilation by editing quaqc.h.) To run quaqc using the test data:
sh
./quaqc -v --output-dir . --peaks test/peak.bed --tss test/tss.bed \
--target-list test/target.bed test/reads.bam
cat ./reads.quaqc.txt
This command changes the location of the output QC report to the current directory,
and uses the peak and TSS files to generate additional stats. Since this example
BAM is just a subset of reads contained within a small region of the Arabidopsis
genome, the target.bed
file restricts quaqc to only consider that region of the genome (and thus adjust
how it calculates the stats). See a copy of the output
report here.
Note that quaqc can also save the final reads passing all filters to a new BAM
file via the -S/--keep flag. (This will increase the runtime.)
quaqcr
For those who prefer to analyze their data with R, a companion package
quaqcr is available. This package contains
functions for both executing quaqc from within R as well as analyzing all of
its outputs made available when outputting the results in JSON format. See
the README for a detailed tutorial.
Help
Please open an issue on GitHub.
Citation
Please cite the following journal article if you find this software useful in your research:
Tremblay, B.J.M. and Qüesta, J.I. (2024). quaqc: efficient and quick ATAC-seq quality control and filtering. Bioinformatics 40, btae649. https://doi.org/10.1093/bioinformatics/btae649
Additional use cases
Creating insertion BED files for use by MACS2/3
A common method for calling ATAC-seq peaks is to provide the reads as a
BED file in order to trick MACS2/3 into using individual reads during pileup
instead of the fragment coordinates in paired-end data. Using the --bed
option, this can be done using quaqc, generating identical output as
bedtools bamtobed. For example:
```sh quaqc -0 --bed Sample.bam
macs3 callpeak [...] \ --treatment Sample.bed.gz --format BED \ --call-summits --keep-dup all \ --shift -75 --extsize 150 \ --nomodel --nolambda ```
However since we only care about the 5-prime ends of each read (i.e. the
insertion sites) when calling peaks this way, we can also add the --bed-ins
flag to tell quaqc to only output the 5-prime coordinates of the reads with
no additional BED fields, saving a decent amount of disk space in a way
which will not affect MACS2/3.
Creating bedGraph files centered around the 5-prime ends of reads
Typically read pileups of ATAC-seq data (such as bedGraph files) are made in such a way as to visualize the sites of transposition, which are the 5-prime ends of the reads. This means needing to resize each read before creating a bedGraph file. Both of these steps can be performed simultaneously using quaqc:
sh
quaqc --bedGraph --bedGraph-qlen 150 Sample.bam
In this example, a bedGraph file is created where each read is resized
to 150 bp, centered around the 5-prime end of each read. (To use the original read sizes without resizing, simply set --bedGraph-qlen to 0.) This functionality can also be used to visualize the exact insertions, adjusting for the +4/-5 transposition shift:
sh
quaqc --bedGraph --bedGraph-tn5 --bedGraph-qlen 1 Sample.bam
The +4/-5 shift can also be adjusted to use custom values via --tn5-fwd and --tn5-rev.
Filtering a single-cell ATAC-seq BAM by cell type
Using a set of barcodes, quaqc can subset a BAM by making use of the --rg-list and
--keep options. For example, using the common CB SAM flag typically used for storing
barcodes:
sh
quaqc -0 -A --rg-list barcodes.txt --rg-tag CB --keep --keep-ext .MyCellType.bam Sample.bam
The -0 flag prevents quaqc from creating a QC report, thus making the output BAM
the only file created by quaqc. The -A flag can be used if this BAM has already been
filtered, and no additional filtering is desired.
Fast iteration of the effects of different read quality thresholds
quaqc is fairly performant, but it can still require some patience for decently
sized BAMs (e.g., about one minute for 80 million reads from Arabidopsis ATAC-seq).
To substantially speed up quaqc to iterate through different filtering
thresholds, make use of the -n/--target-names to restrict quaqc to only
look at single chromosomes. This is often enough to get an idea of the quality of
the data while getting a several fold speed up (e.g., restricting quaqc to the
first chromosome of Arabidopsis takes just over 10 seconds for the previously
mentioned example).
For example, to try out several MAPQ thresholds for an Arabidopsis ATAC-seq sample by only scanning chromosome 1 (which is just '1' for this version of TAIR10):
sh
for i in 5 10 15 20 25 30 ; do
quaqc -n 1 -q $i -i "MAPQ >= ${i}" -O .quaqc.mapq${i}.txt Sample.bam
done
Now, six different MAPQ thresholds have been tested in the time it would take
to process the entire BAM file once across all chromosomes. (The -i/--title
flag is used to add a unique title to each run.)
Quick motif footprints
The TSS pileup functionality of quaqc can be used to instead generate motif
footprints. Note that these are only useful for testing, as no correction
of base composition is peformed to reduce Tn5 transposase insertion biases. The output
must be saved in JSON format to recover the footprint data, which can then
be plotted with an external program (such as quaqcr).
In this example code, the --footprint preset will adjust the TSS pileup
to produce single base resolution data of Tn5 transposase insertion
frequency. The --nfr preset is also used to only consider nucleosome-free
reads.
sh
quaqc --nfr --footprint --tss motif.bed --json sample_motif.json Sample.bam
The metrics
See this annotated report for a description of the metrics.
A note on additional metrics
The set of metrics output by quaqc are those that I personally have found useful. If you can think of additional important ones, please create an issue on GitHub and I would be happy to implement them if it seems feasible to do so.
Extra
A note on the default Tn5 shift values (+4/-5)
It is standard practice to adjust the start coordinates of ATAC-seq
fragments to account for the ~9 bp footprint of the Tn5 transposase by
adding 4 bp to the start and substracting 5 bp from the end (+4/-5).
However, some articles such as this one
instead suggest +4/-4 might better capture the true Tn5 offset and allow
for better estimation of the sequence composition bias of transposition.
To use this alternative offset in quaqc, simply add --tn5-rev 4.
Information on input BAMs
quaqc is meant to work with unfiltered BAMs produced by sequence
alignment tools such as bowtie2. However, some processing
steps are recommended and/or required. These include:
samtools fixmateso that quaqc can compute accurate mate and fragment size metrics for PE experiments (optional, though not having proper read mate data may require using the--use-nomateflag to stop quaqc from filtering all PE reads)samtools sortif the reads are not already coordinate sorted (required)samtools markdup(or an equivalent command) so that quaqc can compute read duplication metrics (optional)samtools indexto generate an accompanying index file (optional; quaqc will create one if missing)
My personal recommendation is to run all of these commands simultaneously during sequence alignment. For example:
sh
[bowtie2/etc ... ] \
| samtools fixmate -m -u - - \
| samtools sort -u - \
| samtools markdup --write-index - file.bam
This requires a sequence aligner which can output its results to stdout,
as well as being grouped by read name. For further information check out
this article.
quaqc only works with indexed BAMs, though actually indexing them before using quaqc is optional. If no index can be found then quaqc will attempt to create the index.
CRAM
As far as I can tell CRAM files work fine as well, as long as the reference is available locally. See the notes in the installation instructions about compiling quaqc with libcurl if you need to work with CRAMs depending on remote references. The Bioconda version is dynamically linked to the full version of HTSlib, making it fully compatible with any type of CRAM.
Long reads
Some quick testing shows that quaqc seems to work with DNA-seq long read BAMs too. However if you try it out and see obvious errors in the output, please open an issue and I will do my best to fix the problem. Overall this is probably the wrong tool to use to run QC on long reads. Keep in mind quaqc will not properly handle supplementary alignments.
Spliced reads
I have not tested quaqc with BAMs containing spliced reads (e.g. RNA-seq), but I expect most of the statistics (such as alignment and fragment lengths) would be incorrectly calculated. Even the number of reads may be wrongly counted, since quaqc is not set up to properly handle supplementary alignments.
Owner
- Name: Benjamin Jean-Marie Tremblay
- Login: bjmt
- Kind: user
- Repositories: 1
- Profile: https://github.com/bjmt
Citation (CITATION.cff)
cff-version: 1.2.0
message: 'If you use this software, please cite it as below.'
authors:
- given-names: Benjamin J.M.
family-names: Tremblay
orcid: 'https://orcid.org/0000-0002-7441-2951'
- given-names: Julia I.
family-names: Qüesta
orcid: 'https://orcid.org/0000-0002-0462-8423'
title: 'quaqc: quick ATAC-seq quality control'
version: 1.1
date-released: '2024-10-31'
url: 'https://github.com/bjmt/quaqc'
preferred-citation:
type: article
authors:
- given-names: Benjamin J.M.
family-names: Tremblay
orcid: 'https://orcid.org/0000-0002-7441-2951'
- given-names: Julia I.
family-names: Qüesta
orcid: 'https://orcid.org/0000-0002-0462-8423'
doi: 10.1093/bioinformatics/btae649
journal: 'Bioinformatics'
title: >-
quaqc: efficient and quick ATAC-seq quality control and
filtering
volume: 40
issue: 11
start: btae649
year: 2024
GitHub Events
Total
- Issues event: 2
- Watch event: 5
- Issue comment event: 4
- Push event: 21
- Create event: 8
Last Year
- Issues event: 2
- Watch event: 5
- Issue comment event: 4
- Push event: 21
- Create event: 8
Dependencies
- actions/checkout v4 composite