Bam-readcount - rapid generation of basepair-resolution sequence metrics
Bam-readcount - rapid generation of basepair-resolution sequence metrics - Published in JOSS (2022)
Science Score: 98.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 8 DOI reference(s) in README and JOSS metadata -
✓Academic publication links
Links to: joss.theoj.org -
✓Committers with academic emails
4 of 15 committers (26.7%) from academic institutions -
✓Institutional organization owner
Organization genome has institutional domain (genome.wustl.edu) -
✓JOSS paper metadata
Published in Journal of Open Source Software
Scientific Fields
Repository
Count bases in BAM/CRAM files
Basic Info
Statistics
- Stars: 318
- Watchers: 55
- Forks: 96
- Open Issues: 48
- Releases: 7
Metadata Files
README.md
bam-readcount
bam-readcount is a utility that runs on a BAM or CRAM file and generates low-level information about
sequencing data at specific nucleotide positions. Its outputs include observed bases,
readcounts, summarized mapping and base qualities, strandedness information,
mismatch counts, and position within the reads. (see "Output" section below)
Originally designed to help filter genomic mutation calls, the metrics bam-readcount outputs
are also useful as input for variant detection tools and for resolving ambiguity between
variant callers.
If you find bam-readcount useful in your work, please cite our paper:
Khanna et al., (2022). Bam-readcount - rapid generation of basepair-resolution sequence metrics. Journal of Open Source Software, 7(69), 3722. https://doi.org/10.21105/joss.03722
Contents
Installation
Docker
The latest release version of bam-readcount is available as a Docker image
on DockerHub
docker pull mgibio/bam-readcount
For details see the
docker-bam-readcount
repository.
Build
Requires a C++ toolchain and cmake. For details see
BUILD.md.
git clone https://github.com/genome/bam-readcount
cd bam-readcount
mkdir build
cd build
cmake ..
make
# Executable is
bin/bam-readcount
Usage
Run with no arguments for command-line help:
$ bam-readcount
Usage: bam-readcount [OPTIONS] <bam_file> [region]
Generate metrics for bam_file at single nucleotide positions.
Example: bam-readcount -f ref.fa some.bam
Available options:
-h [ --help ] produce this message
-v [ --version ] output the version number
-q [ --min-mapping-quality ] arg (=0) minimum mapping quality of reads used
for counting.
-b [ --min-base-quality ] arg (=0) minimum base quality at a position to
use the read for counting.
-d [ --max-count ] arg (=10000000) max depth to avoid excessive memory
usage.
-l [ --site-list ] arg file containing a list of regions to
report readcounts within.
-f [ --reference-fasta ] arg reference sequence in the fasta format.
-D [ --print-individual-mapq ] arg report the mapping qualities as a comma
separated list.
-p [ --per-library ] report results by library.
-w [ --max-warnings ] arg maximum number of warnings of each type
to emit. -1 gives an unlimited number.
-i [ --insertion-centric ] generate indel centric readcounts.
Reads containing insertions will not be
included in per-base counts
The optional [region] should be in the same format as samtools:
chromosome:start-stop
The optional -l (--site-list) file should be tab-separated, no
header, one region per line:
chromosome start end
CRAM support
When using CRAM files as input, if a reference is specified with -f, it will override whatever is in
the CRAM header. Otherwise, the reference(s) encoded in the CRAM header or a lookup by
MD5 at ENA will be used.
Wrappers/Parsers
Add bam-readcount counts to VCF - VAtools allows you to add read-counts to VCF from modern variant callers. Additional details Create csv file - brc-parser parser to convert bam-readcount output to comma seperated long format file.
Output
Output is tab-separated with no header to STDOUT, one line per
position:
chr position reference_base depth base:count:avg_mapping_quality:avg_basequality:avg_se_mapping_quality:num_plus_strand:num_minus_strand:avg_pos_as_fraction:avg_num_mismatches_as_fraction:avg_sum_mismatch_qualities:num_q2_containing_reads:avg_distance_to_q2_start_in_q2_reads:avg_clipped_length:avg_distance_to_effective_3p_end ...
There is one set of :-separated fields for each reported base with
statistics on the set of reads containing that base:
Field | Description
----- | -----------
base | The base, eg C
count | Number of reads
avgmappingquality | Mean mapping quality
avgbasequality | Mean base quality
avgsemappingquality | Mean single ended mapping quality
numplusstrand | Number of reads on the plus/forward strand
numminusstrand | Number of reads on the minus/reverse strand
avgposasfraction | Average position on the read as a fraction, calculated with respect to the length after clipping. This value is normalized to the center of the read: bases occurring strictly at the center of the read have a value of 1, those occurring strictly at the ends should approach a value of 0
avgnummismatchesasfraction | Average number of mismatches on these reads per base
avgsummismatchqualities | Average sum of the base qualities of mismatches in the reads
numq2containingreads | Number of reads with q2 runs at the 3’ end
avgdistancetoq2startinq2reads | Average distance of position (as fraction of unclipped read length) to the start of the q2 run
avgclippedlength | Average clipped read length
avgdistancetoeffective3p_end | Average distance to the 3’ prime end of the read (as fraction of unclipped read length)
Per-library output
With the -p option, each output line will have a set of {}-delimited
results, one for each library:
chr position reference_base depth library_1_name { base:count:avg_mapping_quality:avg_basequality:avg_se_mapping_quality:num_plus_strand:num_minus_strand:avg_pos_as_fraction:avg_num_mismatches_as_fraction:avg_sum_mismatch_qualities:num_q2_containing_reads:avg_distance_to_q2_start_in_q2_reads:avg_clipped_length:avg_distance_to_effective_3p_end } ... library_N_name { base:count:avg_mapping_quality:avg_basequality:avg_se_mapping_quality:num_plus_strand:num_minus_strand:avg_pos_as_fraction:avg_num_mismatches_as_fraction:avg_sum_mismatch_qualities:num_q2_containing_reads:avg_distance_to_q2_start_in_q2_reads:avg_clipped_length:avg_distance_to_effective_3p_end }
Tutorial
For those who learn best by example, a brief tutorial is available here that uses bam-readcount to identify the Omicron SARS-CoV-2 variant of concern from raw sequence data.
Support
For support, please search
bam-readcount on
Biostars as many of the most frequently asked
questions about bam-readcount have been answered there. For problems not addressed there,
please open an github issue or make a BioStar post.
Contributing
We welcome contributions! See Contributing for more details
Owner
- Name: The McDonnell Genome Institute
- Login: genome
- Kind: organization
- Location: St. Louis, MO
- Website: http://genome.wustl.edu/
- Repositories: 167
- Profile: https://github.com/genome
JOSS Publication
Bam-readcount - rapid generation of basepair-resolution sequence metrics
Authors
Division of Oncology, Department of Internal Medicine, Washington University School of Medicine, St. Louis, MO
McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, Current Affiliation: Benson Hill, Inc. St. Louis, MO
Division of Oncology, Department of Internal Medicine, Washington University School of Medicine, St. Louis, MO
Division of Oncology, Department of Internal Medicine, Washington University School of Medicine, St. Louis, MO, Current Affiliation: Moffitt Cancer Center, Tampa, FL
McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, Current Affiliation: Google, Inc. Mountain View, CA
McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO
Division of Oncology, Department of Internal Medicine, Washington University School of Medicine, St. Louis, MO, Siteman Cancer Center, Washington University School of Medicine, St. Louis, MO
Department of Pathology, Washington University School of Medicine, St. Louis, MO
Division of Oncology, Department of Internal Medicine, Washington University School of Medicine, St. Louis, MO, Siteman Cancer Center, Washington University School of Medicine, St. Louis, MO
McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO
Division of Oncology, Department of Internal Medicine, Washington University School of Medicine, St. Louis, MO, McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, Siteman Cancer Center, Washington University School of Medicine, St. Louis, MO, Department of Genetics, Washington University School of Medicine, St. Louis, MO
Division of Oncology, Department of Internal Medicine, Washington University School of Medicine, St. Louis, MO, McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, Siteman Cancer Center, Washington University School of Medicine, St. Louis, MO, Department of Genetics, Washington University School of Medicine, St. Louis, MO
Division of Oncology, Department of Internal Medicine, Washington University School of Medicine, St. Louis, MO, Siteman Cancer Center, Washington University School of Medicine, St. Louis, MO
Tags
genomics cpp sequencingGitHub Events
Total
- Issues event: 6
- Watch event: 13
- Issue comment event: 13
- Fork event: 2
Last Year
- Issues event: 6
- Watch event: 13
- Issue comment event: 13
- Fork event: 2
Committers
Last synced: 5 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Dave Larson | d****n@g****u | 105 |
| Travis Abbott | t****t@g****u | 59 |
| Ajay Khanna | a****s@g****m | 41 |
| Chris Miller | c****r@g****u | 13 |
| sridhar0605 | s****5@g****m | 10 |
| dlarson | d****n@1****d | 8 |
| Travis Abbott | t****t@g****m | 6 |
| Ben Ainscough | b****h@g****m | 4 |
| Scott Smith | s****t@c****g | 1 |
| Sam Brightman | s****n@g****m | 1 |
| Obi Griffith | o****h@g****m | 1 |
| Nathan Nutter | i****m@n****m | 1 |
| Morgan Taschuk | m****k@o****a | 1 |
| Indraniel Das | i****s@g****u | 1 |
| abbcdfinv | a****v@z****e | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 4 months ago
All Time
- Total issues: 90
- Total pull requests: 17
- Average time to close issues: 5 months
- Average time to close pull requests: 3 months
- Total issue authors: 71
- Total pull request authors: 10
- Average comments per issue: 3.01
- Average comments per pull request: 0.88
- Merged pull requests: 14
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 5
- Pull requests: 0
- Average time to close issues: 3 months
- Average time to close pull requests: N/A
- Issue authors: 5
- Pull request authors: 0
- Average comments per issue: 3.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- ernfrid (5)
- sjackman (4)
- friedue (4)
- iranmdl (3)
- lordzappo (2)
- bebatut (2)
- smacarthur (2)
- stroke1989 (2)
- zeronot (2)
- crazyhottommy (2)
- Souzavgp (2)
- matnguyen (1)
- mcfog1 (1)
- SethosII (1)
- rpauly (1)
Pull Request Authors
- sridhar0605 (4)
- apldx (3)
- tabbott (3)
- ernfrid (1)
- colindaven (1)
- bainscou (1)
- morgantaschuk (1)
- seqfu (1)
- sjackman (1)
- sambrightman (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
- Total downloads: unknown
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 2
spack.io: bam-readcount
Bam-readcount generates metrics at single nucleotide positions.
- Homepage: https://github.com/genome/bam-readcount
- License: []
-
Latest release: 1.0.1
published over 2 years ago
