Bam-readcount - rapid generation of basepair-resolution sequence metrics

Bam-readcount - rapid generation of basepair-resolution sequence metrics - Published in JOSS (2022)

https://github.com/genome/bam-readcount

Scientific Fields

Biology Life Sciences - 40% confidence

Last synced: 6 months ago · JSON representation

Repository

Count bases in BAM/CRAM files

Basic Info

Host: GitHub
Owner: genome
License: mit
Language: CMake
Default Branch: master
Homepage:
Size: 20 MB

Statistics

Stars: 318
Watchers: 55
Forks: 96
Open Issues: 48
Releases: 7

Created about 14 years ago · Last pushed about 4 years ago

Metadata Files

Readme Contributing License Zenodo

bam-readcount

latest release Docker Pulls GitHub

bam-readcount is a utility that runs on a BAM or CRAM file and generates low-level information about sequencing data at specific nucleotide positions. Its outputs include observed bases, readcounts, summarized mapping and base qualities, strandedness information, mismatch counts, and position within the reads. (see "Output" section below)

Originally designed to help filter genomic mutation calls, the metrics bam-readcount outputs are also useful as input for variant detection tools and for resolving ambiguity between variant callers.

If you find bam-readcount useful in your work, please cite our paper:

Khanna et al., (2022). Bam-readcount - rapid generation of basepair-resolution sequence metrics. Journal of Open Source Software, 7(69), 3722. https://doi.org/10.21105/joss.03722

Installation

Docker

The latest release version of bam-readcount is available as a Docker image on DockerHub

docker pull mgibio/bam-readcount

For details see the docker-bam-readcount repository.

Build

Requires a C++ toolchain and cmake. For details see BUILD.md.

git clone https://github.com/genome/bam-readcount 
cd bam-readcount
mkdir build
cd build
cmake ..
make
# Executable is
bin/bam-readcount

Usage

Run with no arguments for command-line help:

$ bam-readcount

Usage: bam-readcount [OPTIONS] <bam_file> [region]
Generate metrics for bam_file at single nucleotide positions.
Example: bam-readcount -f ref.fa some.bam

Available options:
  -h [ --help ]                         produce this message
  -v [ --version ]                      output the version number
  -q [ --min-mapping-quality ] arg (=0) minimum mapping quality of reads used
                                        for counting.
  -b [ --min-base-quality ] arg (=0)    minimum base quality at a position to
                                        use the read for counting.
  -d [ --max-count ] arg (=10000000)    max depth to avoid excessive memory
                                        usage.
  -l [ --site-list ] arg                file containing a list of regions to
                                        report readcounts within.
  -f [ --reference-fasta ] arg          reference sequence in the fasta format.
  -D [ --print-individual-mapq ] arg    report the mapping qualities as a comma
                                        separated list.
  -p [ --per-library ]                  report results by library.
  -w [ --max-warnings ] arg             maximum number of warnings of each type
                                        to emit. -1 gives an unlimited number.
  -i [ --insertion-centric ]            generate indel centric readcounts.
                                        Reads containing insertions will not be
                                        included in per-base counts

The optional [region] should be in the same format as samtools:

chromosome:start-stop

The optional -l (--site-list) file should be tab-separated, no header, one region per line:

chromosome  start   end

CRAM support

When using CRAM files as input, if a reference is specified with -f, it will override whatever is in the CRAM header. Otherwise, the reference(s) encoded in the CRAM header or a lookup by MD5 at ENA will be used.

Wrappers/Parsers

Add bam-readcount counts to VCF - VAtools allows you to add read-counts to VCF from modern variant callers. Additional details Create csv file - brc-parser parser to convert bam-readcount output to comma seperated long format file.

Output

Output is tab-separated with no header to STDOUT, one line per position:

chr position    reference_base  depth   base:count:avg_mapping_quality:avg_basequality:avg_se_mapping_quality:num_plus_strand:num_minus_strand:avg_pos_as_fraction:avg_num_mismatches_as_fraction:avg_sum_mismatch_qualities:num_q2_containing_reads:avg_distance_to_q2_start_in_q2_reads:avg_clipped_length:avg_distance_to_effective_3p_end   ...

There is one set of :-separated fields for each reported base with statistics on the set of reads containing that base:

Field | Description ----- | ----------- base | The base, eg C count | Number of reads avgmappingquality | Mean mapping quality avgbasequality | Mean base quality avgsemappingquality | Mean single ended mapping quality numplusstrand | Number of reads on the plus/forward strand numminusstrand | Number of reads on the minus/reverse strand avgposasfraction | Average position on the read as a fraction, calculated with respect to the length after clipping. This value is normalized to the center of the read: bases occurring strictly at the center of the read have a value of 1, those occurring strictly at the ends should approach a value of 0 avgnummismatchesasfraction | Average number of mismatches on these reads per base avgsummismatchqualities | Average sum of the base qualities of mismatches in the reads numq2containingreads | Number of reads with q2 runs at the 3’ end avgdistancetoq2startinq2reads | Average distance of position (as fraction of unclipped read length) to the start of the q2 run avgclippedlength | Average clipped read length avgdistancetoeffective3p_end | Average distance to the 3’ prime end of the read (as fraction of unclipped read length)

Per-library output

With the -p option, each output line will have a set of {}-delimited results, one for each library:

chr position    reference_base  depth   library_1_name  {   base:count:avg_mapping_quality:avg_basequality:avg_se_mapping_quality:num_plus_strand:num_minus_strand:avg_pos_as_fraction:avg_num_mismatches_as_fraction:avg_sum_mismatch_qualities:num_q2_containing_reads:avg_distance_to_q2_start_in_q2_reads:avg_clipped_length:avg_distance_to_effective_3p_end   }   ...   library_N_name    {   base:count:avg_mapping_quality:avg_basequality:avg_se_mapping_quality:num_plus_strand:num_minus_strand:avg_pos_as_fraction:avg_num_mismatches_as_fraction:avg_sum_mismatch_qualities:num_q2_containing_reads:avg_distance_to_q2_start_in_q2_reads:avg_clipped_length:avg_distance_to_effective_3p_end   }

Tutorial

For those who learn best by example, a brief tutorial is available here that uses bam-readcount to identify the Omicron SARS-CoV-2 variant of concern from raw sequence data.

Support

For support, please search bam-readcount on Biostars as many of the most frequently asked questions about bam-readcount have been answered there. For problems not addressed there, please open an github issue or make a BioStar post.

Contributing

We welcome contributions! See Contributing for more details

Owner

Name: The McDonnell Genome Institute
Login: genome
Kind: organization
Location: St. Louis, MO

Website: http://genome.wustl.edu/
Repositories: 167
Profile: https://github.com/genome

JOSS Publication

Bam-readcount - rapid generation of basepair-resolution sequence metrics

Published

January 29, 2022

DOI

10.21105/joss.03722

Volume 7, Issue 69, Page 3722

Authors

Ajay Khanna
Division of Oncology, Department of Internal Medicine, Washington University School of Medicine, St. Louis, MO

David E. Larson
McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, Current Affiliation: Benson Hill, Inc. St. Louis, MO

Sridhar Nonavinkere Srivatsan
Division of Oncology, Department of Internal Medicine, Washington University School of Medicine, St. Louis, MO

Matthew Mosior
Division of Oncology, Department of Internal Medicine, Washington University School of Medicine, St. Louis, MO, Current Affiliation: Moffitt Cancer Center, Tampa, FL

Travis E. Abbott
McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, Current Affiliation: Google, Inc. Mountain View, CA

Susanna Kiwala
McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO

Timothy J. Ley
Division of Oncology, Department of Internal Medicine, Washington University School of Medicine, St. Louis, MO, Siteman Cancer Center, Washington University School of Medicine, St. Louis, MO

Eric J. Duncavage
Department of Pathology, Washington University School of Medicine, St. Louis, MO

Matthew J. Walter
Division of Oncology, Department of Internal Medicine, Washington University School of Medicine, St. Louis, MO, Siteman Cancer Center, Washington University School of Medicine, St. Louis, MO

Jason R. Walker
McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO

Obi L. Griffith
Division of Oncology, Department of Internal Medicine, Washington University School of Medicine, St. Louis, MO, McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, Siteman Cancer Center, Washington University School of Medicine, St. Louis, MO, Department of Genetics, Washington University School of Medicine, St. Louis, MO

Malachi Griffith
Division of Oncology, Department of Internal Medicine, Washington University School of Medicine, St. Louis, MO, McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, Siteman Cancer Center, Washington University School of Medicine, St. Louis, MO, Department of Genetics, Washington University School of Medicine, St. Louis, MO

Christopher A. Miller
Division of Oncology, Department of Internal Medicine, Washington University School of Medicine, St. Louis, MO, Siteman Cancer Center, Washington University School of Medicine, St. Louis, MO

Editor

Lorena Pantano

GitHub Events

Total

Issues event: 6
Watch event: 13
Issue comment event: 13
Fork event: 2

Last Year

Issues event: 6
Watch event: 13
Issue comment event: 13
Fork event: 2

Committers

Last synced: 7 months ago

All Time

Total Commits: 253
Total Committers: 15
Avg Commits per committer: 16.867
Development Distribution Score (DDS): 0.585

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Dave Larson	d**n@g**u	105
Travis Abbott	t**t@g**u	59
Ajay Khanna	a**s@g**m	41
Chris Miller	c**r@g**u	13
sridhar0605	s**5@g**m	10
dlarson	d**n@1**d	8
Travis Abbott	t**t@g**m	6
Ben Ainscough	b**h@g**m	4
Scott Smith	s**t@c**g	1
Sam Brightman	s**n@g**m	1
Obi Griffith	o**h@g**m	1
Nathan Nutter	i**m@n**m	1
Morgan Taschuk	m**k@o**a	1
Indraniel Das	i**s@g**u	1
abbcdfinv	a**v@z**e	1

Committer Domains (Top 20 + Academic)

genome.wustl.edu: 4 oicr.on.ca: 1 nnutter.com: 1 cpan.org: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 90
Total pull requests: 17
Average time to close issues: 5 months
Average time to close pull requests: 3 months
Total issue authors: 71
Total pull request authors: 10
Average comments per issue: 3.01
Average comments per pull request: 0.88
Merged pull requests: 14
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 5
Pull requests: 0
Average time to close issues: 3 months
Average time to close pull requests: N/A
Issue authors: 5
Pull request authors: 0
Average comments per issue: 3.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

ernfrid (5)
sjackman (4)
friedue (4)
iranmdl (3)
lordzappo (2)
bebatut (2)
smacarthur (2)
stroke1989 (2)
zeronot (2)
crazyhottommy (2)
Souzavgp (2)
matnguyen (1)
mcfog1 (1)
SethosII (1)
rpauly (1)

Pull Request Authors

sridhar0605 (4)
apldx (3)
tabbott (3)
ernfrid (1)
colindaven (1)
bainscou (1)
morgantaschuk (1)
seqfu (1)
sjackman (1)
sambrightman (1)

Top Labels

Issue Labels

enhancement (6) joss-review-finished (4) bug (3)

Pull Request Labels

Packages

Total packages: 1
Total downloads: unknown

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 2

spack.io: bam-readcount

Bam-readcount generates metrics at single nucleotide positions.

Homepage: https://github.com/genome/bam-readcount
License: []
Latest release: 1.0.1
published over 2 years ago

Versions: 2
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent repos count: 0.0%

Forks count: 11.3%

Stargazers count: 12.9%

Average: 20.4%

Dependent packages count: 57.3%

Last synced: 6 months ago

Bam-readcount - rapid generation of basepair-resolution sequence metrics

Science Score: 98.0%

Scientific Fields

Repository

Basic Info

Statistics

Metadata Files

README.md

bam-readcount

Contents

Installation

Docker

Build

Usage

CRAM support

Wrappers/Parsers

Output

Per-library output

Tutorial

Support

Contributing

Owner

JOSS Publication

Bam-readcount - rapid generation of basepair-resolution sequence metrics

Authors

Editor

Tags

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

spack.io: bam-readcount

Rankings