perbase

Per-base per-nucleotide depth analysis

https://github.com/sstadick/perbase

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    1 of 8 committers (12.5%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.5%) to scientific vocabulary

Keywords

bioinformatics cli-app rust

Keywords from Contributors

genomics
Last synced: 6 months ago · JSON representation ·

Repository

Per-base per-nucleotide depth analysis

Basic Info
  • Host: GitHub
  • Owner: sstadick
  • License: mit
  • Language: Rust
  • Default Branch: master
  • Homepage:
  • Size: 829 KB
Statistics
  • Stars: 138
  • Watchers: 5
  • Forks: 17
  • Open Issues: 18
  • Releases: 35
Topics
bioinformatics cli-app rust
Created over 5 years ago · Last pushed 6 months ago
Metadata Files
Readme Changelog License Citation

README.md

Publish Rust API docs Crates.io Conda

A highly parallelized utility for analyzing metrics at a per-base level.

If a metric is missing, or performance is lacking. Please file a bug/feature ticket in issues.

Why?

Why perbase when so many other tools are out there? perbase leverages Rust's concurrency system to automagically parallelize over your input regions. This leads to orders of magnitude faster runtimes that scale with the compute resources that you have available. Additionally, perbase aims to be more accurate than other tools. E.g.: perbase counts DELs toward depth, bam-readcount does not, perbase does not count REF_SKIPs toward depth, sambamba does.

Installation

```bash conda install -c bioconda perbase

OR

cargo install perbase

OR

brew install perbase ```

You can also download a binary from the releases page.

Tools

base-depth

The base-depth tool walks over every position in the BAM/CRAM file and calculates the depth, as well as the number of each nucleotide at the given position. Additionally, it counts the numbers of Ins/Dels at each position.

The output columns are as follows:

| Column | Description | | -------------- | -------------------------------------------------------------------------------------------------- | | REF | The reference sequence name | | POS | The position on the reference sequence | | REFBASE | The reference base at the position, column excluded if no reference was supplied | | DEPTH | The total depth at the position SUM(A, C, T, G, N, R, Y, S, W, K, M, DEL) | | A | Total A nucleotides seen at this position | | C | Total C nucleotides seen at this position | | G | Total G nucleotides seen at this position | | T | Total T nucleotides seen at this position | | N | Total N nucleotides seen at this position | | R | Total R nucleotides seen at this position | | Y | Total Y nucleotides seen at this position | | S | Total S nucleotides seen at this position | | W | Total W nucleotides seen at this position | | K | Total K nucleotides seen at this position | | M | Total M nucleotides seen at this position | | INS | Total insertions that start at the base to the right of this position | | DEL | Total deletions covering this position | | REFSKIP | Total reference skip operations covering this position | | FAIL | Total reads failing filters that covered this position (their bases were not counted toward depth) | | COUNTOFMATERESOLUTIONS | Total number times that mate resolution needed to be done | | NEARMAX_DEPTH | Flag to indicate if this position came within 1% of the max depth specified |

bash perbase base-depth ./test/test.bam

Example output

text REF POS DEPTH A C G T N R Y S W K M INS DEL REF_SKIP FAIL COUNT_OF_MATE_RESOUTIONS NEAR_MAX_DEPTH chr1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 false chr1 2 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 false chr1 3 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 false chr1 4 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 false chr1 5 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 false chr1 6 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 false chr1 7 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 false chr1 8 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 false chr1 9 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 false

If the --mate-fix flag is passed, each position will first check if there are any mate overlaps and choose the mate with the hightest MAPQ, breaking ties by choosing the first mate that passes filters. Mates that are discarded are not counted toward FAIL or DEPTH.

If the --reference-fasta is supplied, the REF_BASE field will be filled in. The reference must be indexed and match the BAM/CRAM header of the input.

The output can be compressed and indexed as follows:

```bash perbase base-depth -Z ./test/test.bam -o output.tsv.gz tabix -S 1 -s 1 -b 2 -e 2 ./output.tsv.gz

Query all positions overlapping region

tabix output.tsv.gz chr1:5-10 ```

If the --mate-fix flag is passed, each position will first check if there are any mate overlaps and resolve based on the mate-resolution-strategy, mates that are discarded are not counted toward FAIL or DEPTH.

All strategies first check user-based read filters. If one mate fails filters, the other is chosen. If both fail, the first mate is chosen by default. For reads that are deletions / ref skips or lack a base call, all strategies fall back to the Original strategy (MAPQ → first in pair).

| Strategy | Priority 1 | Priority 2 | Priority 3 (Tie-breaker) | Notes | |----------|------------|------------|--------------------------|-------| | BaseQualMapQualFirstInPair | Higher base quality | Higher MAPQ | First mate in pair | Standard quality-first approach | | BaseQualMapQualIUPAC | Higher base quality | Higher MAPQ | IUPAC code (e.g., A+G→R) | Returns ambiguity codes for ties | | BaseQualMapQualN | Higher base quality | Higher MAPQ | N (unknown base) | Conservative, marks ambiguous as N | | MapQualBaseQualFirstInPair | Higher MAPQ | Higher base quality | First mate in pair | Prioritizes mapping confidence | | MapQualBaseQualIUPAC | Higher MAPQ | Higher base quality | IUPAC code (e.g., A+G→R) | Mapping-first with ambiguity codes | | MapQualBaseQualN | Higher MAPQ | Higher base quality | N (unknown base) | Mapping-first, conservative, marks ambiguous as N | | IUPAC | — | — | IUPAC code | Always returns IUPAC code for different bases, same bases return themselves (A+A→A) | | N | — | — | N or base | Returns N for different bases, same bases return themselves (A+A→A) | | Original | Higher MAPQ | First mate in pair | First mate (default) | Simple MAPQ-based strategy |

IUPAC Ambiguity Codes

When IUPAC strategies are used, the following codes are returned for base combinations:

| Base 1 | Base 2 | IUPAC Code | Meaning | |--------|--------|------------|---------| | A | G | R | puRine (A or G) | | C | T | Y | pYrimidine (C or T) | | G | C | S | Strong (G or C) | | A | T | W | Weak (A or T) | | G | T | K | Keto (G or T) | | A | C | M | aMino (A or C) | | Any | Same | Original | Identical bases return themselves | | Any | Other | N | Any combination not listed above |

Strategy Selection Guide

  • Use BaseQual strategies when base quality is the most reliable indicator of accuracy
  • Use MapQual strategies when mapping quality is more trustworthy (e.g., repetitive regions)
  • Use IUPAC variants when you want to preserve ambiguity information for downstream analysis
  • Use N variants when you prefer conservative base calling
  • Use FirstInPair variants when you want deterministic results without ambiguity codes
  • Use IUPAC/N strategies when you don't trust quality scores and want base-only decisions
  • Use Original for backwards compatibility or simple MAPQ-based selection

Usage:

```text perbase-base-depth 0.10.3 Seth Stadick sstadick@gmail.com Calculate the depth at each base, per-nucleotide

USAGE: perbase base-depth [FLAGS] [OPTIONS]

FLAGS: -Z, --bgzip
Optionally bgzip the output

-h, --help                      
        Prints help information

-k, --keep-zeros                
        Keep positions even if they have 0 depth

-m, --mate-fix                  
        Fix overlapping mates counts, see docs for full details

-M, --skip-merging-intervals    
        Skip merging together regions specified in the optional BED or BCF/VCF files.

        **NOTE** If this is set it could result in duplicate output entries for regions that overlap. **NOTE** This
        may cause issues with downstream tooling.
-V, --version                   
        Prints version information

-z, --zero-base                 
        Output positions as 0-based instead of 1-based

OPTIONS: -B, --bcf-file A BCF/VCF file containing positions of interest. If specified, only bases from the given positions will be reported on -b, --bed-file A BED file containing regions of interest. If specified, only bases from the given regions will be reported on -C, --channel-size-modifier The fraction of a gigabyte to allocate per thread for message passing, can be greater than 1.0 [default: 0.15] -c, --chunksize The ideal number of basepairs each worker receives. Total bp in memory at one time is (threads - 2) * chunksize [default: 1000000] -L, --compression-level The level to use for compressing output (specified by --bgzip) [default: 2]

-T, --compression-threads <compression-threads>
        The number of threads to use for compressing output (specified by --bgzip) [default: 4]

-F, --exclude-flags <exclude-flags>                          
        SAM flags to exclude, recommended 3848 [default: 0]

-f, --include-flags <include-flags>                          
        SAM flags to include [default: 0]

-M, --mate-resolution-strategy <mate-resolution-strategy>
        If `mate_fix` is true, select the method to use for mate fixing [default: original]

-D, --max-depth <max-depth>
        Set the max depth for a pileup. If a positions depth is within 1% of max-depth the `NEAR_MAX_DEPTH` output
        field will be set to true and that position should be viewed as suspect [default: 100000]
-Q, --min-base-quality-score <min-base-quality-score>
        Minium base quality for a base to be counted toward [A, C, T, G]. If the base is less than the specified
        quality score it will instead be counted as an `N`. If nothing is set for this no cutoff will be applied
-q, --min-mapq <min-mapq>
        Minimum MAPQ for a read to count toward depth [default: 0]

-o, --output <output>                                        
        Output path, defaults to stdout

    --ref-cache-size <ref-cache-size>
        Number of Reference Sequences to hold in memory at one time. Smaller will decrease mem usage [default: 10]

-r, --ref-fasta <ref-fasta>                                  
        Indexed reference fasta, set if using CRAM

-t, --threads <threads>                                      
        The number of threads to use [default: 10]

ARGS:
Input indexed BAM/CRAM to analyze

```

only-depth

The only-depth tool walks over the input BAM/CRAM file and calculates the depth over all positions specified by either a BED file or in the BAM/CRAM header. Adjacent positions that have the same depth will be merged together to form a non-inclusive range (see example output).

There are two distinct modes that only-depth can run in, gated by the --fast-mode flag. When running in fast-mode, only depth over the area a read covers is only determined by the reads start and end postions, and no cigar related info is taken into account. --mate-fix may still be used in this mode, and areas where mates overlap will not be counted twice.

Without the --fast-mode flag, the depth at each position is determined in a manner similar to base-depth where DEL will count toward depth, but REF_SKIP will not. Additionally, any reads that fail the --exclude-flags will not be counted toward depth. Lastly, --mate-fix can be applied to avoid counting regions twice where mates may overlap.

Regarding mate fixes, perbase will make "fixes" based only on the counted regions in a read. For example, if you have a read that goes from "chr1:0-1000" with a CIGAR of "25M974N1M", and the mate aligns nicely at "chr1:45-70" with CIGAR "25M", the mate will count toward the depth over "chr1:45-74". This is in contrast to other tools that will reject the mate even though it overlaps a region of R1 that is not counted toward depth.

For the fastest possible output, use only-depth --fast-mode.

Note that it is possible that two adjacent positions may not merge if they fall at a --chunksize boundary. If this is an issue you can set the --chunksize to the size of the largest contig in question. At a future date this may be fixed or a post processing tool may be provided to fix it. For most use cases this should not be a problem. Additionally, you can pipe into merge-adjacent which will fix it as well. EX: perbase only-depth -m file.bam | perbase merge-adjacent > out.tsv.

Example output of perbase only-depth --mate-fix --zero-base ./test/test.bam:

text REF POS END DEPTH chr2 0 4 1 chr2 4 9 2 chr2 9 12 3 chr2 12 14 2 chr2 14 17 3 chr2 17 19 4 chr2 19 23 5 chr2 23 34 4 chr2 34 39 3 chr2 39 49 1 chr2 49 54 2 chr2 54 64 3 chr2 64 74 4 chr2 74 79 3 chr2 79 84 2 chr2 84 89 1

If a BED-like output is needed, --bed-format -z flags can be set, which will write a 0-based, no-header TSV output with an empty 4th column and the depth in the 5th column.

Usage:

```text Calculate the only the depth at each base

USAGE: perbase only-depth [FLAGS] [OPTIONS]

FLAGS: --bed-format
Output BED-like output format with the depth in the 5th column. Note, -z can be used with this to change coordinates to 0-based to be more BED-like -Z, --bgzip
Optionally bgzip the output

-x, --fast-mode                 
        Calculate depth based only on read starts/stops, see docs for full details

-h, --help                      
        Prints help information

-k, --keep-zeros                
        Keep positions even if they have 0 depth

-m, --mate-fix                  
        Fix overlapping mates counts, see docs for full details

-n, --no-merge                  
        Skip merging adjacent bases that have the same depth

-M, --skip-merging-intervals    
        Skip mergeing togther regions specified in the optional BED or BCF/VCF files.

        **NOTE** If this is set it could result in duplicate output entries for regions that overlap. **NOTE** This
        may cause issues with downstream tooling.
-V, --version                   
        Prints version information

-z, --zero-base                 
        Output positions as 0-based instead of 1-based

OPTIONS: -B, --bcf-file A BCF/VCF file containing positions of interest. If specified, only bases from the given positions will be reported on. Note that it may be more efficient to calculate depth over regions if your positions are clustered tightly together -b, --bed-file A BED file containing regions of interest. If specified, only bases from the given regions will be reported on -C, --channel-size-modifier The fraction of a gigabyte to allocate per thread for message passing, can be greater than 1.0 [default: 0.001] -c, --chunksize The ideal number of basepairs each worker receives. Total bp in memory at one time is (threads - 2) * chunksize [default: 1000000] -L, --compression-level The level to use for compressing output (specified by --bgzip) [default: 2]

-T, --compression-threads <compression-threads>
        The number of threads to use for compressing output (specified by --bgzip) [default: 4]

-F, --exclude-flags <exclude-flags>                    
        SAM flags to exclude, recommended 3848 [default: 0]

-f, --include-flags <include-flags>                    
        SAM flags to include [default: 0]

-q, --min-mapq <min-mapq>                              
        Minimum MAPQ for a read to count toward depth [default: 0]

-o, --output <output>                                  
        Output path, defaults to stdout

-r, --ref-fasta <ref-fasta>                            
        Indexed reference fasta, set if using CRAM

-t, --threads <threads>                                
        The number of threads to use [default: 32]

ARGS:
Input indexed BAM/CRAM to analyze ```

merge-adjacent

merge-adjacent is a utility to merge overlapping regions in a BED-like file.

It will take a file with four columns and no header as long as the columns are like:

text <contig>\t<start>\t<stop>\t<depth>\n

Or it can take files with three columns with headers that are like

text <REF|chrom>\t<POS|chromStart>\t<END|chromEnd>\t<DEPTH|COV>

The END|chromEnd column is optional.

``text perbase-merge-adjacent 0.7.5-alpha.0 Seth Stadick <sstadick@gmail.com> Merge adjacent intervals that have the same depth. Input must be sorted like:sort -k1,1 -k2,2n in.bed > in.sorted.bed`

Generally accepts any file with no header tha is \t\t\t. The is optional. See documentation for explaination of headers that are accepted.

USAGE: perbase merge-adjacent [FLAGS] [OPTIONS] [in-file]

FLAGS: -Z, --bgzip
Optionally bgzip the output

-h, --help         
        Prints help information

-n, --no-header    
        Indicate if the input file does not have a header

-V, --version      
        Prints version information

OPTIONS: -T, --compression-level The level to use for compressing output (specified by --bgzip) [default: 2]

-T, --compression-threads <compression-threads>
        The number of threads to use for compressing output (specified by --bgzip) [default: 32]

-o, --output <output>                              
        The output location, defaults to STDOUT

ARGS:
Input bed-like file, defaults to STDIN ```

EX:

bash perbase only-depth indexed.bam | perbase merge-adjacent > out.tsv

Similar Projects

Owner

  • Name: Seth
  • Login: sstadick
  • Kind: user
  • Location: Maryland
  • Company: Bio-Rad Laboratories

Rust, Mojo, Bioinformatics.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Stadick
    given-names: Seth
title: "Perbase"
version: 0.9.0
date-released: 2023-07-20

GitHub Events

Total
  • Create event: 6
  • Release event: 3
  • Issues event: 7
  • Watch event: 14
  • Delete event: 1
  • Issue comment event: 13
  • Push event: 26
  • Pull request review event: 2
  • Pull request review comment event: 2
  • Pull request event: 6
  • Fork event: 2
Last Year
  • Create event: 6
  • Release event: 3
  • Issues event: 7
  • Watch event: 14
  • Delete event: 1
  • Issue comment event: 13
  • Push event: 26
  • Pull request review event: 2
  • Pull request review comment event: 2
  • Pull request event: 6
  • Fork event: 2

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 159
  • Total Committers: 8
  • Avg Commits per committer: 19.875
  • Development Distribution Score (DDS): 0.05
Past Year
  • Commits: 9
  • Committers: 4
  • Avg Commits per committer: 2.25
  • Development Distribution Score (DDS): 0.333
Top Committers
Name Email Commits
Seth Stadick s****k@g****m 151
Nils Homer n****3 2
Rob 1****r 1
Hunter DeMeyer 3****y 1
Chang Y y****0@g****m 1
Brent Pedersen b****e@g****m 1
davidecarlson d****n@s****u 1
Nikolai Karpov n****v@n****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 50
  • Total pull requests: 36
  • Average time to close issues: about 2 months
  • Average time to close pull requests: 9 days
  • Total issue authors: 23
  • Total pull request authors: 10
  • Average comments per issue: 1.62
  • Average comments per pull request: 0.61
  • Merged pull requests: 28
  • Bot issues: 0
  • Bot pull requests: 1
Past Year
  • Issues: 7
  • Pull requests: 8
  • Average time to close issues: about 2 months
  • Average time to close pull requests: 18 days
  • Issue authors: 6
  • Pull request authors: 3
  • Average comments per issue: 1.0
  • Average comments per pull request: 0.25
  • Merged pull requests: 4
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • sstadick (21)
  • brentp (3)
  • patbohn (3)
  • Y-Pei (2)
  • nh13 (2)
  • Surar (2)
  • brainstorm (1)
  • colindaven (1)
  • oschwengers (1)
  • da-i (1)
  • Akazhiel (1)
  • JMNGLS (1)
  • jelber2 (1)
  • ankane (1)
  • ghuls (1)
Pull Request Authors
  • sstadick (22)
  • nkkarpov (2)
  • nh13 (2)
  • biermanr (2)
  • davidecarlson (2)
  • 1ndy (1)
  • Schaudge (1)
  • brentp (1)
  • dependabot[bot] (1)
  • y9c (1)
Top Labels
Issue Labels
enhancement (14) bug (7) investigation (3) question (1) tests (1) documentation (1)
Pull Request Labels
hacktoberfest (2) hacktoberfest-accepted (2) dependencies (1)

Packages

  • Total packages: 2
  • Total downloads:
    • homebrew 25 last-month
    • cargo 29,031 total
  • Total dependent packages: 0
    (may contain duplicates)
  • Total dependent repositories: 0
    (may contain duplicates)
  • Total versions: 31
  • Total maintainers: 1
crates.io: perbase

Fast and correct perbase BAM/CRAM analysis.

  • Versions: 27
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 29,031 Total
Rankings
Forks count: 14.4%
Stargazers count: 15.3%
Average: 23.6%
Downloads: 25.4%
Dependent repos count: 29.3%
Dependent packages count: 33.8%
Maintainers (1)
Last synced: 6 months ago
formulae.brew.sh: perbase

Fast and correct perbase BAM/CRAM analysis

  • Versions: 4
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 25 Last month
Rankings
Dependent packages count: 16.4%
Average: 38.8%
Dependent repos count: 47.8%
Downloads: 52.2%
Last synced: 6 months ago

Dependencies

Cargo.lock cargo
  • 185 dependencies
Cargo.toml cargo
  • proptest 1.0.0 development
  • rstest 0.11.0 development
  • tempfile 3.1.0 development
  • anyhow 1.0.41
  • bio 0.41
  • crossbeam 0.8
  • crossbeam-channel 0.5
  • csv 1.1.6
  • env_logger 0.9.0
  • grep-cli 0.1.5
  • gzp 0.10
  • itertools 0.10.1
  • lazy_static 1.4.0
  • log 0.4.11
  • lru_time_cache 0.11.1
  • num_cpus 1.13.0
  • rayon 1.4.0
  • rust-htslib 0.38
  • rust-lapper 1.0
  • serde 1.0.116
  • smartstring 0.2.4
  • structopt 0.3.18
  • termcolor 1.1.0
.github/workflows/main.yml actions
  • actions/checkout v2 composite
  • svenstaro/upload-release-action v2 composite
.github/workflows/rust.yml actions
  • actions/checkout v2 composite