BAMnostic

BAMnostic: an OS-agnostic toolkit for genomic sequence analysis - Published in JOSS (2018)

https://github.com/betteridiot/bamnostic

Science Score: 95.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README and JOSS metadata
  • Academic publication links
    Links to: ncbi.nlm.nih.gov, joss.theoj.org, zenodo.org
  • Committers with academic emails
    1 of 9 committers (11.1%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Scientific Fields

Biology Life Sciences - 40% confidence
Last synced: 4 months ago · JSON representation

Repository

a pure Python multi-version tolerant, runtime and OS-agnostic Binary Alignment Map (BAM) file parser and random access tool

Basic Info
Statistics
  • Stars: 98
  • Watchers: 2
  • Forks: 19
  • Open Issues: 3
  • Releases: 27
Created almost 8 years ago · Last pushed 6 months ago
Metadata Files
Readme Contributing License Code of conduct Codemeta

README.md

Documentation Status Conda Version PyPI version Maintainability

status DOI License

|Platform | Build Status | |:--------|:------------:| |Windows | Build status Appveyor| |conda | noarch|

|Host | Downloads | |:----|:---------:| |PyPI | Downloads| |conda|Conda Downloads|

BAMnostic

a pure Python, OS-agnositic Binary Alignment Map (BAM) file parser and random access tool.

Note:

Documentation can be found at here or by going to this address: http://bamnostic.readthedocs.io. Documentation was made available through Read the Docs.


Installation

There are 4 methods of installation available (choose one):

Through the conda package manager (Anaconda Cloud)

```bash

first, add the conda-forge channel to your conda build

conda config --add channels conda-forge

now bamnostic is available for install

conda install --solver=libmamba bamnostic ```

Through the Python Package Index (PyPI)

```bash pip install bamnostic

or, if you don't have superuser access

pip install --user bamnostic ```

Through pip+Github

```bash

again, use --user if you don't have superuser access

pip install -e git+https://github.com/betteridiot/bamnostic.git#egg=bamnostic

or, if you don't have superuser access

pip install --user -e git+https://github.com/betteridiot/bamnostic.git#bamnostic#egg=bamnostic ```

Traditional GitHub clone

```bash git clone https://github.com/betteridiot/bamnostic.git cd bamnostic pip install -e .

or, if you don't have superuser access

pip install --user -e . ```


Quickstart

Bamnostic is meant to be a reduced drop-in replacement for pysam. As such it has much the same API as pysam with regard to BAM-related operations.
Note: the pileup() method is not supported at this time.

Importing

```python

import bamnostic as bs ```

Loading your BAM file (Note: CRAM format are not supported at this time)

Bamnostic comes with an example BAM (and respective BAI) file just to play around with the output. Note, however, that the example BAM file does not contain many reference contigs. Therefore, random access is limited. This example file is made availble through bamnostic.example_bam, which is a just a string path to the BAM file within the package.

```python

bam = bs.AlignmentFile(bs.example_bam, 'rb') ```

Get the header

Note: this will print out the SAM header. If the SAM header is not in the BAM file, it will print out the dictionary representation of the BAM header. It is a dictionary of refID keys with contig names and length tuple values.

```python

bam.header {0: ('chr1', 1575), 1: ('chr2', 1584)} ```

Data validation through head()

```python

bam.head(n=2) [EAS5657:6:190:289:82 69 chr1 100 0 * = 100 0 CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA <<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<; MF:C:192, EAS5657:6:190:289:82 137 chr1 100 73 35M = 100 0 AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC <<<<<<;<<<<<<<<<<;<<;<<<<;8<6;9;;2; MF:C:64 Aq:C:0 NM:C:0 UQ:C:0 H0:C:1 H1:C:0] ```

Getting the first read

```python

firstread = next(bam) print(firstread) EAS56_57:6:190:289:82 69 chr1 100 0 * = 100 0 CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA <<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<; MF:C:192 ```

Exploring the read

```python

read name

print(firstread.readname) EAS56_57:6:190:289:82

0-based position

print(first_read.pos) 99

nucleotide sequence

print(first_read.seq) CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA

Read FLAG

print(first_read.flag) 69

decoded FLAG

bs.utils.flagdecode(firstread.flag) [(1, 'read paired'), (4, 'read unmapped'), (64, 'first in pair')] ```

Random Access

```python

for i, read in enumerate(bam.fetch('chr2', 1, 100)): ... if i >= 3: ... break ... print(read)

B7591:8:4:841:340 73 chr2 1 99 36M * 0 0 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTAA <<<<<<<<;<<<<<<<<;<<<<<;<;:<<<<<<<;; MF:C:18 Aq:C:77 NM:C:0 UQ:C:0 H0:C:1 H1:C:0 EAS5467:4:142:943:582 73 chr2 1 99 35M * 0 0 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTA <<<<<<;<<<<<<:<<;<<<<;<<<;<<<:;<<<5 MF:C:18 Aq:C:41 NM:C:0 UQ:C:0 H0:C:1 H1:C:0 EAS54_67:6:43:859:229 153 chr2 1 66 35M * 0 0 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTA +37<=<.;<<7.;77<5<<0<<<;<<<27<<<<<< MF:C:32 Aq:C:0 NM:C:0 UQ:C:0 H0:C:1 H1:C:0 ```


Introduction

Next-Generation Sequencing

The field of genomics requires sequencing data produced by Next-Generation sequencing (NGS) platforms (such as Illumina). These data take the form of millions of short strings that represent the nucleotide sequences (A, T, C, or G) of the sample fragments processed by the NGS platform. More information regarding the NGS workflow can be found here

An example of a single entry (known as FASTQ) can be seen below (FASTQ Format):

bash @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36 GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC +SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36 IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC

Each entry details the read name, lenght, string representation, and quality of each aligned base along the read.

SAM/BAM Format

The data from the NGS platforms are often aligned to reference genome. That is, each entry goes through an alignment algorithm that finds the best position that the entry matches along a known reference sequence. The alignment step extends the original entry with a sundry of additional attributes. A few of the included attributes are contig, position, and Compact Idiosyncratic Gapped Alignment Report (CIGAR) string. The modified entry is called the An example Sequence Alignment Map (SAM) entry can be see below (SAM format):

bash @HD VN:1.5 SO:coordinate @SQ SN:ref LN:45 r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG * r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA * r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0; r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC * r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1; r001 147 ref 37 30 9M = 7 -39 CAGCGGCAT * NM:i:1

There are many benefits to the SAM format: human-readable, each entry is contained to a single line (supporting simple stream analysis), concise description of the read's quality and position, and a file header metadata that supports integrity and reproducibility.

Additionally, a compressed form of the SAM format was designed in parallel. It is called the Binary Alignment Map (BAM). Using a series of clever byte encoding of each SAM entry, the data are compressed into specialized, concatenated GZIP blocks called Blocked GNU Zip Format (BGZF) blocks. Each BGZF block contains a finite amount of data (≈65Kb). While the whole file is GZIP compatible, each individual block is also independently GZIP compatible. This data structure, ultimately, makes the file larger than just a normal GZIP file, but it also allow for random access within the file though the use of a BAM Index file (BAI).

BAI

The BAI file, often produced via samtools, requires the BAM file to be sorted prior to indexing. Using a modified R-tree binning strategy, each reference contig is divided into sequential, non-overlapping bins. That is a parent bin may contain numerous children, but none of the children bins overlap another's assigned interval. Each BAM entry is then assigned to the bin that fully contains it. A visual description of the binning strategy can be found here. Each bin is comprised of chunks, and each chunk contains its respective start and stop byte positions within the BAM file.
In addition to the bin index, a linear index is produced as well. Again, the reference contig is divided into equally sized windows (covering ≈16Kbp/each). Along those windows, the start offset of the first read that overlaps that window is stored. Now, given a region of interest, the first bin that overlaps the region is looked up. The chunks in the bin are stored as virtual offsets.
A virtual offset is a 64-bit unsigned integer that is comprised of the compressed offset coffset (indicating the byte position of the start of the containing BGZF block) and the uncompressed offset uoffset (indicating the byte position within the uncompressed data of the BGZF block that the data starts). A virtual offset is calculated by:

python virtual_offset = coffset << 16 | uoffset

Similarly, the complement of the above is as follows:

python coffset = virtual_offset >> 16 uoffset = virtual_offset ^ (coffset << 16)

A simple seek call against the BAM file will put the head at the start of your region of interest.


Motivation

The common practice within the field of genomics/genetics when analyzing BAM files is to use the program known as samtools. The maintainers of samtools have done a tremendous job of providing distributions that work on a multitude of operating systems. While samtools is powerful, as a command line interface, it is also limited in that it doesn't really afford the ability to perform real-time dynamic processing of reads (without requiring many system calls to samtools). Due to its general nature and inherent readability, a package was written in Python called pysam. This package allowed users a very comfortable means to doing such dynamic processing. However, the foundation of these tools is built on a C-API called htslib and htslib cannot be compiled in a Windows environment. By extension, neither can pysam.

In building a tool for genomic visualization, I wanted it to be platform agnostic. This is precisely when I found out that the tools I had planned to use as a backend did not work on Windows...the most prevalent operation system in the end-user world. So, I wrote bamnostic. As of this writing, bamnostic is OS-agnostic and written completely in Pure Python--requiring only the standard library (and pytest for the test suite). Special care was taken to ensure that it would run on all versions of CPython 2.7 or greater. Additionally, it runs in both stable versions of PyPy. While it may perform slower than its C counterparts, bamnostic opens up the science to a much greater end-user group. Lastly, it is lightweight enough to fit into any simple web server (e.g. Flask), further expanding the science of genetics/genomics.


Citation

If you use bamnostic in your analyses, please consider citing Li et al (2009) as well. Regarding the citation for bamnostic, please use the JoSS journal article (click on the JOSS badge above) or use the following:

Sherman MD and Mills RE, (2018). BAMnostic: an OS-agnostic toolkit for genomic sequence analysis . Journal of Open Source Software, 3(28), 826, https://doi.org/10.21105/joss.00826


Community Guidelines:

Eagerly accepting PRs for improvements, optimizations, or features. For any questions or issues, please feel free to make a post to bamnostic's Issue tracker on github or read over our CONTRIBUTING documentation.


Commmunity Contributors:

Below you will find a list of contributors and it acts as a small token of my gratitude to the community that has helped support this project. 1. @GeekLogan 2. @giesselmann 3. @olgabot 4. @OliverVoogd 5. @gmat 6. [@JMencius](https://github.com/JMencius

Owner

  • Name: Marcus D Sherman
  • Login: betteridiot
  • Kind: user
  • Location: Portland, Maine

PhD candidate in DCMB at the UMich. Member of @mills-lab. @PyDataAnnArbor volunteer, @PyCon staff

JOSS Publication

BAMnostic: an OS-agnostic toolkit for genomic sequence analysis
Published
August 09, 2018
Volume 3, Issue 28, Page 826
Authors
Marcus D. Sherman ORCID
Department of Computational Medicine and Bioinformatics, University of Michigan
Ryan E. Mills ORCID
Department of Computational Medicine and Bioinformatics, University of Michigan, Department of Human Genetics, University of Michigan
Editor
Pjotr Prins ORCID
Tags
genomics bam genetics Next-Generation Sequencing

CodeMeta (codemeta.json)

{
  "@context": [
    "http://schema.org",
    {
      "author": {
        "@id": "schema:author",
        "@container": "@list"
      }
    }
  ],
  "@type": "Code",
  "title": "bamnostic",
  "name": "BAMnostic: a Pure Python OS, version, and runtime agnostic BAM file parser",
  "author": [
    {
      "@id": "https://orcid.org/0000-0002-0243-4609",
      "@type": "Person",
      "email": "mdsherm@umich.edu",
      "name": "Marcus D Sherman",
      "affiliation": "Department of Computational Medicine and Bioinformatics, University of Michigan"
    },
    {
      "@id": "https://orcid.org/0000-0003-3425-6998",
      "@type": "Person",
      "email": "remills@med.umich.edu",
      "name": "Ryan E Mills",
      "affiliation": [
        "Department of Computational Medicine and Bioinformatics, University of Michigan",
        "Department of Human Genetics, University of Michigan"
      ]
    }
  ],
  "copyrightHolder": {
    "@type": "Organization",
    "email": "copyright@umich.edu",
    "name": "University of Michigan"
  },
  "copyrightYear": 2018,
  "creator": {
    "@id": "https://orcid.org/0000-0002-0243-4609"
  },
  "maintainer": "https://orcid.org/0000-0002-0243-4609",
  "codeRepository": "https://github.com/betteridiot/bamnostic",
  "dateModified": "2025-06-27",
  "description": "BAMnostic is a Pure Python OS, version, and runtime agnostic BAM file parser",
  "keywords": "BAM, pysam, genomics, genetics, htslib, samtools",
  "license": "https://github.com/betteridiot/bamnostic/blob/master/LICENSE",
  "softwareVersion": "v1.2",
  "version": "v1.2",
  "readme": "https://github.com/betteridiot/bamnostic/blob/master/README.md",
  "buildInstructions": "https://github.com/betteridiot/bamnostic/blob/master/README.md",
  "issueTracker": "https://github.com/betteridiot/bamnostic/issues",
  "funder": "National Institutes of Health [R01HG007068]",
  "programmingLanguage": {
    "name": "Python",
    "URL": "https://www.python.org/"
  },
  "downloadUrl": [
    "https://github.com/codemeta/codemetar/releases/",
    "https://pypi.org/project/bamnostic/",
    "https://anaconda.org/conda-forge/bamnostic"
  ]
}

GitHub Events

Total
  • Watch event: 5
  • Issue comment event: 8
  • Push event: 6
  • Pull request event: 3
  • Fork event: 1
  • Create event: 1
Last Year
  • Watch event: 5
  • Issue comment event: 8
  • Push event: 6
  • Pull request event: 3
  • Fork event: 1
  • Create event: 1

Committers

Last synced: 5 months ago

All Time
  • Total Commits: 351
  • Total Committers: 9
  • Avg Commits per committer: 39.0
  • Development Distribution Score (DDS): 0.048
Past Year
  • Commits: 3
  • Committers: 2
  • Avg Commits per committer: 1.5
  • Development Distribution Score (DDS): 0.333
Top Committers
Name Email Commits
betteridiot m****n@b****h 334
stickler-ci s****t@s****m 7
Pay Giesselmann p****n@w****e 3
Logan Walker l****w@u****u 2
gmat g****k@y****r 1
OliverVoogd 5****d 1
Olga Botvinnik o****k@g****m 1
Mario Fasold f****d@g****m 1
Jun Mencius 6****s 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 32
  • Total pull requests: 26
  • Average time to close issues: 2 months
  • Average time to close pull requests: 2 days
  • Total issue authors: 20
  • Total pull request authors: 9
  • Average comments per issue: 2.63
  • Average comments per pull request: 0.77
  • Merged pull requests: 24
  • Bot issues: 0
  • Bot pull requests: 1
Past Year
  • Issues: 2
  • Pull requests: 2
  • Average time to close issues: 9 months
  • Average time to close pull requests: 22 days
  • Issue authors: 2
  • Pull request authors: 2
  • Average comments per issue: 1.5
  • Average comments per pull request: 3.0
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • olgabot (5)
  • peterjc (5)
  • betteridiot (2)
  • envanloo (2)
  • ghost (2)
  • ThomasH-RGB (2)
  • Akanksha2511 (1)
  • castrocp (1)
  • akikuno (1)
  • Frank1219 (1)
  • najoshi (1)
  • Eugene108 (1)
  • Jeshuwin (1)
  • ChristopheH (1)
  • delocalizer (1)
Pull Request Authors
  • betteridiot (17)
  • olgabot (3)
  • JMencius (2)
  • stickler-ci[bot] (1)
  • OliverVoogd (1)
  • giesselmann (1)
  • mfasold (1)
  • GeekLogan (1)
  • gmat (1)
Top Labels
Issue Labels
enhancement (3) in progress (1) help wanted (1)
Pull Request Labels

Packages

  • Total packages: 2
  • Total downloads:
    • pypi 789 last-month
  • Total dependent packages: 3
    (may contain duplicates)
  • Total dependent repositories: 2
    (may contain duplicates)
  • Total versions: 81
  • Total maintainers: 1
pypi.org: bamnostic

Pure Python, OS-agnostic Binary Alignment Map (BAM) random access and parsing tool

  • Versions: 55
  • Dependent Packages: 3
  • Dependent Repositories: 1
  • Downloads: 789 Last month
Rankings
Dependent packages count: 3.1%
Stargazers count: 7.5%
Forks count: 8.7%
Average: 10.4%
Downloads: 11.0%
Dependent repos count: 21.7%
Maintainers (1)
Last synced: 4 months ago
conda-forge.org: bamnostic
  • Versions: 26
  • Dependent Packages: 0
  • Dependent Repositories: 1
Rankings
Dependent repos count: 24.3%
Stargazers count: 35.0%
Average: 37.2%
Forks count: 38.0%
Dependent packages count: 51.6%
Last synced: 4 months ago

Dependencies

docs/requirements.txt pypi
  • docutils <0.18
  • sphinx *
  • sphinx_rtd_theme *
pyproject.toml pypi
requirements.txt pypi
setup.py pypi
docs/environment.yaml pypi
  • sphinx_rtd_theme *