biocommons.seqrepo

non-redundant, compressed, journalled, file-based storage for biological sequences

https://github.com/biocommons/biocommons.seqrepo

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
  • Committers with academic emails
    1 of 17 committers (5.9%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.4%) to scientific vocabulary

Keywords

bioinformatics genome-analysis genomics sequencing variant-analysis variation
Last synced: 4 months ago · JSON representation ·

Repository

non-redundant, compressed, journalled, file-based storage for biological sequences

Basic Info
  • Host: GitHub
  • Owner: biocommons
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 613 KB
Statistics
  • Stars: 42
  • Watchers: 9
  • Forks: 36
  • Open Issues: 30
  • Releases: 5
Topics
bioinformatics genome-analysis genomics sequencing variant-analysis variation
Created over 9 years ago · Last pushed 5 months ago
Metadata Files
Readme License Citation Codeowners

README.md

biocommons.seqrepo

SeqRepo is a Python package for storing and reading a local collection of biological sequences. The repository is non-redundant, compressed, and journalled, making it efficient to store and transfer multiple snapshots.

Introduction

Specific, named biological sequences provide the reference and coordinate system for communicating variation and consequential phenotypic changes. Several databases of sequences exist, with significant overlap, all using distinct names. Furthermore, these systems are often difficult to install locally.

SeqRepo provides an efficient, non-redundant and indexed storage system for biological sequences. Clients refer to sequences and metadata using familiar identifiers, such as NM_000551.3 or GRCh38:1, or any of several hash-based identifiers. The interface supports fast slicing of arbitrary regions of large sequences.

A "fully-qualified" identifier includes a namespace to disambiguate accessions from different origins or sequence sets (e.g., "1" in GRCh37 and GRCh38). If the namespace is provided, seqrepo uses it as-is; if the namespace is not provided and the unqualified identifier refers to a unique sequence, it is returned; otherwise, the use of ambiguous identifiers raise an error.

SeqRepo favors namespaces from identifiers.org whenever available. Examples include refseq and ensembl.

seqrepo-rest-service provides a REST interface and docker image.

Released under the Apache License, 2.0.

ci_rel | cov | pypi_rel | ChangeLog

Citation

Hart RK, Prlić A (2020). SeqRepo: A system for managing local collections of biological sequences. PLoS ONE 15(12): e0239883. https://doi.org/10.1371/journal.pone.0239883

Features

  • Timestamped, read-only snapshots.
  • Space-efficient storage of sequences within a single snapshot and across snapshots.
  • Bandwidth-efficient transfer incremental updates.
  • Fast fetching of sequence slices on chromosome-scale sequences.
  • Precomputed digests that may be used as sequence aliases.
  • Mappings of external aliases (i.e., accessions or identifiers like NM_013305.4) to sequences.

Deployments Scenarios

Technical Quick Peek

Within a single snapshot, sequences are stored non-redundantly and compressed in an add-only journalled filesystem structure. A truncated SHA-512 hash is used to assess uniquness and as an internal id. (The digest is truncated for space efficiency.)

Sequences are compressed using the Block GZipped Format (BGZF)), which enables pysam to provide fast random access to compressed sequences. (Variable compression typically makes random access impossible.)

Sequence files are immutable, thereby enabling the use of hardlinks across snapshots and eliminating redundant transfers (e.g., with rsync).

Each sequence id is associated with a namespaced alias in a sqlite database. Such as <seguid,rvvuhY0FxFLNwf10FXFIrSQ7AvQ>, <NCBI,NP_004009.1>, <gi,5032303>, <ensembl-75ENSP00000354464>, <ensembl-85,ENSP00000354464.4>. The sqlite database is mutable across releases.

For calibration, recent releases that include 3 human genome assemblies (including patches), and full RefSeq sets (NM, NR, NP, NT, XM, and XP) consumes approximately 8GB. The minimum marginal size for additional snapshots is approximately 2GB (for the sqlite database, which is not hardlinked).

For more information, see docs/design.rst.

Requirements

Reading a sequence repository requires several Python packages, all of which are available from pypi. Installation should be as simple as pip install biocommons.seqrepo.

Acquiring SeqRepo snapshots using the CLI requires an rsync binary. Note that openrsync, which now ships with new MacOS installs, does not support all required functions. Mac users should install rsync from HomeBrew and use the --rsync-exe option to declare its exact location.

Writing sequence files also requires bgzip, which provided in the htslib repo. Ubuntu users should install the tabix package with sudo apt install tabix.

Development and deployments are on Ubuntu. Other systems may work but are not tested. Patches to get other systems working would be welcomed.

Quick Start

OS X

$ brew install python htslib

Ubuntu

$ sudo apt install -y python3-dev gcc zlib1g-dev tabix

All platforms

$ python -m venv venv
$ source venv/bin/activate
$ pip install seqrepo
$ sudo mkdir -p /usr/local/share/seqrepo
$ sudo chown $USER /usr/local/share/seqrepo
$ seqrepo pull -i 2024-12-20
$ seqrepo show-status -i 2024-12-20
seqrepo 0.6.12.dev7+gd311e3e.d20250730
instance directory: /usr/local/share/seqrepo/2024-12-20, 13.5 GB
backends: fastadir (schema 1), seqaliasdb (schema 1)
sequences: 1144093 sequences, 128795051613 residues, 435 files
aliases: 5865328 aliases, 5865328 current, 35 namespaces, 1144093 sequences

# Simple Pythonic interface to sequences
>> from biocommons.seqrepo import SeqRepo
>> sr = SeqRepo("/usr/local/share/seqrepo/2024-12-20")
>> sr["NC_000001.11"][780000:780020]
'TGGTGGCACGCGCTTGTAGT'

# Or, use the seqrepo shell for even easier access
$ seqrepo start-shell -i 2024-12-20
In [1]: sr["NC_000001.11"][780000:780020]
Out[1]: 'TGGTGGCACGCGCTTGTAGT'

# N.B. The following output is edited for simplicity
$ seqrepo export -i 2024-12-20 | head -n100
>MD5:611ff0945aa9eeaaf9ef908d0e744cd0 SEGUID:mirLo912A/MppLuS1cUyFMduLUQ SHA1:9a2acba3dd7603f329a4bb92d5c53214c76e2d44 VMC:GS_---7nAwbv5Fs2Ml2-k3X6Zvj-6ZcjeD3 ga4gh:SQ.---7nAwbv5Fs2Ml2-k3X6Zvj-6ZcjeD3 sha512t24u:---7nAwbv5Fs2Ml2-k3X6Zvj-6ZcjeD3
MDSPLREDDSQTCARLWEAEVKRHSLEGLTVFGTAVQIHNVQRRAIRAKGTQEAQAELLCRGPRLLDRFLEDACILKEGRGTDTGQHCRGDARISSHLEA
SGTHIQLLALFLVSSSDTPPSLLRFCHALEHDIRYNSSFDSYYPLSPHSRHNDDLQTPSSHLGYIITVPDPTLPLTFASLYLGMAPCTSMGSSSMGIFQS
QRIHAFMKGKNKWDEYEGRKESWKIRSNSQTGEPTF
>MD5:8ca1247fe64b17b9d40c2112a8bfc3a2 NCBI:XM_017008743.2 SEGUID:8cijSTmR/FOL+Gtq9gf6JlbRcvY SHA1:f1c8a3493991fc538bf86b6af607fa2656d172f6 VMC:GS_---BUlBwgZN_r5wSII-WCNDd9nn1Owj4 ga4gh:SQ.---BUlBwgZN_r5wSII-WCNDd9nn1Owj4 refseq:XM_017008743.2 sha512t24u:---BUlBwgZN_r5wSII-WCNDd9nn1Owj4
ACTTATGGAAAACAGTGTGGCATATTCTGCTGAGCTTCGCCCTGGAAGAAGCCTCTTTTATACATCTCTTCAGGGAAGAGAGAAGCAATGGGCATGTTAG
TATACAATGATCACAGCCACGCAGGCCTGCAAGCTGCCTTTTGGACAGGCTGTTGACTGCCGTTCCAATTAGCTGATTGGAGAATGTGGAATGCAGAGTG
ATAATGCTGCATATCTGCTATCAGGCAGCAGCAAAGGTTTTTGTCTTGGGAAGGCAAGCTTTCCCTGCAATATTATCTCAGCAGCTCCCTAGCTGCTTAC

See Installation and Mirroring for more information.

Environment Variables

SEQREPOLRUCACHEMAXSIZE sets the lrucache maxsize for the sqlite query response caching. It defaults to 1 million but can also be set to "none" to be unlimited.

SEQREPOFDCACHEMAXSIZE sets the lrucache size for file handler caching during FASTA sequence retrievals. It defaults to 0 to disable any caching, but can be set to a specific value or "none" to be unlimited. Using a moderate value (>10) will greatly increase performance of sequence retrieval.

Developing

Developing on OS X

brew install python htslib

If you get "xcrun: error: invalid active developer path", you need to install XCode. See this StackOverflow answer.

Developing on Ubuntu

sudo apt install -y python3-dev gcc zlib1g-dev tabix

Here's how to get started developing:

make devready
source venv/bin/activate
seqrepo --version

Code reformatting:

make reformat

Install pre-commit hook:

# included in `make devready`, not necessary for new installations
pre-commit install

Building a docker image

Docker images are available at https://hub.docker.com/r/biocommons/seqrepo. Tags correspond to the version of data, not the version of seqrepo, because the intent is to make it easy to depend on a local version of seqrepo files. Each docker image is an installation of seqrepo that downloads the corresponding version of seqrepo data. When used in conjunction with docker volumes for persistence, this provides an easy way to incorporate seqrepo data into a docker stack.

Building

cd misc/docker
make 2021-01-29.log  # builds and pushes to hub.docker.com (i.e., you need creds)

Owner

  • Name: biocommons
  • Login: biocommons
  • Kind: organization

a collection of open source bioinformatics tools

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
title: SeqRepo
type: software
authors:
  - given-names: Reece K.
    family-names: Hart
  - given-names: Andreas
    family-names: Prlić
repository-code: 'https://github.com/biocommons/biocommons.seqrepo'
license: Apache-2.0

preferred-citation:
  type: article
  title: "SeqRepo: A system for managing local collections of biological sequences"
  authors:
  - family-names: "Hart"
    given-names: "Reece K."
  - family-names: "Prlić"
    given-names: "Andreas"
  doi: "10.1371/journal.pone.0239883"
  journal: "PLoS One"
  year: 2020
  month: 12
  volume: 15
  issue: 12
  start: "e0239883"

GitHub Events

Total
  • Create event: 16
  • Release event: 2
  • Issues event: 17
  • Watch event: 3
  • Delete event: 13
  • Issue comment event: 34
  • Push event: 39
  • Pull request event: 18
  • Pull request review event: 12
  • Pull request review comment event: 3
  • Fork event: 1
Last Year
  • Create event: 16
  • Release event: 2
  • Issues event: 17
  • Watch event: 3
  • Delete event: 13
  • Issue comment event: 34
  • Push event: 39
  • Pull request event: 18
  • Pull request review event: 12
  • Pull request review comment event: 3
  • Fork event: 1

Committers

Last synced: about 2 years ago

All Time
  • Total Commits: 374
  • Total Committers: 17
  • Avg Commits per committer: 22.0
  • Development Distribution Score (DDS): 0.088
Past Year
  • Commits: 22
  • Committers: 4
  • Avg Commits per committer: 5.5
  • Development Distribution Score (DDS): 0.273
Top Committers
Name Email Commits
Reece Hart r****t@g****m 341
Andreas Prlic a****c@i****m 6
Alan Rubin a****n@w****u 4
Liang Chen c****c 3
Andreas Prlic a****c@g****m 3
Ben Robinson b****n@i****m 3
Sam Pearlman s****l 2
Teemu Vesala t****a@b****m 2
Manuel Holtgrewe m****e@b****e 2
Deena Blumenkrantz d****7@g****m 1
Iskandar Pashayev i****v@g****m 1
Jake Peacock j****k@g****m 1
Alex Henrie a****4@g****m 1
Dylan Reinhardt d****t@m****m 1
Joseph Solomon j****9@g****m 1
Marcelo Gobelli a****o@c****t 1
Lawrence Lee 3****e 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 116
  • Total pull requests: 70
  • Average time to close issues: 11 months
  • Average time to close pull requests: about 1 month
  • Total issue authors: 28
  • Total pull request authors: 22
  • Average comments per issue: 1.84
  • Average comments per pull request: 1.19
  • Merged pull requests: 54
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 14
  • Pull requests: 18
  • Average time to close issues: 2 days
  • Average time to close pull requests: 3 days
  • Issue authors: 6
  • Pull request authors: 3
  • Average comments per issue: 1.36
  • Average comments per pull request: 0.22
  • Merged pull requests: 13
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • reece (62)
  • jsstevenson (12)
  • theferrit32 (6)
  • holtgrewe (4)
  • wlymanambry (3)
  • ahwagner (2)
  • andreasprlic (2)
  • jfreidin (2)
  • chenliangomc (1)
  • ok-gitr (1)
  • dylan-myome (1)
  • gromdimon (1)
  • quinnwai (1)
  • mbutterfield-GENOME (1)
  • git4waki (1)
Pull Request Authors
  • jsstevenson (42)
  • reece (15)
  • holtgrewe (6)
  • theferrit32 (3)
  • andreasprlic (3)
  • Deena-B (3)
  • decareano (2)
  • kazmiekr (2)
  • afrubin (2)
  • chenliangomc (2)
  • andreas-invitae (1)
  • davmlaw (1)
  • teemuvesala (1)
  • sampearl (1)
  • korikuzma (1)
Top Labels
Issue Labels
bug (16) enhancement (10) keep alive (7) stale (5) wontfix (4) closed-by-stale (4) question (2) documentation (2) abandoned (1) breaking change (1) help wanted (1) project proposal (1)
Pull Request Labels
stale (2) documentation (1) bug (1)

Packages

  • Total packages: 2
  • Total downloads:
    • pypi 57,756 last-month
  • Total docker downloads: 15
  • Total dependent packages: 10
    (may contain duplicates)
  • Total dependent repositories: 22
    (may contain duplicates)
  • Total versions: 49
  • Total maintainers: 3
pypi.org: biocommons.seqrepo

Non-redundant, compressed, journalled, file-based storage for biological sequences

  • Versions: 48
  • Dependent Packages: 10
  • Dependent Repositories: 21
  • Downloads: 56,737 Last month
  • Docker Downloads: 15
Rankings
Dependent packages count: 1.1%
Downloads: 1.5%
Dependent repos count: 3.2%
Docker downloads count: 3.8%
Average: 4.6%
Forks count: 6.8%
Stargazers count: 11.0%
Maintainers (3)
Last synced: 5 months ago
pypi.org: seqrepo

alias for the biocommons.seqrepo package

  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 1,019 Last month
Rankings
Downloads: 6.1%
Forks count: 6.8%
Dependent packages count: 10.0%
Stargazers count: 10.9%
Average: 11.1%
Dependent repos count: 21.7%
Maintainers (1)
Last synced: 5 months ago

Dependencies

.github/workflows/labels.yml actions
  • actions/checkout v3 composite
  • crazy-max/ghaction-github-labeler v4 composite
.github/workflows/python-package.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
  • pypa/gh-action-pypi-publish release/v1 composite
.github/workflows/stale.yml actions
  • actions/stale v8 composite
pyproject.toml pypi
setup.py pypi