biocommons.seqrepo

non-redundant, compressed, journalled, file-based storage for biological sequences

https://github.com/biocommons/biocommons.seqrepo

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
○
Academic publication links
✓
Committers with academic emails
1 of 17 committers (5.9%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.4%) to scientific vocabulary

Keywords

bioinformatics genome-analysis genomics sequencing variant-analysis variation

Last synced: 6 months ago · JSON representation ·

Repository

non-redundant, compressed, journalled, file-based storage for biological sequences

Basic Info

Host: GitHub
Owner: biocommons
License: apache-2.0
Language: Python
Default Branch: main
Homepage:
Size: 613 KB

Statistics

Stars: 42
Watchers: 9
Forks: 36
Open Issues: 30
Releases: 5

Topics

bioinformatics genome-analysis genomics sequencing variant-analysis variation

Created over 9 years ago · Last pushed 6 months ago

Metadata Files

Readme License Citation Codeowners

biocommons.seqrepo

SeqRepo is a Python package for storing and reading a local collection of biological sequences. The repository is non-redundant, compressed, and journalled, making it efficient to store and transfer multiple snapshots.

Introduction

Specific, named biological sequences provide the reference and coordinate system for communicating variation and consequential phenotypic changes. Several databases of sequences exist, with significant overlap, all using distinct names. Furthermore, these systems are often difficult to install locally.

SeqRepo provides an efficient, non-redundant and indexed storage system for biological sequences. Clients refer to sequences and metadata using familiar identifiers, such as NM_000551.3 or GRCh38:1, or any of several hash-based identifiers. The interface supports fast slicing of arbitrary regions of large sequences.

A "fully-qualified" identifier includes a namespace to disambiguate accessions from different origins or sequence sets (e.g., "1" in GRCh37 and GRCh38). If the namespace is provided, seqrepo uses it as-is; if the namespace is not provided and the unqualified identifier refers to a unique sequence, it is returned; otherwise, the use of ambiguous identifiers raise an error.

SeqRepo favors namespaces from identifiers.org whenever available. Examples include refseq and ensembl.

seqrepo-rest-service provides a REST interface and docker image.

Released under the Apache License, 2.0.

| | | ChangeLog

Citation

Hart RK, Prlić A (2020). SeqRepo: A system for managing local collections of biological sequences. PLoS ONE 15(12): e0239883. https://doi.org/10.1371/journal.pone.0239883

Features

Timestamped, read-only snapshots.
Space-efficient storage of sequences within a single snapshot and across snapshots.
Bandwidth-efficient transfer incremental updates.
Fast fetching of sequence slices on chromosome-scale sequences.
Precomputed digests that may be used as sequence aliases.
Mappings of external aliases (i.e., accessions or identifiers like NM_013305.4) to sequences.

Deployments Scenarios

Local read-only archive, mirrored from public site, accessed via Python API (see Mirroring documentation)
Local read-write archive, maintained with command line utility and/or API (see Command Line Interface documentation).
Docker data-only container that may be linked to application container.
SeqRepo and refget REST API for local or remote access (see seqrepo-rest-service)

Technical Quick Peek

Within a single snapshot, sequences are stored non-redundantly and compressed in an add-only journalled filesystem structure. A truncated SHA-512 hash is used to assess uniquness and as an internal id. (The digest is truncated for space efficiency.)

Sequences are compressed using the Block GZipped Format (BGZF)), which enables pysam to provide fast random access to compressed sequences. (Variable compression typically makes random access impossible.)

Sequence files are immutable, thereby enabling the use of hardlinks across snapshots and eliminating redundant transfers (e.g., with rsync).

Each sequence id is associated with a namespaced alias in a sqlite database. Such as <seguid,rvvuhY0FxFLNwf10FXFIrSQ7AvQ>, <NCBI,NP_004009.1>, <gi,5032303>, <ensembl-75ENSP00000354464>, <ensembl-85,ENSP00000354464.4>. The sqlite database is mutable across releases.

For calibration, recent releases that include 3 human genome assemblies (including patches), and full RefSeq sets (NM, NR, NP, NT, XM, and XP) consumes approximately 8GB. The minimum marginal size for additional snapshots is approximately 2GB (for the sqlite database, which is not hardlinked).

For more information, see docs/design.rst.

Requirements

Reading a sequence repository requires several Python packages, all of which are available from pypi. Installation should be as simple as pip install biocommons.seqrepo.

Acquiring SeqRepo snapshots using the CLI requires an rsync binary. Note that openrsync, which now ships with new MacOS installs, does not support all required functions. Mac users should install rsync from HomeBrew and use the --rsync-exe option to declare its exact location.

Writing sequence files also requires bgzip, which provided in the htslib repo. Ubuntu users should install the tabix package with sudo apt install tabix.

Development and deployments are on Ubuntu. Other systems may work but are not tested. Patches to get other systems working would be welcomed.

Quick Start

OS X

$ brew install python htslib

Ubuntu

$ sudo apt install -y python3-dev gcc zlib1g-dev tabix

All platforms

$ python -m venv venv
$ source venv/bin/activate
$ pip install seqrepo
$ sudo mkdir -p /usr/local/share/seqrepo
$ sudo chown $USER /usr/local/share/seqrepo
$ seqrepo pull -i 2024-12-20
$ seqrepo show-status -i 2024-12-20
seqrepo 0.6.12.dev7+gd311e3e.d20250730
instance directory: /usr/local/share/seqrepo/2024-12-20, 13.5 GB
backends: fastadir (schema 1), seqaliasdb (schema 1)
sequences: 1144093 sequences, 128795051613 residues, 435 files
aliases: 5865328 aliases, 5865328 current, 35 namespaces, 1144093 sequences

# Simple Pythonic interface to sequences
>> from biocommons.seqrepo import SeqRepo
>> sr = SeqRepo("/usr/local/share/seqrepo/2024-12-20")
>> sr["NC_000001.11"][780000:780020]
'TGGTGGCACGCGCTTGTAGT'

# Or, use the seqrepo shell for even easier access
$ seqrepo start-shell -i 2024-12-20
In [1]: sr["NC_000001.11"][780000:780020]
Out[1]: 'TGGTGGCACGCGCTTGTAGT'

# N.B. The following output is edited for simplicity
$ seqrepo export -i 2024-12-20 | head -n100
>MD5:611ff0945aa9eeaaf9ef908d0e744cd0 SEGUID:mirLo912A/MppLuS1cUyFMduLUQ SHA1:9a2acba3dd7603f329a4bb92d5c53214c76e2d44 VMC:GS_---7nAwbv5Fs2Ml2-k3X6Zvj-6ZcjeD3 ga4gh:SQ.---7nAwbv5Fs2Ml2-k3X6Zvj-6ZcjeD3 sha512t24u:---7nAwbv5Fs2Ml2-k3X6Zvj-6ZcjeD3
MDSPLREDDSQTCARLWEAEVKRHSLEGLTVFGTAVQIHNVQRRAIRAKGTQEAQAELLCRGPRLLDRFLEDACILKEGRGTDTGQHCRGDARISSHLEA
SGTHIQLLALFLVSSSDTPPSLLRFCHALEHDIRYNSSFDSYYPLSPHSRHNDDLQTPSSHLGYIITVPDPTLPLTFASLYLGMAPCTSMGSSSMGIFQS
QRIHAFMKGKNKWDEYEGRKESWKIRSNSQTGEPTF
>MD5:8ca1247fe64b17b9d40c2112a8bfc3a2 NCBI:XM_017008743.2 SEGUID:8cijSTmR/FOL+Gtq9gf6JlbRcvY SHA1:f1c8a3493991fc538bf86b6af607fa2656d172f6 VMC:GS_---BUlBwgZN_r5wSII-WCNDd9nn1Owj4 ga4gh:SQ.---BUlBwgZN_r5wSII-WCNDd9nn1Owj4 refseq:XM_017008743.2 sha512t24u:---BUlBwgZN_r5wSII-WCNDd9nn1Owj4
ACTTATGGAAAACAGTGTGGCATATTCTGCTGAGCTTCGCCCTGGAAGAAGCCTCTTTTATACATCTCTTCAGGGAAGAGAGAAGCAATGGGCATGTTAG
TATACAATGATCACAGCCACGCAGGCCTGCAAGCTGCCTTTTGGACAGGCTGTTGACTGCCGTTCCAATTAGCTGATTGGAGAATGTGGAATGCAGAGTG
ATAATGCTGCATATCTGCTATCAGGCAGCAGCAAAGGTTTTTGTCTTGGGAAGGCAAGCTTTCCCTGCAATATTATCTCAGCAGCTCCCTAGCTGCTTAC

See Installation and Mirroring for more information.

Environment Variables

SEQREPOLRUCACHEMAXSIZE sets the lrucache maxsize for the sqlite query response caching. It defaults to 1 million but can also be set to "none" to be unlimited.

SEQREPOFDCACHEMAXSIZE sets the lrucache size for file handler caching during FASTA sequence retrievals. It defaults to 0 to disable any caching, but can be set to a specific value or "none" to be unlimited. Using a moderate value (>10) will greatly increase performance of sequence retrieval.

Developing

Developing on OS X

brew install python htslib

If you get "xcrun: error: invalid active developer path", you need to install XCode. See this StackOverflow answer.

Developing on Ubuntu

sudo apt install -y python3-dev gcc zlib1g-dev tabix

Here's how to get started developing:

make devready
source venv/bin/activate
seqrepo --version

Code reformatting:

make reformat

Install pre-commit hook:

# included in `make devready`, not necessary for new installations
pre-commit install

Building a docker image

Docker images are available at https://hub.docker.com/r/biocommons/seqrepo. Tags correspond to the version of data, not the version of seqrepo, because the intent is to make it easy to depend on a local version of seqrepo files. Each docker image is an installation of seqrepo that downloads the corresponding version of seqrepo data. When used in conjunction with docker volumes for persistence, this provides an easy way to incorporate seqrepo data into a docker stack.

Building

cd misc/docker
make 2021-01-29.log  # builds and pushes to hub.docker.com (i.e., you need creds)

Owner

Name: biocommons
Login: biocommons
Kind: organization

Website: https://github.com/biocommons/biocommons/wiki/Welcome
Repositories: 19
Profile: https://github.com/biocommons

a collection of open source bioinformatics tools

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
title: SeqRepo
type: software
authors:
  - given-names: Reece K.
    family-names: Hart
  - given-names: Andreas
    family-names: Prlić
repository-code: 'https://github.com/biocommons/biocommons.seqrepo'
license: Apache-2.0

preferred-citation:
  type: article
  title: "SeqRepo: A system for managing local collections of biological sequences"
  authors:
  - family-names: "Hart"
    given-names: "Reece K."
  - family-names: "Prlić"
    given-names: "Andreas"
  doi: "10.1371/journal.pone.0239883"
  journal: "PLoS One"
  year: 2020
  month: 12
  volume: 15
  issue: 12
  start: "e0239883"

GitHub Events

Total

Create event: 16
Release event: 2
Issues event: 17
Watch event: 3
Delete event: 13
Issue comment event: 34
Push event: 39
Pull request event: 18
Pull request review event: 12
Pull request review comment event: 3
Fork event: 1

Last Year

Create event: 16
Release event: 2
Issues event: 17
Watch event: 3
Delete event: 13
Issue comment event: 34
Push event: 39
Pull request event: 18
Pull request review event: 12
Pull request review comment event: 3
Fork event: 1

Committers

Last synced: over 2 years ago

All Time

Total Commits: 374
Total Committers: 17
Avg Commits per committer: 22.0
Development Distribution Score (DDS): 0.088

Past Year

Commits: 22
Committers: 4
Avg Commits per committer: 5.5
Development Distribution Score (DDS): 0.273

Top Committers

Name	Email	Commits
Reece Hart	r**t@g**m	341
Andreas Prlic	a**c@i**m	6
Alan Rubin	a**n@w**u	4
Liang Chen	c****c	3
Andreas Prlic	a**c@g**m	3
Ben Robinson	b**n@i**m	3
Sam Pearlman	s****l	2
Teemu Vesala	t**a@b**m	2
Manuel Holtgrewe	m**e@b**e	2
Deena Blumenkrantz	d**7@g**m	1
Iskandar Pashayev	i**v@g**m	1
Jake Peacock	j**k@g**m	1
Alex Henrie	a**4@g**m	1
Dylan Reinhardt	d**t@m**m	1
Joseph Solomon	j**9@g**m	1
Marcelo Gobelli	a**o@c**t	1
Lawrence Lee	3****e	1

Committer Domains (Top 20 + Academic)

invitae.com: 2 comcast.net: 1 myome.com: 1 bih-charite.de: 1 badboysofquality.com: 1 wehi.edu.au: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 116
Total pull requests: 70
Average time to close issues: 11 months
Average time to close pull requests: about 1 month
Total issue authors: 28
Total pull request authors: 22
Average comments per issue: 1.84
Average comments per pull request: 1.19
Merged pull requests: 54
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 14
Pull requests: 18
Average time to close issues: 2 days
Average time to close pull requests: 3 days
Issue authors: 6
Pull request authors: 3
Average comments per issue: 1.36
Average comments per pull request: 0.22
Merged pull requests: 13
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

reece (62)
jsstevenson (12)
theferrit32 (6)
holtgrewe (4)
wlymanambry (3)
ahwagner (2)
andreasprlic (2)
jfreidin (2)
chenliangomc (1)
ok-gitr (1)
dylan-myome (1)
gromdimon (1)
quinnwai (1)
mbutterfield-GENOME (1)
git4waki (1)

Pull Request Authors

jsstevenson (42)
reece (15)
holtgrewe (6)
theferrit32 (3)
andreasprlic (3)
Deena-B (3)
decareano (2)
kazmiekr (2)
afrubin (2)
chenliangomc (2)
andreas-invitae (1)
davmlaw (1)
teemuvesala (1)
sampearl (1)
korikuzma (1)

Top Labels

Issue Labels

bug (16) enhancement (10) keep alive (7) stale (5) wontfix (4) closed-by-stale (4) question (2) documentation (2) abandoned (1) breaking change (1) help wanted (1) project proposal (1)

Pull Request Labels

stale (2) documentation (1) bug (1)

Packages

Total packages: 2
Total downloads:
- pypi 57,756 last-month
Total docker downloads: 15

Total dependent packages: 10
(may contain duplicates)
Total dependent repositories: 22
(may contain duplicates)
Total versions: 49
Total maintainers: 3

pypi.org: biocommons.seqrepo

Non-redundant, compressed, journalled, file-based storage for biological sequences

Homepage: https://github.com/biocommons/biocommons.seqrepo
Documentation: https://biocommons.seqrepo.readthedocs.io/
License: Apache Software License
Latest release: 0.6.11
published 11 months ago

Versions: 48
Dependent Packages: 10
Dependent Repositories: 21
Downloads: 56,737 Last month
Docker Downloads: 15

Rankings

Dependent packages count: 1.1%

Downloads: 1.5%

Dependent repos count: 3.2%

Docker downloads count: 3.8%

Average: 4.6%

Forks count: 6.8%

Stargazers count: 11.0%

Maintainers (3)

jsstevenson korikuzma reece

Last synced: 6 months ago

pypi.org: seqrepo

alias for the biocommons.seqrepo package

Homepage: https://github.com/biocommons/biocommons.seqrepo
Documentation: https://seqrepo.readthedocs.io/
License: apache-2.0
Latest release: 0.0.0
published over 7 years ago

Versions: 1
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 1,019 Last month

Rankings

Downloads: 6.1%

Forks count: 6.8%

Dependent packages count: 10.0%

Stargazers count: 10.9%

Average: 11.1%

Dependent repos count: 21.7%

Maintainers (1)

reece

Last synced: 6 months ago

Dependencies

.github/workflows/labels.yml actions

actions/checkout v3 composite
crazy-max/ghaction-github-labeler v4 composite

.github/workflows/python-package.yml actions

actions/checkout v3 composite
actions/setup-python v4 composite
pypa/gh-action-pypi-publish release/v1 composite

.github/workflows/stale.yml actions

actions/stale v8 composite

pyproject.toml pypi

setup.py pypi

biocommons.seqrepo

Science Score: 67.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

biocommons.seqrepo

Introduction

Citation

Features

Deployments Scenarios

Technical Quick Peek

Requirements

Quick Start

OS X

Ubuntu

All platforms

Environment Variables

Developing

Developing on OS X

Developing on Ubuntu

Building a docker image

Building

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: biocommons.seqrepo

Rankings

Maintainers (3)

pypi.org: seqrepo

Rankings

Maintainers (1)

Dependencies