https://github.com/cogent3/cogent3-h5seqs

h5py driver for cogent3 unaligned and aligned seqs data

https://github.com/cogent3/cogent3-h5seqs

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.9%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

h5py driver for cogent3 unaligned and aligned seqs data

Basic Info
  • Host: GitHub
  • Owner: cogent3
  • License: bsd-3-clause
  • Language: Python
  • Default Branch: develop
  • Size: 193 KB
Statistics
  • Stars: 0
  • Watchers: 2
  • Forks: 1
  • Open Issues: 0
  • Releases: 8
Created about 1 year ago · Last pushed 10 months ago
Metadata Files
Readme License

README.md

CI Coverage Status Ruff

cogent3-h5seqs: a HDF5 storage driver for cogent3 sequence collections

cogent3-h5seqs is a sequence storage plug-in for cogent3. It uses HDF5 as the storage format for biological sequences, supporting both unaligned sequence collections and alignments. Storage can be in memory (the default) or on disk and sequences are compressed using the lzf compression engine.

The advantage of HDF5 is that once primary sequence formats have been converted from text into numpy arrays, loading and manipulating sequence data is fast and very memory efficient.

Sequences are stored under the hexdigest of their xxhash.hash64(). This means duplicated sequences are stored only once and we also store the mapping of sequence names to the hexdigest.

Installation

pip install cogent3-h5seqs

Usage

Making cogent3-h5seqs the default storage

Using cogent3.set_storage_defaults(), you can set cogent3-h5seqs as the default storage. This means whenever a sequence collection is loaded from disk or created in memory, it will use the storage within this package.

The following statement makes cogent3-h5seqs the default for both unaligned and aligned sequence collections.

```python import cogent3

cogent3.setstoragedefaults(unalignedseqs="h5seqsunaligned", alignedseqs="h5seqsaligned") ```

You can undo this setting by

python cogent3.set_storage_defaults(unaligned_seqs=None, aligned_seqs=None)

Using cogent3-h5seqs as storage per object

You don't have to specify the storage as the default for all instances, but can do it on a per object basis.

python coll = cogent3.load_unaligned_seqs(some_path, moltype="dna", storage_backend="h5seqs_unaligned")

or, for alignments.

python aln = cogent3.load_aligned_seqs(some_path, moltype="dna", storage_backend="h5seqs_aligned")

The same values can also be provided to the make_unaligned_seqs(), make_aligned_seqs() functions in cogent3.

Note You can turn off compression with compression=False. This will speed up operations.

Saving storage to disk

cogent3-h5seqs supports writing to disk, and employs the filename suffix .c3h5u for unaligned sequences and .c3h5a for aligned sequences. This will work whether your current object is using cogent3-h5seqs for storage or not. For example

```python import cogent3

samplealn = cogent3.getdataset("brca1") # using the cogent3 builtin storage outpath = "~/Desktop/alignmentoutput.c3h5a" samplealn.write(outpath) # writes out as cogent3-h5seqs HDF5 storage ```

For a sequence collection, do the following.

```python samplecoll = cogent3.getdataset("brca1").degap()

Note the different suffix

outpath = "~/Desktop/alignmentoutput.c3h5u" samplecoll.write(outpath) # writes out as cogent3-h5seqs HDF5 storage ```

Loading storage from disk

cogent3 correctly directs to cogent3-h5seqs for loading based on the filename suffix.

python inpath = "~/Desktop/alignment_output.c3h5u" sample_coll = cogent3.load_unaligned_seqs(inpath, moltype="dna")

Note You cannot write an alignment instance to an unaligned storage type or vice versa. Nor can you read into the different types.

Owner

  • Name: cogent3
  • Login: cogent3
  • Kind: organization

GitHub Events

Total
  • Release event: 10
  • Delete event: 5
  • Push event: 22
  • Pull request event: 50
  • Fork event: 1
  • Create event: 13
Last Year
  • Release event: 10
  • Delete event: 5
  • Push event: 22
  • Pull request event: 50
  • Fork event: 1
  • Create event: 13

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 0
  • Total pull requests: 9
  • Average time to close issues: N/A
  • Average time to close pull requests: 2 days
  • Total issue authors: 0
  • Total pull request authors: 2
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 8
  • Bot issues: 0
  • Bot pull requests: 2
Past Year
  • Issues: 0
  • Pull requests: 9
  • Average time to close issues: N/A
  • Average time to close pull requests: 2 days
  • Issue authors: 0
  • Pull request authors: 2
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 8
  • Bot issues: 0
  • Bot pull requests: 2
Top Authors
Issue Authors
Pull Request Authors
  • GavinHuttley (25)
  • dependabot[bot] (2)
Top Labels
Issue Labels
Pull Request Labels
dependencies (2) github_actions (2)