https://github.com/cogent3/cogent3-h5seqs
h5py driver for cogent3 unaligned and aligned seqs data
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.9%) to scientific vocabulary
Repository
h5py driver for cogent3 unaligned and aligned seqs data
Basic Info
- Host: GitHub
- Owner: cogent3
- License: bsd-3-clause
- Language: Python
- Default Branch: develop
- Size: 193 KB
Statistics
- Stars: 0
- Watchers: 2
- Forks: 1
- Open Issues: 0
- Releases: 8
Metadata Files
README.md
cogent3-h5seqs: a HDF5 storage driver for cogent3 sequence collections
cogent3-h5seqs is a sequence storage plug-in for cogent3. It uses HDF5 as the storage format for biological sequences, supporting both unaligned sequence collections and alignments. Storage can be in memory (the default) or on disk and sequences are compressed using the lzf compression engine.
The advantage of HDF5 is that once primary sequence formats have been converted from text into numpy arrays, loading and manipulating sequence data is fast and very memory efficient.
Sequences are stored under the hexdigest of their xxhash.hash64(). This means duplicated sequences are stored only once and we also store the mapping of sequence names to the hexdigest.
Installation
pip install cogent3-h5seqs
Usage
Making cogent3-h5seqs the default storage
Using cogent3.set_storage_defaults(), you can set cogent3-h5seqs as the default storage. This means whenever a sequence collection is loaded from disk or created in memory, it will use the storage within this package.
The following statement makes cogent3-h5seqs the default for both unaligned and aligned sequence collections.
```python import cogent3
cogent3.setstoragedefaults(unalignedseqs="h5seqsunaligned", alignedseqs="h5seqsaligned") ```
You can undo this setting by
python
cogent3.set_storage_defaults(unaligned_seqs=None, aligned_seqs=None)
Using cogent3-h5seqs as storage per object
You don't have to specify the storage as the default for all instances, but can do it on a per object basis.
python
coll = cogent3.load_unaligned_seqs(some_path,
moltype="dna",
storage_backend="h5seqs_unaligned")
or, for alignments.
python
aln = cogent3.load_aligned_seqs(some_path,
moltype="dna",
storage_backend="h5seqs_aligned")
The same values can also be provided to the make_unaligned_seqs(), make_aligned_seqs() functions in cogent3.
Note You can turn off compression with
compression=False. This will speed up operations.
Saving storage to disk
cogent3-h5seqs supports writing to disk, and employs the filename suffix .c3h5u for unaligned sequences and .c3h5a for aligned sequences. This will work whether your current object is using cogent3-h5seqs for storage or not. For example
```python import cogent3
samplealn = cogent3.getdataset("brca1") # using the cogent3 builtin storage outpath = "~/Desktop/alignmentoutput.c3h5a" samplealn.write(outpath) # writes out as cogent3-h5seqs HDF5 storage ```
For a sequence collection, do the following.
```python samplecoll = cogent3.getdataset("brca1").degap()
Note the different suffix
outpath = "~/Desktop/alignmentoutput.c3h5u" samplecoll.write(outpath) # writes out as cogent3-h5seqs HDF5 storage ```
Loading storage from disk
cogent3 correctly directs to cogent3-h5seqs for loading based on the filename suffix.
python
inpath = "~/Desktop/alignment_output.c3h5u"
sample_coll = cogent3.load_unaligned_seqs(inpath, moltype="dna")
Note You cannot write an alignment instance to an unaligned storage type or vice versa. Nor can you read into the different types.
Owner
- Name: cogent3
- Login: cogent3
- Kind: organization
- Repositories: 5
- Profile: https://github.com/cogent3
GitHub Events
Total
- Release event: 10
- Delete event: 5
- Push event: 22
- Pull request event: 50
- Fork event: 1
- Create event: 13
Last Year
- Release event: 10
- Delete event: 5
- Push event: 22
- Pull request event: 50
- Fork event: 1
- Create event: 13
Issues and Pull Requests
Last synced: 10 months ago
All Time
- Total issues: 0
- Total pull requests: 9
- Average time to close issues: N/A
- Average time to close pull requests: 2 days
- Total issue authors: 0
- Total pull request authors: 2
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 8
- Bot issues: 0
- Bot pull requests: 2
Past Year
- Issues: 0
- Pull requests: 9
- Average time to close issues: N/A
- Average time to close pull requests: 2 days
- Issue authors: 0
- Pull request authors: 2
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 8
- Bot issues: 0
- Bot pull requests: 2
Top Authors
Issue Authors
Pull Request Authors
- GavinHuttley (25)
- dependabot[bot] (2)