nthash-avx512
Various test improvements upon mirounga's AVX2 and AVX512 ports of ntHash2 in 2018
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
✓DOI references
Found 4 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.0%) to scientific vocabulary
Repository
Various test improvements upon mirounga's AVX2 and AVX512 ports of ntHash2 in 2018
Basic Info
Statistics
- Stars: 5
- Watchers: 4
- Forks: 1
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
ntHash2 AVX512
This is a bugfixed version of ntHash2 AVX2 and AVX512 ports with expanded tests. Also added 32-bits scalar ntHash.
In terms of correctness, the 64-bit hash version (which is default hash type in ntHash 1 and 2) of scalar, AVX2 and AVX512 all agree together. The 32-bit AVX2/AVX512 ports agree together, as they implement some strange 31-bit ntHash2, but they do not agree with the 32-bits scalar version which implement ntHash1.
This hasn't been merged in the original ntHash repository as this is an old ntHash codebase.
See https://github.com/bcgsc/ntHash/pull/9 for initial version.
ntHash
ntHash is a recursive hash function for hashing all possible k-mers in a DNA/RNA sequence.
Build the test suite
$ ./autogen.sh
$ ./configure
$ make
$ sudo make install
To install nttest in a specified directory:
$ ./autogen.sh
$ ./configure --prefix=/opt/ntHash/
$ make
$ make install
The nttest suite has the options for runtime and uniformity tests.
Runtime test
For the runtime test the program has the following options:
nttest [OPTIONS] ... [FILE]
Parameters:
* -k, --kmer=SIZE: the length of k-mer used for runtime test hashing [50]
* -h, --hash=SIZE: the number of generated hashes for each k-mer [1]
* FILE: is the input fasta or fastq file
For example to evaluate the runtime of different hash methods on the test file reads.fa in DATA/ folder for k-mer length 50, run:
$ nttest -k50 reads.fa
Uniformity test
For the uniformity test using the Bloom filter data structure the program has the following options:
nttest --uniformity [OPTIONS] ... [REF_FILE] [QUERY_FILE]
Parameters:
* -q, --qnum=SIZE: number of queries in query file
* -l, --qlen=SIZE: length of reads in query file
* -t, --tnum=SIZE: number of sequences in reference file
* -g, --tlen=SIZE: length of reference sequence
* -i, --input: generate random query and reference files
* -j, threads=SIZE: number of threads to run uniformity test [1]
* REF_FILE: the reference file name
* QUERY_FILE: the query file name
For example, to evaluate the uniformity of different hash methods using the Bloom filter data structure on randomly generated data sets with following options:
* 100 genes of length 5,000,000bp as reference in file genes.fa
* 4,000,000 reads of length 250bp as query in file reads.fa
* 12 threads
run:
$ nttest --uniformity --input -q4000000 -l250 -t100 -g5000000 -j12 genes.fa reads.fa
Code samples
To hash all k-mers of length k in a given sequence seq:
bash
string kmer = seq.substr(0, k);
uint64_t hVal=0;
hVal = NT64(kmer.c_str(), k); // initial hash value
...
for (size_t i = 0; i < seq.length() - k; i++)
{
hVal = NT64(hVal, seq[i], seq[i+k], k); // consecutive hash values
...
}
To canonical hash all k-mers of length k in a given sequence seq:
bash
string kmer = seq.substr(0, k);
uint64_t hVal, fhVal=0, rhVal=0; // canonical, forward, and reverse-strand hash values
hVal = NTC64(kmer.c_str(), k, fhVal, rhVal); // initial hash value
...
for (size_t i = 0; i < seq.length() - k; i++)
{
hVal = NTC64(seq[i], seq[i+k], k, fhVal, rhVal); // consecutive hash values
...
}
To multi-hash with h hash values all k-mers of length k in a given sequence seq:
bash
string kmer = seq.substr(0, k);
uint64_t hVec[h];
NTM64(kmer.c_str(), k, h, hVec); // initial hash vector
...
for (size_t i = 0; i < seq.length() - k; i++)
{
NTM64(seq[i], seq[i+k], k, h, hVec); // consecutive hash vectors
...
}
ntHashIterator
Enables ntHash on sequences
To hash all k-mers of length k in a given sequence seq with h hash values using ntHashIterator:
bash
ntHashIterator itr(seq, h, k);
while (itr != itr.end())
{
... use *itr ...
++itr;
}
Usage example (C++)
Outputing hash values of all k-mers in a sequence
```C++
include
include
include "ntHashIterator.hpp"
int main(int argc, const char* argv[]) { /* test sequence */ std::string seq = "GAGTGTCAAACATTCAGACAACAGCAGGGGTGCTCTGGAATCCTATGTGAGGAACAAACATTCAGGCCACAGTAG";
/* k is the k-mer length */
unsigned k = 70;
/* h is the number of hashes for each k-mer */
unsigned h = 1;
/* init ntHash state and compute hash values for first k-mer */
ntHashIterator itr(seq, h, k);
while (itr != itr.end()) {
std::cout << (*itr)[0] << std::endl;
++itr;
}
return 0;
} ```
Publications
ntHash
Hamid Mohamadi, Justin Chu, Benjamin P Vandervalk, and Inanc Birol. ntHash: recursive nucleotide hashing. Bioinformatics (2016) 32 (22): 3492-3494. doi:10.1093/bioinformatics/btw397
Owner
- Name: Rayan Chikhi
- Login: rchikhi
- Kind: user
- Location: France
- Company: CNRS
- Website: http://rayan.chikhi.name
- Repositories: 15
- Profile: https://github.com/rchikhi
Citation (CITATION.bib)
@article{doi:10.1093/bioinformatics/btw397,
author = {Mohamadi, Hamid and Chu, Justin and Vandervalk, Benjamin P. and Birol, Inanc},
title = {ntHash: recursive nucleotide hashing},
journal = {Bioinformatics},
volume = {32},
number = {22},
pages = {3492},
year = {2016},
doi = {10.1093/bioinformatics/btw397},
URL = { + http://dx.doi.org/10.1093/bioinformatics/btw397},
eprint = {/oup/backfile/Content_public/Journal/bioinformatics/32/22/10.1093_bioinformatics_btw397/3/btw397.pdf}
}
GitHub Events
Total
- Push event: 1
- Fork event: 1
- Create event: 1
Last Year
- Push event: 1
- Fork event: 1
- Create event: 1
Issues and Pull Requests
Last synced: 11 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0