nthash-avx512

Various test improvements upon mirounga's AVX2 and AVX512 ports of ntHash2 in 2018

https://github.com/rchikhi/nthash-avx512

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 4 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.0%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Various test improvements upon mirounga's AVX2 and AVX512 ports of ntHash2 in 2018

Basic Info

Host: GitHub
Owner: rchikhi
License: mit
Language: C++
Default Branch: rs_avx
Homepage:
Size: 11.5 MB

Statistics

Stars: 5
Watchers: 4
Forks: 1
Open Issues: 0
Releases: 0

Created over 3 years ago · Last pushed about 1 year ago

Metadata Files

Readme Changelog License Citation

ntHash2 AVX512

This is a bugfixed version of ntHash2 AVX2 and AVX512 ports with expanded tests. Also added 32-bits scalar ntHash.

In terms of correctness, the 64-bit hash version (which is default hash type in ntHash 1 and 2) of scalar, AVX2 and AVX512 all agree together. The 32-bit AVX2/AVX512 ports agree together, as they implement some strange 31-bit ntHash2, but they do not agree with the 32-bits scalar version which implement ntHash1.

This hasn't been merged in the original ntHash repository as this is an old ntHash codebase.

See https://github.com/bcgsc/ntHash/pull/9 for initial version.

ntHash

ntHash is a recursive hash function for hashing all possible k-mers in a DNA/RNA sequence.

Build the test suite

$ ./autogen.sh $ ./configure $ make $ sudo make install

To install nttest in a specified directory:

$ ./autogen.sh $ ./configure --prefix=/opt/ntHash/ $ make $ make install

The nttest suite has the options for runtime and uniformity tests.

Runtime test

For the runtime test the program has the following options: nttest [OPTIONS] ... [FILE] Parameters: * -k, --kmer=SIZE: the length of k-mer used for runtime test hashing [50] * -h, --hash=SIZE: the number of generated hashes for each k-mer [1] * FILE: is the input fasta or fastq file

For example to evaluate the runtime of different hash methods on the test file reads.fa in DATA/ folder for k-mer length 50, run: $ nttest -k50 reads.fa

Uniformity test

For the uniformity test using the Bloom filter data structure the program has the following options: nttest --uniformity [OPTIONS] ... [REF_FILE] [QUERY_FILE]

Parameters: * -q, --qnum=SIZE: number of queries in query file * -l, --qlen=SIZE: length of reads in query file * -t, --tnum=SIZE: number of sequences in reference file * -g, --tlen=SIZE: length of reference sequence * -i, --input: generate random query and reference files * -j, threads=SIZE: number of threads to run uniformity test [1] * REF_FILE: the reference file name * QUERY_FILE: the query file name

For example, to evaluate the uniformity of different hash methods using the Bloom filter data structure on randomly generated data sets with following options: * 100 genes of length 5,000,000bp as reference in file genes.fa * 4,000,000 reads of length 250bp as query in file reads.fa * 12 threads

run: $ nttest --uniformity --input -q4000000 -l250 -t100 -g5000000 -j12 genes.fa reads.fa

Code samples

To hash all k-mers of length k in a given sequence seq: bash string kmer = seq.substr(0, k); uint64_t hVal=0; hVal = NT64(kmer.c_str(), k); // initial hash value ... for (size_t i = 0; i < seq.length() - k; i++) { hVal = NT64(hVal, seq[i], seq[i+k], k); // consecutive hash values ... } To canonical hash all k-mers of length k in a given sequence seq: bash string kmer = seq.substr(0, k); uint64_t hVal, fhVal=0, rhVal=0; // canonical, forward, and reverse-strand hash values hVal = NTC64(kmer.c_str(), k, fhVal, rhVal); // initial hash value ... for (size_t i = 0; i < seq.length() - k; i++) { hVal = NTC64(seq[i], seq[i+k], k, fhVal, rhVal); // consecutive hash values ... } To multi-hash with h hash values all k-mers of length k in a given sequence seq: bash string kmer = seq.substr(0, k); uint64_t hVec[h]; NTM64(kmer.c_str(), k, h, hVec); // initial hash vector ... for (size_t i = 0; i < seq.length() - k; i++) { NTM64(seq[i], seq[i+k], k, h, hVec); // consecutive hash vectors ... }

ntHashIterator

Enables ntHash on sequences

To hash all k-mers of length k in a given sequence seq with h hash values using ntHashIterator: bash ntHashIterator itr(seq, h, k); while (itr != itr.end()) { ... use *itr ... ++itr; }

Usage example (C++)

Outputing hash values of all k-mers in a sequence

```C++

include

include "ntHashIterator.hpp"

int main(int argc, const char* argv[]) { /* test sequence */ std::string seq = "GAGTGTCAAACATTCAGACAACAGCAGGGGTGCTCTGGAATCCTATGTGAGGAACAAACATTCAGGCCACAGTAG";

/* k is the k-mer length */
unsigned k = 70;

/* h is the number of hashes for each k-mer */
unsigned h = 1;

/* init ntHash state and compute hash values for first k-mer */
ntHashIterator itr(seq, h, k);
while (itr != itr.end()) {
    std::cout << (*itr)[0] << std::endl;
    ++itr;
}

return 0;

} ```

Publications

ntHash

Hamid Mohamadi, Justin Chu, Benjamin P Vandervalk, and Inanc Birol. ntHash: recursive nucleotide hashing. Bioinformatics (2016) 32 (22): 3492-3494. doi:10.1093/bioinformatics/btw397

Owner

Name: Rayan Chikhi
Login: rchikhi
Kind: user
Location: France
Company: CNRS

Website: http://rayan.chikhi.name
Repositories: 15
Profile: https://github.com/rchikhi

Citation (CITATION.bib)

@article{doi:10.1093/bioinformatics/btw397,
author = {Mohamadi, Hamid and Chu, Justin and Vandervalk, Benjamin P. and Birol, Inanc},
title = {ntHash: recursive nucleotide hashing},
journal = {Bioinformatics},
volume = {32},
number = {22},
pages = {3492},
year = {2016},
doi = {10.1093/bioinformatics/btw397},
URL = { + http://dx.doi.org/10.1093/bioinformatics/btw397},
eprint = {/oup/backfile/Content_public/Journal/bioinformatics/32/22/10.1093_bioinformatics_btw397/3/btw397.pdf}
}

GitHub Events

Total

Push event: 1
Fork event: 1
Create event: 1

Last Year

Push event: 1
Fork event: 1
Create event: 1

Issues and Pull Requests

Last synced: about 1 year ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

nthash-avx512

Science Score: 44.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

ntHash2 AVX512

ntHash

Build the test suite

Runtime test

Uniformity test

Code samples

ntHashIterator

Usage example (C++)

include

include

include "ntHashIterator.hpp"

Publications

ntHash

Owner

Citation (CITATION.bib)

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels