nthash-avx512

Various test improvements upon mirounga's AVX2 and AVX512 ports of ntHash2 in 2018

https://github.com/rchikhi/nthash-avx512

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 4 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.0%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Various test improvements upon mirounga's AVX2 and AVX512 ports of ntHash2 in 2018

Basic Info
  • Host: GitHub
  • Owner: rchikhi
  • License: mit
  • Language: C++
  • Default Branch: rs_avx
  • Homepage:
  • Size: 11.5 MB
Statistics
  • Stars: 5
  • Watchers: 4
  • Forks: 1
  • Open Issues: 0
  • Releases: 0
Created over 3 years ago · Last pushed 11 months ago
Metadata Files
Readme Changelog License Citation

README.md

ntHash2 AVX512

This is a bugfixed version of ntHash2 AVX2 and AVX512 ports with expanded tests. Also added 32-bits scalar ntHash.

In terms of correctness, the 64-bit hash version (which is default hash type in ntHash 1 and 2) of scalar, AVX2 and AVX512 all agree together. The 32-bit AVX2/AVX512 ports agree together, as they implement some strange 31-bit ntHash2, but they do not agree with the 32-bits scalar version which implement ntHash1.

This hasn't been merged in the original ntHash repository as this is an old ntHash codebase.

See https://github.com/bcgsc/ntHash/pull/9 for initial version.

ntHash

ntHash is a recursive hash function for hashing all possible k-mers in a DNA/RNA sequence.

Build the test suite

$ ./autogen.sh $ ./configure $ make $ sudo make install

To install nttest in a specified directory:

$ ./autogen.sh $ ./configure --prefix=/opt/ntHash/ $ make $ make install

The nttest suite has the options for runtime and uniformity tests.

Runtime test

For the runtime test the program has the following options: nttest [OPTIONS] ... [FILE] Parameters: * -k, --kmer=SIZE: the length of k-mer used for runtime test hashing [50] * -h, --hash=SIZE: the number of generated hashes for each k-mer [1] * FILE: is the input fasta or fastq file

For example to evaluate the runtime of different hash methods on the test file reads.fa in DATA/ folder for k-mer length 50, run: $ nttest -k50 reads.fa

Uniformity test

For the uniformity test using the Bloom filter data structure the program has the following options: nttest --uniformity [OPTIONS] ... [REF_FILE] [QUERY_FILE]

Parameters: * -q, --qnum=SIZE: number of queries in query file * -l, --qlen=SIZE: length of reads in query file * -t, --tnum=SIZE: number of sequences in reference file * -g, --tlen=SIZE: length of reference sequence * -i, --input: generate random query and reference files * -j, threads=SIZE: number of threads to run uniformity test [1] * REF_FILE: the reference file name * QUERY_FILE: the query file name

For example, to evaluate the uniformity of different hash methods using the Bloom filter data structure on randomly generated data sets with following options: * 100 genes of length 5,000,000bp as reference in file genes.fa * 4,000,000 reads of length 250bp as query in file reads.fa * 12 threads

run: $ nttest --uniformity --input -q4000000 -l250 -t100 -g5000000 -j12 genes.fa reads.fa

Code samples

To hash all k-mers of length k in a given sequence seq: bash string kmer = seq.substr(0, k); uint64_t hVal=0; hVal = NT64(kmer.c_str(), k); // initial hash value ... for (size_t i = 0; i < seq.length() - k; i++) { hVal = NT64(hVal, seq[i], seq[i+k], k); // consecutive hash values ... } To canonical hash all k-mers of length k in a given sequence seq: bash string kmer = seq.substr(0, k); uint64_t hVal, fhVal=0, rhVal=0; // canonical, forward, and reverse-strand hash values hVal = NTC64(kmer.c_str(), k, fhVal, rhVal); // initial hash value ... for (size_t i = 0; i < seq.length() - k; i++) { hVal = NTC64(seq[i], seq[i+k], k, fhVal, rhVal); // consecutive hash values ... } To multi-hash with h hash values all k-mers of length k in a given sequence seq: bash string kmer = seq.substr(0, k); uint64_t hVec[h]; NTM64(kmer.c_str(), k, h, hVec); // initial hash vector ... for (size_t i = 0; i < seq.length() - k; i++) { NTM64(seq[i], seq[i+k], k, h, hVec); // consecutive hash vectors ... }

ntHashIterator

Enables ntHash on sequences

To hash all k-mers of length k in a given sequence seq with h hash values using ntHashIterator: bash ntHashIterator itr(seq, h, k); while (itr != itr.end()) { ... use *itr ... ++itr; }

Usage example (C++)

Outputing hash values of all k-mers in a sequence

```C++

include

include

include "ntHashIterator.hpp"

int main(int argc, const char* argv[]) { /* test sequence */ std::string seq = "GAGTGTCAAACATTCAGACAACAGCAGGGGTGCTCTGGAATCCTATGTGAGGAACAAACATTCAGGCCACAGTAG";

/* k is the k-mer length */
unsigned k = 70;

/* h is the number of hashes for each k-mer */
unsigned h = 1;

/* init ntHash state and compute hash values for first k-mer */
ntHashIterator itr(seq, h, k);
while (itr != itr.end()) {
    std::cout << (*itr)[0] << std::endl;
    ++itr;
}

return 0;

} ```

Publications

ntHash

Hamid Mohamadi, Justin Chu, Benjamin P Vandervalk, and Inanc Birol. ntHash: recursive nucleotide hashing. Bioinformatics (2016) 32 (22): 3492-3494. doi:10.1093/bioinformatics/btw397

Owner

  • Name: Rayan Chikhi
  • Login: rchikhi
  • Kind: user
  • Location: France
  • Company: CNRS

Citation (CITATION.bib)

@article{doi:10.1093/bioinformatics/btw397,
author = {Mohamadi, Hamid and Chu, Justin and Vandervalk, Benjamin P. and Birol, Inanc},
title = {ntHash: recursive nucleotide hashing},
journal = {Bioinformatics},
volume = {32},
number = {22},
pages = {3492},
year = {2016},
doi = {10.1093/bioinformatics/btw397},
URL = { + http://dx.doi.org/10.1093/bioinformatics/btw397},
eprint = {/oup/backfile/Content_public/Journal/bioinformatics/32/22/10.1093_bioinformatics_btw397/3/btw397.pdf}
}

GitHub Events

Total
  • Push event: 1
  • Fork event: 1
  • Create event: 1
Last Year
  • Push event: 1
  • Fork event: 1
  • Create event: 1

Issues and Pull Requests

Last synced: 11 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels