seqhasher

SeqHasher - A tool for hashing individual sequences in FASTA files

https://github.com/vmikk/seqhasher

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.2%) to scientific vocabulary

Keywords

dna-sequences fasta fastq hashing

Last synced: 9 months ago · JSON representation ·

Repository

SeqHasher - A tool for hashing individual sequences in FASTA files

Basic Info

Host: GitHub
Owner: vmikk
License: gpl-3.0
Language: Go
Default Branch: main
Homepage:
Size: 108 KB

Statistics

Stars: 1
Watchers: 2
Forks: 0
Open Issues: 0
Releases: 8

Topics

dna-sequences fasta fastq hashing

Created about 2 years ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

SeqHasher

Overview

seqhasher is a high-performance command-line tool designed to calculate a hash (digest or fingerprint) for each sequence in a FASTA or FASTQ file and add it to the sequence header. It supports multiple hashing algorithms and offers various output options.

Features

Fast processing of FASTA/FASTQ files (thanks to shenwei356/bio package)
Support for multiple hash algorithms: SHA-1, SHA-3, MD5, xxHash, CityHash, MurmurHash3, ntHash, and BLAKE3
Automatic support for compressed input files (gzip, zstd, xz, and bzip2)
Supports reading from STDIN and writing to STDOUT
Option to output only headers or full sequences
Case-sensitive hashing option
Customizable output format (e.g., include filename or a custom text string in the header)

Quick start

Input data (e.g., input.fasta): ```

seq1 AAAA seq2 ACTG seq3 aaaa ```

Basic usage (default SHA1 hash): seqhasher input.fasta - ```

input.fasta;e2512172abf8cc9f67fdd49eb6cacf2df71bbad3;seq1 AAAA input.fasta;65c89f59d38cdbf90dfaf0b0a6884829df8396b0;seq2 ACTG input.fasta;e2512172abf8cc9f67fdd49eb6cacf2df71bbad3;seq3 AAAA ```

Custom name instead of input filename (e.g., useful when processing stdin): seqhasher --name "test_file" input.fasta - ```

testfile;e2512172abf8cc9f67fdd49eb6cacf2df71bbad3;seq1 AAAA testfile;65c89f59d38cdbf90dfaf0b0a6884829df8396b0;seq2 ACTG test_file;e2512172abf8cc9f67fdd49eb6cacf2df71bbad3;seq3 AAAA ```

Output only headers: seqhasher --headersonly input.fasta - input.fasta;e2512172abf8cc9f67fdd49eb6cacf2df71bbad3;seq1 input.fasta;65c89f59d38cdbf90dfaf0b0a6884829df8396b0;seq2 input.fasta;e2512172abf8cc9f67fdd49eb6cacf2df71bbad3;seq3

Omit filename from output: seqhasher --headersonly --nofilename input.fasta - e2512172abf8cc9f67fdd49eb6cacf2df71bbad3;seq1 65c89f59d38cdbf90dfaf0b0a6884829df8396b0;seq2 e2512172abf8cc9f67fdd49eb6cacf2df71bbad3;seq3

Use different hash functions (xxHash) and case-sensitive mode: seqhasher --headersonly --nofilename --hash xxhash --casesensitive input.fasta - cf40b5b72bc43e77;seq1 704b34bf20faedf2;seq2 42a70d1abf84bf32;seq3

Multiple hashes (useful to ensure absence of collisions): seqhasher --headersonly --nofilename --hash sha1,xxhash --casesensitive input.fasta - e2512172abf8cc9f67fdd49eb6cacf2df71bbad3;cf40b5b72bc43e77;seq1 65c89f59d38cdbf90dfaf0b0a6884829df8396b0;704b34bf20faedf2;seq2 70c881d4a26984ddce795f6f71817c9cf4480e79;42a70d1abf84bf32;seq3

Usage

```plaintext seqhasher [options] [output_file]

Options: -o, --headersonly Output only sequence headers, excluding the sequences themselves -H, --hash Hash algorithm(s): sha1 (default), sha3, md5, xxhash, cityhash, murmur3, nthash, blake3 -c, --casesensitive Take into account sequence case. By default, sequences are converted to uppercase -n, --nofilename Omit the file name from the sequence header -f, --name Replace the input file's name in the header with -v, --version Print the version of the program and exit -h, --help Show this help message and exit

Arguments: Path to the input FASTA/FASTQ file (supports gzip, zstd, xz, or bzip2 compression) or '-' for standard input (stdin) [output_file] Path to the output file or '-' for standard output (stdout) If omitted, output is sent to stdout. ```

Description

The tool can either read the input from a specified file or from standard input (stdin), and similarly, it can write the output to a specified file or standard output (stdout).

The --name option allows to customize the header of the output by specifying a text to replace the input file name.

The --hash option allows to specify which hash function to use (multiple coma-separated values allowed, e.g., --hash sha1,nthash). Currently, the following hash functions are supported:
- sha1: SHA-1 (default), 160-bit hash value - sha3: SHA-3, Keccak-based secure cryptographic hash standard, 512-bit hash value - md5: MD5, 128-bit hash value - xxhash: xxHash, extremely fast algorithm, 64-bit hash value - cityhash: CityHash (e.g., used in VSEARCH), 128-bit hash value - murmur3: Murmur3 (e.g., used in Sourmash, but 64-bit), 128-bit hash value - nthash: ntHash (designed for DNA sequences), 64-bit hash value. This implementation uses the full length of the sequence as the k-mer size, effectively hashing the entire sequence at once using the non-canonical (forward) hash of the sequence - blake3: BLAKE3 (fast cryptographic hash function), 256-bit hash value

[!NOTE] The probability of a collision (when different DNA sequences end up with the same hash) is roughly 1 in 2^nbits, where nbits is the length of the hash in bits. This means that functions with shorter bit-lengths (e.g., Murmur3 and CityHash) are more likely to have collisions as the dataset grows, while SHA-3 has a much lower chance of collisions because of its larger bit length. However, shorter hashes are generally faster to compute and take up less space when saved to a file, making them more efficient for some tasks despite the higher collision risk.

Examples

To process a FASTA file and output to another file: bash seqhasher input.fasta output.fasta

To process a FASTA file from standard input and output to standard output, while replacing the file name in the header with 'Sample': ```bash cat input.fasta | seqhasher --name 'Sample' - - > output.fasta

OR

seqhasher --name 'Sample' - - < input.fasta > output.fasta ```

Benchmark

To evaluate the performance of two solutions for processing DNA sequences, we utilized hyperfine.

Benchmarks were performed on a system with the following specifications: - CPU: Intel Core i7-10510U (Comet Lake) - Storage: NVMe SSD

Test data

First, let's create the test data - a FASTA file containing 500,000 sequences, each 30 to 3000 nucleotides long (this should take a couple of minutes):

bash awk -v numSeq=500000 'BEGIN{ srand(); for(i=1; i<=numSeq; i++){ seqLen=int(rand()*(2971))+30; printf(">seq_%d\n", i); for(j=1; j<=seqLen; j++){ r=rand(); if(r < 0.25) nucleotide="A"; else if(r < 0.5) nucleotide="C"; else if(r < 0.75) nucleotide="G"; else nucleotide="T"; printf("%s", nucleotide); } printf("\n"); } }' > big.fasta The size of the file is ~760MB.

Hashing functions performance

bash hyperfine \ --runs 10 --warmup 3 \ --export-markdown hashing_benchmark.md \ 'seqhasher --headersonly --casesensitive --hash md5 big.fasta - > /dev/null' \ 'seqhasher --headersonly --casesensitive --hash sha1 big.fasta - > /dev/null' \ 'seqhasher --headersonly --casesensitive --hash sha3 big.fasta - > /dev/null' \ 'seqhasher --headersonly --casesensitive --hash xxhash big.fasta - > /dev/null' \ 'seqhasher --headersonly --casesensitive --hash murmur3 big.fasta - > /dev/null' \ 'seqhasher --headersonly --casesensitive --hash cityhash big.fasta - > /dev/null' \ 'seqhasher --headersonly --casesensitive --hash nthash big.fasta - > /dev/null' \ 'seqhasher --headersonly --casesensitive --hash blake3 big.fasta - > /dev/null' \ 'seqhasher --headersonly --hash sha1,blake3 big.fasta - > /dev/null' \ 'seqhasher --headersonly --hash xxhash,murmur3 big.fasta - > /dev/null'

| Command | Mean [s] | Min [s] | Max [s] | Relative | |:-----------------|--------------:|--------:|--------:|------------:| | md5 | 1.712 ± 0.069 | 1.651 | 1.847 | 1.75 ± 0.10 | | sha1 | 1.614 ± 0.021 | 1.586 | 1.645 | 1.65 ± 0.08 | | sha3 | 4.823 ± 0.135 | 4.707 | 5.090 | 4.93 ± 0.26 | | xxhash | 0.977 ± 0.043 | 0.941 | 1.079 | 1.00 | | murmur3 | 1.106 ± 0.058 | 1.058 | 1.233 | 1.13 ± 0.08 | | cityhash | 1.078 ± 0.019 | 1.048 | 1.111 | 1.10 ± 0.05 | | nthash | 2.138 ± 0.022 | 2.112 | 2.170 | 2.19 ± 0.10 | | blake3 | 1.718 ± 0.066 | 1.645 | 1.864 | 1.76 ± 0.10 | | sha1,blake3 | 3.384 ± 0.096 | 3.290 | 3.640 | 3.46 ± 0.18 | | xxhash,murmur3 | 2.234 ± 0.073 | 2.193 | 2.422 | 2.29 ± 0.13 |

Values are in seconds per 500,000 sequences (756,622,201 bp)

As shown, xxHash provides the best performance, followed by CityHash and MurmurHash3. These hash functions produce relatively short hash fingerprints (64 and 128 bits, respectively). In contrast, SHA-3 is the slowest hash function in this benchmark, generating the longest hash (512 bits).

[!NOTE] However, it's important to note that these values may depend on the instruction set of the CPU being used, as some processors may optimize specific algorithms differently (e.g., via SIMD or other hardware acceleration). For example, modern CPUs may use SHA Extensions to accelerate SHA-family algorithms. Additionally, the performance reported here is tied to the particular implementations of the hash algorithms used in seqhasher. Other implementations may yield different results, and these values should not be interpreted as a definitive ranking of the algorithms themselves.

Installation

Pre-built binaries

Download the latest release for your platform from the Releases page.

Building from source

Ensure you have Go 1.23 or later installed.
Then, to install seqhasher v.1.1.1 run:

bash git clone --depth 1 --branch 1.1.1 https://github.com/vmikk/seqhasher cd seqhasher go build -ldflags="-w -s" seqhasher.go

Known issues and limitations

Seqhasher does not take line wrapping in FASTA file into account (whitespace characters are stripped from the sequence before processing);
The tool may not work correctly with sequences containing non-ASCII characters;
IUPAC ambiguity codes (R,Y,S,W,K,M,B,D,H,V,N), characters denoting gaps ('-' or '.'), and any other non-DNA characters are handled "as is" (hash will depend on them);
Empty sequences return an empty hash;

Owner

Name: Vladimir Mikryukov
Login: vmikk
Kind: user
Location: Tartu, Estonia
Company: The University of Tartu

Twitter: vmikkkk
Repositories: 54
Profile: https://github.com/vmikk

Citation (CITATION.cff)

cff-version: 1.2.0
title: "SeqHasher: DNA Sequence Hashing Tool"
type: software
authors:
- family-names: "Mikryukov"
  given-names: "Vladimir"
  orcid: "https://orcid.org/0000-0003-2786-2690"
version: 1.1.1
doi: 10.5281/zenodo.14311356
date-released: 2024-12-08
url: "https://github.com/vmikk/seqhasher"
license: GPL-3.0
repository-code: "https://github.com/vmikk/seqhasher"

GitHub Events

Total

Release event: 3
Delete event: 1
Push event: 27
Create event: 3

Last Year

Release event: 3
Delete event: 1
Push event: 27
Create event: 3

Committers

Last synced: 10 months ago

All Time

Total Commits: 211
Total Committers: 1
Avg Commits per committer: 211.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 179
Committers: 1
Avg Commits per committer: 179.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Vladimir Mikryukov	v**v@g**m	211

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads: unknown

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 0

proxy.golang.org: github.com/vmikk/seqhasher

Homepage: https://github.com/vmikk/seqhasher
Documentation: https://pkg.go.dev/github.com/vmikk/seqhasher#section-documentation
License: GPL-3.0

Versions: 0
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent packages count: 8.4%

Average: 9.0%

Dependent repos count: 9.5%

Last synced: 10 months ago

Dependencies

go.mod go

github.com/cespare/xxhash/v2 v2.2.0
github.com/dsnet/compress v0.0.1
github.com/elliotwutingfeng/asciiset v0.0.0-20230602022725-51bbb787efab
github.com/go-faster/city v1.0.1
github.com/klauspost/compress v1.16.3
github.com/klauspost/pgzip v1.2.5
github.com/shenwei356/bio v0.13.3
github.com/shenwei356/util v0.5.0
github.com/shenwei356/xopen v0.3.2
github.com/spaolacci/murmur3 v1.1.0
github.com/ulikunitz/xz v0.5.11

go.sum go

github.com/cespare/xxhash/v2 v2.2.0
github.com/dsnet/compress v0.0.1
github.com/dsnet/golib v0.0.0-20171103203638-1ea166775780
github.com/elliotwutingfeng/asciiset v0.0.0-20230602022725-51bbb787efab
github.com/go-faster/city v1.0.1
github.com/klauspost/compress v1.4.1
github.com/klauspost/compress v1.16.3
github.com/klauspost/cpuid v1.2.0
github.com/klauspost/pgzip v1.2.5
github.com/shenwei356/bio v0.13.3
github.com/shenwei356/util v0.5.0
github.com/shenwei356/xopen v0.3.2
github.com/spaolacci/murmur3 v1.1.0
github.com/ulikunitz/xz v0.5.6
github.com/ulikunitz/xz v0.5.11

seqhasher

Science Score: 67.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

SeqHasher

Overview

Features

Quick start

Usage

Description

Examples

OR

Benchmark

Test data

Hashing functions performance

Installation

Pre-built binaries

Building from source

Known issues and limitations

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

proxy.golang.org: github.com/vmikk/seqhasher

Rankings

Dependencies