https://github.com/alexpreynolds/kmer-boolean
Test if a kmer is or is not in a set of sequences
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.2%) to scientific vocabulary
Keywords
Repository
Test if a kmer is or is not in a set of sequences
Basic Info
- Host: GitHub
- Owner: alexpreynolds
- License: mit
- Language: C++
- Default Branch: master
- Size: 251 KB
Statistics
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
kmer-boolean
This utility tests if a specified kmer is or is not in a set of FASTA sequences provided on standard input, for a given k, returning the according "true" or "false" result. Alternatively, the binary can instead report kmers that are found (--present) or not found (--absent), or either case (--all).
Memory usage
Internally, this test keeps an array of bits to minimize the memory overhead of storing per-kmer presence or absence state. This requires at least 22k-3 bytes to store the bitarray. Querying 16mers, for example, will require 537 MB of memory.
For the C++ binary, an additional 8 MB buffer is reserved for storing intermediate sequence data that streams in from the input FASTA file. If the --read-in-all-at-once option is used, the sequence data is read into memory all at once. It is strongly recommended to instead use the default streaming option to minimize memory usage and runtime overhead required to read all sequences into memory.
Runtime
To explore runtime characteristics, we used hg38 assembly data, looking for kmers that are absent in this genome build from k=2 up to 12, to get a general sense of how runtimes trend.
This test is naturally parallelized by chromosome, as any kmer absent in one chromosome is absent over all chromosomes, by definition. We include scripts compatible with a Slurm job scheduler for local testing.
While we include scripts to parallelize the generation of runtime data for all k-by-chromosome combinations, plotting uses the maximum runtime among all chromosomes as the natural bound for querying kmers for the given value of k.

Some work is still left to do to optimize data structures used for bitsetting kmers.
Notes regarding FASTA input
Note that searches are not performed on "canonical" DNA kmers, but on all unique kmers. For example, a search for the AG 2mer will be treated separately from the reverse complement CT 2mer.
The input FASTA file may contain one or more records (so-called "multi-FASTA"). Each FASTA record in a multi-FASTA file is scanned separately, but each record contributes to the overall kmer query report. Split a multi-FASTA file if you want to query kmer distributions for individual records.
Kmers containing hard-masked bases (N) are ignored. Soft-masked bases (lowercase bases) are included in queries.
An 8 MB buffer is kept of sequence data streaming in from the input FASTA. It is also possible to use the --read-in-all-at-once option to read the entire FASTA records into memory. Using this option would not be recommended for genome-scale input files.
C++
This C++ implementation includes a custom bitset container which can be sized at runtime. The STL bitset library can only be sized at compile time and thus is not used here.
Compilation
Run make to build the kmer-boolean binary. Run make clean to clean up temporary files.
Usage
Options
Run kmer-boolean --help to get a listing of options:
``` kmer-boolean version: 1.1 author: Alex Reynolds
Usage:
$ kmer-boolean [arguments] < input
Test if specified kmer is or is not in a set of FASTA sequences provided on standard input, for a given k, returning the according 'true' or 'false' result.
General Options:
--k=n K-value for kmer length (integer) --query-kmer=s Test or query kmer (string) [--present | --absent | --all] If query kmer is omitted, print all present or absent kmers, or all kmers --read-in-all-at-once Read all sequence data into memory before processing (not recommended for very large FASTA inputs; default is to stream through input in chunks)
Process Flags:
--help Show this usage message --version Show binary version
```
Examples
You can return a list of all kmers found in the input via --present:
$ echo -e ">foo\nCATTCTC\nGGGAC\n>bar\nTTATAT\n>baz\nTTTATTAG\nACCTCT" | ./kmer-boolean --k=2 --present
AC found
AT found
AG found
CA found
CC found
CT found
CG found
TA found
TC found
TT found
GA found
GG found
Conversely, you can return a list of kmers not found in the input via --absent:
$ echo -e ">foo\nCATTCTC\nGGGAC\n>bar\nTTATAT\n>baz\nTTTATTAG\nACCTCT" | ./kmer-boolean --k=2 --absent
AA not found
TG not found
GC not found
GT not found
You can also use --all to return a list of all kmers, whether found or not found in the input FASTA:
$ echo -e ">foo\nCATTCTC\nGGGAC\n>bar\nTTATAT\n>baz\nTTTATTAG\nACCTCT" | ./kmer-boolean --k=3 --all
AAA not found
AAC not found
AAT not found
AAG not found
ACA not found
ACC found
ACT not found
...
GTT not found
GTG not found
GGA found
GGC not found
GGT not found
GGG found
If a filter option (--present, etc.) is not specified, kmer-boolean will return a list of all present kmers.
You may also query for a specific kmer with the --query-kmer=<kmer> option:
$ echo -e ">foo\nCATTCTC\nGGGAC\n>bar\nTTATAT\n>baz\nTTTATTAG\nACCTCT" | ./kmer-boolean --k=5 --query-kmer="ACGGC"
ACGGC not found
If the queried kmer is found, the exit code for the application is zero ("success"). If the kmer is not found, the exit code is non-zero ("failure").
Python
The kmer-boolean.py script uses the Python bitarray library to set up the required number of booleans for the specified k, and tests the boolean value for the specified kmer, or else returns those kmers which are observed in the input.
Note that searches are performed not on "canonical" kmers, but all unique kmers. In other words, for example, a search for the AG 2mer will be treated separately from the reverse complement CT 2mer.
Testing
This has been tested under Python 3.6.3 on OS X. Use of conda or pip may be required for installation of the bitstring library.
The makefile offers a python_test target to run expected "true" and "false" tests of searching for a pair of 2mers from some ad-hoc sequences piped in via standard input to the kmer-boolean.py script.
Owner
- Name: Alex Reynolds
- Login: alexpreynolds
- Kind: user
- Location: Seattle, WA USA
- Company: Altius Institute for Biomedical Sciences
- Website: bitsumma.com
- Repositories: 92
- Profile: https://github.com/alexpreynolds
Pug caregiver, curler, cyclist, gardener, beginning French scholar
GitHub Events
Total
Last Year
Committers
Last synced: 9 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Alex Reynolds | a****s@g****m | 16 |
Issues and Pull Requests
Last synced: 9 months ago
All Time
- Total issues: 0
- Total pull requests: 1
- Average time to close issues: N/A
- Average time to close pull requests: less than a minute
- Total issue authors: 0
- Total pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
- alexpreynolds (1)