Recent Releases of kmerdb
kmerdb - v0.9.5
Modest release, some bugfixes, mostly stable interface for the 0.9 minor version. The goal of this release was to get the project into condition where the de Bruijn graph structure is essentially workable for networkx graph utilities.
What's in this release?
Bugfixes for the graph and kmer submodules, which I've been using in the REPL to debug trivial use cases using error free reads. Blog post (as of 7.31.25) is available with a brief description of the idea to use the networkx.has_eulerian_path helper functionality to prototype components of an assembly process, with potential for further development with equivalent solution to the helper implemented perhaps in Cython.
The shred and is_sequence_na and similar functions are compatible with strings if no SeqRecord is provided.
In prior draft versions out on PyPI the kmer module and other related functionality used by the parse and graph submodules were reworked for better development experience. The logger submodule was recently fixed. Even appmap saw some minor action.
Although the total lines changed has been small, the UX and development experiences have been improved slightly.
Next goals
Better implementation of a networkx feature. Improved ideas for an aligner. I'm tabling the codon usage bias idea for the moment. I would like to focus some efforts on making money and finding the right fit for my skill level with programming as well as my current situation. I'd also like to focus some efforts on web development and even some leetcode in the future.
But moreso than that, I need to develop my portfolio beyond just this one project.
- Python
Published by MatthewRalston 7 months ago
kmerdb - v0.8.21
Adds a working implementation of a codon count table and a chi-square test for codon-usage bias*
Please see kmerdb codons --help and kmerdb CUB -h for more details. Uses scipy.stats chisquare implementation for running a chisquare (bonferroni correction) test for each sequence in the 'test' group against the aggregate from the 'reference' group created using kmerdb codons.
- there are many ways of ascertaining the bias in a group of sequences, typically full CDSes, and comparing the usage pattern. A codon count/frequency table is typically the launching point for a chisquare analysis or an ECN, RSCU, CAI, or other metrics.
- 0.8.21 Fixed logging issues, housekeeping on
minimizer.pyand minimizer index functionality. The index file can get pretty big if you're including everything. Currently just keeping the setting to dump all kmers to disk True for now. A better file will have the empty useless k-mer rows removed.
- Python
Published by MatthewRalston 8 months ago
kmerdb - v0.8.20
Adds a working implementation of a codon count table and a chi-square test for codon-usage bias*
Please see kmerdb codons --help and kmerdb CUB -h for more details. Uses scipy.stats chisquare implementation for running a chisquare (bonferroni correction) test for each sequence in the 'test' group against the aggregate from the 'reference' group created using kmerdb codons.
- there are many ways of ascertaining the bias in a group of sequences, typically full CDSes, and comparing the usage pattern. A codon count/frequency table is typically the launching point for a chisquare analysis or an ECN, RSCU, CAI, or other metrics.
- Python
Published by MatthewRalston 8 months ago
kmerdb - v0.8.17
Patches for read length, kmer module, and codons module (unfunctional). New features include amino-acid to id interconversion, and amino acid kmer size will be a pretty significant issue as the base 5 binary encoding isn't really an intelligent solution at all for identifying an amino acid kmer. I'm not going to develop that feature out.
The codons module however has hashmaps for converting 3-mer ids into the appropriate amino acid, and has synonymous usage hashmaps too, also targeting 3-mer ids. This feature will be developer further.
- Python
Published by MatthewRalston 9 months ago
kmerdb - v0.8.13
This is a very minor release cycle at this point on the 0.8 version. The project is still pre-alignment, pre-assembly and other milestones. The 0.8.13 focuses on making a minimizer index for beginning a discussion on minimizer exact matching or approximate matching between two sequences. Currently, only a stub for a minimizer-seeded alignment routine is implemented, and a utility for creating single minimizer subsets of a k-mer array.
On a personal note, I'm working on some personal things, considering deepseek R1, and trying to get a new job. That's my goal is earn some change, sadly.
Otherwise, I believe this is the direction the project needs to achieve its eventual goal of minimizer distances and interoperable minimizer and match index files and how they can be use to seed alignments and store those intermediary forms as well.
That's the direction I'm going in and excited for more time to work on this.
- Python
Published by MatthewRalston about 1 year ago
kmerdb - v0.8.5
Hotfix to the 0.8.4 release, which was said to be 'acceptance tested'. And then came the profile UI rework during the actual 0.8.4-0.8.5 interlude, and the 0.8.4 commit that was chosen needed to be yanked, and then UI fix is replaced here with hotfix. Currently writes correct counts (not zeros, see diff) to the file and not zeroes I'm glad I noticed, just a few days after the 0.8.4 patch.
This release includes the recent features:
graphsubcommandusage|helpsubcommands--debugfor default error handling-o|--output-namerevised usage patternssamplesheet pattern. --quietthroughout verbose commands
- Python
Published by MatthewRalston over 1 year ago
kmerdb - v0.8.2
Acceptance tested version v0.8. New features (vs 0.7.6+) include
- exit_summary
- graph subcommand
- usage/help subcommand
- Deprecations
- Improved logging
- Logfile
- --num-log-lines to display last X lines of log in the exit_summary error datastructure, upon raising an exception
- Command banner
- Fixed citation subcommand.
as well as a number of bugfixes
Notes
Acceptance tested 'stable' v0.8 release. Some regressions may remain silent in usage without the --debug feature, which skips a feature used to condense metadata, and help the developer collect and assess bugs with relevant feature/step metadata available, as well as traceback, feature description etc error information available at program exit. The structure is YAML/JSON.
The data structure schema is located at config.exit_summary_schema.
- Python
Published by MatthewRalston almost 2 years ago
kmerdb - v0.8.0
--debug flag introduced to skip error/exit handling module. In brief, errors are caught and processed differently under --debug modes. Convenience features have been added to the standard invocation to provide an "exit summary", describing (clearly) the 'last loggable line', abbreviates the logging to stderr, captures traceback objects, errors, program "steps" and "features" (described in config.py) where the program failed, and other relevant metadata.
Use --debug if the program exits early or with no logged information.
kmerdb usage -m method introduced
Revisions tested across the board.
profile, graph, usage, matrix, distance, kmeans, and hierarchical have been acceptance tested.
- Python
Published by MatthewRalston almost 2 years ago
kmerdb - v0.7.8
kmerdb graph introduced, producing a new file form .kdbg, an edge list. New metadata schema for new format as well. kmerdb view and kmerdb header are compatible with new format.
The goal is to create an weighted graph. Support for assembly and graph visualizations in the future.
After 0.7.6 the .kdb spec will be loosely deprecated. While the .kdb format may remain unchanged (don't know yet), the goal is to produce an adjacency list structure from only the k-mer counts and the 'neighbor' k-mer ids. After the format revision (mostly to the --all-metadata option), a new command kmerdb graph will be applied to generate a on-disk representation of an adjacency list.
- What does this mean?
At this point, the new feature is in the planning stage, and it is not known if backwards compatibility (< 0.7.7) will be supported. One goal is to create an adjacency list structure on disk from the --all-metadata augmented .kdb format. It is not clear yet if cycles will be permitted in the graph structure, or if a distinct "offset" flag will be used. An example follows.
- 0.7.6
.kdbformat col1 is row number, col2 is sort order, col3 is k-mer id, col4 is k-mer count, col5 (--all-metadata) featured a loosely specified 'neighbor' JSON field, consisting of a dictionary with "A", "C", "T" "G" etc. keys and it was poorly implemented. Basically, the neighboring (left side and right side) k-mer ids were provided.1 1 1 123 - 0.7.7+
.kdbgcol1 is unique row number, col2 is k-mer id (may be repeated), col3 is a.csvfield of possible adjacent row-ids, corresponding to the k-mer id's (col2) neighbors in kmer-space. col4 represents a possible solution for the graph traversal that produces a Hamiltonian (whatever) walk through the graph recapitulating either the exact (.fasta) assembly solution OR a potential solution to the assembly from available data and a feasible solution either usingnetworkxor somehow a custom graph traversal algorithm that minimized the penalty of omitting rows/k-mers based on the suggestion of the shortest path to visit each k-mer once but that also? maximizes the number of rows visited? I'm not sure yet how this will be specifically implemented, as the.kdbgformat is the first step.1 1234 2345,3456,... 3
- Python
Published by MatthewRalston almost 2 years ago
kmerdb - v0.7.1alpha
Migrating to a new wheel generation process using python -m build. pyproject.toml Generates a working installed module kmerdb.
This can be confirmed by running ```python
$ python
import kmerdb ```
Similarly, direct module invocation works fine, this minor release is bugged because of the pyproject.toml.
bash
python -m kmerdb -h
- Python
Published by MatthewRalston about 3 years ago
kmerdb - v0.6.3
This realease contains several improvements to the primary file interfaces in fileutil.py, __init__.py, parse.py, kmer.py and others. This is a backwards incompatible release, skipping ahead a major version. You'll need an update kmerdb to use the new uint64 features, sort order, and more.
- Python
Published by MatthewRalston about 4 years ago
kmerdb - v0.5.1
This release cleans up some functionality on kmerdb view. It also modifies the code throughout to use unsigned integer64 throughout. Let's see how pervasive the specifcation becomes as we begin to explore the Pearson and Spearman space for the different depths. We'll be really needing another exploratory study regarding high fidelity coverage (100x-200x).
The kdb files can also be sorted, but reading them has become a little bit of a challenge. Will try in the next pull request. This had to be done on master today. The suite deserved a bit of some TLC.
- Python
Published by MatthewRalston about 4 years ago
kmerdb - v0.1.0
This is the first release of the backwards incompatible change to the PostgreSQL database (technically 0.0.11 or 0.0.12 were the first releases that contained the changes). Now the program is streamlined to use in-memory k-mer counting and aggregation.
- Python
Published by MatthewRalston about 4 years ago
kmerdb - v0.0.10
The true alpha release.
Bugfixes #46, makes everything fully pipeable by standardizing on nargs="*" instead of nargs="+" to permit implicit STDIN reading:
```bash
Generate count matrix | Normalize NB counts | nxn distance matrix | k-means clustering
kmerdb matrix Unnormalized *.kdb | kmerdb matrix Normalized | kmerdb distance correlation | kmerdb kmeans -k 10 sklearn ```
Again, the 4 main steps in this data processing workflow 1. Profile generation ```bash kmerdb profile genome.fa genome1.kdb
kmerdb profile input1.fa input2.fa compound1.kdb
2. Data matrix (X) generation
bash
4 options for matrix generation/processing:
Normalization vs PCA vs tSNE
kmerdb matrix Unnormalized *.kdb > X.tsv
kmerdb matrix Normalized *.kdb # Uses DESeq2 normalization suitable for NB distributed count matrices via rpy2
kmerdb matrix PCA -n 20 *.kdb
kmerdb matrix tSNE -n 20 *.kdb
3. Distance matrix generation
- 22 distance metrics available
bash
kmerdb distance spearman X.tsv
kmerdb matrix Normalized *.kdb | kmerdb distance euclidean
4. Hierarchical/kmeans clustering
bash
kmerdb kmeans -k 10 -i input.tsv sklearn # Biopython k-means implementation also available
kmerdb hierarchical -i input.tsv -m ward
```
- Python
Published by MatthewRalston over 4 years ago
kmerdb - alpha
This has been used in a test sweep over 30 genomes, and has been deemed to be suitable for academic use.
The new release features IUPAC support, a full graph database (in bgzf format only) with the --all-metadata flag on the kmerdb profile command.
- Python
Published by MatthewRalston about 5 years ago
kmerdb - Pre-alpha
This is the first release that can be described as pre-alpha. The naming convention is that the module and CLI code is now referred to as 'kmerdb' to avoid namespace issues with pip install kdb which is occupied by another package. Now we will continue to refer to the database files as .kdb files.
The command contains the following subcommands:
- profile
- distance
- matrix
- kmeans
- hierarchical
- index
- view
- header
- Python
Published by MatthewRalston about 5 years ago