minsh
A minimal Python re-implementation of the A* with seed heuristic for exact global alignment (edit distance) in near-linear time
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 3 DOI reference(s) in README -
✓Academic publication links
Links to: biorxiv.org, springer.com -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.4%) to scientific vocabulary
Repository
A minimal Python re-implementation of the A* with seed heuristic for exact global alignment (edit distance) in near-linear time
Basic Info
Statistics
- Stars: 22
- Watchers: 2
- Forks: 7
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Minimalistic Seed Heuristic for A*
minSH is a short working implementation that aligns sequences $A$ and $B$ end to end using a minimal number of edit operations (substitutions, insertions and deletions). As a side effect, it computes the exact edit distance $ed(A,B)$ with near-linear scaling, given limited divergence $d=ed(A,B)/max(|A|, |B|)$. astar.py implements A* with seed heuristic $h{seed}(i,j) = \Big| \big\{ s \in Seeds{\geq i} \mid s \notin B \big\} \Big|$ in a short and simple way:
```python def buildseedheuristic(A, B, k): """Builds the admissible seed heuristic for A and B with k-mers."""
kmers = { B[j:j+k] for j in range(len(B)-k+1) } # O(nk): Index all kmers from B. O(n) with rolling hash
seeds = [ A[i:i+k] for i in range(0, len(A)-k+1, k) ] # O(n): Split A into non-overlapping seeds of length k.
is_seed_missing = [ s not in kmers for s in seeds ] # O(n): Is seed unseen in B, so >=1 edit must be made to align it.
suffix_sum = np.cumsum(is_seed_missing[::-1])[::-1] # O(n): How many of the remaining seeds have to be edited
return lambda ij, k=k: suffix_sum[ ceildiv(ij[0], k) ] # O(1): How many seeds starting after the current position i have to be edited?
```
Next, we just use the seed heuristic for a starndard A* on the alignment graph A x B:
python
h_seed = build_seed_heuristic(A, B, k=log(len(A)))
astar(A, B, h_seed)
Background
Sequence alignment as a shortest path problem
We can look at aligning sequences A and B as a process of sequentially aligning longer and longer prefixes of $A$ ($A[\dots i]$) to prefixes of $B$ ($B[\dots j]$) by matching, substituting, inserting or deleting single letters (minimizing edit distance). Thus, finding an alignment with a minimal number of edit operations is equivalent to finding a shortest path starting from node $s=(0,0)$ and ending at node $t=(|A|, |B|)$ in a graph of prefixes (also called edit graph or alignment graph), where each edge corresponds to one operation (with cost $0$ for a match or $1$ otherwise). This graph representation enables us to apply general shortest path algorithms.
Dijkstra's shortest path algorithm
The simplest algorithm we can use is Dijkstra's algorithm which finds a shortest path of length $d$ by sequentially exploring nodes $u$ by increasing distance $dist(s,u)$ from the start node $s$ and until expanding the end node $t$. The problem is that the search circles around $s$ regardless of where the target $t$ is, so in our 2D lattice graph the number of explored nodes with $dist(s,u) < d$ grows quadratically in $d$. For most data (e.g. genetic) the edit distance $d$ grows proportionally to $|s|$ and $|t|$, so the whole algorithm becomes quadratic which is practically infeasible for long sequences.
A* algorithm
The A* algirthm is a generalization of Dijkstra's algorithm that explores the nodes $u$ not just by their distance from the start $dist(s,u)$ but also adding an estimation of the remaining distance to the target $dist(s,u) + h(u,t)$. This heuristic function $h(u,t)$ allows for a potentially very direct search towards the target but it has to be designed depending on specific knowledge of the graph/task to be: 1. admissible (i.e. to never exceed the remaining distance $d(u,t)$), or otherwise the found path may not be shortest. 2. accurate in estimating $dist(s,u)$, or otherwise the search will not be directly going to $t$ 3. fast to be computed for each explored node, or otherwise, the A* algorithm will be slow in practice
Match pruning
Usage
To use in your projects, simply add this repository as a dependency:
bash
pip install git+https://github.com/pesho-ivanov/minSH.git
Then call the align function with two strings A and B:
```python import math from minsh.astar import hdijkstra, align, buildseedh, print_stats
A = "ACGT" * 100 B = "AGCT" * 100
alphabetsize = 4 k = math.ceil(math.log(len(A), alphabetsize)) hseed = buildseedh(A, B, k)
gseed = align(A, B, hseed) printstats(A, B, k, gseed) ```
The library can also be used as a standalone command-line script:
```bash $ astar data/smallA.fa data/smallB.fa
Aligning sequences with len(A)=18, k=3: Edit distance: 3 Error rate: 16.67% Explored band: 3.39 ```
To explore additional tooling - clone the repository and install the dependencies:
bash
git clone https://github.com/pesho-ivanov/minSH && cd minSH
pip install -r requirements.txt
python scripts/test.py # tests
python scripts/generate.py # synthetic FASTA files
TODO
Optimizations:
- rolling hash: for linear time precomputation
- greedy matching (aka sliding)
- pruning using index trees
Presentation:
- visualization of the alignment (png, gif)
- interactivity for adding patches
- report stats
- benchmark
Related work
minSH is inspired by minGPT to be small, clean, interpretable and educational re-implementation of the recent aligning approach based on the A* shortest path algorithm.
Detailed Publications on A* for alignment
AStarix semi-global seq-to-graph aligner:
- Ivanov et al., (RECOMB 2020) — Introduces A* with a trie for semi-global.
- Ivanov et al., (RECOMB 2022) — Introduces SH for read alignment on general graph references using trie.
A*PA global seq-to-seq aligner:
- Groot Koerkamp and Ivanov (preprint 2023) — Applies SH to global alignment (edit distance). Generalizes SH with chaining, inexact matches, gap costs (for higher error rates). Optimizes SH with pruning (for near-linear scaling with length), and A* with diagonal transition (for faster quadratic mode). Licensed under the Mozilla Public License, Version 2.0. In short, you are free to use and abuse, but give it back to the community.
Owner
- Name: Pesho Ivanov
- Login: pesho-ivanov
- Kind: user
- Location: State College
- Company: Penn State
- Website: pesho-ivanov.github.io
- Twitter: peshotrie
- Repositories: 31
- Profile: https://github.com/pesho-ivanov
Comp.Bio postdoc @ Penn State; PhD from ETH Zurich: Optimal sequence alignment using A*
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Ivanov" given-names: "Pesho" orcid: 'https://orcid.org/0000-0002-8119-3849' email: ivanov@pesho.info title: "minSH: Minimal re-implementation of the A* with seed heuristic" version: 1.0.0 date-released: 2023-06-05 url: "https://github.com/pesho-ivanov/minSH"
GitHub Events
Total
- Watch event: 4
- Push event: 1
- Fork event: 1
Last Year
- Watch event: 4
- Push event: 1
- Fork event: 1
Committers
Last synced: over 1 year ago
Top Committers
| Name | Commits | |
|---|---|---|
| Pesho Ivanov | p****v | 20 |
| Pesho Ivanov | i****v@p****o | 18 |
| Ash Vardanian | 1****n | 18 |
| Leo Alekseyev | d****k@g****m | 1 |
| Tyler Neylon | t****n@g****m | 1 |
| Robert Ghilduta | 4****a | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 10 months ago
All Time
- Total issues: 0
- Total pull requests: 2
- Average time to close issues: N/A
- Average time to close pull requests: about 10 hours
- Total issue authors: 0
- Total pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 2
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
- ashvardanian (3)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- editdistance *
- fenwick *
- numpy *
- rolling *