ish

Alignment-based filtering CLI tool

https://github.com/bioradopensource/ish

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: biorxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.2%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Alignment-based filtering CLI tool

Basic Info
  • Host: GitHub
  • Owner: BioRadOpenSource
  • License: apache-2.0
  • Language: Mojo
  • Default Branch: main
  • Homepage:
  • Size: 1.31 MB
Statistics
  • Stars: 53
  • Watchers: 6
  • Forks: 1
  • Open Issues: 5
  • Releases: 5
Created 12 months ago · Last pushed 8 months ago
Metadata Files
Readme Changelog Contributing License Citation Codeowners

README.md


Logo

Fast record filtering by alignment-scores 🔥

Written in Mojo License Badge Build status CodeQL
Contributors Welcome

ish is a CLI tool for searching for matches against records using different alignment methods.

Build

  1. Install pixi

  2. pixi run build

  3. ./ish --help

Pixi / Conda install

``` pixi global install -c conda-forge -c https://repo.prefix.dev/modular-community -c https://conda.modular.com/max ish

Or

conda install -c conda-forge -c https://repo.prefix.dev/modular-community -c https://conda.modular.com/max ish ```

For best performance it's recommended to build from source.

Usage

```sh ❯ ./ish --help ish Search for inexact patterns in files.

ARGS: =1)>... Pattern to search for, then any number of files or directories to search. FLAGS: --help [Default: False] Show help message

--verbose <Bool> [Default: False]
    Verbose logging output.

OPTIONS: --scoring-matrix [Default: ascii] The scoring matrix to use. ascii: does no encoding of input bytes, matches are 2, mismatch is -2. blosum62: encodes searched inputs as amino acids and uses the classic Blosum62 scoring matrix. actgn: encodes searched inputs as nucleotides, matches are 2, mismatch is -2, Ns match anything. actgn0: encodes searched inputs as nucleotides, matches are 2, mismatch is -2, Ns don't count toward score.

--score <Float> [Default: 0.8]
    The min score needed to return a match. Results >= this value will be returned. The score is the found alignment score / the optimal score for the given scoring matrix and gap-open / gap-extend penalty.

--gap-open <Int> [Default: 3]
    Score penalty for opening a gap.

--gap-extend <Int> [Default: 1]
    Score penalty for extending a gap.

--match-algo <String> [Default: striped-semi-global]
    The algorithm to use for matching: [striped-local, striped-semi-global]

--record-type <String> [Default: line]
    The input record type: [line, fastx]

--threads <Int> [Default: 10]
    The number of threads to use. Defaults to the number of physical cores.

--batch-size <Int> [Default: 268435456]
    The number of bytes in a parallel processing batch. Note that this may use 2-3x this amount to account for intermediate transfer buffers.

--max-gpus <Int> [Default: 0]
    The max number of GPUs to try to use. If set to 0 this will ignore any found GPUs. In general, if you have only one query then there won't be much using more than 1 GPU. GPUs won't always be faster than CPU parallelization depending on the profile of data you are working with.

--output-file <String> [Default: /dev/stdout]
    The file to write the output to, defaults to stdout.

--sg-ends-free <String> [Default: FFTT]
    The ends-free for semi-global alignment, if used. The free ends are: (query_start, query_end, target_start, target_end). These must be specified with a T or F, all four must be specified. By default this target ends are free.

```

```sh

Some actual usage.

❯ ./ish blosum62 ./ishbenchaligner.mojo ./ishbenchaligner.mojo:94 defaultvalue=String("Blosum50"), ./ishbenchaligner.mojo:96 "Scoring matrix to use. Currently supports: [Blosum50," ./ishbenchaligner.mojo:97 " Blosum62, ACTGN]" ./ishbenchaligner.mojo:379 if matrixname == "Blosum50": ./ishbenchaligner.mojo:380 matrix = ScoringMatrix.blosum50() ./ishbenchaligner.mojo:381 elif matrixname == "Blosum62": ./ishbenchaligner.mojo:382 matrix = ScoringMatrix.blosum62() ./ishbench_aligner.mojo:390 ## Assuming we are using Blosum50 AA matrix for everything below this for now. ```

🔥 Note

The filepath:linenumber in the match allows you to cmd-click on the match and have vscode open the file at that location.

Match Methods

  • striped-semi-global: Striped Semi-global, SIMD accelerated, GPU accelerated when available, supports affine gaps and scoring matrices. Specify ends-free with the --sg-ends-free options.
  • striped-local: Striped Smith-Waterman, SIMD accelerated, supports affine gaps and scoring matrices.

Record Types

  • line: match against one line at a time, a-la grep
  • fastx: match against the sequence portion of FASTA or FASTQ records.

ish-aligner

This is a benchmarking tool based on parasail_aligner.

⚠️ Warning

ish-aligner and all variations of it are for development purposes only.

Further Reading

The associated paper can be found here.

Future Work

  • Support multiple queries
  • Choose a better default between cpu and gpu / think about more. GPU crushes on big files / long running / many files, cpu is faster for small jobs
  • Add ability to not skip dotfiles

Rattler Build

For testing the build process for modular-community

bash pixi global install rattler-build rattler-build build -c https://repo.prefix.dev/modular-community -c https://conda.modular.com/max -c conda-forge --skip-existing=all -r ./recipe.yaml

Owner

  • Name: BioRadOpenSource
  • Login: BioRadOpenSource
  • Kind: organization
  • Location: San Francisco Bay Area

Public facing repos

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Stadick"
  given-names: "Seth"
  orcid: "https://orcid.org/0009-0002-0915-9459"
title: "ish"
version: 1.1.1
doi: https://doi.org/10.1101/2025.06.04.657890
date-released: 2025-06-09
url: "https://github.com/BioRadOpenSource/ish"

GitHub Events

Total
  • Create event: 14
  • Issues event: 3
  • Release event: 2
  • Watch event: 30
  • Delete event: 9
  • Issue comment event: 10
  • Public event: 1
  • Push event: 34
  • Pull request review event: 10
  • Pull request review comment event: 6
  • Pull request event: 18
Last Year
  • Create event: 14
  • Issues event: 3
  • Release event: 2
  • Watch event: 30
  • Delete event: 9
  • Issue comment event: 10
  • Public event: 1
  • Push event: 34
  • Pull request review event: 10
  • Pull request review comment event: 6
  • Pull request event: 18