ska

Split k-mer analysis – version 2

https://github.com/bacpop/ska.rust

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.1%) to scientific vocabulary

Keywords

alignment bioinformatics k-mer rust sequence ska
Last synced: 6 months ago · JSON representation ·

Repository

Split k-mer analysis – version 2

Basic Info
Statistics
  • Stars: 88
  • Watchers: 6
  • Forks: 6
  • Open Issues: 9
  • Releases: 21
Topics
alignment bioinformatics k-mer rust sequence ska
Created over 3 years ago · Last pushed 7 months ago
Metadata Files
Readme License Citation

README.md

Split K-mer Analysis (version 2)

Cargo Build & Test docs.rs Clippy check codecov Crates.io GitHub release (latest SemVer) <!-- badges: end -->

Description

This is a reimplementation of the SKA package in the rust language, by Johanna von Wachsmann, Simon Harris and John Lees. We are also grateful to have received user contributions from:

  • Romain Derelle
  • Tommi Maklin
  • Joel Hellewell
  • Timothy Russell
  • Nicholas Croucher
  • Dan Lu

Split k-mer analysis (version 2) uses exact matching of split k-mer sequences to align closely related sequences, typically small haploid genomes such as bacteria and viruses.

SKA can only align SNPs further than the k-mer length apart, and does not use a gap penalty approach or give alignment scores. But the advantages are speed and flexibility, particularly the ability to run on a reference-free manner (i.e. including accessory genome variation) on both assemblies and reads.

Citation

Romain Derelle, Johanna von Wachsmann, Tommi Mäklin, Joel Hellewell, Timothy Russell, Ajit Lalvani, Leonid Chindelevitch, Nicholas J. Croucher, Simon R. Harris, John A. Lees (2024). Seamless, rapid, and accurate analyses of outbreak genomic data using split k-mer analysis. Genome Research, 34(10), 1661–1673.

https://genome.cshlp.org/content/34/10/1661.abstract

Documentation

Can be found at https://docs.rs/ska. We also have some tutorials available:

Installation

Choose from:

  1. Download a binary from the releases.
  2. Use cargo install ska or cargo add ska.
  3. Use conda install -c bioconda ska2 (note the two!).
  4. Build from source

For 2) or 4) you must have the rust toolchain installed.

OS X users

If you have an M1/M2 (arm64) Mac, we aren't currently automatically building binaries, so would recommend either option 2) or 4) for best performance.

If you get a message saying the binary isn't signed by Apple and can't be run, use the following command to bypass this: xattr -d "com.apple.quarantine" ./ska

Build from source

  1. Clone the repository with git clone.
  2. Run cargo install --path . or RUSTFLAGS="-C target-cpu=native" cargo install --path . to optimise for your machine.

Differences from SKA1

Optimisations include:

  • Integer DNA encoding, optimised parsing from FASTA/FASTQ.
  • Faster dictionaries.
  • Full parallelisation of build phase.
  • Smaller, standardised input/output files. Faster to save/load.
  • Reduced memory footprint and increased speed with read filtering.

And other improvements:

  • IUPAC uncertainty codes for multiple copy split k-mers.
  • Uncertainty with self-reverse-complement split k-mers (palindromes).
  • Fully dynamic files (merge, delete samples).
  • Native VCF output for map.
  • Support for known strand sequence (e.g. RNA viruses).
  • Stream to STDOUT, or file with -o.
  • Simpler command line combining ska fasta, ska fastq, ska alleles and ska merge into the new ska build.
  • Option for single commands to run ska align or ska map.
  • New coverage model for filtering FASTQ files with ska cov.
  • Logging.
  • CI testing.

All of which make ska.rust run faster and with smaller file size and memory footprint than the original.

Planned features

  • Sparse data structure which will reduce space and make parallelisation more efficient. Issue #47.
  • 'fastcall' mode. Issue #52.

Feature ideas (not definitely planned)

  • Add support for ambiguity in VCF output (ska map). Issue #5.
  • Non-serial loading of .skf files (for when they are very large). Issue #22.
  • Alternative mixture models for read error correction. Issue #50.

Things you can no longer do

  • Use k > 63 (shouldn't be necessary? Let us know if you need this and why).
  • ska annotate (use bedtools).
  • ska compare, ska humanise, ska info or ska summary (replaced by ska nk --full-info).
  • ska unique (you can parse ska nk --full-info if you want this functionality, but we didn't think it's used much).
  • ska type (use PopPUNK instead of MLST 🙂)
  • Ns are always skipped, and will not be found in any split k-mers.
  • .skf files are not backwards compatible with version 1.

Owner

  • Name: Bacterial population genetics
  • Login: bacpop
  • Kind: organization
  • Email: contact@bacpop.org
  • Location: United Kingdom

Pathogen Informatics and Modelling @ EMBL-EBI / Bacterial Evolutionary Epidemiology Group @ Imperial College London

Citation (CITATION.cff)

cff-version: 1.2.0
message: If you use this software, please cite both the article from preferred-citation and the software itself.
authors:
  - family-names: Derelle
    given-names: Romain
  - family-names: von Wachsmann
    given-names: Johanna
  - family-names: Mäklin
    given-names: Tommi
  - family-names: Croucher
    given-names: Nicholas J.
  - family-names: Harris
    given-names: Simon R.
  - family-names: Lees
    given-names: John A.
title: Split K-mer Analysis (version 2)
version: 0.3.11
url: https://github.com/bacpop/ska.rust
date-released: '2024-09-25'
preferred-citation:
  authors:
    - family-names: Derelle
      given-names: Romain
    - family-names: von Wachsmann
      given-names: Johanna
    - family-names: Mäklin
      given-names: Tommi
    - family-names: Hellewell
      given-names: Joel
    - family-names: Russell
      given-names: Timothy
    - family-names: Lalvani
      given-names: Ajit
    - family-names: Chindelevitch
      given-names: Leonid
    - family-names: Croucher
      given-names: Nicholas J.
    - family-names: Harris
      given-names: Simon R.
    - family-names: Lees
      given-names: John A.
  title: Seamless, rapid, and accurate analyses of outbreak genomic data using split k-mer analysis
  doi: 10.1101/gr.279449.124
  url: http://genome.cshlp.org/content/34/10/1661.abstract
  type: article
  pages: 1661-1673
  year: '2024'
  conference: {}
  publisher: {}

GitHub Events

Total
  • Create event: 6
  • Release event: 1
  • Issues event: 10
  • Watch event: 22
  • Delete event: 2
  • Member event: 2
  • Issue comment event: 65
  • Push event: 62
  • Pull request review comment event: 67
  • Pull request review event: 15
  • Pull request event: 13
  • Fork event: 2
Last Year
  • Create event: 6
  • Release event: 1
  • Issues event: 10
  • Watch event: 22
  • Delete event: 2
  • Member event: 2
  • Issue comment event: 65
  • Push event: 62
  • Pull request review comment event: 67
  • Pull request review event: 15
  • Pull request event: 13
  • Fork event: 2

Committers

Last synced: about 2 years ago

All Time
  • Total Commits: 234
  • Total Committers: 2
  • Avg Commits per committer: 117.0
  • Development Distribution Score (DDS): 0.013
Past Year
  • Commits: 216
  • Committers: 2
  • Avg Commits per committer: 108.0
  • Development Distribution Score (DDS): 0.014
Top Committers
Name Email Commits
John Lees l****6@g****m 231
Tommi Mäklin t****i@m****i 3
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 47
  • Total pull requests: 56
  • Average time to close issues: about 1 month
  • Average time to close pull requests: 1 day
  • Total issue authors: 14
  • Total pull request authors: 6
  • Average comments per issue: 3.38
  • Average comments per pull request: 1.55
  • Merged pull requests: 47
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 10
  • Pull requests: 13
  • Average time to close issues: 20 days
  • Average time to close pull requests: 4 days
  • Issue authors: 8
  • Pull request authors: 4
  • Average comments per issue: 3.7
  • Average comments per pull request: 3.85
  • Merged pull requests: 8
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • johnlees (16)
  • rderelle (10)
  • danrlu (5)
  • PWSmit (3)
  • jvfe (1)
  • jdaeth274 (1)
  • vesa00 (1)
  • rrwick (1)
  • tmaklin (1)
  • kristyhoran (1)
  • cammo0p (1)
  • kebern (1)
  • maxlcummins (1)
  • taffners (1)
Pull Request Authors
  • johnlees (42)
  • apollis44 (9)
  • jhellewell14 (6)
  • rderelle (4)
  • vrbouza (2)
  • tmaklin (1)
Top Labels
Issue Labels
enhancement (17) bug (3) not planned (2) documentation (2)
Pull Request Labels
bug (2) documentation (2)

Packages

  • Total packages: 1
  • Total downloads:
    • cargo 24,266 total
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 22
  • Total maintainers: 1
crates.io: ska

Split k-mer analysis

  • Versions: 22
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 24,266 Total
Rankings
Dependent repos count: 16.5%
Stargazers count: 19.1%
Forks count: 23.2%
Average: 28.1%
Dependent packages count: 36.1%
Downloads: 45.7%
Maintainers (1)
Last synced: 6 months ago

Dependencies

.github/workflows/ci.yml actions
  • actions/checkout v3 composite
.github/workflows/clippy.yml actions
  • actions-rs/clippy-check v1 composite
  • actions-rs/toolchain v1 composite
  • actions/checkout v1 composite
.github/workflows/codecov.yml actions
  • actions-rs/grcov v0.1.5 composite
  • actions-rs/toolchain v1 composite
  • actions/checkout v3 composite
  • codecov/codecov-action v3.1.0 composite
.github/workflows/release.yml actions
  • actions-rs/toolchain v1 composite
  • actions/checkout v3 composite
  • actions/checkout v2 composite
  • actions/download-artifact v2 composite
  • actions/upload-artifact v3 composite
  • katyo/publish-crates v1 composite
  • softprops/action-gh-release v1 composite
.github/workflows/version.yml actions
  • actions/checkout v3 composite
Cargo.toml cargo
  • assert_fs 1.0.10 development
  • predicates 2.1.5 development
  • pretty_assertions 1.3.0 development
  • snapbox 0.4.3 development
  • ahash 0.8.2
  • ciborium 0.2.0
  • clap 4.0.27
  • hashbrown 0.12
  • indicatif 0.17.2
  • log 0.4.17
  • ndarray 0.15.6
  • needletail 0.4.1
  • noodles-vcf 0.22.0
  • num-traits 0.2.15
  • num_cpus 1.0
  • rayon 1.5.3
  • regex 1.7.0
  • serde 1.0.147
  • simple_logger 4.0.0
  • snap 1.1.0