<code>diverse-seq</code>

<code>diverse-seq</code>: an application for alignment-free selecting and clustering biological sequences - Published in JOSS (2025)

https://github.com/huttleylab/diverseseq

Science Score: 93.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 4 DOI reference(s) in README and JOSS metadata
  • Academic publication links
    Links to: biorxiv.org, joss.theoj.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Keywords

bioinformatics phylogenetics
Last synced: 4 months ago · JSON representation

Repository

Tools for analysis of sequence divergence

Basic Info
Statistics
  • Stars: 6
  • Watchers: 3
  • Forks: 4
  • Open Issues: 4
  • Releases: 19
Topics
bioinformatics phylogenetics
Created over 3 years ago · Last pushed 4 months ago
Metadata Files
Readme License

README.md

PyPI - Python Version CI Coverage Status Codacy Badge CodeQL Ruff DOI

diverse-seq provides alignment-free algorithms to facilitate phylogenetic workflows

diverse-seq implements computationally efficient alignment-free algorithms that enable efficient prototyping for phylogenetic workflows. It can accelerate parameter selection searches for sequence alignment and phylogeny estimation by identifying a subset of sequences that are representative of the diversity in a collection. We show that selecting representative sequences with an entropy measure of k-mer frequencies correspond well to sampling via conventional genetic distances. The computational performance is linear with respect to the number of sequences and can be run in parallel. Applied to a collection of 10.5k whole microbial genomes on a laptop took ~8 minutes to prepare the data and 4 minutes to select 100 representatives. diverse-seq can further boost the performance of phylogenetic estimation by providing a seed phylogeny that can be further refined by a more sophisticated algorithm. For ~1k whole microbial genomes on a laptop, it takes ~1.8 minutes to estimate a bifurcating tree from mash distances.

You can read more about the methods implemented in diverse-seq in the preprint here.

The user documentation is here.

Installation

We recommend installing diverse-seq from PyPI as follows

pip install "diverse-seq[extra]"

for the full jupyter experience.

For command line only usage, install as follows

pip install diverse-seq

NOTE If you experience any errors during installation, we recommend using uv pip. This command provides much better error messages than the standard pip command. If you cannot resolve the installation problem, please open an issue on the GitHub repository.

Using uv

Speaking of uv, it provides a simplified approach to install dvs as a command-line only tool as

uv tool install diverse-seq

Usage in this case is then

uvx --from diverse-seq dvs

Dependencies

For a full listing of dependencies, see the pyproject.toml file.

The command line interface

dvs is the command line interface for diverse-seq.

The `dvs` subcommands ``` Usage: dvs [OPTIONS] COMMAND [ARGS]... dvs -- alignment free detection of the most diverse sequences using JSD Options: --version Show the version and exit. --help Show this message and exit. Commands: demo-data Export a demo sequence file prep Writes processed sequences to a .dvseqs. max Identify the seqs that maximise average delta JSD nmost Identify n seqs that maximise average delta JSD ctree Quickly compute a cluster tree based on kmers for a collection... ```

The Python API

We make comparable capabilities available as cogent3 apps. The main difference is the app instances directly operate on, and return, cogent3 sequence collections. See the docs for demonstrations of how to use the apps.

Project Information

diverse-seq is released under the BSD-3 license. If you want to contribute to the diverse-seq project (and we hope you do! :innocent:) the code of conduct and other useful developer information is available on the wiki.

Owner

  • Name: HuttleyLab
  • Login: HuttleyLab
  • Kind: organization

JOSS Publication

<code>diverse-seq</code>: an application for alignment-free selecting and clustering biological sequences
Published
June 07, 2025
Volume 10, Issue 110, Page 7765
Authors
Gavin Huttley ORCID
Research School of Biology, Australian National University, Australia
Katherine Caley ORCID
Research School of Biology, Australian National University, Australia
Robert McArthur ORCID
Research School of Biology, Australian National University, Australia
Editor
Frederick Boehm ORCID
Tags
genomics statistics machine learning bioinformatics molecular evolution phylogenetics

GitHub Events

Total
  • Create event: 51
  • Issues event: 18
  • Release event: 14
  • Watch event: 3
  • Delete event: 40
  • Issue comment event: 199
  • Push event: 101
  • Gollum event: 23
  • Pull request review comment event: 43
  • Pull request review event: 97
  • Pull request event: 190
  • Fork event: 2
Last Year
  • Create event: 51
  • Issues event: 18
  • Release event: 14
  • Watch event: 3
  • Delete event: 40
  • Issue comment event: 199
  • Push event: 101
  • Gollum event: 23
  • Pull request review comment event: 43
  • Pull request review event: 97
  • Pull request event: 190
  • Fork event: 2

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 12
  • Total pull requests: 129
  • Average time to close issues: about 2 months
  • Average time to close pull requests: 1 day
  • Total issue authors: 3
  • Total pull request authors: 4
  • Average comments per issue: 1.0
  • Average comments per pull request: 1.23
  • Merged pull requests: 106
  • Bot issues: 0
  • Bot pull requests: 48
Past Year
  • Issues: 10
  • Pull requests: 116
  • Average time to close issues: 6 days
  • Average time to close pull requests: 1 day
  • Issue authors: 3
  • Pull request authors: 4
  • Average comments per issue: 1.0
  • Average comments per pull request: 1.35
  • Merged pull requests: 96
  • Bot issues: 0
  • Bot pull requests: 47
Top Authors
Issue Authors
  • iimog (6)
  • GavinHuttley (4)
  • xin-huang (2)
Pull Request Authors
  • GavinHuttley (73)
  • dependabot[bot] (48)
  • rmcar17 (7)
  • iimog (1)
Top Labels
Issue Labels
in-progress (3) bug (2)
Pull Request Labels
dependencies (48) python (42) github_actions (6)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 178 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 16
  • Total maintainers: 1
pypi.org: diverse-seq

diverse_seq: a tool for sampling diverse biological sequences

  • Versions: 16
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 178 Last month
Rankings
Dependent packages count: 10.4%
Average: 34.6%
Dependent repos count: 58.7%
Maintainers (1)
Last synced: 4 months ago

Dependencies

.github/workflows/ci.yml actions
  • actions/checkout v4 composite
  • actions/setup-python v5 composite
  • coverallsapp/github-action v2 composite
.github/workflows/codeql.yml actions
  • actions/checkout v4 composite
  • github/codeql-action/analyze v3 composite
  • github/codeql-action/autobuild v3 composite
  • github/codeql-action/init v3 composite
.github/workflows/linters.yml actions
  • EndBug/add-and-commit v9 composite
  • actions/checkout v4 composite
  • actions/setup-python v5 composite
.github/workflows/release.yml actions
  • actions/checkout v4 composite
  • actions/download-artifact v4 composite
  • actions/setup-python v5 composite
  • actions/upload-artifact v4 composite
  • pypa/gh-action-pypi-publish release/v1 composite
pyproject.toml pypi
  • attrs *
  • click *
  • cogent3 *
  • h5py *
  • hdf5plugin *
  • numpy >=2.0
  • rich *
  • scitrack *