galah

More scalable dereplication for metagenome assembled genomes

https://github.com/wwood/galah

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 4 DOI reference(s) in README
  • Academic publication links
  • Committers with academic emails
    2 of 5 committers (40.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.1%) to scientific vocabulary

Keywords from Contributors

bioinformatics metagenomics microbiome
Last synced: 6 months ago · JSON representation ·

Repository

More scalable dereplication for metagenome assembled genomes

Basic Info
  • Host: GitHub
  • Owner: wwood
  • License: gpl-3.0
  • Language: Rust
  • Default Branch: main
  • Homepage:
  • Size: 13 MB
Statistics
  • Stars: 65
  • Watchers: 3
  • Forks: 13
  • Open Issues: 16
  • Releases: 7
Created about 6 years ago · Last pushed 6 months ago
Metadata Files
Readme Citation

README.md

Galah logo

Galah

Anaconda-Server Badge

Galah aims to be a more scalable metagenome assembled genome (MAG) dereplication method. That is, it clusters microbial genomes together based on their average nucleotide identity (ANI), and chooses a single member of each cluster as the representative.

Galah uses a greedy clustering approach to speed up genome dereplication, relative to e.g. dRep, particularly when there are many closely related genomes (i.e. >95% ANI). Generated cluster representatives have 2 properties. If the ANI threshold was set to 99%, then:

  1. Each representative is <99% ANI to each other representative.
  2. All members are >=99% ANI to the representative.

If CheckM genome qualities were specified, then the clusters have an additional property:

  1. Each representative genome has a better quality score than other members of the cluster. Each genome is assigned a quality score based on the formula completeness-5*contamination-5*num_contigs/100-5*num_ambiguous_bases/100000, which is reduced from a quality formula described in Parks et. al. 2020 https://doi.org/10.1038/s41587-020-0501-8.

If instead CheckM qualities were not provided, then the following holds instead:

  1. Each representative genome was specified to galah before other members of the cluster.

The overall greedy clustering approach was largely inspired by the work of Donovan Parks, as described in Parks et. al. 2020. It operates in 3 steps. In the first step, genomes are assigned as representative if no genomes of higher quality are >99% ANI. In the second step, each non-representative genome is assigned to the representative genome it has the highest ANI with.

Installation

Install through the bioconda package

Galah can be installed through the bioconda conda channel. After initial setup of conda and the bioconda channel, it can be installed with mamba (or conda) with:

mamba install galah

One can see details of the galah recipe.

Galah can also be used indirectly through CoverM via its cluster subcommand, which is also available on bioconda.

Pre-compiled binary

Galah can be installed by downloading statically compiled binaries, available on the releases page.

Third party dependencies listed below are required for this method.

Compiling from source

Galah can also be installed from source, using the cargo build system after installing Rust.

cargo install galah Third party dependencies listed below are required for this method.

Development

To run an unreleased version of Galah, after installing Rust:

git clone https://github.com/wwood/galah cd galah pixi run cargo run -- cluster ...etc... Third party dependencies listed below are required for this method.

Dependencies

For some advanced usage of Galah, 3rd party tools are required, which must be installed separately:

  • skani v0.2.2 https://github.com/bluenote-1577/skani
  • FastANI v1.34 https://github.com/ParBLiSS/FastANI

Usage

For clustering a set of genomes at 99% ANI: galah cluster --genome-fasta-files /path/to/genome1.fna /path/to/genome2.fna --output-cluster-definition clusters.tsv There are several other options for specifying genomes, ANI cutoffs, etc.

For clustering a set of contigs at 99% ANI: galah cluster --cluster-contigs --small-genomes --genome-fasta-files /path/to/contigs.fna --output-cluster-definition clusters.tsv

The full usage is described on the manual page, which can be accessed on the command line running galah cluster --full-help.

Precluster ANI

Similar to dRep, galah operates in two stages. In the first, a fast pre-clustering distance (finch or skani) is calculated between each pair of genomes. Genome pairs are only considered as potentially in the same cluster with skani or FastANI if the prethreshold ANI is greater than the specified value. By default, the precluster ANI is set at 95% and the final ANI is set at 99%.

License

Galah is made available under GPL3+. See LICENSE.txt for details. Copyright Ben Woodcroft.

Developed by Ben Woodcroft at the Centre for Microbiome Research, Queensland University of Technology.

Owner

  • Name: Ben J Woodcroft
  • Login: wwood
  • Kind: user
  • Location: Brisbane, University of Queensland
  • Company: Queensland University of Technology

Informatics team leader at the Centre for Microbiome Research (CMR)

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Aroney
    given-names: Samuel T. N.
    orcid: https://orcid.org/0000-0001-9806-5846
  - family-names: Camargo
    given-names: Antonio P.
    orcid: https://orcid.org/0000-0003-3913-2484
  - family-names: Tyson
    given-names: Gene W.
    orcid: https://orcid.org/0000-0001-8559-9427
  - family-names: Woodcroft
    given-names: Ben J.
    orcid: https://orcid.org/0000-0003-0670-7480
title: "Galah: More scalable dereplication for metagenome assembled genomes"
version: 0.4.2
doi: 10.5281/zenodo.10526086
date-released: 2024-09-03

GitHub Events

Total
  • Issues event: 6
  • Watch event: 11
  • Delete event: 10
  • Issue comment event: 17
  • Push event: 23
  • Pull request event: 16
  • Fork event: 1
  • Create event: 10
Last Year
  • Issues event: 6
  • Watch event: 11
  • Delete event: 10
  • Issue comment event: 17
  • Push event: 23
  • Pull request event: 16
  • Fork event: 1
  • Create event: 10

Committers

Last synced: 12 months ago

All Time
  • Total Commits: 164
  • Total Committers: 5
  • Avg Commits per committer: 32.8
  • Development Distribution Score (DDS): 0.354
Past Year
  • Commits: 8
  • Committers: 2
  • Avg Commits per committer: 4.0
  • Development Distribution Score (DDS): 0.375
Top Committers
Name Email Commits
Ben Woodcroft d****n@g****m 106
AroneyS s****y@q****u 35
Ben J. Woodcroft d****m 20
Antônio Pedro Camargo a****o 2
Rhys Newell r****l@h****u 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 27
  • Total pull requests: 34
  • Average time to close issues: 18 days
  • Average time to close pull requests: 3 months
  • Total issue authors: 14
  • Total pull request authors: 6
  • Average comments per issue: 1.85
  • Average comments per pull request: 1.41
  • Merged pull requests: 18
  • Bot issues: 0
  • Bot pull requests: 9
Past Year
  • Issues: 4
  • Pull requests: 12
  • Average time to close issues: 5 days
  • Average time to close pull requests: about 2 months
  • Issue authors: 3
  • Pull request authors: 3
  • Average comments per issue: 1.5
  • Average comments per pull request: 0.5
  • Merged pull requests: 8
  • Bot issues: 0
  • Bot pull requests: 1
Top Authors
Issue Authors
  • jianshu93 (5)
  • apcamargo (5)
  • SilasK (4)
  • aistBMRG (2)
  • Rridley7 (1)
  • MrOlm (1)
  • xvazquezc (1)
  • AroneyS (1)
  • wwood (1)
  • Sanrrone (1)
  • prototaxites (1)
  • ShailNair (1)
  • alienzj (1)
Pull Request Authors
  • AroneyS (23)
  • dependabot[bot] (11)
  • apcamargo (3)
  • solc42 (2)
  • wwood (1)
  • rhysnewell (1)
Top Labels
Issue Labels
Pull Request Labels
dependencies (11)

Packages

  • Total packages: 1
  • Total downloads:
    • cargo 13,565 total
  • Total dependent packages: 1
  • Total dependent repositories: 2
  • Total versions: 7
  • Total maintainers: 2
crates.io: galah

Microbial genome dereplicator

  • Versions: 7
  • Dependent Packages: 1
  • Dependent Repositories: 2
  • Downloads: 13,565 Total
Rankings
Dependent repos count: 13.2%
Forks count: 15.6%
Average: 18.1%
Dependent packages count: 18.2%
Stargazers count: 18.9%
Downloads: 24.7%
Maintainers (2)
Last synced: 6 months ago

Dependencies

Cargo.toml cargo
  • assert_cli 0.6.* development
  • ansi_term 0.12
  • bird_tool_utils-man 0.4.0
  • checkm 0.1.*
  • clap 3.*
  • csv 1.1
  • env_logger 0.9.*
  • finch 0.3.*
  • lazy_static 1.4.0
  • log 0.4.*
  • needletail 0.4.*
  • partitions 0.2.*
  • rayon 1.5
  • tempfile 3.*
.github/workflows/test-galah.yml actions
  • actions/checkout v2 composite
  • conda-incubator/setup-miniconda v2 composite