pyfastani

Cython bindings and Python interface to FastANI, a method for fast whole-genome similarity estimation.

https://github.com/althonos/pyfastani

Science Score: 77.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 6 DOI reference(s) in README
✓
Academic publication links
Links to: biorxiv.org, nature.com
✓
Committers with academic emails
2 of 2 committers (100.0%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.5%) to scientific vocabulary

Keywords

ani average-nucleotide-identity bioinformatics cython-library metagenomes python-bindings python-library taxonomy

Last synced: 6 months ago · JSON representation ·

Repository

Cython bindings and Python interface to FastANI, a method for fast whole-genome similarity estimation.

Basic Info

Host: GitHub
Owner: althonos
License: mit
Language: Cython
Default Branch: main
Homepage:
Size: 430 KB

Statistics

Stars: 23
Watchers: 3
Forks: 2
Open Issues: 0
Releases: 13

Topics

ani average-nucleotide-identity bioinformatics cython-library metagenomes python-bindings python-library taxonomy

Created over 4 years ago · Last pushed 6 months ago

Metadata Files

Readme Changelog Contributing License Citation

🐍⏩🧬 PyFastANI

Cython bindings and Python interface to FastANI, a method for fast whole-genome similarity estimation. *Now with multithreading!***

🗺️ Overview

FastANI is a method published in 2018 by Chirag Jain et al.[1] for high-throughput computation of whole-genome Average Nucleotide Identity (ANI). It uses MashMap to compute orthologous mappings without the need for expensive alignments.

pyfastani is a Python module, implemented using the Cython language, that provides bindings to FastANI. It directly interacts with the FastANI internals, which has the following advantages over CLI wrappers:

simpler compilation: FastANI requires several additional libraries, which make compilation of the original binary non-trivial. In PyFastANI, libraries that were needed for threading or I/O are provided as stubs, and Boost::math headers are vendored so you can build the package without hassle. Or even better, just install from one of the provided wheels!
single dependency: If your software or your analysis pipeline is distributed as a Python package, you can add pyfastani as a dependency to your project, and stop worrying about the FastANI binary being present on the end-user machine.
sans I/O: Everything happens in memory, in Python objects you control, making it easier to pass your sequences to FastANI without needing to write them to a temporary file.
multi-threading: Genome query resolves the fragment mapping step in parallel, leading to shorter querying times even with a single genome.

This library is still a work-in-progress, and in an experimental stage, but it should already pack enough features to be used in a standard pipeline.

🔧 Installing

PyFastANI can be installed directly from PyPI, which hosts some pre-built CPython wheels for x86-64 Unix platforms, as well as the code required to compile from source with Cython: console $ pip install pyfastani

In the event you have to compile the package from source, all the required libraries are vendored in the source distribution, so you'll only need a C/C++ compiler.

Otherwise, PyFastANI is also available as a Bioconda package: console $ conda install -c bioconda pyfastani

💡 Example

The following snippets show how to compute the ANI between two genomes, with the reference being a draft genome. For one-to-many or many-to-many searches, simply add additional references with m.add_draft before indexing. Note that any name can be given to the reference sequences, this will just affect the name attribute of the hits returned for a query.

🔬 Biopython

Biopython does not let us access to the sequence directly, so we need to convert it to bytes first with the bytes builtin function. For older versions of Biopython (earlier than 1.79), use record.seq.encode() instead of bytes(record.seq).

```python import pyfastani import Bio.SeqIO

sketch = pyfastani.Sketch()

add a single draft genome to the mapper, and index it

ref = list(Bio.SeqIO.parse("vendor/FastANI/data/Shigellaflexneri2a01.fna", "fasta")) sketch.adddraft("S. flexneri", (bytes(record.seq) for record in ref))

index the sketch and get a mapper

mapper = sketch.index()

read the query and query the mapper

query = Bio.SeqIO.read("vendor/FastANI/data/EscherichiacolistrK12MG1655.fna", "fasta") hits = mapper.query_sequence(bytes(query.seq))

for hit in hits: print("E. coli K12 MG1655", hit.name, hit.identity, hit.matches, hit.fragments) ```

🧪 Scikit-bio

Scikit-bio lets us access to the sequence directly as a numpy array, but shows the values as byte strings by default. To make them readable as char (for compatibility with the C code), they must be cast with seq.values.view('B').

```python import pyfastani import skbio.io

sketch = pyfastani.Sketch()

ref = list(skbio.io.read("vendor/FastANI/data/Shigellaflexneri2a01.fna", "fasta")) sketch.adddraft("Shigellaflexneri2a_01", (seq.values.view('B') for seq in ref))

mapper = sketch.index()

read the query and query the mapper

query = next(skbio.io.read("vendor/FastANI/data/EscherichiacolistrK12MG1655.fna", "fasta")) hits = mapper.query_genome(query.values.view('B'))

for hit in hits: print("E. coli K12 MG1655", hit.name, hit.identity, hit.matches, hit.fragments) ```

⏱️ Benchmarks

In the original FastANI tool, multi-threading was only used to improve the performance of many-to-many searches: each thread would have a chunk of the reference genomes, and querying would be done in parallel for each reference. However, with a small set of reference genomes, there may not be enough for all the threads to work, so it cannot scale with a large number of threads. In addition, this causes the same query genome to be hashed several times, which is not optimal. In pyfastani, multi-threading is used to compute the hashes and mapping of query genome fragments. This allows parallelism to be useful even when a only few reference genomes are available.

The benchmarks below show the time for querying a single genome (with Mapper.query_draft) using a variable number of threads. Benchmarks were run on a i7-8550U CPU running @1.80GHz with 4 physical / 8 logical cores, using 50 bacterial genomes from the proGenomes database. For clarity, only 5 randomly-selected genomes are shown on the second graph. Each run was repeated 3 times.

Benchmarks

🔖 Citation

If you found PyFastANI useful, please cite our paper, as well as the original FastANI paper.

To cite PyFastANI:

Martin Larralde, Georg Zeller, Laura M. Carroll. 2025. PyOrthoANI, PyFastANI, and Pyskani: a suite of Python libraries for computation of average nucleotide identity. NAR Genomics and Bioinformatics 7(3):lqaf095. doi:10.1093/nargab/lqaf095.

To cite FastANI:

Chirag Jain, Luis M Rodriguez-R, Adam M Phillippy, Konstantinos T Konstantinidis, Srinivas Aluru. 2018. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nature Communications 9(1):5114. doi:10.1038/s41467-018-07641-9.

🔎 See Also

Computing ANI for metagenomic sequences? You may be interested in pyskani, a Python package for computing ANI using the skani method developed by Jim Shaw and Yun William Yu.

💭 Feedback

⚠️ Issue Tracker

Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.

🏗️ Contributing

Contributions are more than welcome! See CONTRIBUTING.md for more details.

⚖️ License

This library is provided under the MIT License.

The FastANI code was written by Chirag Jain and is distributed under the terms of the Apache License 2.0, unless otherwise specified in vendored sources. See vendor/FastANI/LICENSE for more information. The cpu_features code was written by Guillaume Chatelet and is distributed under the terms of the Apache License 2.0. See vendor/cpu_features/LICENSE for more information. The Boost::math headers were written by Boost Libraries contributors and is distributed under the terms of the Boost Software License. See vendor/boost-math/LICENSE for more information.

This project is in no way not affiliated, sponsored, or otherwise endorsed by the original FastANI authors. It was developed by Martin Larralde during his PhD project at the European Molecular Biology Laboratory in the Zeller team.

📚 References

[1] Jain, C., Rodriguez-R, L.M., Phillippy, A.M. et al. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat Commun 9, 5114 (2018). doi:10.1038/s41467-018-07641-9.

Owner

Name: Martin Larralde
Login: althonos
Kind: user
Location: Heidelberg, Germany
Company: EMBL / LUMC, @zellerlab

Twitter: althonos
Repositories: 91
Profile: https://github.com/althonos

PhD candidate in Bioinformatics, passionate about programming, SIMD-enthusiast, Pythonista, Rustacean. I write poems, and sometimes they are executable.

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: PyFastANI
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Martin
    family-names: Larralde
    email: martin.larralde@embl.de
    affiliation: Leiden University Medical Center
    orcid: 'https://orcid.org/0000-0002-3947-4444'
  - given-names: Georg
    family-names: Zeller
    affiliation: Leiden University Medical Center
    orcid: 'https://orcid.org/0000-0003-1429-7485'
  - given-names: Laura
    name-particle: M.
    family-names: Carroll
    affiliation: Umeå University
    orcid: 'https://orcid.org/0000-0002-3677-0192'
identifiers:
  - type: doi
    value: 10.1101/2025.02.13.638148
    description: bioRxiv preprint
  - type: doi
    value: 10.1093/nargab/lqaf095
    description: NAR Genomics & Bioinformatics paper
repository-code: 'https://github.com/althonos/pyfastani'
abstract: >-
  The average nucleotide identity (ANI) metric has become
  the gold standard for prokaryotic species delineation in
  the genomics era. The most popular ANI algorithms are
  available as command-line tools and/or web applications,
  making it inconvenient or impossible to incorporate them
  into bioinformatic workflows, which utilize the popular
  Python programming language. Here, we present PyOrthoANI,
  PyFastANI, and Pyskani, Python libraries for three popular
  ANI computation methods. ANI values produced by
  PyOrthoANI, PyFastANI, and Pyskani are virtually identical
  to those produced by OrthoANI, FastANI, and skani,
  respectively. All three libraries integrate seamlessly
  with BioPython, making it easy and convenient to use,
  compare, and benchmark popular ANI algorithms within
  Python-based workflows.
keywords:
  - python
  - library
  - average nucleotide identity
  - ANI
license: MIT
preferred-citation:
  type: article
  authors:
  - given-names: Martin
    family-names: Larralde
    email: martin.larralde@embl.de
    affiliation: Leiden University Medical Center
    orcid: 'https://orcid.org/0000-0002-3947-4444'
  - given-names: Georg
    family-names: Zeller
    affiliation: Leiden University Medical Center
    orcid: 'https://orcid.org/0000-0003-1429-7485'
  - given-names: Laura
    name-particle: M.
    family-names: Carroll
    affiliation: Umeå University
    orcid: 'https://orcid.org/0000-0002-3677-0192'
  doi: "10.1093/nargab/lqaf095"
  journal: "NAR Genomics and Bioinformatics"
  volume: 7
  issue: 3
  title: "PyOrthoANI, PyFastANI, and Pyskani: a suite of Python libraries for computation of average nucleotide identity"
  year: 2025
  month: 9

GitHub Events

Total

Release event: 2
Watch event: 4
Delete event: 1
Push event: 35
Pull request event: 1
Fork event: 1
Create event: 3

Last Year

Release event: 2
Watch event: 4
Delete event: 1
Push event: 35
Pull request event: 1
Fork event: 1
Create event: 3

Committers

Last synced: 7 months ago

All Time

Total Commits: 235
Total Committers: 2
Avg Commits per committer: 117.5
Development Distribution Score (DDS): 0.004

Past Year

Commits: 43
Committers: 2
Avg Commits per committer: 21.5
Development Distribution Score (DDS): 0.023

Top Committers

Name	Email	Commits
Martin Larralde	m**e@e**e	234
Laura Carroll	l**7@c**u	1

Committer Domains (Top 20 + Academic)

cornell.edu: 1 embl.de: 1

Issues and Pull Requests

Last synced: 7 months ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 646 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 12
Total maintainers: 2

pypi.org: pyfastani

Cython bindings and Python interface to FastANI, a method for fast whole-genome similarity estimation.

Documentation: https://pyfastani.readthedocs.io/en/stable/
License: MIT License Copyright (c) 2021-2025 Martin Larralde <martin.larralde@embl.de> Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Latest release: 0.6.1
published about 1 year ago

Versions: 12
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 646 Last month

Rankings

Dependent packages count: 7.3%

Downloads: 10.9%

Stargazers count: 14.9%

Average: 17.0%

Dependent repos count: 22.1%

Forks count: 30.0%

Maintainers (2)

althonos lmc297

Last synced: 6 months ago

Dependencies

.github/workflows/requirements.txt pypi

auditwheel *
codecov *
coverage *
cython *
setuptools >=46.4.0
wheel *

benches/mapping/requirements.txt pypi

biopython *
matplotlib *
numpy *
palettable *
pandas *
rich *
scipy *

docs/requirements.txt pypi

cython *
ipykernel *
ipython *
nbsphinx *
pygments *
pygments-style-monokailight *
recommonmark *
semantic_version *
setuptools >=46.4
sphinx *

.github/workflows/package.yml actions

KSXGitHub/github-actions-deploy-aur v2.2.5 composite
actions/checkout v2 composite
actions/checkout v1 composite
actions/download-artifact v2 composite
actions/setup-python v2 composite
actions/upload-artifact v2 composite
addnab/docker-run-action v2 composite
docker/setup-qemu-action v1 composite
pypa/gh-action-pypi-publish master composite
rasmus-saks/release-a-changelog-action v1.0.1 composite

.github/workflows/test.yml actions

actions/cache v2 composite
actions/checkout v2 composite
actions/setup-python v2 composite
codecov/codecov-action v1 composite

pyproject.toml pypi

src/pyfastani/tests/requirements.txt pypi

pyfastani

Science Score: 77.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

🐍⏩🧬 PyFastANI

🗺️ Overview

🔧 Installing

💡 Example

🔬 Biopython

add a single draft genome to the mapper, and index it

index the sketch and get a mapper

read the query and query the mapper

🧪 Scikit-bio

read the query and query the mapper

⏱️ Benchmarks

🔖 Citation

🔎 See Also

💭 Feedback

⚠️ Issue Tracker

🏗️ Contributing

⚖️ License

📚 References

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: pyfastani

Rankings

Maintainers (2)

Dependencies