taxotagger

DNA taxonomy identification, powered by deep learning and semantic search

https://github.com/mycoai/taxotagger

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.3%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

DNA taxonomy identification, powered by deep learning and semantic search

Basic Info
Statistics
  • Stars: 2
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 7
Created over 1 year ago · Last pushed about 1 year ago
Metadata Files
Readme License Citation

README.md

TaxoTagger

pypi badge Static Badge

TaxoTagger is an open-source Python library for DNA taxonomy identification, which involves categorizing DNA sequences into their respective taxonomic groups. It is powered by deep learning and semantic search to provide efficient and accurate results.

Key Features:

  • 🚀 Build vector databases from DNA sequences with ease
  • ⚡ Conduct efficient semantic searches for precise results
  • 🛠 Extend support for custom embedding models effortlessly
  • 🌐 Interact seamlessly through a user-friendly web app

Installation

TaxoTagger requires Python 3.10 or later.

```bash

create an virtual environment

conda create -n venv-3.10 python=3.10 conda activate venv-3.10

install the taxotagger package

pip install --pre taxotagger ```

Usage

Build a vector database from a FASTA file

```python from taxotagger import ProjectConfig from taxotagger import TaxoTagger

config = ProjectConfig() tt = TaxoTagger(config)

creating the database will take ~30s

tt.create_db('data/database.fasta') ```

By default, the ~/.cache/mycoai folder is used to store the vector database and the embedding model. The MycoAI-CNN.pt model is automatically downloaded to this folder if it is not there, and the vector database is created and named after the model.

Conduct a semantic search with FASTA file

```python from taxotagger import ProjectConfig from taxotagger import TaxoTagger

config = ProjectConfig() tt = TaxoTagger(config)

semantic search and return the top 1 result for each query sequence

res = tt.search('data/query.fasta', limit = 1) ```

The data/query.fasta file contains two query sequences: KY106088 and KY106087.

The search results res will be a dictionary with taxonomic level names as keys and matched results as values for each of the two query sequences. For example, res['phylum'] will look like:

python [ [{"id": "KY106088", "distance": 1.0, "entity": {"phylum": "Ascomycota"}}], [{"id": "KY106087", "distance": 0.9999998807907104, "entity": {"phylum": "Ascomycota"}}] ]

The first inner list is the top results for the first query sequence, and the second inner list is the top results for the second query sequence.

The id field is the sequence ID of the matched sequence. The distance field is the cosine similarity between the query sequence and the matched sequence with a value between 0 and 1, the closer to 1, the more similar. The entity field is the taxonomic information of the matched sequence.

We can see that the top 1 results for both query sequences are exactly themselves. This is because the query sequences are also in the database. You can try with different query sequences to see the search results.

Docs

Please visit the official documentation for more details.

Question and feedback

Please submit an issue if you have any question or feedback.

Citation

If you use TaxoTagger in your work, please cite it by clicking the Cite this repository on right top of this page.

Owner

  • Name: MycoAI
  • Login: MycoAI
  • Kind: organization

Citation (CITATION.cff)

# YAML 1.2
---
cff-version: "1.1.0"
title: "TaxoTagger"
authors:
  -
    given-names: Cunliang
    family-names: Geng
    affiliation: "Netherlands eScience Center"
    orcid: "https://orcid.org/0000-0002-1409-8358"
version: "0.0.1-alpha.7"
repository-code: "https://github.com/MycoAI/taxotagger"
keywords:
  - Machine Learning
  - Vector database
  - Sematic search
  - Fungi
  - Taxonomy
message: "If you use this software, please cite it using these metadata."
license: Apache-2.0

GitHub Events

Total
  • Release event: 3
  • Watch event: 3
  • Push event: 11
  • Create event: 3
Last Year
  • Release event: 3
  • Watch event: 3
  • Push event: 11
  • Create event: 3

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 29 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 6
  • Total maintainers: 1
pypi.org: taxotagger

Fungi DNA barcoder based on semantic searching

  • Versions: 6
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 29 Last month
Rankings
Dependent packages count: 10.5%
Average: 34.7%
Dependent repos count: 58.9%
Maintainers (1)
Last synced: 6 months ago

Dependencies

requirement.txt pypi
  • httpx *
  • mycoai-its *
  • rich *
  • torch *