taxonopy

A Python package for efficiently aligning organismal taxonomic hierarchies using the Global Names Verifier

https://github.com/imageomics/taxonopy

Science Score: 75.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
    Organization imageomics has institutional domain (imageomics.osu.edu)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.2%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

A Python package for efficiently aligning organismal taxonomic hierarchies using the Global Names Verifier

Basic Info
  • Host: GitHub
  • Owner: Imageomics
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 314 KB
Statistics
  • Stars: 3
  • Watchers: 5
  • Forks: 0
  • Open Issues: 8
  • Releases: 1
Created almost 2 years ago · Last pushed 7 months ago
Metadata Files
Readme License Citation

README.md

TaxonoPy

DOI

PyPI - Version PyPI - Python Version

TaxonoPy (taxon-o-py) is a command-line tool for creating an internally consistent taxonomic hierarchy using the Global Names Verifier (gnverifier). See below for the structure of inputs and outputs.

Purpose

The motivation for this package is to create an internally consistent and standardized classification set for organisms in a large biodiversity dataset composed from different data providers that may use very similar and overlapping but not identical taxonomic hierarchies.

Its development has been driven by its application in the TreeOfLife-200M (TOL) dataset. This dataset contains over 200 million samples of organisms from four core data providers:

The names (and classification) of taxa may be (and often are) inconsistent across these resources. This package addresses this problem by creating an internally consistent classification set for such taxa.

Input

A directory containing Parquet partitions of the seven-rank Linnaean taxonomic metadata for organisms in the dataset. Labels should include: - uuid: a unique identifier for each sample (required). - kingdom, phylum, class, order, family, genus, species: the taxonomic ranks of the organism (required, may have sparsity). - scientific_name: the scientific name of the organism, to the most specific rank available (optional). - common_name: the common (i.e. vernacular) name of the organism (optional).

See the example data in - examples/input/sample.parquet - examples/resolved/sample.resolved.parquet (generated with taxonopy resolve) - examples/resolved_with_common_names/sample.resolved.parquet (generated with taxonopy common-names)

Challenges

This taxonomy information is provided by each data provider and the original sources, but the classification can be...

  • Inconsistent: both between and within sources (e.g. kingdom Metazoa vs. Animalia).
  • Incomplete: many samples are missing one or more ranks. Some have 'holes' where higher and lower ranks are present, but intermediate ranks are missing.
  • Incorrect: some samples have incorrect classifications. This can come in the form of spelling errors, nonstandard ideosyncratic terms, or outdated classifications.
  • Ambiguous: homonyms, synonyms, and other terms that can be interpreted in multiple ways unless handled systematically.

Taxonomic authorities exist to standardize classification, but ... - There are many authorities. - They may disagree. - A given organism may be missing from some.

Solution

TaxonoPy uses the taxonomic hierarchies provided by the TOL core data providers to query GNVerifier and create a standardized classification for each sample in the TOL dataset. It prioritizes the GBIF Backbone Taxonomy, since this represents the largest part of the TOL dataset. Where GBIF misses, backup sources such as the Catalogue of Life and Open Tree of Life (OTOL) Reference Taxonomy are used.

Installation

TaxonoPy can be installed with pip after setting up a virtual environment.

User Installation with pip

To install the latest version of TaxonoPy, run: console pip install taxonopy

Usage

You may view the help for the command line interface by running: console taxonopy --help This will show you the available commands and options: ```console usage: taxonopy [-h] [--cache-dir CACHE_DIR] [--show-cache-path] [--cache-stats] [--clear-cache] [--show-config] [--version] {resolve,trace,common-names} ...

TaxonoPy: Resolve taxonomic names using GNVerifier and trace data provenance.

positional arguments: {resolve,trace,common-names} resolve Run the taxonomic resolution workflow trace Trace data provenance of TaxonoPy objects common-names Merge vernacular names (post-process) into resolved outputs

options: -h, --help show this help message and exit --cache-dir CACHEDIR Directory for TaxonoPy cache (can also be set with TAXONOPYCACHE_DIR environment variable) (default: None) --show-cache-path Display the current cache directory path and exit (default: False) --cache-stats Display statistics about the cache and exit (default: False) --clear-cache Clear the TaxonoPy object cache. May be used in isolation. (default: False) --show-config Show current configuration and exit (default: False) --version Show version number and exit ```

Command: resolve

The resolve command is used to perform taxonomic resolution on a dataset. It takes a directory of Parquet partitions as input and outputs a directory of resolved Parquet partitions. ``` usage: taxonopy resolve [-h] -i INPUT -o OUTPUT_DIR [--output-format {csv,parquet}] [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}] [--log-file LOG_FILE] [--force-input] [--batch-size BATCH_SIZE] [--all-matches] [--capitalize] [--fuzzy-uninomial] [--fuzzy-relaxed] [--species-group] [--refresh-cache]

options: -h, --help show this help message and exit -i, --input INPUT Path to input Parquet or CSV file/directory -o, --output-dir OUTPUTDIR Directory to save resolved and unsolved output files --output-format {csv,parquet} Output file format --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL} Set logging level --log-file LOGFILE Optional file to write logs to --force-input Force use of input metadata without resolution

GNVerifier Settings: --batch-size BATCH_SIZE Max number of name queries per GNVerifier API/subprocess call --all-matches Return all matches instead of just the best one --capitalize Capitalize the first letter of each name --fuzzy-uninomial Enable fuzzy matching for uninomial names --fuzzy-relaxed Relax fuzzy matching criteria --species-group Enable group species matching

Cache Management: --refresh-cache Force refresh of cached objects (input parsing, grouping) before running. ``` It is recommended to keep GNVerifier settings at their defaults.

Command: trace

The trace command is used to trace the provenance of a taxonomic entry. It takes a UUID and an input path as arguments and outputs the full path of the entry through TaxonoPy. ```console usage: taxonopy trace [-h] {entry} ...

positional arguments: {entry} entry Trace an individual taxonomic entry by UUID

options: -h, --help show this help message and exit

usage: taxonopy trace entry [-h] --uuid UUID --from-input FROM_INPUT [--format {json,text}] [--verbose]

options: -h, --help show this help message and exit --uuid UUID UUID of the taxonomic entry --from-input FROM_INPUT Path to the original input dataset --format {json,text} Output format --verbose Show full details including all UUIDs in group ```

Command: common-names

The common-names command is used to merge vernacular names into the resolved output. It takes a directory of resolved Parquet partitions as input and outputs a directory of resolved Parquet partitions with common names. ```console usage: taxonopy common-names [-h] --resolved-dir ANNOTATIONDIR --output-dir OUTPUTDIR

options: -h, --help show this help message and exit --resolved-dir ANNOTATIONDIR Directory containing your *.resolved.parquet files --output-dir OUTPUTDIR Directory to write annotated .parquet files `` Note that thecommon-namescommand is a post-processing step and should be run after theresolve` command.

Example Usage

To perform taxonomic resolution on a dataset with subsequent common name annotation, run: console taxonopy resolve \ --input /path/to/formatted/input \ --output-dir /path/to/resolved/output console taxonopy common-names \ --resolved-dir /path/to/resolved/output \ --output-dir /path/to/resolved_with_common-names/output

TaxonoPy creates a cache of the objects associated with input entries for use with the trace command. By default, this cache is stored in the ~/.cache/taxonopy directory.

Development

See the Wiki Development Page for development instructions.

Owner

  • Name: Imageomics Institute
  • Login: Imageomics
  • Kind: organization

Citation (CITATION.cff)

abstract: "A Python package for efficiently aligning organismal taxonomic hierarchies using the Global Names Verifier."
authors:
- family-names: "Thompson"
  given-names: "Matthew J."
  orcid: "https://orcid.org/0000-0003-0583-8585"
- family-names: "Campolongo"
  given-names: "Elizabeth G."
  orcid: "https://orcid.org/0000-0003-0846-2413"
cff-version: 1.2.0
date-released: "2025-05-23"
identifiers:
  - description: "The GitHub release URL of tag v0.1.0-beta."
    type: url
    value: "https://github.com/Imageomics/TaxonoPy/releases/tag/v0.1.0-beta"
  - description: "The GitHub URL of the commit tagged with v0.1.0-beta"
    type: url
    value: "https://github.com/Imageomics/TaxonoPy/tree/b3ddeb8eb05d09c15c417ce2d2a4354a2a6fa49d"
keywords:
  - imageomics
  - taxonomy
  - "taxonomic resolution"
  - "tree of life"
  - alignment
  - hierarchy
references:
  - type: software
    title: "GNverifier -- a reconciler and resolver of scientific names against more than 100 data sources."
    version: "v1.2.2"
    authors:
      - family-names: "Mozzherin"
        given-names: "Dmitry"
        orcid: "https://orcid.org/0000-0003-1593-1417"
    repository-code: "https://github.com/gnames/gnverifier"
    date-released: "2024-11-04"
    doi: 10.5281/zenodo.10070488
    license: MIT
license: MIT
message: "If you use this software, please cite it using the metadata from this file."
repository-code: "https://github.com/Imageomics/TaxonoPy"
title: "TaxonoPy"
version: "0.1.0-beta"
doi: "10.5281/zenodo.15499454"
type: software

GitHub Events

Total
  • Create event: 2
  • Issues event: 2
  • Watch event: 3
  • Delete event: 2
  • Issue comment event: 8
  • Public event: 1
  • Push event: 6
  • Gollum event: 1
  • Pull request review event: 4
  • Pull request review comment event: 5
  • Pull request event: 2
Last Year
  • Create event: 2
  • Issues event: 2
  • Watch event: 3
  • Delete event: 2
  • Issue comment event: 8
  • Public event: 1
  • Push event: 6
  • Gollum event: 1
  • Pull request review event: 4
  • Pull request review comment event: 5
  • Pull request event: 2

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 18 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 1
pypi.org: taxonopy

A Python package for resolving taxonomic hierarchies using the Global Names Verifier API.

  • Documentation: https://github.com/Imageomics/TaxonoPy
  • License: MIT License Copyright (c) 2025 Imageomics Institute Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
  • Latest release: 0.1.0b0
    published 9 months ago
  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 18 Last month
Rankings
Dependent packages count: 9.1%
Average: 30.2%
Dependent repos count: 51.2%
Last synced: 6 months ago

Dependencies

.github/workflows/publish-to-pypi.yaml actions
  • actions/checkout v4 composite
  • actions/setup-python v5 composite
  • pypa/gh-action-pypi-publish release/v1 composite
pyproject.toml pypi
  • pandas *
  • polars *
  • pyarrow *
  • pydantic *
  • requests *
  • tqdm *