Fastsubtrees

Fastsubtrees: simple and efficient subtrees extractions in Python with applications to NCBI taxonomy - Published in JOSS (2022)

https://github.com/ggonnella/fastsubtrees

Science Score: 100.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 4 DOI reference(s) in README and JOSS metadata
✓
Academic publication links
Links to: joss.theoj.org
✓
Committers with academic emails
3 of 4 committers (75.0%) from academic institutions
○
Institutional organization owner
✓
JOSS paper metadata
Published in Journal of Open Source Software

Keywords

bioinformatics ncbi-taxonomy python subtree subtree-extraction subtree-query taxonomy tree

Scientific Fields

Computer Science Computer Science - 37% confidence

Last synced: 6 months ago · JSON representation ·

Repository

Python library and command line script , for fast extraction of subtrees of fairly large trees, consisting of millions of nodes, such as the NCBI taxonomy tree.

Basic Info

Host: GitHub
Owner: ggonnella
License: isc
Language: Python
Default Branch: main
Homepage:
Size: 834 KB

Statistics

Stars: 4
Watchers: 4
Forks: 0
Open Issues: 0
Releases: 0

Topics

bioinformatics ncbi-taxonomy python subtree subtree-extraction subtree-query taxonomy tree

Created about 4 years ago · Last pushed about 1 year ago

Metadata Files

Readme Changelog License Citation Authors

Fastsubtrees

Fastsubtrees is a Python library and a command line script, for handling fairly large trees (in the order of magnitude of millions nodes), in particular allowing the fast extraction of any subtree. The main application domain of fastsubtrees is working with the NCBI taxonomy tree, however the code is implemented in a generic way, so that other applications are possible.

The library functionality can be accessed both from inside Python code and from the provided command line tool fastsubtrees.

Introduction

For the use of fastsubtrees, nodes must be uniquely identified by non-negative IDs. Furthermore, the space of the IDs must be compact (i.e. the maximum ID should not be much larger than the number of IDs).

The first step when using fastsubtrees is to construct a tree representation. The operation requires a source of IDs of elements and their parents, which can be a tabular file, or any Python function yielding the IDs.

This operation just takes a few seconds, for a tree with million nodes, such as the NCBI taxonomy tree. It must be done only once, if a tree does not change, since the resulting data is stored to file.

The IDs of the NCBI taxonomy tree fullfill the conditions stated above. However, the library can be used for any tree. A way to use the library with IDs which do not fullfill the conditions, it to map them to an ID space which does, and store the original IDs as an attribute.

Besides the IDs, a tree can contain further information, e.g. integers, floats or other data, here called attributes, associated to the nodes. Each node can contain zero, one or more values for an attribute. To add values for an attribute, a tabular file or another data source (a Python function) is selected.

The data for any subtree can then be easily and efficently queried; thereby the node IDs and/other selected attributes can be retrieved.

The tree representation is dynamic, i.e. both the tree topology and the attribute values can be edited and changed.

Working with the library

Installation

The package can be installed using pip install fastsubtrees.

Command line interface

The command line tool fastsubtrees allows constructing and modifying a tree (subcommand tree), adding and editing attributes (subcommand attribute) and performing a subtree query (subcommand query).

The command line interface is further described in the CLI manual.

CLI example: working with the NCBI taxonomy tree

The example below uses the fastsubtrees command, as well as the ntdownload library (installed as a dependency, by pip) for obtaining the NCBI taxonomy data.

``` ntdownload ntdumps # download NCBI taxonomy data fastsubtrees tree nt.tree --ncbi ntdumps/nodes.dmp -f # create the tree fastsubtrees query nt.tree 562 # query node 562

attributes

ATTRTAB=data/accessiontaxidattribute.tsv.gz # data file TAXID=2; GENOMESIZE=3; GCCONTENT=4 # column numbers, 1-based

fastsubtrees attribute nt.tree genomesize $ATTRTAB -e $TAXID -v $GENOMESIZE -t int fastsubtrees attribute nt.tree GCcontent $ATTRTAB -e $TAXID -v $GCCONTENT -t float

fastsubtrees query nt.tree 562 genomesize GCcontent # query including attributes

taxonomy names

ntnames ntdumps >| names.tsv # prepare data from names dump fastsubtrees attribute nt.tree taxname names.tsv # add names as attribute fastsubtrees query nt.tree 562 taxname genome_size # query including taxa names ```

Using NtSubtree

The package ntsubtree (installable by pip) simplifies working with the NCBI taxonomy even more. Tree and the taxonomic names tables are automatically created and stored in a central location.

```

first run after installing automatically downloads and constructs the tree

ntsubtree query 562 # taxonomic names displayed alongside the IDs ntsubtree query -n "Escherichia" # Query by taxonomic name

attributes

ATTRTAB=data/accessiontaxidattribute.tsv.gz # data file TAXID=2; GENOMESIZE=3; GCCONTENT=4 # column numbers

ntsubtree attribute genomesize $ATTRTAB -e $TAXID -v $GENOMESIZE ntsubtree attribute GCcontent $ATTRTAB -e $TAXID -v $GCCONTENT ntsubtree query -n "Escherichia" genomesize GCcontent

check if a newer version of the taxonomy data is available

and update the tree if necessary, keeping the attribute values:

ntsubtree update ```

API

The library functionality can be also directly accessed in Python code using the API, which is documented in the API manual.

API example: working with the NCBI taxonomy tree

The example below uses the fastsubtrees command, as well as the ntdownload library (installed as a dependency, by pip) for obtaining the NCBI taxonomy data.

```python

download the NCBI taxonomy data

from ntdownload import Downloader d = Downloader("ntdumpsdir") has_downloaded = d.run()

from fastsubtrees import Tree infile = "ntdumpsdir/nodes.dmp" tree = Tree.constructfromncbidump(infile) # create the tree results = tree.subtreeids(562) # retrieve subtree IDs

attrtab="data/accessiontaxidattribute.tsv.gz" # data file taxidcol=1; genomesizecol=2; gccontent_col=3 # column numbers, 0-based

tree.tofile("nt.tree") tree.createattributefromtabular("genomesize", attrtab, elemfieldnum=taxidcol, attrfieldnum=genomesizecol, castingfn=int) tree.createattributefromtabular("GCcontent", attrtab, elemfieldnum=taxidcol, attrfieldnum=gccontentcol, castingfn=float) results = tree.subtreeinfo(562, ["genomesize", "GCcontent"])

taxonomy names

from ntdownload import yieldscientificnamesfromdump as generator tree.createattribute("taxname", generator("ntdumpsdir")) results = tree.subtreeinfo(562, ["taxname", "genome_size"]) ```

Using NtSubtree

The package ntsubtree (installable by pip) simplifies working with the NCBI taxonomy even more. Tree and the taxonomic names tables are automatically created and stored in a central location. The first time the library is included these operations are done automatically.

```python import ntsubtree

tree = ntsubtree.gettree() results = tree.subtreeids(562)

taxid = ntsubtree.searchname("Escherichia") results = tree.subtreeinfo(taxid, ["taxname"])

attrtab="data/accessiontaxidattribute.tsv.gz" # data file taxidcol=1; genomesizecol=2; gccontent_col=3 # column numbers, 0-based

tree.createattributefromtabular("genomesize", attrtab, elemfieldnum=taxidcol, attrfieldnum=genomesizecol, castingfn=int) tree.createattributefromtabular("GCcontent", attrtab, elemfieldnum=taxidcol, attrfieldnum=gccontentcol, castingfn=float) results = tree.subtreeinfo(562, ["genomesize", "GC_content"])

check if a newer version of the taxonomy data is available

and update the tree if necessary, keeping the attribute values:

ntsubtree.update() ```

Docker

To try or test the package, it is possible to use fastsubtrees by employing the Docker image defined in Dockerfile. This does not require any external database installation and configuration.

Example of the Docker command line:

``` # create a Docker image docker build --tag "fastsubtrees" . # create a container and run it docker run -p 8050:8050 --detach --name fastsubtreesC fastsubtrees # or, if it was already created and stopped, restart it using: # docker start fastsubtreesC # run the tests docker exec fastsubtreesC tests # run benchmarks docker exec fastsubtreesC benchmarks # run the example application docker exec fastsubtreesC start-example-app # now open it in the browser at https://0.0.0.0:8050 ```

Tests

To run the test suite, you can use pytest (or make tests). The tests include tests of fastsubtrees and of the sub-package ntmirror. The latter are partly dependent on a database installation and configuration which must be given in ntmirror/tests/config.yaml; database-dependent tests are skipped if this configuration file is not provided.

The entire test suite can be also run from the Docker container, without further configuration, see above the Docker section.

Benchmarks

Benchmarks can be run using the shell scripts provided under benchmarks. These require data, which is downloaded from NCBI taxonomy and some pre-computed example data which is provided in the data subdirectory (genome sizes and GC content).

The benchmarks can be convienently run from the Docker container, without requiring a database installation and setup, see above the Docker section.

Example application: Genome attributes viewer

An interactive web application based on fastsubtrees was developed using dash. It allows to graphically display the distribution of values of attributes in subtrees of the NCBI taxonomic tree. It is a separate Python package, which can be installed using pip, and depends on fastsubtrees.

It can also be installed using the Docker image of fastsubtrees (see above in the Docker section).

For more information see also the genomes-attributes-viewer/README.md file.

Local installation and startup

To application can be installed using pip install genomes_attributes_viewer or from the genomes_attributes_viewer directory of the fastsubtrees repository.

To start the application, use the genomes-attributes-viewer. The first time this command is run, the application data are downloaded and prepared, taking a few seconds. Startup on subsequent starts does not require these operations and is thus faster.

Other subpackages

NtSubtree

NtSubtree is a library which automatically downloads the NCBI taxonomy dump and constructs the fastsubtrees data for it. It allows to easily keep the data up-to-date. It is a separate Python package, which can be installed using pip, and depends on fastsubtrees.

The query command of the NtSubtree CLI tool automatically display also taxonomic names, alongside the IDs in query and allow to perform queries by taxonomic name.

For more information see also the ntsubtree/README.md file.

ntdownload

When working with the NCBI taxonomy database, a local copy of the NCBI taxonomy dump can be obtained and kept up-to-date using the ntdownload package, which is located in the directory ntdownload. It is a separate Python package, which can be installed using pip, independently from fastsubtrees.

Please refer to the user manual of ntdownload located under ntdownload/README.md for more information.

ntmirror

A downloaded NCBI taxonomy database dump can be loaded to a local SQL database, using the package ntmirror, which is located in the directory ntmirror. It is a separate Python package, which can be installed using pip, independently from fastsubtrees.

It contains also a script to extract subtrees from the local database mirror using hierarchical SQL queries.

Please refer to the user manual of ntmirror located under ntmirror/README.md for more information.

Internals

For achieving an efficient running time and memory use, the nodes of the tree are represented compactly in deep-first traversal order. Subtrees are then extracted in O(s) time, where s is the size of the extracted subtree (i.e. not depending on the size of the whole tree).

The IDs must not necessarily be all consecutive (i.e. some "holes" may be present), but the largest node ID (idmax) should not be much larger than the total number of nodes, because the memory consumption is in O(idmax).

For each attribute defined in a tree, a file is created, where the attribute values are stored. The attributes are also stored in the same deep-first traversal order as the tree IDs.

Tree construction algorithm

The tree construction algorithm used here is the following, where the input data consists of 2-tuples (element_id, parent_id) and the maximum node ID m is not much larger than the number of IDs n.

iteration over the input data to construct a table P of parents by ID, i.e. P[element_id]=parent_id if element_id is in the tree, and P[element_id]=UNDEF if not, where UNDEF is a special value. This requires O(n) steps for reading the IDs and O(m) steps for writing either the ID or the UNDEF value to P; since m>=n, the total time is in O(m).
iteration over table P to construct a table S of subtree sizes by ID; for each element the tree is climbed to the root, to add the element to the counts of each ancestor. This operation requires O(n*d) time, where d is the height of the node, which is in average case much lower than m and d=m is the worst case.
iteration over each node ID to construct the list D, consisting of the depth first order of the nodes, and the table C of the coordinates of all nodes in the tree data, by id. For this operation, first the root is added to D and C, then for each other node x in P, the tree is climbed and nodes added to a stack until the next not-yet-added ancestor is found. The position where to add it this node is computed by the next free position in the subtree of its parent (which must have been already added, by definition, thus the next free position in its subtree is known). After this, the next stack element is added, until x is added. Although this operation also requires climbing the tree, it takes in total O(n) time, since each node is added only once.

Parallelizing the tree construction

Currently the slowest step of the construction, detailed in the previous section, is the second, i.e. the computation of S. Since each node must be count in the subtree size of all its ancestors, there is no easy way to reduce the time from O(n*h).

To parallelize this step, one divides the parents table into t slices, and assign each to a different sub-process (not thread, because of the GIL). Each sub-process would then count the subtree sizes in the slice only. A version implemented with a shared table and a lock was too slow, since access to the table was concurrent among the sub-processes. In the current version, instead, each sub-process makes a own subtree sizes S' table. The sub-processes S' tables are summed up after completion for obtaining the S table.

This option can be activated in the CLI using the --processes P option, or in the API setting the nprocesses argument of Tree.construct and related methods. Benchmarks show that the parallel version did not significantly improve the performance on constructing the NCBI taxonomy tree, likely because of the overhead of process starting, array S' initialization and summing up of all S' to S after completion.

Community guidelines

Contributions to the software are welcome. Please clone this repository and send a pull request on Github, to let the changes be integrated in the original repository.

In case of bugs and issues, please report them through the Github Issues page of the repository.

Documentation

The complete documentation of Fastsubtrees is available on ReadTheDocs at https://fastsubtrees.readthedocs.io/ in website and PDF format.

Licence

All code of Fastsubtrees is released under the ISC license. (see LICENSE file). It is functionally equivalent to a two-term BSD copyright with language removed that is made unnecessary by the Berne convention. See http://openbsd.org/policy.html for more information on copyrights.

Acknowledgements

This software has been originally created in context of the DFG project GO 3192/1-1 “Automated characterization of microbial genomes and metagenomes by collection and verification of association rules”. The funders had no role in study design, data collection and analysis.

Owner

Name: Giorgio Gonnella
Login: ggonnella
Kind: user
Location: Goettingen, Germany
Company: Bioinformatics, University of Goettingen

Repositories: 36
Profile: https://github.com/ggonnella

JOSS Publication

Fastsubtrees: simple and efficient subtrees extractions in Python with applications to NCBI taxonomy

Published

November 13, 2022

DOI

10.21105/joss.04755

Volume 7, Issue 79, Page 4755

Authors

Aman Modi

Department of Bioinformatics (IMG), University of Göttingen, Göttingen, Germany

Giorgio Gonnella

Department of Bioinformatics (IMG), University of Göttingen, Göttingen, Germany, Center for Bioinformatics (ZBH), University of Hamburg, Hamburg, Germany

Editor

Kelly Rowland

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Gonnella"
  given-names: "Giorgio"
  orcid: "https://orcid.org/0000-0003-3900-5397"
title: 'Fastsubtrees: simple and efficient subtrees extractions in Python with applications to NCBI taxonomy'
version: 2.2
date-released: 2025-02-17
url: "htts://github.com/ggonnella/fastsubtrees/"

GitHub Events

Total

Delete event: 3
Push event: 2
Create event: 1

Last Year

Delete event: 3
Push event: 2
Create event: 1

Committers

Last synced: 7 months ago

All Time

Total Commits: 482
Total Committers: 4
Avg Commits per committer: 120.5
Development Distribution Score (DDS): 0.156

Past Year

Commits: 6
Committers: 1
Avg Commits per committer: 6.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Giorgio Gonnella	g**a@z**e	407
aman.modi	a**i@s**e	73
Giorgio Gonnella	g**a@u**e	1
Aman Modi	a**o@t**e	1

Committer Domains (Top 20 + Academic)

thekla.gobics.de: 1 uni-goettingen.de: 1 stud.uni-goettingen.de: 1 zbh.uni-hamburg.de: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 25
Total pull requests: 2
Average time to close issues: 5 days
Average time to close pull requests: 3 months
Total issue authors: 2
Total pull request authors: 2
Average comments per issue: 3.24
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

ggonnella (17)
KonradHoeffner (8)

Pull Request Authors

ggonnella (1)
iamaman23 (1)

Top Labels

Issue Labels

documentation (7) enhancement (4) bug (2) question (1)

Pull Request Labels

Packages

Total packages: 5
Total downloads:
- pypi 215 last-month

Total dependent packages: 6
(may contain duplicates)
Total dependent repositories: 2
(may contain duplicates)
Total versions: 27
Total maintainers: 1

pypi.org: ntmirror

Easily updatable local NCBI taxonomy database copy

Homepage: https://github.com/ggonnella/fastsubtrees/tree/main/ntmirror
Documentation: https://ntmirror.readthedocs.io/
License: ISC
Latest release: 1.4.1
published over 3 years ago

Versions: 6
Dependent Packages: 1
Dependent Repositories: 1
Downloads: 43 Last month

Rankings

Dependent packages count: 3.2%

Average: 20.3%

Dependent repos count: 22.1%

Forks count: 22.8%

Stargazers count: 23.2%

Downloads: 30.4%

Maintainers (1)

gonnella

Last synced: 6 months ago

pypi.org: fastsubtrees

Tree representation for fast subtree queries

Homepage: https://github.com/ggonnella/fastsubtrees
Documentation: https://fastsubtrees.readthedocs.io/
License: ISC
Latest release: 1.6.1
published over 3 years ago

Versions: 11
Dependent Packages: 2
Dependent Repositories: 1
Downloads: 96 Last month

Rankings

Dependent packages count: 2.1%

Average: 20.4%

Dependent repos count: 22.1%

Forks count: 22.8%

Stargazers count: 23.2%

Downloads: 32.0%

Maintainers (1)

gonnella

Last synced: 6 months ago

pypi.org: ntdownload

Easily updatable local NCBI taxonomy dumps file copy

Homepage: https://github.com/ggonnella/fastsubtrees/tree/main/ntdowload
Documentation: https://ntdownload.readthedocs.io/
License: ISC
Latest release: 1.7.1
published over 3 years ago

Versions: 5
Dependent Packages: 3
Dependent Repositories: 0
Downloads: 41 Last month

Rankings

Dependent packages count: 1.4%

Average: 20.4%

Forks count: 23.2%

Stargazers count: 23.3%

Downloads: 23.6%

Dependent repos count: 30.6%

Maintainers (1)

gonnella

Last synced: 6 months ago

pypi.org: genomes-attributes-viewer

Example application of fastsubtrees

Homepage: https://github.com/ggonnella/fastsubtrees/"+ "tree/main/genomes_attributes_viewer
Documentation: https://genomes-attributes-viewer.readthedocs.io/
License: ISC
Latest release: 1.3
published over 3 years ago

Versions: 3
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 24 Last month

Rankings

Dependent packages count: 6.6%

Average: 22.9%

Forks count: 23.2%

Stargazers count: 23.3%

Downloads: 30.5%

Dependent repos count: 30.6%

Maintainers (1)

gonnella

Last synced: 6 months ago

pypi.org: ntsubtree

Tree representation for fast queries of the subtree of a taxon in the NCBI taxonomy tree

Homepage: https://github.com/ggonnella/fastsubtrees/tree/main/ntsubtree
Documentation: https://ntsubtree.readthedocs.io/
License: ISC
Latest release: 1.1
published over 3 years ago

Versions: 2
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 11 Last month

Rankings

Dependent packages count: 6.6%

Forks count: 23.2%

Stargazers count: 23.3%

Average: 23.6%

Dependent repos count: 30.6%

Downloads: 34.3%

Maintainers (1)

gonnella

Last synced: 6 months ago

Dependencies

ntmirror/setup.py pypi

loguru *
sqlalchemy *

setup.py pypi

docopt >=0.6.2
loguru >=0.5.1
schema >=0.7.4
tqdm >=4.57.0

.github/workflows/draft-pdf.yml actions

actions/checkout v2 composite
actions/upload-artifact v1 composite
openjournals/openjournals-draft-action master composite

.github/workflows/python-package.yml actions

actions/checkout v3 composite
actions/setup-python v3 composite

Dockerfile docker

ubuntu 22.04 build

data/requirements.txt pypi

docopts *
sh *

docs/requirements.txt pypi

myst_parser *

genomes_attributes_viewer/setup.py pypi

Flask ==2.1.2
dash-bootstrap-components ==1.0.2
dash-html-components ==2.0.0
fastsubtrees >=2.0

ntdownload/requirements.txt pypi

loguru *
schema *
snacli *

ntdownload/setup.py pypi

loguru *

ntmirror/requirements.txt pypi

PyYAML *
loguru *
mariadb *
mysql *
ntdownload *
schema *
snacli *
sqlalchemy *
sqlalchemy_repr *

ntsubtree/setup.py pypi

docopt >=0.6.2
loguru >=0.5.1
ntdownload *
schema >=0.7.4
tqdm >=4.57.0

requirements.txt pypi

docopt >=0.6.2
loguru >=0.5.1
ntdownload >=1.6
schema >=0.7.4
sh >=1.14.2
tqdm >=4.57.0

Fastsubtrees

Science Score: 100.0%

Keywords

Scientific Fields

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Fastsubtrees

Introduction

Working with the library

Installation

Command line interface

CLI example: working with the NCBI taxonomy tree

attributes

taxonomy names

Using NtSubtree

first run after installing automatically downloads and constructs the tree

attributes

check if a newer version of the taxonomy data is available

and update the tree if necessary, keeping the attribute values:

API

API example: working with the NCBI taxonomy tree

download the NCBI taxonomy data

taxonomy names

Using NtSubtree

check if a newer version of the taxonomy data is available

and update the tree if necessary, keeping the attribute values:

Docker

Tests

Benchmarks

Example application: Genome attributes viewer

Local installation and startup

Other subpackages

NtSubtree

ntdownload

ntmirror

Internals

Tree construction algorithm

Parallelizing the tree construction

Community guidelines

Documentation

Licence

Acknowledgements

Owner

JOSS Publication

Fastsubtrees: simple and efficient subtrees extractions in Python with applications to NCBI taxonomy

Authors

Editor

Tags

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: ntmirror

Rankings

Maintainers (1)

pypi.org: fastsubtrees

Rankings

Maintainers (1)

pypi.org: ntdownload

Rankings

Maintainers (1)