redvals
Tools for obtaining RED values from GTDB phylogenetic trees
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.1%) to scientific vocabulary
Repository
Tools for obtaining RED values from GTDB phylogenetic trees
Basic Info
- Host: GitHub
- Owner: HaigBishop
- License: mit
- Language: Python
- Default Branch: main
- Size: 28 MB
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
redvals
Tool for obtaining and accessing Relative Evolutionary Divergence (RED) values from GTDB phylogenetic trees.
GTDB r226 Warning
There appears to be an issue with the GTDB release 226 data, which only affects the archaeal tree. The corresponding RED values file (gtdbtk_r226_ar53.tsv) seems to use a different root to the tree file, which leads to the lack of RED values for two internal nodes. When you run decorate_from_tsv() with the r226 files, redvals will detect this and warn you. You will be given the option to ignore the issue and proceed. If you choose to ignore it, the undecorated nodes will be assigned an arbitrary RED value of 0.8.
Please be aware that because of this, the pre-computed decorated archaeal tree for r226 provided in this repository (decorated_trees/bac120_r226_decorated.pkl) is also affected by this arbitrary value and should be used with caution.
Set Up
- Clone repository
git clone https://github.com/HaigBishop/redvals.git cd redvals - Install dependencies (biopython, pandas and tqdm)
conda create -n redvals_env python=3.12 biopython=1.85 pandas=2.3.2 tqdm=4.67.1 conda activate redvals_env - Use package
- See usage below
- Or see in depth examples in
example_1.pyandexample_2.py
Usage
Import redvals.RedTree
The RedTree object represents the bacterial and archaeal phylogenetic trees.
python
from redvals import RedTree
Load Undecorated Trees (Init Method 1)
This loads original Newick trees with no RED values. Also assigns 'redvals IDs' to all nodes.
python
red_trees = RedTree("trees/bac120_r220.tree", "trees/ar53_r220.tree")
Decorate Trees from TSVs (takes 30-90 minutes)
This adds RED values and RED distances to every node in both trees.
python
red_trees.decorate_from_tsv("red_values/gtdbtk_r220_bac120.tsv", "red_values/gtdbtk_r220_ar53.tsv")
Write Decorated Trees
This saves the trees with the RED values and distances as .pkl files, avoiding repetition of the time-consuming decoration process.
python
red_trees.write_decorated_trees("decorated_trees/bac120_r220_decorated.pkl", "decorated_trees/ar53_r220_decorated.pkl")
Load Decorated Trees (Init Method 2)
Load the already decorated trees originating from writedecoratedtrees.
python
red_trees = RedTree("decorated_trees/bac120_r220_decorated.pkl", "decorated_trees/ar53_r220_decorated.pkl")
Convert Node IDs
The IDs found in the orginal GTDB .tree files are maintained (called gtdb_ids). But for most internal nodes there are no IDs. Therefore new IDs are assigned to all nodes called redvals_ids. You can use either type at any time, but if you wish to convert between them you can use the code here.
python
bacterial_redvals_id = red_trees.get_redvals_id("GB_GCA_002687935.1")
gtdb_id = red_trees.get_gtdb_id("bac00000001")
Get All Node IDs
You can retrieve all node IDs with optional filtering.
python
all_gtdb_ids = red_trees.get_node_ids(domain='both', node_type='both', id_type='gtdb') # (many internal nodes do not have GTDB IDs)
internal_archaeal_redvals_ids = red_trees.get_node_ids(domain='arc', node_type='internal', id_type='redvals')
Get All Nodes
You can retrieve all nodes with optional filtering.
python
all_leaf_nodes = red_trees.get_nodes(domain='both', node_type='leaf')
all_bacterial_nodes = red_trees.get_nodes(domain='bac', node_type='both')
all_bacterial_leaf_nodes = red_trees.get_nodes(domain='bac', node_type='leaf')
Access Node Info
Many attributes of nodes can easily be accessed using getnodeinfo.
python
node_info = red_trees.get_node_info("bac00000001")
print(f"GTDB ID: {node_info.gtdb_id}, redvals ID: {node_info.redvals_id}, Domain: {node_info.domain}")
print(f"RED value: {node_info.red_value}, RED distance: {node_info.red_distance}, Is terminal?: {node_info.is_terminal}")
Compute RED Distances
For any two nodes distbetweennodes gives the RED distance between them and the redvals_ids of their MRCA. This works for two leaf nodes, two internal nodes, or an internal and a leaf node.
python
red_distance, mrca_node_id = red_trees.dist_between_nodes("bac00000001", "RS_GCF_001186155.3")
Map Taxon Names to Nodes
Given a taxon name (e.g. g_Escherichia) we can use getdistanceintaxon(taxonname) to get the RED distance between any pair of leaf nodes who's MRCA is the node representing that taxon. Before using getdistanceintaxon(taxonname) however, mapnodestotaxa(seqsfasta) must be run, where seqsfasta is a GTDB database FASTA file such as ssuallr220.fna. ```python
Running maptaxato_nodes may take 20-40 minutes
redtrees.maptaxatonodes("D:/16Sdatabases/ssuallr220.fna", saveresultpath="./taxonmappings/taxontonodemapping220.pkl")
Then these are fast
reddistance = redtrees.getdistanceintaxon("pNitrospirota") sterraenode = redtrees.getnodefromtaxonname("s__Spirillospora terrae") ```
Load Pre-Computed Taxon Name Mappings
If you previously ran maptaxatonodes and saved the result, you can use loadtaxatonodemapping to quicky load taxon -> node mappings ```python redtrees.loadtaxatonodemapping("./taxonmappings/taxontonodemapping220.pkl") reddistance = redtrees.getdistanceintaxon("pNitrospirota") sterraenode = redtrees.getnodefromtaxon_name("sSpirillospora terrae") ```
Files
Original Newick Tree Files
./trees/bac120_r220.tree./trees/ar53_r220.tree
These are the GTDB phylogenetic trees (release 220) in Newick format. Obtained from: https://gtdb.ecogenomic.org/downloads
Example Usage: ```python from Bio import Phylo
Load the Newick tree file of bacterial GTDB tree
bac120tree = Phylo.read("./trees/bac120r220.tree", "newick")
Print some info about the tree
terminalnodes = bac120tree.getterminals() nonterminalnodes = arc53tree.getnonterminals() print("5 leaf node names:", [node.name for node in terminalnodes[:5]]) print("5 internal node names:", [node.name for node in nonterminalnodes[:5]]) print("Number of leaf nodes:", len(terminalnodes)) print("Number of internal nodes:", len(nonterminalnodes)) ```
RED Values TSV Files
./red_values/gtdbtk_r220_bac120.tsv./red_values/gtdbtk_r220_ar53.tsv
These TSV files contain the RED values for all nodes in each tree. They originate from the release 220 gtdbtkpackage directory from: https://gtdb.ecogenomic.org/downloads i.e. https://data.ace.uq.edu.au/public/gtdb/data/releases/release220/220.0/auxillaryfiles/gtdbtkpackage/fullpackage/
There is one row for each node (terminal and nonterminal). The second column holds the RED value assigned to the given node. The first column holds one or two leaf IDs (AKA genome IDs). For example in the first column we might find "GBGCA020721905.1" or "GBGCA026414805.1|GBGCA020721905.1". When there is a single leaf ID, this is a leaf node, and therefore the RED value in column two is 1.0. But if there are two leaf IDs seperated by a "|" symbol, this row represents an internal node. Specifically, it represents the node that is the most recent common ancestor (MRCA) of the two leaf IDs in column one.
Decorated Tree Files
./decorated_trees/bac120_r220_decorated.pkl./decorated_trees/ar53_r220_decorated.pkl
These files are Python pickle files, each containing a Bio.Phylo.Newick.Tree object. Loading these objects (see below) results in the same object as loading the .tree files (as above), but the difference is that they are decorated with RED values, and RED distances.
Example Usage: ```python from Bio import Phylo import pickle
Load the pickle file of decorated bacterial GTDB tree
with open("./out/bac120r220decorated.pkl", "rb") as f: decoratedbac120tree = pickle.load(f)
Access a RED value using the tree
node520 = decoratedbac120tree.getnode("bac00000520") print("RED value of 'bac00000520':", node520.redvalue) ```
Taxon to Node Mapping
./taxon_mappings/taxon_to_node_mapping_220.pkl
This file is a Python pickle file containing a dictionary which simply maps taxon_name to redvals_id where redvals_id is the ID of the node representing that clade. For example, "gEscherichia" -> "bac00002631" or "pNitrospirota" -> "bac00079003". This is used for getnodefromtaxonname() and getdistancein_taxon().
How to Contribute
Contributions are very welcome.
- Fork the repository and clone it locally
- Create a new branch for your feature or bug fix
- Make your changes and write tests if applicable
- Run existing tests to ensure nothing was broken
- Submit a pull request with a clear description of your changes
To Do
- add good docstrings to RedTree
- add comprehensive tests to all_tests.py
- add to PyPI
License
This project is licensed under the MIT License - see the LICENSE file for details.
Data Licensing
This tool processes data from the Genome Taxonomy Database (GTDB). The GTDB data is subject to its own licensing terms: - GTDB data (including phylogenetic trees and taxonomy) is available under a CC BY-SA 4.0 license
Other Info
Version: 0.2.0 (2025-09-04) Author: Haig Bishop Email: haig.bishop@pg.canterbury.ac.nz GitHub: https://github.com/HaigBishop/redvals
Owner
- Name: Haig Bishop
- Login: HaigBishop
- Kind: user
- Location: Christchurch, New Zealand
- Repositories: 1
- Profile: https://github.com/HaigBishop
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Bishop"
given-names: "Haig"
email: "haigvbishop@gmail.com"
title: "redvals"
version: 0.1.1
date-released: 2025-03-28
url: "https://github.com/HaigBishop/redvals"
repository-code: "https://github.com/HaigBishop/redvals"
license: MIT
GitHub Events
Total
- Watch event: 1
- Push event: 23
- Create event: 2
Last Year
- Watch event: 1
- Push event: 23
- Create event: 2
Dependencies
- biopython *
- pandas *
- tqdm *