wgtda
Weighted Gene Topological Data Analysis (WGTDA) is a topological based framework to identify biomarkers for gene expression data.
Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.1%) to scientific vocabulary
Repository
Weighted Gene Topological Data Analysis (WGTDA) is a topological based framework to identify biomarkers for gene expression data.
Basic Info
- Host: GitHub
- Owner: IBM
- License: apache-2.0
- Language: Python
- Default Branch: main
- Size: 67.1 MB
Statistics
- Stars: 5
- Watchers: 3
- Forks: 3
- Open Issues: 2
- Releases: 0
Metadata Files
README.md
WGTDA (Weighted Gene Topological Data Analysis)
Weighted Gene Topological Data Analysis (WGTDA) is a topological based framework to identify biomarkers for gene expression data.

- WGTDA utilizes a set of computational topology techniques used to uncover the intricate local and global topological features of gene expression data.
- The technique converts gene expression data into a gene-gene correlation-based simplicial complex and employs persistent homology to identify topological interactions at different topological scales.
- The topological features that are the most persistent are identified as biomarkers.
- WGTDA uses maTILDA which is a TDA library from IBM to construct the simplicial complex and to perform persisent homology. maTILDA
- The paper documenting the WGTDA is presented here WGTDA
Getting Started
To install WGTDA:
Make sure that the python version you use in line with our setup file, using a fresh environment is always a good idea:
commandline conda create -n wgtda python=3.9 -y conda activate wgtdaInstall the
mainbranch to keep up to date with the latest supported features:commandline pip install git+https://github.com/IBM/WGTDAContribution
Contributions to the WGTDA codebase are welcome!
Usage
Code
Have a look at the tutorial for more detailed usage of WGTDA link to tutorial
```python
Import packages and WGTDA modules
import pandas as pd import numpy as np from wgtda.correlation import computedistancecorrelationmatrix, computewtomatrix from wgtda.filters import extractpersistentholes, removeinfinitevalues from wgtda import (constructvrcomplexrnamatrix, interactionsdataframe, filtergenes, convertgeneexptoarrayand_dict)
Load the gene expression data from a .pkl file
geneexpressiondf = pd.read_pickle('../data/TCGA/BRCA.pkl')
Filter the gene expression data using the gene list from cancer_genes.txt
geneexpressionfiltered = filtergenes(geneexpressiondf, '../data/preselection/cancergenes.txt')
geneexpressionfilteredarray, genedict = convertgeneexptoarrayanddict(geneexpressionfiltered)
Compute the Distance Correlation matrix
distancecorrmatrix = computedistancecorrelationmatrix(geneexparr=geneexpressionfilteredarray)
Or Compute the Signed Weighted Topological Overlap (wTO) matrix
wtomatrix = computewtomatrix(geneexparr=geneexpressionfilteredarray)
persistence, ripscomplex = constructvrcomplexrnamatrix(distancecorrmatrix) interactions = interactionsdataframe(persistence, ripscomplex, genedict) interactions = removeinfinitevalues(interactions) interactions.to_csv( "interactions.csv", index=True) ```
Command-Line Interface
To use the tool via the command line, run the main.py script with the required arguments. Below are the command-line arguments supported by the tool:
--file_path or -p: The path to the gene expression data file. This argument is required.
--filtergenespath or -fg: The path to a CSV file or txt file containing preselected genes. The tool will filter the dataset to include only these genes.
--output_path or -o: The path where the processed interactions CSV will be saved.
Filtering of topological features
--removeinfinitevalues: -inf. Bool values. True (Recommended) - if you want topological structures that tend to infinite False - Keep topological structures that tend to infinite.
--filter_persistence: -fp Optional. A integer from 0-100 specifying the top percentage of values to extract from the 'lifespan' column. If provided, the tool will filter the dataset to include only these top values.
commandline
python main.py --file_path data/TCGA/BRCA.pkl --filter_genes_path data/preselection/cancer_genes.csv --output_path output/interactions.csv --remove_infinite_values True
Outputs (Topological Gene Interactions)
The output file contains the proposed biomarkers identified through the WGTDA analysis. Each row in this file represents a topological interaction between genes in $betti0, betti1, betti2$ space. Betti numbers are used to differentiate topological spaces based on the connectivity of $n$-dimensional simplicial complexes. For example, $Betti0$ corresponds to the number of connected components or clusters, $Betti1$ represents the number of non-contractible loops or cycles, and $Betti2$ indicates the number of voids or enclosed regions in the data space.
In the context of topological features, the higher Betti numbers indicate a more complex topological structure with more independent cycles or voids. A higher Betti number suggests increased topological complexity, which may be associated with more intricate and robust biological processes. Furthermore, by focusing on these top persistent interactions, researchers can prioritize genes for further experimental validation and study, ultimately contributing to the understanding and manipulation of lifespan-associated pathways.
Citation
If you use WGTDA for research, please consider citing the
reference paper:
raw
@article{nyase2024wgtda,
title={WGTDA: A Topological Perspective to Biomarker Discovery in Gene Expression Data},
author={Nyase, Ndivhuwo and Mashatola, Lebohang and Kohlakala, Aviwe and Rhrissorrakrai, Kahn and Muller, Stephanie},
journal={arXiv preprint arXiv:2402.08807},
year={2024}
}
Owner
- Name: International Business Machines
- Login: IBM
- Kind: organization
- Email: awesome@ibm.com
- Location: United States of America
- Website: https://www.ibm.com/opensource/
- Twitter: ibmdeveloper
- Repositories: 3,152
- Profile: https://github.com/IBM
GitHub Events
Total
- Watch event: 4
- Push event: 3
- Pull request event: 1
- Fork event: 1
- Create event: 1
Last Year
- Watch event: 4
- Push event: 3
- Pull request event: 1
- Fork event: 1
- Create event: 1
Dependencies
- bokeh ==3.4.1
- dcor ==0.6
- matilda ==0.0.1
- matplotlib ==3.8.3
- networkx ==3.2.1
- numpy ==1.26.4
- pandas ==2.2.2
- pybind11 ==2.12.0
- rectangle_packer ==2.0.2
- scikit_learn ==1.4.2
- scikit_learn ==1.5.0
- scipy ==1.13.0
- scipy ==1.13.1
- seaborn ==0.13.2
- setuptools ==68.2.2
- sympy ==1.12
- tqdm ==4.66.2
- xgboost ==2.0.3
- dcor >=0.6
- matilda *
- numpy >=1.21.0
- pandas >=1.3.0
- scipy >=1.7.0
- dcor >=0.5.4
- numpy >=1.21.0
- pandas >=1.3.0
- scipy >=1.7.0
- dcor >=0.6
- numpy >=1.21.0
- pandas >=1.3.0
- scipy >=1.7.0