wgtda

Weighted Gene Topological Data Analysis (WGTDA) is a topological based framework to identify biomarkers for gene expression data.

https://github.com/ibm/wgtda

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.1%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Weighted Gene Topological Data Analysis (WGTDA) is a topological based framework to identify biomarkers for gene expression data.

Basic Info

Host: GitHub
Owner: IBM
License: apache-2.0
Language: Python
Default Branch: main
Size: 67.1 MB

Statistics

Stars: 5
Watchers: 3
Forks: 3
Open Issues: 2
Releases: 0

Created almost 2 years ago · Last pushed 11 months ago

Metadata Files

Readme License Code of conduct Citation

WGTDA (Weighted Gene Topological Data Analysis)

Weighted Gene Topological Data Analysis (WGTDA) is a topological based framework to identify biomarkers for gene expression data.

WGTDA Framework

WGTDA utilizes a set of computational topology techniques used to uncover the intricate local and global topological features of gene expression data.
The technique converts gene expression data into a gene-gene correlation-based simplicial complex and employs persistent homology to identify topological interactions at different topological scales.
The topological features that are the most persistent are identified as biomarkers.
WGTDA uses maTILDA which is a TDA library from IBM to construct the simplicial complex and to perform persisent homology. maTILDA
The paper documenting the WGTDA is presented here WGTDA

Getting Started

To install WGTDA:

Make sure that the python version you use in line with our setup file, using a fresh environment is always a good idea: commandline conda create -n wgtda python=3.9 -y conda activate wgtda
Install the main branch to keep up to date with the latest supported features: commandline pip install git+https://github.com/IBM/WGTDA

Contribution

Contributions to the WGTDA codebase are welcome!

Usage

Code

Have a look at the tutorial for more detailed usage of WGTDA link to tutorial

```python

Import packages and WGTDA modules

import pandas as pd import numpy as np from wgtda.correlation import computedistancecorrelationmatrix, computewtomatrix from wgtda.filters import extractpersistentholes, removeinfinitevalues from wgtda import (constructvrcomplexrnamatrix, interactionsdataframe, filtergenes, convertgeneexptoarrayand_dict)

Load the gene expression data from a .pkl file

geneexpressiondf = pd.read_pickle('../data/TCGA/BRCA.pkl')

Filter the gene expression data using the gene list from cancer_genes.txt

geneexpressionfiltered = filtergenes(geneexpressiondf, '../data/preselection/cancergenes.txt')

geneexpressionfilteredarray, genedict = convertgeneexptoarrayanddict(geneexpressionfiltered)

Compute the Distance Correlation matrix

distancecorrmatrix = computedistancecorrelationmatrix(geneexparr=geneexpressionfilteredarray)

Or Compute the Signed Weighted Topological Overlap (wTO) matrix

wtomatrix = computewtomatrix(geneexparr=geneexpressionfilteredarray)

persistence, ripscomplex = constructvrcomplexrnamatrix(distancecorrmatrix) interactions = interactionsdataframe(persistence, ripscomplex, genedict) interactions = removeinfinitevalues(interactions) interactions.to_csv( "interactions.csv", index=True) ```

Command-Line Interface

To use the tool via the command line, run the main.py script with the required arguments. Below are the command-line arguments supported by the tool:

--file_path or -p: The path to the gene expression data file. This argument is required.

--filtergenespath or -fg: The path to a CSV file or txt file containing preselected genes. The tool will filter the dataset to include only these genes.

--output_path or -o: The path where the processed interactions CSV will be saved.

Filtering of topological features

--removeinfinitevalues: -inf. Bool values. True (Recommended) - if you want topological structures that tend to infinite False - Keep topological structures that tend to infinite.

--filter_persistence: -fp Optional. A integer from 0-100 specifying the top percentage of values to extract from the 'lifespan' column. If provided, the tool will filter the dataset to include only these top values.

commandline python main.py --file_path data/TCGA/BRCA.pkl --filter_genes_path data/preselection/cancer_genes.csv --output_path output/interactions.csv --remove_infinite_values True

Outputs (Topological Gene Interactions)

The output file contains the proposed biomarkers identified through the WGTDA analysis. Each row in this file represents a topological interaction between genes in $betti0, betti1, betti2$ space. Betti numbers are used to differentiate topological spaces based on the connectivity of $n$-dimensional simplicial complexes. For example, $Betti0$ corresponds to the number of connected components or clusters, $Betti1$ represents the number of non-contractible loops or cycles, and $Betti2$ indicates the number of voids or enclosed regions in the data space.

In the context of topological features, the higher Betti numbers indicate a more complex topological structure with more independent cycles or voids. A higher Betti number suggests increased topological complexity, which may be associated with more intricate and robust biological processes. Furthermore, by focusing on these top persistent interactions, researchers can prioritize genes for further experimental validation and study, ultimately contributing to the understanding and manipulation of lifespan-associated pathways.

Citation

If you use WGTDA for research, please consider citing the reference paper: raw @article{nyase2024wgtda, title={WGTDA: A Topological Perspective to Biomarker Discovery in Gene Expression Data}, author={Nyase, Ndivhuwo and Mashatola, Lebohang and Kohlakala, Aviwe and Rhrissorrakrai, Kahn and Muller, Stephanie}, journal={arXiv preprint arXiv:2402.08807}, year={2024} }

Owner

Name: International Business Machines
Login: IBM
Kind: organization
Email: awesome@ibm.com
Location: United States of America

Website: https://www.ibm.com/opensource/
Twitter: ibmdeveloper
Repositories: 3,152
Profile: https://github.com/IBM

GitHub Events

Total

Watch event: 4
Push event: 3
Pull request event: 1
Fork event: 1
Create event: 1

Last Year

Watch event: 4
Push event: 3
Pull request event: 1
Fork event: 1
Create event: 1

Dependencies

requirements.txt pypi

bokeh ==3.4.1
dcor ==0.6
matilda ==0.0.1
matplotlib ==3.8.3
networkx ==3.2.1
numpy ==1.26.4
pandas ==2.2.2
pybind11 ==2.12.0
rectangle_packer ==2.0.2
scikit_learn ==1.4.2
scikit_learn ==1.5.0
scipy ==1.13.0
scipy ==1.13.1
seaborn ==0.13.2
setuptools ==68.2.2
sympy ==1.12
tqdm ==4.66.2
xgboost ==2.0.3

setup.py pypi

dcor >=0.6
matilda *
numpy >=1.21.0
pandas >=1.3.0
scipy >=1.7.0

src/wgtda/wgtda.egg-info/requires.txt pypi

dcor >=0.5.4
numpy >=1.21.0
pandas >=1.3.0
scipy >=1.7.0

src/wgtda.egg-info/requires.txt pypi

dcor >=0.6
numpy >=1.21.0
pandas >=1.3.0
scipy >=1.7.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

wgtda

Science Score: 36.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

WGTDA (Weighted Gene Topological Data Analysis)

Getting Started

Contribution

Usage

Code

Import packages and WGTDA modules

Load the gene expression data from a .pkl file

Filter the gene expression data using the gene list from cancer_genes.txt

Compute the Distance Correlation matrix

Or Compute the Signed Weighted Topological Overlap (wTO) matrix

Command-Line Interface

Filtering of topological features

Outputs (Topological Gene Interactions)

Citation

Owner

GitHub Events

Total

Last Year

Dependencies