wgtda

Weighted Gene Topological Data Analysis (WGTDA) is a topological based framework to identify biomarkers for gene expression data.

https://github.com/ibm/wgtda

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.1%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Weighted Gene Topological Data Analysis (WGTDA) is a topological based framework to identify biomarkers for gene expression data.

Basic Info
  • Host: GitHub
  • Owner: IBM
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Size: 67.1 MB
Statistics
  • Stars: 5
  • Watchers: 3
  • Forks: 3
  • Open Issues: 2
  • Releases: 0
Created almost 2 years ago · Last pushed 11 months ago
Metadata Files
Readme License Code of conduct Citation

README.md

Code style: black

WGTDA (Weighted Gene Topological Data Analysis)

Weighted Gene Topological Data Analysis (WGTDA) is a topological based framework to identify biomarkers for gene expression data.

WGTDA Framework

  • WGTDA utilizes a set of computational topology techniques used to uncover the intricate local and global topological features of gene expression data.
  • The technique converts gene expression data into a gene-gene correlation-based simplicial complex and employs persistent homology to identify topological interactions at different topological scales.
  • The topological features that are the most persistent are identified as biomarkers.
  • WGTDA uses maTILDA which is a TDA library from IBM to construct the simplicial complex and to perform persisent homology. maTILDA
  • The paper documenting the WGTDA is presented here WGTDA

Getting Started

To install WGTDA:

  1. Make sure that the python version you use in line with our setup file, using a fresh environment is always a good idea: commandline conda create -n wgtda python=3.9 -y conda activate wgtda

  2. Install the main branch to keep up to date with the latest supported features: commandline pip install git+https://github.com/IBM/WGTDA

    Contribution

    Contributions to the WGTDA codebase are welcome!

Usage

Code

Have a look at the tutorial for more detailed usage of WGTDA link to tutorial

```python

Import packages and WGTDA modules

import pandas as pd import numpy as np from wgtda.correlation import computedistancecorrelationmatrix, computewtomatrix from wgtda.filters import extractpersistentholes, removeinfinitevalues from wgtda import (constructvrcomplexrnamatrix, interactionsdataframe, filtergenes, convertgeneexptoarrayand_dict)

Load the gene expression data from a .pkl file

geneexpressiondf = pd.read_pickle('../data/TCGA/BRCA.pkl')

Filter the gene expression data using the gene list from cancer_genes.txt

geneexpressionfiltered = filtergenes(geneexpressiondf, '../data/preselection/cancergenes.txt')

geneexpressionfilteredarray, genedict = convertgeneexptoarrayanddict(geneexpressionfiltered)

Compute the Distance Correlation matrix

distancecorrmatrix = computedistancecorrelationmatrix(geneexparr=geneexpressionfilteredarray)

Or Compute the Signed Weighted Topological Overlap (wTO) matrix

wtomatrix = computewtomatrix(geneexparr=geneexpressionfilteredarray)

persistence, ripscomplex = constructvrcomplexrnamatrix(distancecorrmatrix) interactions = interactionsdataframe(persistence, ripscomplex, genedict) interactions = removeinfinitevalues(interactions) interactions.to_csv( "interactions.csv", index=True) ```

Command-Line Interface

To use the tool via the command line, run the main.py script with the required arguments. Below are the command-line arguments supported by the tool:

--file_path or -p: The path to the gene expression data file. This argument is required.

--filtergenespath or -fg: The path to a CSV file or txt file containing preselected genes. The tool will filter the dataset to include only these genes.

--output_path or -o: The path where the processed interactions CSV will be saved.

Filtering of topological features

--removeinfinitevalues: -inf. Bool values. True (Recommended) - if you want topological structures that tend to infinite False - Keep topological structures that tend to infinite.

--filter_persistence: -fp Optional. A integer from 0-100 specifying the top percentage of values to extract from the 'lifespan' column. If provided, the tool will filter the dataset to include only these top values.

commandline python main.py --file_path data/TCGA/BRCA.pkl --filter_genes_path data/preselection/cancer_genes.csv --output_path output/interactions.csv --remove_infinite_values True

Outputs (Topological Gene Interactions)

The output file contains the proposed biomarkers identified through the WGTDA analysis. Each row in this file represents a topological interaction between genes in $betti0, betti1, betti2$ space. Betti numbers are used to differentiate topological spaces based on the connectivity of $n$-dimensional simplicial complexes. For example, $Betti0$ corresponds to the number of connected components or clusters, $Betti1$ represents the number of non-contractible loops or cycles, and $Betti2$ indicates the number of voids or enclosed regions in the data space.

In the context of topological features, the higher Betti numbers indicate a more complex topological structure with more independent cycles or voids. A higher Betti number suggests increased topological complexity, which may be associated with more intricate and robust biological processes. Furthermore, by focusing on these top persistent interactions, researchers can prioritize genes for further experimental validation and study, ultimately contributing to the understanding and manipulation of lifespan-associated pathways.

Citation

If you use WGTDA for research, please consider citing the reference paper: raw @article{nyase2024wgtda, title={WGTDA: A Topological Perspective to Biomarker Discovery in Gene Expression Data}, author={Nyase, Ndivhuwo and Mashatola, Lebohang and Kohlakala, Aviwe and Rhrissorrakrai, Kahn and Muller, Stephanie}, journal={arXiv preprint arXiv:2402.08807}, year={2024} }

Owner

  • Name: International Business Machines
  • Login: IBM
  • Kind: organization
  • Email: awesome@ibm.com
  • Location: United States of America

GitHub Events

Total
  • Watch event: 4
  • Push event: 3
  • Pull request event: 1
  • Fork event: 1
  • Create event: 1
Last Year
  • Watch event: 4
  • Push event: 3
  • Pull request event: 1
  • Fork event: 1
  • Create event: 1

Dependencies

requirements.txt pypi
  • bokeh ==3.4.1
  • dcor ==0.6
  • matilda ==0.0.1
  • matplotlib ==3.8.3
  • networkx ==3.2.1
  • numpy ==1.26.4
  • pandas ==2.2.2
  • pybind11 ==2.12.0
  • rectangle_packer ==2.0.2
  • scikit_learn ==1.4.2
  • scikit_learn ==1.5.0
  • scipy ==1.13.0
  • scipy ==1.13.1
  • seaborn ==0.13.2
  • setuptools ==68.2.2
  • sympy ==1.12
  • tqdm ==4.66.2
  • xgboost ==2.0.3
setup.py pypi
  • dcor >=0.6
  • matilda *
  • numpy >=1.21.0
  • pandas >=1.3.0
  • scipy >=1.7.0
src/wgtda/wgtda.egg-info/requires.txt pypi
  • dcor >=0.5.4
  • numpy >=1.21.0
  • pandas >=1.3.0
  • scipy >=1.7.0
src/wgtda.egg-info/requires.txt pypi
  • dcor >=0.6
  • numpy >=1.21.0
  • pandas >=1.3.0
  • scipy >=1.7.0