mapper-pipeline

Identifying homogeneous subgroups of patients and important features: a topological machine learning approach

https://github.com/kcl-bhi/mapper-pipeline

Science Score: 65.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
✓
Institutional organization owner
Organization kcl-bhi has institutional domain (www.kcl.ac.uk)
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.3%) to scientific vocabulary

Keywords

clustering pipeline python tda topology

Last synced: 6 months ago · JSON representation ·

Repository

Identifying homogeneous subgroups of patients and important features: a topological machine learning approach

Basic Info

Host: GitHub
Owner: kcl-bhi
License: gpl-3.0
Language: Python
Default Branch: main
Homepage:
Size: 299 KB

Statistics

Stars: 5
Watchers: 2
Forks: 0
Open Issues: 0
Releases: 0

Topics

clustering pipeline python tda topology

Created about 5 years ago · Last pushed over 3 years ago

Metadata Files

Readme License Citation

Identifying homogeneous subgroups of patients and important features: a topological machine learning approach

Ewan Carr¹
Mathieu Carrière²
Bertrand Michel³
Fred Chazal⁴
Raquel Iniesta¹

¹ Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology & Neuroscience, King's College London, United Kingdom.
²Inria Sophia-Antipolis, DataShape team, Biot, France.
³Ecole Centrale de Nantes, LMJL — UMR CNRS 6629, Nantes, France.
⁴Inria Saclay, Ile-de-France, France.

For more information, please see the 🔓 paper in BMC Bioinformatics:

Carr, E., Carrière, M., Michel, B. et al. Identifying homogeneous subgroups of patients and important features: a topological machine learning approach. BMC Bioinformatics 22, 449 (2021). https://doi.org/10.1186/s12859-021-04360-9

Please contact raquel.iniesta@kcl.ac.uk for queries.

About

This repository provides a pipeline for clustering based on topological data analysis:

Background This paper exploits recent developments in topological data analysis to present a pipeline for clustering based on Mapper, an algorithm that reduces complex data into a one-dimensional graph.

Results We present a pipeline to identify and summarise clusters based on statistically significant topological features from a point cloud using Mapper.

Conclusions Key strengths of this pipeline include the integration of prior knowledge to inform the clustering process and the selection of optimal clusters; the use of the bootstrap to restrict the search to robust topological features; the use of machine learning to inspect clusters; and the ability to incorporate mixed data types. Our pipeline can be downloaded under the GNU GPLv3 license at https://github.com/kcl-bhi/mapper-pipeline.

Software

Our pipeline is written in Python 3 and builds on several open source packages, including sklearn-tda, GUDHI, xgboost, and pygraphviz.

To get started, clone this repository and create a new virtual environment:

git clone https://github.com/kcl-bhi/mapper-pipeline.git cd mapper-pipeline python3 -m venv env source env/bin/activate pip install -r requirments.txt

Using the pipeline

The pipeline expects an input dataset in CSV format. In our application, the dataset contained information on ≈140 variables for ≈430 participants. If input.csv is not found in the working directory, sample data will be simulated.

The scripts should be used as follows:

Generate input files

{bash} python3 prepare_inputs.py 1. Load the input dataset. 2. Construct the Gower distance matrix. Note that this requires categorical variables to be specified using categorical_items.csv. 3. Define sets of parameters to explore via grid search. 4. Create a dictionary containing all combinations of input parameters.

```python
params = [{'fil': f,
           'res': r,
           'gain': gain}
          for f in fil.items()
          for r in resolutions
          for gain in [0.1, 0.2, 0.3, 0.4]]
 ```

5. Store each set of inputs, and other required data, in the `inputs`
   directory.

Run Mapper for each set of input parameters

The script test_single_graph.py runs Mapper for a single set of parameters. It requires three arguments:

{bash} python3 test_single_graph.py '0333' 'inputs' 'outputs'

0333 refers to the set of parameters to test; inputs and outputs specify the folders to load inputs and save outputs. This script:

1. Runs Mapper for the specified parameters (using `MapperComplex`).
2. Identifies statistically significant, representative, topological features.
3. Extracts required summaries and stores in the `outputs` subfolder.

Process all outputs and produce summaries

{bash} python3 process_outputs.py

This file:

Loads all outputs (from outputs)
Excludes graphs with no significant features or duplicate graphs.
Splits each graph into separate topological features and removes features with <5% or >95% of the sample.
Derives required summaries for each feature. This includes homogeneity among feature members with respect to pre-specified outcome.
Ranks all features by homogeneity and select the top N features.
Visualise each top-ranked feature and output summaries to spreadsheet.

Parallel computing

The grid search can be time-consuming, especially as the number of parameters settings increase. Fortunately, this process can be straightforwardly parallelised either using multiple cores on a local machine or using cluster computing.

On a single machine, using parallel :

{bash} python3 prepare_inputs.py parallel --progress -j4 < jobs

On a cluster:

```{bash}

!/bin/bash

SBATCH --tasks=8

SBATCH --mem=4000

SBATCH --job-name=array

SBATCH --array=1-2000

SBATCH --output=logs/%a.out

SBATCH --time=0-72:00

count=$(printf "%03d" $SLURMARRAYTASKID) python3 testsingle_graph.py $count "inputs" "outputs" ```

Owner

Name: Department of Biostatistics & Health Informatics, King's College London
Login: kcl-bhi
Kind: organization
Email: phidl-sysadmin@kcl.ac.uk

Website: https://www.kcl.ac.uk/mental-health-and-psychological-sciences/about/departments/biostatistics-and-health-informatics
Repositories: 1
Profile: https://github.com/kcl-bhi

Citation (CITATION.cff)

# YAML 1.2
---
abstract: |
    "Background
    
    This paper exploits recent developments in topological data analysis to present a pipeline for clustering based on Mapper, an algorithm that reduces complex data into a one-dimensional graph.
    
    Results
    
    We present a pipeline to identify and summarise clusters based on statistically significant topological features from a point cloud using Mapper.
    
    Conclusions
    
    Key strengths of this pipeline include the integration of prior knowledge to inform the clustering process and the selection of optimal clusters; the use of the bootstrap to restrict the search to robust topological features; the use of machine learning to inspect clusters; and the ability to incorporate mixed data types. Our pipeline can be downloaded under the GNU GPLv3 license at https://github.com/kcl-bhi/mapper-pipeline."
authors: 
  -
    affiliation: "Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, UK"
    family-names: Carr
    given-names: Ewan
  -
    affiliation: "Inria Sophia-Antipolis, DataShape Team, Biot, France"
    family-names: "Carrière"
    given-names: Mathieu
  -
    affiliation: "Ecole Centrale de Nantes, LMJL – UMR CNRS 6629, Nantes, France"
    family-names: Michel
    given-names: Bertrand
  -
    affiliation: "Inria Saclay, Ile-de-France, Alan Turing Building, Palaiseau, France"
    family-names: Chazal
    given-names: "Frédéric"
  -
    affiliation: "Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, UK"
    family-names: Iniesta
    given-names: Raquel
cff-version: "1.1.0"
date-released: 2021-09-20
doi: "10.1186/s12859-021-04360-9"
keywords: 
  - "Topological data analysis"
  - Clustering
  - "Machine learning"
message: "If you use this software, please cite it using these metadata."
repository-code: "https://github.com/kcl-bhi/mapper-pipeline"
title: "Identifying homogeneous subgroups of patients and important features: a topological machine learning approach"
...

GitHub Events

Total

Last Year

Dependencies

requirements.txt pypi

gower ==0.0.5
gudhi ==3.4.0
ipython ==7.31.1
matplotlib ==3.3.3
networkx ==2.5
numpy ==1.21.0
pandas ==1.2.0
pygraphviz ==1.6
scikit-learn ==0.24.0
statsmodels ==0.12.1
tqdm ==4.56.0
xgboost ==1.3.1
xlsxwriter ==1.3.7

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

mapper-pipeline

Science Score: 65.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Identifying homogeneous subgroups of patients and important features: a topological machine learning approach

About

Software

Using the pipeline

Parallel computing

!/bin/bash

SBATCH --tasks=8

SBATCH --mem=4000

SBATCH --job-name=array

SBATCH --array=1-2000

SBATCH --output=logs/%a.out

SBATCH --time=0-72:00

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Dependencies