mapper-pipeline

Identifying homogeneous subgroups of patients and important features: a topological machine learning approach

https://github.com/kcl-bhi/mapper-pipeline

Science Score: 65.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
    Organization kcl-bhi has institutional domain (www.kcl.ac.uk)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.3%) to scientific vocabulary

Keywords

clustering pipeline python tda topology
Last synced: 6 months ago · JSON representation ·

Repository

Identifying homogeneous subgroups of patients and important features: a topological machine learning approach

Basic Info
  • Host: GitHub
  • Owner: kcl-bhi
  • License: gpl-3.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 299 KB
Statistics
  • Stars: 5
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
clustering pipeline python tda topology
Created about 5 years ago · Last pushed over 3 years ago
Metadata Files
Readme License Citation

README.md

Identifying homogeneous subgroups of patients and important features: a topological machine learning approach

Ewan Carr1
Mathieu Carrière2
Bertrand Michel3
Fred Chazal4
Raquel Iniesta1

1 Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology & Neuroscience, King's College London, United Kingdom.
2Inria Sophia-Antipolis, DataShape team, Biot, France.
3Ecole Centrale de Nantes, LMJL — UMR CNRS 6629, Nantes, France.
4Inria Saclay, Ile-de-France, France.

For more information, please see the 🔓 paper in BMC Bioinformatics:

Carr, E., Carrière, M., Michel, B. et al. Identifying homogeneous subgroups of patients and important features: a topological machine learning approach. BMC Bioinformatics 22, 449 (2021). https://doi.org/10.1186/s12859-021-04360-9

Please contact raquel.iniesta@kcl.ac.uk for queries.

About

This repository provides a pipeline for clustering based on topological data analysis:

Background This paper exploits recent developments in topological data analysis to present a pipeline for clustering based on Mapper, an algorithm that reduces complex data into a one-dimensional graph.

Results We present a pipeline to identify and summarise clusters based on statistically significant topological features from a point cloud using Mapper.

Conclusions Key strengths of this pipeline include the integration of prior knowledge to inform the clustering process and the selection of optimal clusters; the use of the bootstrap to restrict the search to robust topological features; the use of machine learning to inspect clusters; and the ability to incorporate mixed data types. Our pipeline can be downloaded under the GNU GPLv3 license at https://github.com/kcl-bhi/mapper-pipeline.

Software

Our pipeline is written in Python 3 and builds on several open source packages, including sklearn-tda, GUDHI, xgboost, and pygraphviz.

To get started, clone this repository and create a new virtual environment:

git clone https://github.com/kcl-bhi/mapper-pipeline.git cd mapper-pipeline python3 -m venv env source env/bin/activate pip install -r requirments.txt

Using the pipeline

The pipeline expects an input dataset in CSV format. In our application, the dataset contained information on ≈140 variables for ≈430 participants. If input.csv is not found in the working directory, sample data will be simulated.

The scripts should be used as follows:

  1. Generate input files

{bash} python3 prepare_inputs.py 1. Load the input dataset. 2. Construct the Gower distance matrix. Note that this requires categorical variables to be specified using categorical_items.csv. 3. Define sets of parameters to explore via grid search. 4. Create a dictionary containing all combinations of input parameters.

```python
params = [{'fil': f,
           'res': r,
           'gain': gain}
          for f in fil.items()
          for r in resolutions
          for gain in [0.1, 0.2, 0.3, 0.4]]
 ```

5. Store each set of inputs, and other required data, in the `inputs`
   directory.
  1. Run Mapper for each set of input parameters

The script test_single_graph.py runs Mapper for a single set of parameters. It requires three arguments:

{bash} python3 test_single_graph.py '0333' 'inputs' 'outputs'

0333 refers to the set of parameters to test; inputs and outputs specify the folders to load inputs and save outputs. This script:

1. Runs Mapper for the specified parameters (using `MapperComplex`).
2. Identifies statistically significant, representative, topological features.
3. Extracts required summaries and stores in the `outputs` subfolder.
  1. Process all outputs and produce summaries

{bash} python3 process_outputs.py

This file:

  1. Loads all outputs (from outputs)
  2. Excludes graphs with no significant features or duplicate graphs.
  3. Splits each graph into separate topological features and removes features with <5% or >95% of the sample.
  4. Derives required summaries for each feature. This includes homogeneity among feature members with respect to pre-specified outcome.
  5. Ranks all features by homogeneity and select the top N features.
  6. Visualise each top-ranked feature and output summaries to spreadsheet.

Parallel computing

The grid search can be time-consuming, especially as the number of parameters settings increase. Fortunately, this process can be straightforwardly parallelised either using multiple cores on a local machine or using cluster computing.

On a single machine, using parallel :

{bash} python3 prepare_inputs.py parallel --progress -j4 < jobs

On a cluster:

```{bash}

!/bin/bash

SBATCH --tasks=8

SBATCH --mem=4000

SBATCH --job-name=array

SBATCH --array=1-2000

SBATCH --output=logs/%a.out

SBATCH --time=0-72:00

count=$(printf "%03d" $SLURMARRAYTASKID) python3 testsingle_graph.py $count "inputs" "outputs" ```

Owner

  • Name: Department of Biostatistics & Health Informatics, King's College London
  • Login: kcl-bhi
  • Kind: organization
  • Email: phidl-sysadmin@kcl.ac.uk

Citation (CITATION.cff)

# YAML 1.2
---
abstract: |
    "Background
    
    This paper exploits recent developments in topological data analysis to present a pipeline for clustering based on Mapper, an algorithm that reduces complex data into a one-dimensional graph.
    
    Results
    
    We present a pipeline to identify and summarise clusters based on statistically significant topological features from a point cloud using Mapper.
    
    Conclusions
    
    Key strengths of this pipeline include the integration of prior knowledge to inform the clustering process and the selection of optimal clusters; the use of the bootstrap to restrict the search to robust topological features; the use of machine learning to inspect clusters; and the ability to incorporate mixed data types. Our pipeline can be downloaded under the GNU GPLv3 license at https://github.com/kcl-bhi/mapper-pipeline."
authors: 
  -
    affiliation: "Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, UK"
    family-names: Carr
    given-names: Ewan
  -
    affiliation: "Inria Sophia-Antipolis, DataShape Team, Biot, France"
    family-names: "Carrière"
    given-names: Mathieu
  -
    affiliation: "Ecole Centrale de Nantes, LMJL – UMR CNRS 6629, Nantes, France"
    family-names: Michel
    given-names: Bertrand
  -
    affiliation: "Inria Saclay, Ile-de-France, Alan Turing Building, Palaiseau, France"
    family-names: Chazal
    given-names: "Frédéric"
  -
    affiliation: "Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, UK"
    family-names: Iniesta
    given-names: Raquel
cff-version: "1.1.0"
date-released: 2021-09-20
doi: "10.1186/s12859-021-04360-9"
keywords: 
  - "Topological data analysis"
  - Clustering
  - "Machine learning"
message: "If you use this software, please cite it using these metadata."
repository-code: "https://github.com/kcl-bhi/mapper-pipeline"
title: "Identifying homogeneous subgroups of patients and important features: a topological machine learning approach"
...

GitHub Events

Total
Last Year

Dependencies

requirements.txt pypi
  • gower ==0.0.5
  • gudhi ==3.4.0
  • ipython ==7.31.1
  • matplotlib ==3.3.3
  • networkx ==2.5
  • numpy ==1.21.0
  • pandas ==1.2.0
  • pygraphviz ==1.6
  • scikit-learn ==0.24.0
  • statsmodels ==0.12.1
  • tqdm ==4.56.0
  • xgboost ==1.3.1
  • xlsxwriter ==1.3.7