mapper-pipeline
Identifying homogeneous subgroups of patients and important features: a topological machine learning approach
Science Score: 65.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
✓Institutional organization owner
Organization kcl-bhi has institutional domain (www.kcl.ac.uk) -
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.3%) to scientific vocabulary
Keywords
Repository
Identifying homogeneous subgroups of patients and important features: a topological machine learning approach
Basic Info
Statistics
- Stars: 5
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
Identifying homogeneous subgroups of patients and important features: a topological machine learning approach
Ewan Carr1
Mathieu Carrière2
Bertrand Michel3
Fred Chazal4
Raquel Iniesta1
1 Department of Biostatistics and Health Informatics,
Institute of Psychiatry, Psychology & Neuroscience, King's College London, United Kingdom.
2Inria Sophia-Antipolis, DataShape team, Biot, France.
3Ecole Centrale de Nantes, LMJL — UMR CNRS 6629, Nantes, France.
4Inria Saclay, Ile-de-France, France.
For more information, please see the 🔓 paper in BMC Bioinformatics:
Carr, E., Carrière, M., Michel, B. et al. Identifying homogeneous subgroups of patients and important features: a topological machine learning approach. BMC Bioinformatics 22, 449 (2021). https://doi.org/10.1186/s12859-021-04360-9
Please contact raquel.iniesta@kcl.ac.uk for queries.
About
This repository provides a pipeline for clustering based on topological data analysis:
Background This paper exploits recent developments in topological data analysis to present a pipeline for clustering based on Mapper, an algorithm that reduces complex data into a one-dimensional graph.
Results We present a pipeline to identify and summarise clusters based on statistically significant topological features from a point cloud using Mapper.
Conclusions Key strengths of this pipeline include the integration of prior knowledge to inform the clustering process and the selection of optimal clusters; the use of the bootstrap to restrict the search to robust topological features; the use of machine learning to inspect clusters; and the ability to incorporate mixed data types. Our pipeline can be downloaded under the GNU GPLv3 license at https://github.com/kcl-bhi/mapper-pipeline.
Software
Our pipeline is written in Python 3 and builds on several open source packages,
including sklearn-tda,
GUDHI,
xgboost, and
pygraphviz.
To get started, clone this repository and create a new virtual environment:
git clone https://github.com/kcl-bhi/mapper-pipeline.git
cd mapper-pipeline
python3 -m venv env
source env/bin/activate
pip install -r requirments.txt
Using the pipeline
The pipeline expects an input dataset in CSV format. In our application, the
dataset contained information on ≈140 variables for ≈430 participants. If
input.csv is not found in the working directory, sample data will be
simulated.
The scripts should be used as follows:
- Generate input files
{bash}
python3 prepare_inputs.py
1. Load the input dataset.
2. Construct the Gower distance matrix. Note that this requires
categorical variables to be specified using categorical_items.csv.
3. Define sets of parameters to explore via grid search.
4. Create a dictionary containing all combinations of input parameters.
```python
params = [{'fil': f,
'res': r,
'gain': gain}
for f in fil.items()
for r in resolutions
for gain in [0.1, 0.2, 0.3, 0.4]]
```
5. Store each set of inputs, and other required data, in the `inputs`
directory.
- Run Mapper for each set of input parameters
The script test_single_graph.py runs Mapper for a single set of
parameters. It requires three arguments:
{bash}
python3 test_single_graph.py '0333' 'inputs' 'outputs'
0333 refers to the set of parameters to test; inputs and outputs
specify the folders to load inputs and save outputs. This script:
1. Runs Mapper for the specified parameters (using `MapperComplex`).
2. Identifies statistically significant, representative, topological features.
3. Extracts required summaries and stores in the `outputs` subfolder.
- Process all outputs and produce summaries
{bash}
python3 process_outputs.py
This file:
- Loads all outputs (from
outputs) - Excludes graphs with no significant features or duplicate graphs.
- Splits each graph into separate topological features and removes features with <5% or >95% of the sample.
- Derives required summaries for each feature. This includes homogeneity among feature members with respect to pre-specified outcome.
- Ranks all features by homogeneity and select the top N features.
- Visualise each top-ranked feature and output summaries to spreadsheet.
Parallel computing
The grid search can be time-consuming, especially as the number of parameters settings increase. Fortunately, this process can be straightforwardly parallelised either using multiple cores on a local machine or using cluster computing.
On a single machine, using parallel :
{bash}
python3 prepare_inputs.py
parallel --progress -j4 < jobs
On a cluster:
```{bash}
!/bin/bash
SBATCH --tasks=8
SBATCH --mem=4000
SBATCH --job-name=array
SBATCH --array=1-2000
SBATCH --output=logs/%a.out
SBATCH --time=0-72:00
count=$(printf "%03d" $SLURMARRAYTASKID) python3 testsingle_graph.py $count "inputs" "outputs" ```
Owner
- Name: Department of Biostatistics & Health Informatics, King's College London
- Login: kcl-bhi
- Kind: organization
- Email: phidl-sysadmin@kcl.ac.uk
Citation (CITATION.cff)
# YAML 1.2
---
abstract: |
"Background
This paper exploits recent developments in topological data analysis to present a pipeline for clustering based on Mapper, an algorithm that reduces complex data into a one-dimensional graph.
Results
We present a pipeline to identify and summarise clusters based on statistically significant topological features from a point cloud using Mapper.
Conclusions
Key strengths of this pipeline include the integration of prior knowledge to inform the clustering process and the selection of optimal clusters; the use of the bootstrap to restrict the search to robust topological features; the use of machine learning to inspect clusters; and the ability to incorporate mixed data types. Our pipeline can be downloaded under the GNU GPLv3 license at https://github.com/kcl-bhi/mapper-pipeline."
authors:
-
affiliation: "Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, UK"
family-names: Carr
given-names: Ewan
-
affiliation: "Inria Sophia-Antipolis, DataShape Team, Biot, France"
family-names: "Carrière"
given-names: Mathieu
-
affiliation: "Ecole Centrale de Nantes, LMJL – UMR CNRS 6629, Nantes, France"
family-names: Michel
given-names: Bertrand
-
affiliation: "Inria Saclay, Ile-de-France, Alan Turing Building, Palaiseau, France"
family-names: Chazal
given-names: "Frédéric"
-
affiliation: "Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, UK"
family-names: Iniesta
given-names: Raquel
cff-version: "1.1.0"
date-released: 2021-09-20
doi: "10.1186/s12859-021-04360-9"
keywords:
- "Topological data analysis"
- Clustering
- "Machine learning"
message: "If you use this software, please cite it using these metadata."
repository-code: "https://github.com/kcl-bhi/mapper-pipeline"
title: "Identifying homogeneous subgroups of patients and important features: a topological machine learning approach"
...
GitHub Events
Total
Last Year
Dependencies
- gower ==0.0.5
- gudhi ==3.4.0
- ipython ==7.31.1
- matplotlib ==3.3.3
- networkx ==2.5
- numpy ==1.21.0
- pandas ==1.2.0
- pygraphviz ==1.6
- scikit-learn ==0.24.0
- statsmodels ==0.12.1
- tqdm ==4.56.0
- xgboost ==1.3.1
- xlsxwriter ==1.3.7