bhanu
Machine learning based classification of Parkinson's disease using photoplethysmography data
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in README -
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.6%) to scientific vocabulary
Repository
Machine learning based classification of Parkinson's disease using photoplethysmography data
Basic Info
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Bhanu - Non-invasive Parkinson's Disease Detection Using Transformer-based Analysis of Photoplethysmography Signals
Abstract
This independent research project investigates the potential of photoplethysmography (PPG) signals as non-invasive biomarkers for Parkinson's Disease (PD) detection. By leveraging the MIMIC-III waveform and clinical databases and adapting and finetuning the HeartGPT architecture HeartGPT GitHub repository, I demonstrate the feasibility of using transformer-based deep learning models for analyzing physiological time series data in neurological disease detection.
Research Objectives
- Evaluate the efficacy of PPG signals as biomarkers for Parkinson's Disease
- Develop and validate a transformer-based deep learning approach for medical time series classification by adapting and fine-tuning HeartGPT
- Examine the effectiveness of selective fine-tuning by training the final five transformer blocks while keeping earlier layers frozen, testing whether HeartGPT's learned signal representations transfer to PD detection
- Implement a systematic approach for processing MIMIC-III waveform data, including patient matching, data loading and signal preprocessing pipelines suitable for deep learning applications
Methodology
Data Collection and Preprocessing
The study utilizes the MIMIC-III (Medical Information Mart for Intensive Care III) database. MIMIC-III comprises deidentified health-related data associated with over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012. Of particular interest to this study are the high-resolution physiological waveforms, specifically the PPG signals, recorded during patient stays. My preprocessing pipeline includes the following:
Patient Cohort Selection:
- Identification of PD patients using ICD-9 codes
- Demographic matching according to patient age and gender with control subjects
Signal Preprocessing:
- Extraction of 4-second PPG segments from waveform data
- Bandpass filtering (0.7 Hz - 10 Hz)
- Removal of segments that contain missing or NaN values
Data Transformation:
- Signal normalization (zero mean, unit variance)
- Min-max scaling to [0,1] range
- Tokenization into discrete values (0-100 -> 101 total tokens)
- Train/validation/test split with stratification
```bibtex @article{johnson2016mimic, title={MIMIC-III, a freely accessible critical care database}, author={Johnson, Alistair EW and Pollard, Tom J and Shen, Lu and Li-Wei, H Lehman and Feng, Mengling and Ghassemi, Mohammad and Moody, Benjamin and Szolovits, Peter and Celi, Leo Anthony and Mark, Roger G}, journal={Scientific data}, volume={3}, number={1}, pages={1--9}, year={2016}, publisher={Nature Publishing Group} }
@article{moody2020mimic, title={MIMIC-III Waveform Database (version 1.0)}, author={Moody, Benjamin and Moody, George and Villarroel, Mauricio and Clifford, Gari D and Silva, Ikaro}, journal={PhysioNet}, year={2020}, doi={10.13026/c2607m} } ```
Model Architecture
My approach builds upon the HeartGPT model, with some modifications for PD detection:
- Input Layer: Processes tokenized PPG sequences
- Transformer Backbone: 8 layers with 8 attention heads
- Custom Classification Head: For classifying PD from PPG signals
- Embedding Dimension: 64
- Sequence Length: 500 tokens (4-second PPG window sampled at 125 Hz)
The model employs a fine-tuning strategy where:
- Initial layers remain frozen, preserving learned physiological features
- Final five transformer blocks are fine-tuned
- New classification head is trained from scratch
bibtex
@misc{davies2024interpretablepretrainedtransformersheart,
title={Interpretable Pre-Trained Transformers for Heart Time-Series Data},
author={Harry J. Davies and James Monsen and Danilo P. Mandic},
year={2024},
eprint={2407.20775},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2407.20775},
}
Hardware recommendations
- RAM: 64 GB
- Storage: 128 GB SSD
- GPU: NVIDIA A100 or similar
- High-speed internet connection
Reproduction of results
Prerequisites
- Valid PhysioNet credentials
- MIMIC-III data use agreement
- Completed CITI training
- Python 3.10+
- For dependencies see requirements.txt
Installation
- Clone the repository:
bash
git clone https://github.com/degenfabian/Bhanu.git
cd Bhanu
- Install dependencies:
```bash pip install -r requirements.txt
```
(Optional) Configure hyperparameters in Config class in trainandeval.py
Download data:
```bash python download_data.py
```
- Preprocess data (takes ~1-2 hours):
```bash python preprocess_data.py
```
- Create model_weights directory inside root project folder
```bash mkdir model_weights
```
Download model weights from HeartGPT GitHub repository (PPGPT500kiters.pth) and put them in model_weights directory
Train and evaluate model:
```bash python trainandeval.py
```
Project Structure
bash
Bhanu/
├── data/
│ ├── waveform_data/
│ │ ├── PD/
│ │ └── non_PD/
│ ├── train_dataset.pt
│ ├── val_dataset.pt
│ └── test_dataset.pt
├── preprocess_data.py
├── model.py
├── train_and_eval.py
├── metrics.py
└── model_weights/
└── PPGPT_500k_iters.pth
Training and Evaluation
This model was trained in Google Colab using an A100 GPU. The split for the dataset was 70% for training, 10% for validation and the remaining 20% for testing. Around 5147.8 hours of PPG data is from patients with Parkinson's disease, amounting to 18.57 GB of data. For patients without Parkinson's disease there are around 4583.1 hours of PPG data, amounting to 16.53 GB of data. The total dataset size is therefore 35.1 GB.
Results and Discussion
Note: The model is currently undergoing training. This section will be updated with final results.
Performance Metrics
The model will be evaluated using the following metrics:
- Accuracy
- Sensitivity
- Specificity
- F1 Score
Limitations and Biases
Dataset-Specific Biases
Selection Bias
- MIMIC-III data comes exclusively from ICU/hospital settings, meaning all subjects (both PD and control) were ill enough to require hospitalization
- PD patients in the dataset may represent more severe or complicated cases than the general PD population
- Control subjects are not healthy individuals but other hospitalized patients, potentially confounding the analysis
Demographic Biases
- MIMIC-III data comes from a single medical center (Beth Israel Deaconess Medical Center)
- Geographic limitation to one region may not represent global population variations
- Potential socioeconomic biases based on hospital location and accessibility
Methodological Biases
Signal Processing Biases
- 4-second PPG segment selection may miss longer-term patterns
- Bandpass filtering could eliminate potentially relevant signal components
- Tokenization process may introduce quantization artifacts
Model Architecture Biases
- Transfer learning from HeartGPT may carry over biases from cardiac domain
- Frozen initial layers may retain inappropriate feature representations
Future work
- External validation on independent datasets
- Prospective validation studies in clinical settings
- Comparison with traditional PD diagnostic methods
- Assessment of model performance across different PD stages
Contact
Maintainer: [Fabian Degen] - [fabidegen@gmail.com]
For bugs and feature requests, please open an issue in this GitHub repository.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Research Disclaimer: This work is intended for research purposes only. The methods and findings presented here should not be used for clinical diagnosis without proper validation and regulatory approval.
Acknowledgments: I thank the PhysioNet team for providing access to the MIMIC-III database and the original HeartGPT authors for their spectacular work.
Owner
- Name: Fabian Degen
- Login: degenfabian
- Kind: user
- Repositories: 1
- Profile: https://github.com/degenfabian
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Degen" given-names: "Fabian" orcid: "https://orcid.org/0009-0009-4824-1843" title: "Bhanu - Non-invasive Parkinson's Disease Detection Using Deep Learning Based Analysis of Photoplethysmography Signals" version: 1.0.0 date-released: 2025-01-09 url: "https://github.com/degenfabian/Bhanu"
GitHub Events
Total
- Push event: 27
- Create event: 4
Last Year
- Push event: 27
- Create event: 4
Dependencies
- numpy ==2.1.3
- pandas ==2.2.3
- scikit_learn ==1.5.2
- scipy ==1.14.1
- torch ==2.5.1
- tqdm ==4.66.6
- wfdb ==4.1.2