zimmermann_ramanmachinelearning
Tool for comparing machine-learning-based classification methods for Raman spectra developed during the master thesis "Classification of Cell Systems using Raman Spectroscopy and Machine Learning"
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.0%) to scientific vocabulary
Repository
Tool for comparing machine-learning-based classification methods for Raman spectra developed during the master thesis "Classification of Cell Systems using Raman Spectroscopy and Machine Learning"
Basic Info
Statistics
- Stars: 3
- Watchers: 1
- Forks: 1
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
RamanMachineLearning
This repository contains the machine-learning-based data analysis workflow for Raman spectra developed during the master's thesis "Classification of Cell Systems using Raman Spectroscopy and Machine Learning" by Daniel Zimmermann.
Setup
After downloading this tool, first adjust the contents of raman_ml.conf according to your needs. The following tables list the parameters that can be adjusted.
General Parameters
| Parameter | Description |
| --------- | ----------- |
| FILE_PREFIX | Filename prefix that will be used for all output files |
| N_TRIALS | Number of randomized repetitions for cross-validation |
| N_FOLDS | Number of folds for (nested) k-fold cross-validation |
| N_CORES | Number of CPU cores that will be used (-1 for all available) |
| SCORING | Performance metrics to be calculated during cross-validation |
| CONDA_DIR | Directory of your conda installation |
| ENV_NAME | Name that will be used for the conda environment |
| DIR1/DIR2 | Directories where the individual spectra are stored |
| LAB1/LAB2 | Class labels corresponding to DIR1/DIR2 |
QC/Preprocessing Parameters
| Parameter | Description |
| --------- | ----------- |
| QC_LIM_LOW/HIGH | Wavenumber range that should be considered for quality control |
| QC_WINDOW | Window size for the Savitzky-Golay filter during quality control |
| QC_THRESHOLD | Minimum value of the derivate spectrum for a peak to be detected |
| QC_MIN_HEIGHT | Minimum height for a peak to be detected |
| QC_SCORE | How the intensity of the spectrum influences the quality score. 0 - None, 1 - Median peak height, 2 - Mean peak height, 3 - Mean area, 4 - Total area |
| QC_PEAKS | How the number of peaks influences the quality score. 0 - None, 1 - Linear, 2 - Quadratic |
| QC_NUM | How many spectra to keep from each class |
| PREP_LIM_LOW/HIGH | Wavenumber range that will be retained during preprocessing |
| PREP_WINDOW | Window size for the Savitzky-Golay filter during preprocessing |
Classification Model Parameters
Here, value ranges must be entered which will be optimized during cross-validation. Some value ranges are log-scaled. For these, the entered value represents the power of 10 of the actual parameter value(see also numpy.logspace). For a more detailed description of each parameter refer to the scikit-learn documentation.
| Parameter | Description |
| --------- | ----------- |
| PCA_COMP | Range of pca-components to test. Format: (min max+1 stepsize) |
| NMF_COMP | Range of nmf-components to test. Format: (min max+1 stepsize) |
| FA_CLUST | Range of the number of clusters to test in Feature Agglomeration-LDA. Format: (min max+1 stepsize) |
| PEAK_DIST | Range of the minimum peak distance to test in Peak-LDA. Format: (min max+1 stepsize) |
| LR1_C | Range of values for the regularization parameter C to test in logistic regression (l1). Logarithmic. Format: (min max nsteps) |
| `LR2C| Range of values for C to test in logistic regression (l2). Logarithmic. Format: (min max n_steps) |
| SVM1_C | Range of values for C to test in SVM (l1). Logarithmic. Format: (min max nsteps) |
| `SVM2C| Range of values for C to test in SVM (l2). Logarithmic. Format: (min max n_steps) |
|DTALPHA` | Range of values for the parameter α of cost-complexity-pruning to test in the decision tree model. Logarithmic. Format: (min max nsteps) |
| RF_FEATURE_SAMPLE | Range of the feature subsample parameter in the random forest model. Format: (min max nsteps) |
| `GBDTLEARNINGRATE` | Range of the learning rate parameter in the gradient-boosting model. Format: (min max nsteps) |
To run the workflow, execute the file run.sh from the terminal and follow the instructions on screen.
Examining Results
To examine the result of each ML method, IPython notebooks are provided in the notebooks directory.
inspect_data.ipynb can be used to take a look at the raw spectra.
inspect_results_full.ipynb shows the following results for each model:
- Table of scoring metrics
- Validation curve (if applicable)
- Distribution of optimal parameter values
- Confidence scores or probabilities per class
- Model coefficients (if applicable)
- Confusion matrix
- ROC curve
- Shapley values for interpretation (only random forest & gradient-boosting)
Owner
- Name: Daniel
- Login: DerZ115
- Kind: user
- Repositories: 1
- Profile: https://github.com/DerZ115
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: RamanMachineLearning
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: Daniel
family-names: Zimmermann
identifiers:
- type: url
value: >-
https://github.com/FHWNTulln/Zimmermann_RamanMachineLearning
description: Github Repository
license: GPL-3.0
version: '1.0'
GitHub Events
Total
- Push event: 1
Last Year
- Push event: 1
Dependencies
- jupyter 1.0.0.*
- lightgbm 3.3.2.*
- matplotlib 3.5.3.*
- mlxtend 0.20.0.*
- natsort 8.1.0.*
- numpy 1.22.4.*
- pandas 1.4.3.*
- pybaselines 0.8.0.*
- python 3.10.6.*
- python-graphviz 0.20.*
- scikit-learn 1.1.2.*
- scipy 1.9.0.*
- seaborn 0.11.2.*
- shap 0.41.0.*
- tqdm 4.64.0.*