automated-robust-and-reproducible-voice-pathology-detection
https://github.com/aailab-uct/automated-robust-and-reproducible-voice-pathology-detection
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 6 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.8%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: aailab-uct
- License: gpl-3.0
- Language: Python
- Default Branch: main
- Size: 71.5 MB
Statistics
- Stars: 1
- Watchers: 0
- Forks: 1
- Open Issues: 0
- Releases: 3
Metadata Files
README.md
Automated Robust and Reproducible Voice Pathology Detection
This repository contains the code for the paper "Reproducible Machine Learning-based Voice Pathology Detection: Introducing the Methodology and the Pitch Difference Feature" by Jan Vrba, Jakub Steinbach, Tomáš Jirsa, Laura Verde, Roberta De Fazio, Zuzana Urbániová, Martin Chovanec, Yuwen Zeng, Kei Ichiji, Lukáš Hájek, Zuzana Sedláková, Jan Mareš, Noriyasu Homma. DOI: 10.1016/j.jvoice.2025.03.028
Requirements
For running experiments - prepared dataset (see below) - reproducing the results is computationally demanding, thus we suggest to use as many CPUs as possible
Used libraries and software
- Python 3.12.3
- see requirements.txt for all dependencies
- we recommend using virtual environment and using pip install -r requirements.txt to install all requirements
- if you need to speed up the computation, using pip install -r requirements.txt --no-binary numpy,scipy could help
Used setup for experiments - 2x AMD EPYC 9374F - 64 GB RAM - Ubuntu 24.04 LTS
Notes on reproducibility
We tried to make the results as reproducible as possible. Althoug we employed many approaches to ensure the reproducibility, it is still possible that the results will not be exactly the same, mainle due to the floating point arithmetic. We provide the checksums for all files that are generated during the experiments. The checksums are stored in the misc folder. The checksums are computed using the code in src/checksum.py, however, they can be computed using 7-zip or sha256sum utility.
When running the main.py script, the script checks the checksums of the downloaded data, the extracted features, the generated datasets, the results of the classifiers, the average results, and the results of the repeated stratified cross-validation.
Beware that git on Windows can change the line endings from LF to CRLF. This can cause the checksums to fail. Most probably you only need to change EOL of misc\svd_information.csv. Additionaly, check this link on how to configure git to handle this problem. The script should save every csv file with LF as EOL.
Additionaly we provide our_results_tables and our_results_xvalidation folders that contain the results of our experiments. We do not include the results folder as it is too large to be stored in the repository, even when compressed.
Dataset preparation
The dataset is not included in this repository due to the license reason, but it can be downloaded from publicly
available website. Firstly download the Saarbruecken Voice Database (SVD)
available here. You need to download all recordings of /a/ vowel
produced at normal pitch that are encoded as wav files. Due to the limitation of SVD interface, download male recordings
and female recordings separately. Then create the svd_db folder in the root of this project and put all recordings
there.
At this step we assume following folder structure:
svm_voices_research
└───misc
└───src
└───svd_db
│ 1-a_n.wav
│ 2-a_n.wav
│ ...
│ 2610-a_n.wav
We provide the svd_information.csv file that contains the information about the SVD database (age, sex, pathologies, etc.). The file is stored in the misc folder and contains data scraped from the SVD website.
When running the main.py script, the script checks the checksums of the downloaded data to ensure that the dataset is complete and consistent.
Reproducing the results
After successfully downloading the whole SVD our results can be reproduced by running the main.py script, executing
the following command
python main.py
Additionally, you can specify the number of threads used for dataset generation by setting the MAX_WORKERS variable in the main.py script. The default value is 24. Moreover, you can run individual steps of the pipeline by running the corresponding script. These scripts usually allow to specify sex and classifier type you want to run. However, the main.py script is the most convenient way to reproduce the results, provides checking of the checksums of the generated files, and was tested to work as expected.
This script automatically performs following tasks from the correspongind scripts:
- checking downloaded data consistency
During this step, it is checked if your downloaded files from SVD corresponds with our dataset.
This is achieved via comparison of the checksum of your data with the checksum in the misc/data_usd.sha256.
data_preprocessing.py- removing the corrupted files and duplicities and triming silence
We remove corrupted files, undesired pathologies as well as singers and duplicities from svd_db and create a datasets folder that contains SVD database wav files with trimmed silences. The resulting datasets is checked using the checksum stored in a misc/after_I.sha256.
feature_extraction.py- extracting features from preprocessed data
We employ many feature extraction methods to extract features from the preprocessed data as described in the article.
The resulting features.csv file should have the same checksum as the checksum stored in the misc/after_II.sha256 file.
dataset_generator.py- generating all datasets
The resulting training_data folder should contain two subfolders training_data/men, training_data/women and training_data/both. Each of those subfolders should contain 20480 subfolders with pickled dataset dataset.pk and its configuration stored in configuration.json. The checksum of resulting dataset folders is checked and should be same as the provided checksum in misc/after_III.sha256 file. The dataset generating can be sped up by setting the number of threads in MAX_WORKERS variable (see line 27 in main.py).
classifier_pipeline.py- running the machine learning pipeline.
This step is computationally extremely demanding and it took more than several weeks to finish on the recommended setup. The number of cores (or CPUs) can
significantly reduce the computation time!. The result is the results folder containing subfolders for each classifier with men, women and both that contains the subfolders with results for all datasets generated in previous step. The results are saved into the results.csv files. All files with results are checking across the checksums provided in misc/after_IV.sha256 file.
average_results.py- computing average results
In this step the computation of average results across all parameters for each classifier and each dataset is done.
The resulting folder results_tables should contain csv files for each classifier and sex. The csv file contains
the average metrics for each dataset (average across all hyperparameters evaluated during the grid search performed
in the previous step 5.). The resulting files are checked across the checksums provided in misc/after_V.sha256 file.
repeated_cross_validation_best.py- repeated stratified cross-validation of the best classifiers
This step is computationally demanding. For each type of classifier, we select 1000 the best classifiers
(configuration of hyperparameters and dataset) and perform repeated (100 times) stratified 10-fold cross-validation.
The results are saved in the results_xvalidation folder, that contains subfolders named according to classifiers names.
I each of those subfolders is a xvalidation_results.csv file with the performance of the classifiers during the
above mentioned process. The results of this step are checked across the checksums provided
in misc/after_VI.sha256 file.
Description of files in this repository:
- main.py - script to reproduce all our results
- requirements.txt - list of used packages
- analyze_results.py - script to check the best results produced by
classifier_pipeline.py - average_results.py - script to compute the average metrics for single dataset (run after *_pipeline.py) is finished
- classifier_configs.py - configuration of classifiers pipelines with corresponding hyperparameters specification for the grid search
- classifier_pipeline.py - grid search for classifiers specified in
classifier_configs.pyon all generated datasets - data_preprocessing.py - script that removes corrupted files and duplicities
- dataset_generator.py - script that generates datasets from preprocessed data
- feature_extraction.py - script that extracts features from preprocessed data
- README.md - this file
- repeatedcrossvalidation_best.py - script that performs repeated stratified cross-validation of the best classifiers
- src - folder with additional code
- checksum.py - script that computes checksums of the files
- custom_smote.py - script with custom SMOTE implementation
- misc - folder with additional files
- after_I.sha256 - checksum of the
datasetsfolder after thedata_preprocessing.pyscript - after_II.sha256 - checksum of the
features.csvfile after thefeature_extraction.pyscript - after_III.sha256 - checksum of the
training_datafolder after thedataset_generator.pyscript - after_IV.sha256 - checksum of the
resultsfolder after theclassifier_pipeline.pyscript - after_V.sha256 - checksum of the
results_tablesfolder after theaverage_results.pyscript - after_VI.sha256 - checksum of the
results_xvalidationfolder after therepeated_cross_validation_best.pyscript - data_used.sha256 - checksum of the downloaded data
- listofexcluded_files.csv - list of excluded files from the SVD database with the reason of exclusion
- svd_information.csv - information about the SVD database
- after_I.sha256 - checksum of the
- article_tables - folder with code to generate tables used in the article
- generatebestdatasetresulttable.py - script that handles generation of the latex table for the best performing datasets for each classifier and sex
- generatetablesfortopresults.py - script for generating a latex table with the best results and their configurations
- tablebestdatasets_template.txt - latex template
- ourresultstables - our results after running the
average_results.pyscript - ourresultsxvalidation - our results after running the
repeated_cross_validation_best.pyscript
Owner
- Name: aailab-uct
- Login: aailab-uct
- Kind: organization
- Repositories: 1
- Profile: https://github.com/aailab-uct
Citation (CITATION.cff)
cff-version: 1.2.0
title: >-
Automated Robust and Reproducible Voice Pathology
Detection
message: >-
If you use this code, we would appreciate a citation to
our paper.
type: software
authors:
- given-names: Jan
family-names: Vrba
email: jan.vrba@vscht.cz
affiliation: >-
Department of Mathematics, Informatics, and
Cybernetics, University of Chemistry and Technology,
Prague, Prague, Czech Republic and Department of
Radiological Imaging and Informatics, Tohoku
University Graduate School of Medicine, Sendai, Japan
orcid: 'https://orcid.org/0000-0002-3538-7124'
- given-names: Jakub
family-names: Steinbach
email: jakub.steinbach@vscht.cz
affiliation: >-
Department of Mathematics, Informatics, and
Cybernetics, University of Chemistry and Technology,
Prague, Prague, Czech Republic
orcid: 'https://orcid.org/0000-0002-4152-2177'
- given-names: Tomáš
family-names: Jirsa
email: tomas.jirsa@vscht.cz
affiliation: >-
Department of Mathematics, Informatics, and
Cybernetics, University of Chemistry and Technology,
Prague, Prague, Czech Republic
orcid: 'https://orcid.org/0000-0001-5228-9596'
- orcid: 'https://orcid.org/0000-0003-2422-1732'
affiliation: >-
Department of Mathematics and Physics, University of
Campania ”Luigi Vanvitelli”, Caserta, Italy
email: laura.verde@unicampania.it
given-names: Laura
family-names: Verde
- given-names: Roberta
family-names: De Fazio
affiliation: >-
Department of Mathematics and Physics, University of
Campania ”Luigi Vanvitelli”, Caserta, Italy
email: roberta.defazio@unicampania.it
orcid: 'https://orcid.org/0000-0002-0271-132X'
license: GPL-3.0
GitHub Events
Total
- Release event: 1
- Watch event: 5
- Member event: 3
- Push event: 1
- Fork event: 1
- Create event: 1
Last Year
- Release event: 1
- Watch event: 5
- Member event: 3
- Push event: 1
- Fork event: 1
- Create event: 1