https://github.com/aailab-uct/feature-scaling-leakge-in-vpd

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 3 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.0%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: aailab-uct
License: agpl-3.0
Language: Jupyter Notebook
Default Branch: main
Size: 153 KB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 1

Created about 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme License

Feature Scaling Induced Data Leakage Quantification in Machine-learning Based Voice Pathology Detection

This repository contains the code for the paper "Feature Scaling Induced Data Leakage Quantification in Machine-learning Based Voice Pathology Detection" by Jan Vrba, Jakub Steinbach, Tomáš Jirsa and Noriyasu Homma. DOI: XYZ

Requirements

Used libraries and software - Python 3.13.2 - see requiretemnts.txt for all dependencies - we recommend using virtual environment and using pip install -r requirements.txt to install all requirements

Used setup for experiments - AMD Ryzen 9 5900X - 112 GB RAM - 1TB SSD hard drive - Ubuntu 24.04.2 LTS

Dataset preparation

The SVD dataset is not included in this repository due to the license reason, but it can be downloaded from publicly available website. Please, follow the instructions in our repository available here.

Once the features.csv file is generated, place it into the data folder, run flatten_features.py to generate the flattened_features.csv file.

For any following work, we assume following directory structure:

vpd_scaling_leakage_study └───data │ features.csv │ flattened_features.csv │ voiced_features_8000_fft.csv

Reproducing the results

After data preparation, run main.py to run the calculations. The results will be saved in the form of json files in four folders named XXX_results for randomly splitted data and XXX_results_stratified for stratified data split. Note that XXX represents the database name.

You can utilize the result_tables.ipynb notebook to generate results in the form of Tables 3 to 7. Similarly, you can utilize the permutation_test_bias.ipynb notebook to conduct the permutation test of statistical significance of bias for each dataset-transformer-model-split combination.

Owner

Name: aailab-uct
Login: aailab-uct
Kind: organization

Repositories: 1
Profile: https://github.com/aailab-uct

GitHub Events

Total

Release event: 1
Push event: 1
Create event: 2

Last Year

Release event: 1
Push event: 1
Create event: 2

Dependencies

requirements.txt pypi

joblib ==1.5.0
numpy ==2.2.5
pandas ==2.2.3
python-dateutil ==2.9.0.post0
pytz ==2025.2
scikit-learn ==1.6.1
scipy ==1.15.3
six ==1.17.0
threadpoolctl ==3.6.0
tqdm ==4.67.1
tzdata ==2025.2

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science