multicolumncorruption
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
✓DOI references
Found 1 DOI reference(s) in README -
✓Academic publication links
Links to: frontiersin.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.4%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: ptsialis
- License: apache-2.0
- Language: Jupyter Notebook
- Default Branch: main
- Size: 22.8 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Source Code for Masterthesis: Assessing and Predicting the Optimal Imputation Method Regarding the Predictive Performance of Machine Learning Models
Target
An optimization approach for decision making in data science based on benchmarking the impact of different data imputation methods on the predictive performance of machine learning models.
Disclaimer
This is research project and not intended for production usage.
This Masterthesis is building on the work of Sebastian Jäger, Arndt Allhorn and Felix Bießmann, as described in their paper: "A Benchmark for Data Imputation Methods" https://www.frontiersin.org/article/10.3389/fdata.2021.693674
Installation
Steps to set up the required conda environment:
- create an environment
Data-Imputation-Thesiswith conda,bash conda env create -f environment.yaml - activate the new environment
bash conda activate Data-Imputation-Thesis - install
jengabash cd src/jenga python setup.py develop - install
data-imputation-paperbash cd ../.. python setup.py develop # or `install`It might be necessary to install the required GPU drivers manually (Version might change based on used hardware):bash conda install -c conda-forge cudatoolkit=11.7.0 pip install nvidia-cudnn-cu11==8.6.0.163Activate the packages every time you activate the environment:bash CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)")) export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/:$CUDNN_PATH/libhttps://www.tensorflow.org/install/pip
Usage
Imputation Experiments execute
run-experiment.pywith the required settings (explained below). The experiment name must contain `corrupted.Baseline Experiments execute
run-experiment.pywith the required settings (explained below). The experiment name is not allowed to contain `corrupted.Imputation Experiments execute
run-experiment-subset.pywith the required settings (explained below). The experiment name must contain `corrupted.Baseline Experiments execute
run-experiment-subset.pywith the required settings (explained below). The experiment name is not allowed to contain `corrupted.Examples to start the experiments start requires the ID of the dataset (737), imputation method (mode), experiment name (testexperiment), missing fractions (0.3, 0.5), missing patterns (MAR,MCAR), strategies (singlesingle), number of repetitions (3), and a path to the storage folder for results (../results).
bash
python run-experiment.py 737 mode test_experiment --missing-fractions 0.3,0.5 --missing-types MAR,MCAR --strategies single_single --num-repetitions 3 --base-path ../results
Note from Sebastian Jäger
This project has been set up using PyScaffold 3.2.2 and the dsproject extension 0.4. For details and usage information on PyScaffold see https://pyscaffold.org/.
Owner
- Login: ptsialis
- Kind: user
- Repositories: 1
- Profile: https://github.com/ptsialis
Citation (CITATION.bib)
@ARTICLE{optimal_imputation_dittrich_2023,
AUTHOR={Dittrich, Pascal},
TITLE={Assessing and Predicting the Optimal Impution Method Regarding the Predictive Performance of Machine Learning Models},
YEAR={2023},
MONTH={May},
URL={-},
ABSTRACT={-}
}
GitHub Events
Total
- Push event: 1
Last Year
- Push event: 1
Dependencies
- python 3.8.8 build
- cookiecutter *
- django *
- flake8 *
- pyscaffoldext-custom-extension *
- pyscaffoldext-dsproject *
- pyscaffoldext-markdown *
- pyscaffoldext-pyproject *
- pytest *
- pytest-cov *
- pytest-fixture-config *
- pytest-shutil *
- pytest-virtualenv *
- pytest-xdist *
- setuptools >=38.3
- sphinx *
- autokeras *
- flake8 *
- flake8-mypy *
- jedi ==0.17.2
- openml *
- pydocstyle *
- tensorflow *
- typer *
- autokeras *
- flake8 *
- flake8-mypy *
- jedi ==0.17.2
- openml *
- plotly ==5.14.1
- pydocstyle *
- tensorflow ==2.10
- tensorflow-text ==2.10
- typer *
- absl-py ==2.1.0
- accelerate ==0.21.0
- aiohttp ==3.9.5
- aiohttp-cors ==0.7.0
- aiosignal ==1.3.1
- annotated-types ==0.7.0
- antlr4-python3-runtime ==4.9.3
- astunparse ==1.6.3
- async-timeout ==4.0.3
- autogluon ==1.1.1
- autogluon-common ==1.1.1
- autogluon-core ==1.1.1
- autogluon-features ==1.1.1
- autogluon-multimodal ==1.1.1
- autogluon-tabular ==1.1.1
- autogluon-timeseries ==1.1.1
- autokeras ==1.1.0
- blis ==0.7.11
- boto3 ==1.34.144
- botocore ==1.34.144
- catalogue ==2.0.10
- catboost ==1.2.5
- category-encoders ==2.6.3
- cloudpathlib ==0.18.1
- cloudpickle ==3.0.0
- coloredlogs ==15.0.1
- colorful ==0.5.6
- confection ==0.1.5
- cymem ==2.0.8
- dask ==2023.5.0
- datasets ==2.20.0
- dill ==0.3.8
- distributed ==2023.5.0
- dm-tree ==0.1.8
- evaluate ==0.4.2
- fastai ==2.7.15
- fastcore ==1.5.54
- fastdownload ==0.0.7
- fastprogress ==1.0.3
- flake8 ==7.1.0
- flake8-mypy ==17.8.0
- flatbuffers ==24.3.25
- frozenlist ==1.4.1
- fsspec ==2024.6.1
- gast ==0.4.0
- gdown ==5.2.0
- gluonts ==0.15.1
- google-api-core ==2.19.1
- google-auth ==2.30.0
- google-auth-oauthlib ==0.4.6
- google-pasta ==0.2.0
- googleapis-common-protos ==1.63.2
- grpcio ==1.64.1
- h5py ==3.11.0
- huggingface-hub ==0.23.4
- humanfriendly ==10.0
- hyperopt ==0.2.7
- imagecorruptions ==1.1.2
- imageio ==2.34.1
- imgaug ==0.4.0
- jedi ==0.17.2
- jmespath ==1.0.1
- keras ==2.10.0
- keras-core ==0.1.5
- keras-nlp ==0.6.1
- keras-preprocessing ==1.1.2
- keras-tuner ==1.4.7
- kt-legacy ==1.0.5
- langcodes ==3.4.0
- language-data ==1.2.0
- lazy-loader ==0.4
- libclang ==18.1.1
- lightgbm ==4.3.0
- lightning ==2.3.3
- lightning-utilities ==0.11.5
- llvmlite ==0.41.1
- marisa-trie ==1.2.0
- markdown ==3.6
- markdown-it-py ==3.0.0
- mccabe ==0.7.0
- mdurl ==0.1.2
- minio ==7.2.7
- mlforecast ==0.10.0
- model-index ==0.1.11
- mpmath ==1.3.0
- msgpack ==1.0.8
- multidict ==6.0.5
- multiprocess ==0.70.16
- murmurhash ==1.0.10
- mypy ==1.10.0
- mypy-extensions ==1.0.0
- namex ==0.0.8
- networkx ==3.1
- nlpaug ==1.1.11
- nltk ==3.8.1
- nptyping ==2.4.1
- numba ==0.58.1
- nvidia-cublas-cu11 ==11.11.3.6
- nvidia-cublas-cu12 ==12.1.3.1
- nvidia-cuda-cupti-cu12 ==12.1.105
- nvidia-cuda-nvrtc-cu12 ==12.1.105
- nvidia-cuda-runtime-cu12 ==12.1.105
- nvidia-cudnn-cu11 ==8.6.0.163
- nvidia-cudnn-cu12 ==8.9.2.26
- nvidia-cufft-cu12 ==11.0.2.54
- nvidia-curand-cu12 ==10.3.2.106
- nvidia-cusolver-cu12 ==11.4.5.107
- nvidia-cusparse-cu12 ==12.1.0.106
- nvidia-ml-py3 ==7.352.0
- nvidia-nccl-cu12 ==2.20.5
- nvidia-nvjitlink-cu12 ==12.5.82
- nvidia-nvtx-cu12 ==12.1.105
- oauthlib ==3.2.2
- omegaconf ==2.2.3
- onnx ==1.16.1
- onnxruntime ==1.18.1
- opencensus ==0.11.4
- opencensus-context ==0.1.3
- opencv-python ==4.10.0.84
- opendatalab ==0.0.10
- openmim ==0.3.9
- openml ==0.14.2
- openxlab ==0.0.11
- opt-einsum ==3.3.0
- optimum ==1.18.1
- ordered-set ==4.1.0
- orjson ==3.10.6
- panda ==0.3.1
- parso ==0.7.1
- patsy ==0.5.6
- pdf2image ==1.17.0
- plotly ==5.14.1
- preshed ==3.0.9
- proto-plus ==1.24.0
- protobuf ==3.19.6
- py-spy ==0.3.14
- py4j ==0.10.9.7
- pyarrow ==16.1.0
- pyarrow-hotfix ==0.6
- pyasn1 ==0.6.0
- pyasn1-modules ==0.4.0
- pycodestyle ==2.12.0
- pycryptodome ==3.20.0
- pydantic ==2.8.2
- pydantic-core ==2.20.1
- pydocstyle ==6.3.0
- pyflakes ==3.2.0
- pytesseract ==0.3.10
- python-graphviz ==0.20.3
- pytorch-lightning ==2.3.3
- pytorch-metric-learning ==2.3.0
- pywavelets ==1.4.1
- ray ==2.10.0
- regex ==2024.5.15
- requests-oauthlib ==2.0.0
- rich ==13.7.1
- rsa ==4.9
- s3transfer ==0.10.2
- safetensors ==0.4.3
- scikit-image ==0.20.0
- scikit-learn ==1.3.2
- scipy ==1.9.1
- sentencepiece ==0.2.0
- seqeval ==1.2.2
- setuptools ==70.3.0
- shapely ==2.0.4
- shellingham ==1.5.4
- smart-open ==7.0.4
- spacy ==3.7.5
- spacy-legacy ==3.0.12
- spacy-loggers ==1.0.5
- srsly ==2.4.8
- statsforecast ==1.4.0
- statsmodels ==0.14.1
- sympy ==1.13.0
- tabulate ==0.9.0
- tblib ==3.0.0
- tenacity ==8.4.1
- tensorboard ==2.10.0
- tensorboard-data-server ==0.6.1
- tensorboard-plugin-wit ==1.8.1
- tensorboardx ==2.6.2.2
- tensorflow ==2.10.0
- tensorflow-estimator ==2.10.0
- tensorflow-hub ==0.16.1
- tensorflow-io-gcs-filesystem ==0.34.0
- tensorflow-text ==2.10.0
- termcolor ==2.4.0
- text-unidecode ==1.3
- tf-keras ==2.15.0
- thinc ==8.2.5
- tifffile ==2023.7.10
- timm ==0.9.16
- tokenizers ==0.15.2
- toolz ==0.12.1
- torch ==2.3.1
- torchmetrics ==1.2.1
- torchvision ==0.18.1
- transformers ==4.39.3
- triton ==2.3.1
- typer ==0.12.3
- urllib3 ==1.26.19
- utilsforecast ==0.0.10
- wasabi ==1.1.3
- weasel ==0.4.1
- werkzeug ==3.0.3
- window-ops ==0.0.15
- wrapt ==1.16.0
- xgboost ==2.1.0
- xmltodict ==0.13.0
- xxhash ==3.4.1
- yarl ==1.9.4
- Keras-Preprocessing ==1.1.2
- Markdown ==3.6
- PyQt5 ==5.15.10
- PyWavelets ==1.4.1
- Werkzeug ==3.0.3
- absl-py ==2.1.0
- accelerate ==0.21.0
- aiohttp ==3.9.5
- aiohttp-cors ==0.7.0
- aiosignal ==1.3.1
- annotated-types ==0.7.0
- antlr4-python3-runtime ==4.9.3
- astunparse ==1.6.3
- async-timeout ==4.0.3
- autogluon ==1.1.1
- autogluon.common ==1.1.1
- autogluon.core ==1.1.1
- autogluon.features ==1.1.1
- autogluon.multimodal ==1.1.1
- autogluon.tabular ==1.1.1
- autogluon.timeseries ==1.1.1
- autokeras ==1.1.0
- blis ==0.7.11
- boto3 ==1.34.144
- botocore ==1.34.144
- catalogue ==2.0.10
- catboost ==1.2.5
- category-encoders ==2.6.3
- cloudpathlib ==0.18.1
- coloredlogs ==15.0.1
- colorful ==0.5.6
- colorlog ==5.0.1
- confection ==0.1.5
- cymem ==2.0.8
- datasets ==2.20.0
- dill ==0.3.8
- dm-tree ==0.1.8
- evaluate ==0.4.2
- fastai ==2.7.15
- fastcore ==1.5.54
- fastdownload ==0.0.7
- fastprogress ==1.0.3
- flake8 ==7.1.0
- flake8-mypy ==17.8.0
- flatbuffers ==24.3.25
- frozenlist ==1.4.1
- gast ==0.4.0
- gdown ==5.2.0
- gluonts ==0.15.1
- google-api-core ==2.19.1
- google-auth ==2.30.0
- google-auth-oauthlib ==1.0.0
- google-pasta ==0.2.0
- googleapis-common-protos ==1.63.2
- graphviz ==0.20.3
- grpcio ==1.64.1
- h5py ==3.11.0
- huggingface-hub ==0.23.4
- humanfriendly ==10.0
- hyperopt ==0.2.7
- imagecorruptions ==1.1.2
- imageio ==2.34.1
- imgaug ==0.4.0
- jedi ==0.17.2
- jmespath ==1.0.1
- kaleido ==0.2.1
- keras ==2.13.1
- keras-core ==0.1.5
- keras-nlp ==0.6.1
- keras-tuner ==1.4.7
- kt-legacy ==1.0.5
- langcodes ==3.4.0
- language_data ==1.2.0
- lazy_loader ==0.4
- libclang ==18.1.1
- lightgbm ==4.3.0
- lightning ==2.3.3
- lightning-utilities ==0.11.5
- llvmlite ==0.41.1
- marisa-trie ==1.2.0
- markdown-it-py ==3.0.0
- mccabe ==0.7.0
- mdurl ==0.1.2
- minio ==7.2.7
- mkl-service ==2.4.0
- mlforecast ==0.10.0
- model-index ==0.1.11
- multidict ==6.0.5
- multiprocess ==0.70.16
- murmurhash ==1.0.10
- mypy ==1.10.0
- mypy-extensions ==1.0.0
- namex ==0.0.8
- nlpaug ==1.1.11
- nltk ==3.8.1
- nptyping ==2.4.1
- numba ==0.58.1
- nvidia-cublas-cu11 ==11.11.3.6
- nvidia-cublas-cu12 ==12.1.3.1
- nvidia-cuda-cupti-cu12 ==12.1.105
- nvidia-cuda-nvrtc-cu12 ==12.1.105
- nvidia-cuda-runtime-cu12 ==12.1.105
- nvidia-cudnn-cu11 ==8.6.0.163
- nvidia-cudnn-cu12 ==8.9.2.26
- nvidia-cufft-cu12 ==11.0.2.54
- nvidia-curand-cu12 ==10.3.2.106
- nvidia-cusolver-cu12 ==11.4.5.107
- nvidia-cusparse-cu12 ==12.1.0.106
- nvidia-ml-py3 ==7.352.0
- nvidia-nccl-cu12 ==2.20.5
- nvidia-nvjitlink-cu12 ==12.5.82
- nvidia-nvtx-cu12 ==12.1.105
- oauthlib ==3.2.2
- omegaconf ==2.2.3
- onnx ==1.16.1
- onnxruntime ==1.18.1
- opencensus ==0.11.4
- opencensus-context ==0.1.3
- opencv-python ==4.10.0.84
- opendatalab ==0.0.10
- openmim ==0.3.9
- openml ==0.14.2
- openxlab ==0.0.11
- opt-einsum ==3.3.0
- optimum ==1.18.1
- ordered-set ==4.1.0
- orjson ==3.10.6
- panda ==0.3.1
- parso ==0.7.1
- patsy ==0.5.6
- pdf2image ==1.17.0
- plotly ==5.14.1
- ply ==3.11
- preshed ==3.0.9
- proto-plus ==1.24.0
- protobuf ==4.25.4
- py-spy ==0.3.14
- py4j ==0.10.9.7
- pyarrow ==16.1.0
- pyarrow-hotfix ==0.6
- pyasn1 ==0.6.0
- pyasn1_modules ==0.4.0
- pycodestyle ==2.12.0
- pycryptodome ==3.20.0
- pydantic ==2.8.2
- pydantic_core ==2.20.1
- pydocstyle ==6.3.0
- pyflakes ==3.2.0
- pytesseract ==0.3.10
- pytorch-lightning ==2.3.3
- pytorch-metric-learning ==2.3.0
- ray ==2.10.0
- regex ==2024.5.15
- requests-oauthlib ==2.0.0
- rich ==13.7.1
- rsa ==4.9
- s3transfer ==0.10.2
- safetensors ==0.4.3
- scikit-image ==0.20.0
- scikit-learn ==1.3.2
- scipy ==1.9.1
- sentencepiece ==0.2.0
- seqeval ==1.2.2
- shapely ==2.0.4
- shellingham ==1.5.4
- smart-open ==7.0.4
- spacy ==3.7.5
- spacy-legacy ==3.0.12
- spacy-loggers ==1.0.5
- srsly ==2.4.8
- statsforecast ==1.4.0
- statsmodels ==0.14.1
- tabulate ==0.9.0
- tenacity ==8.4.1
- tensorboard ==2.13.0
- tensorboard-data-server ==0.7.2
- tensorboard-plugin-wit ==1.8.1
- tensorboardX ==2.6.2.2
- tensorflow ==2.13.1
- tensorflow-estimator ==2.13.0
- tensorflow-hub ==0.16.1
- tensorflow-io-gcs-filesystem ==0.34.0
- tensorflow-text ==2.10.0
- tensorrt ==10.3.0
- tensorrt-cu12 ==10.3.0
- tensorrt-cu12-bindings ==10.3.0
- tensorrt-cu12-libs ==10.3.0
- termcolor ==2.4.0
- text-unidecode ==1.3
- tf-keras ==2.15.0
- thinc ==8.2.5
- tifffile ==2023.7.10
- timm ==0.9.16
- tokenizers ==0.15.2
- torch ==2.3.1
- torchaudio ==2.3.1
- torchmetrics ==1.2.1
- torchvision ==0.18.1
- transformers ==4.39.3
- triton ==2.3.1
- typer ==0.12.3
- typing_extensions ==4.5.0
- urllib3 ==1.26.19
- utilsforecast ==0.0.10
- wasabi ==1.1.3
- weasel ==0.4.1
- webencodings ==0.5.1
- window_ops ==0.0.15
- wrapt ==1.16.0
- xarray ==2023.1.0
- xgboost ==2.1.0
- xmltodict ==0.13.0
- xxhash ==3.4.1
- yarl ==1.9.4
- zstandard ==0.22.0
- pytest *
- pytest-cov *
- flake8 *
- flake8-mypy *
- imagecorruptions *
- imgaug *
- jsonpickle *
- tensorflow-data-validation *