Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: frontiersin.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.4%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: ptsialis
  • License: apache-2.0
  • Language: Jupyter Notebook
  • Default Branch: main
  • Size: 22.8 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created over 1 year ago · Last pushed over 1 year ago
Metadata Files
Readme Changelog License Citation Authors

README.md

Source Code for Masterthesis: Assessing and Predicting the Optimal Imputation Method Regarding the Predictive Performance of Machine Learning Models

Target

An optimization approach for decision making in data science based on benchmarking the impact of different data imputation methods on the predictive performance of machine learning models.

Disclaimer

This is research project and not intended for production usage.

This Masterthesis is building on the work of Sebastian Jäger, Arndt Allhorn and Felix Bießmann, as described in their paper: "A Benchmark for Data Imputation Methods" https://www.frontiersin.org/article/10.3389/fdata.2021.693674

Installation

Steps to set up the required conda environment:

  1. create an environment Data-Imputation-Thesis with conda, bash conda env create -f environment.yaml
  2. activate the new environment bash conda activate Data-Imputation-Thesis
  3. install jenga bash cd src/jenga python setup.py develop
  4. install data-imputation-paper bash cd ../.. python setup.py develop # or `install` It might be necessary to install the required GPU drivers manually (Version might change based on used hardware): bash conda install -c conda-forge cudatoolkit=11.7.0 pip install nvidia-cudnn-cu11==8.6.0.163 Activate the packages every time you activate the environment: bash CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)")) export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/:$CUDNN_PATH/lib https://www.tensorflow.org/install/pip

Usage

  • Imputation Experiments execute run-experiment.pywith the required settings (explained below). The experiment name must contain `corrupted.

  • Baseline Experiments execute run-experiment.pywith the required settings (explained below). The experiment name is not allowed to contain `corrupted.

  • Imputation Experiments execute run-experiment-subset.pywith the required settings (explained below). The experiment name must contain `corrupted.

  • Baseline Experiments execute run-experiment-subset.pywith the required settings (explained below). The experiment name is not allowed to contain `corrupted.

  • Examples to start the experiments start requires the ID of the dataset (737), imputation method (mode), experiment name (testexperiment), missing fractions (0.3, 0.5), missing patterns (MAR,MCAR), strategies (singlesingle), number of repetitions (3), and a path to the storage folder for results (../results).

bash python run-experiment.py 737 mode test_experiment --missing-fractions 0.3,0.5 --missing-types MAR,MCAR --strategies single_single --num-repetitions 3 --base-path ../results

Note from Sebastian Jäger

This project has been set up using PyScaffold 3.2.2 and the dsproject extension 0.4. For details and usage information on PyScaffold see https://pyscaffold.org/.

Owner

  • Login: ptsialis
  • Kind: user

Citation (CITATION.bib)

@ARTICLE{optimal_imputation_dittrich_2023,
	AUTHOR={Dittrich, Pascal},
	TITLE={Assessing and Predicting the Optimal Impution Method Regarding the Predictive Performance of Machine Learning Models},
	YEAR={2023},
	MONTH={May},
	URL={-},
	ABSTRACT={-}
}

GitHub Events

Total
  • Push event: 1
Last Year
  • Push event: 1

Dependencies

cluster/docker/Dockerfile docker
  • python 3.8.8 build
.eggs/PyScaffold-3.2.3-py3.8.egg/EGG-INFO/requires.txt pypi
  • cookiecutter *
  • django *
  • flake8 *
  • pyscaffoldext-custom-extension *
  • pyscaffoldext-dsproject *
  • pyscaffoldext-markdown *
  • pyscaffoldext-pyproject *
  • pytest *
  • pytest-cov *
  • pytest-fixture-config *
  • pytest-shutil *
  • pytest-virtualenv *
  • pytest-xdist *
  • setuptools >=38.3
  • sphinx *
condaenv.q_vpwh2j.requirements.txt pypi
  • autokeras *
  • flake8 *
  • flake8-mypy *
  • jedi ==0.17.2
  • openml *
  • pydocstyle *
  • tensorflow *
  • typer *
environment.yaml pypi
  • autokeras *
  • flake8 *
  • flake8-mypy *
  • jedi ==0.17.2
  • openml *
  • plotly ==5.14.1
  • pydocstyle *
  • tensorflow ==2.10
  • tensorflow-text ==2.10
  • typer *
environment.yml pypi
  • absl-py ==2.1.0
  • accelerate ==0.21.0
  • aiohttp ==3.9.5
  • aiohttp-cors ==0.7.0
  • aiosignal ==1.3.1
  • annotated-types ==0.7.0
  • antlr4-python3-runtime ==4.9.3
  • astunparse ==1.6.3
  • async-timeout ==4.0.3
  • autogluon ==1.1.1
  • autogluon-common ==1.1.1
  • autogluon-core ==1.1.1
  • autogluon-features ==1.1.1
  • autogluon-multimodal ==1.1.1
  • autogluon-tabular ==1.1.1
  • autogluon-timeseries ==1.1.1
  • autokeras ==1.1.0
  • blis ==0.7.11
  • boto3 ==1.34.144
  • botocore ==1.34.144
  • catalogue ==2.0.10
  • catboost ==1.2.5
  • category-encoders ==2.6.3
  • cloudpathlib ==0.18.1
  • cloudpickle ==3.0.0
  • coloredlogs ==15.0.1
  • colorful ==0.5.6
  • confection ==0.1.5
  • cymem ==2.0.8
  • dask ==2023.5.0
  • datasets ==2.20.0
  • dill ==0.3.8
  • distributed ==2023.5.0
  • dm-tree ==0.1.8
  • evaluate ==0.4.2
  • fastai ==2.7.15
  • fastcore ==1.5.54
  • fastdownload ==0.0.7
  • fastprogress ==1.0.3
  • flake8 ==7.1.0
  • flake8-mypy ==17.8.0
  • flatbuffers ==24.3.25
  • frozenlist ==1.4.1
  • fsspec ==2024.6.1
  • gast ==0.4.0
  • gdown ==5.2.0
  • gluonts ==0.15.1
  • google-api-core ==2.19.1
  • google-auth ==2.30.0
  • google-auth-oauthlib ==0.4.6
  • google-pasta ==0.2.0
  • googleapis-common-protos ==1.63.2
  • grpcio ==1.64.1
  • h5py ==3.11.0
  • huggingface-hub ==0.23.4
  • humanfriendly ==10.0
  • hyperopt ==0.2.7
  • imagecorruptions ==1.1.2
  • imageio ==2.34.1
  • imgaug ==0.4.0
  • jedi ==0.17.2
  • jmespath ==1.0.1
  • keras ==2.10.0
  • keras-core ==0.1.5
  • keras-nlp ==0.6.1
  • keras-preprocessing ==1.1.2
  • keras-tuner ==1.4.7
  • kt-legacy ==1.0.5
  • langcodes ==3.4.0
  • language-data ==1.2.0
  • lazy-loader ==0.4
  • libclang ==18.1.1
  • lightgbm ==4.3.0
  • lightning ==2.3.3
  • lightning-utilities ==0.11.5
  • llvmlite ==0.41.1
  • marisa-trie ==1.2.0
  • markdown ==3.6
  • markdown-it-py ==3.0.0
  • mccabe ==0.7.0
  • mdurl ==0.1.2
  • minio ==7.2.7
  • mlforecast ==0.10.0
  • model-index ==0.1.11
  • mpmath ==1.3.0
  • msgpack ==1.0.8
  • multidict ==6.0.5
  • multiprocess ==0.70.16
  • murmurhash ==1.0.10
  • mypy ==1.10.0
  • mypy-extensions ==1.0.0
  • namex ==0.0.8
  • networkx ==3.1
  • nlpaug ==1.1.11
  • nltk ==3.8.1
  • nptyping ==2.4.1
  • numba ==0.58.1
  • nvidia-cublas-cu11 ==11.11.3.6
  • nvidia-cublas-cu12 ==12.1.3.1
  • nvidia-cuda-cupti-cu12 ==12.1.105
  • nvidia-cuda-nvrtc-cu12 ==12.1.105
  • nvidia-cuda-runtime-cu12 ==12.1.105
  • nvidia-cudnn-cu11 ==8.6.0.163
  • nvidia-cudnn-cu12 ==8.9.2.26
  • nvidia-cufft-cu12 ==11.0.2.54
  • nvidia-curand-cu12 ==10.3.2.106
  • nvidia-cusolver-cu12 ==11.4.5.107
  • nvidia-cusparse-cu12 ==12.1.0.106
  • nvidia-ml-py3 ==7.352.0
  • nvidia-nccl-cu12 ==2.20.5
  • nvidia-nvjitlink-cu12 ==12.5.82
  • nvidia-nvtx-cu12 ==12.1.105
  • oauthlib ==3.2.2
  • omegaconf ==2.2.3
  • onnx ==1.16.1
  • onnxruntime ==1.18.1
  • opencensus ==0.11.4
  • opencensus-context ==0.1.3
  • opencv-python ==4.10.0.84
  • opendatalab ==0.0.10
  • openmim ==0.3.9
  • openml ==0.14.2
  • openxlab ==0.0.11
  • opt-einsum ==3.3.0
  • optimum ==1.18.1
  • ordered-set ==4.1.0
  • orjson ==3.10.6
  • panda ==0.3.1
  • parso ==0.7.1
  • patsy ==0.5.6
  • pdf2image ==1.17.0
  • plotly ==5.14.1
  • preshed ==3.0.9
  • proto-plus ==1.24.0
  • protobuf ==3.19.6
  • py-spy ==0.3.14
  • py4j ==0.10.9.7
  • pyarrow ==16.1.0
  • pyarrow-hotfix ==0.6
  • pyasn1 ==0.6.0
  • pyasn1-modules ==0.4.0
  • pycodestyle ==2.12.0
  • pycryptodome ==3.20.0
  • pydantic ==2.8.2
  • pydantic-core ==2.20.1
  • pydocstyle ==6.3.0
  • pyflakes ==3.2.0
  • pytesseract ==0.3.10
  • python-graphviz ==0.20.3
  • pytorch-lightning ==2.3.3
  • pytorch-metric-learning ==2.3.0
  • pywavelets ==1.4.1
  • ray ==2.10.0
  • regex ==2024.5.15
  • requests-oauthlib ==2.0.0
  • rich ==13.7.1
  • rsa ==4.9
  • s3transfer ==0.10.2
  • safetensors ==0.4.3
  • scikit-image ==0.20.0
  • scikit-learn ==1.3.2
  • scipy ==1.9.1
  • sentencepiece ==0.2.0
  • seqeval ==1.2.2
  • setuptools ==70.3.0
  • shapely ==2.0.4
  • shellingham ==1.5.4
  • smart-open ==7.0.4
  • spacy ==3.7.5
  • spacy-legacy ==3.0.12
  • spacy-loggers ==1.0.5
  • srsly ==2.4.8
  • statsforecast ==1.4.0
  • statsmodels ==0.14.1
  • sympy ==1.13.0
  • tabulate ==0.9.0
  • tblib ==3.0.0
  • tenacity ==8.4.1
  • tensorboard ==2.10.0
  • tensorboard-data-server ==0.6.1
  • tensorboard-plugin-wit ==1.8.1
  • tensorboardx ==2.6.2.2
  • tensorflow ==2.10.0
  • tensorflow-estimator ==2.10.0
  • tensorflow-hub ==0.16.1
  • tensorflow-io-gcs-filesystem ==0.34.0
  • tensorflow-text ==2.10.0
  • termcolor ==2.4.0
  • text-unidecode ==1.3
  • tf-keras ==2.15.0
  • thinc ==8.2.5
  • tifffile ==2023.7.10
  • timm ==0.9.16
  • tokenizers ==0.15.2
  • toolz ==0.12.1
  • torch ==2.3.1
  • torchmetrics ==1.2.1
  • torchvision ==0.18.1
  • transformers ==4.39.3
  • triton ==2.3.1
  • typer ==0.12.3
  • urllib3 ==1.26.19
  • utilsforecast ==0.0.10
  • wasabi ==1.1.3
  • weasel ==0.4.1
  • werkzeug ==3.0.3
  • window-ops ==0.0.15
  • wrapt ==1.16.0
  • xgboost ==2.1.0
  • xmltodict ==0.13.0
  • xxhash ==3.4.1
  • yarl ==1.9.4
requirements_new.txt pypi
  • Keras-Preprocessing ==1.1.2
  • Markdown ==3.6
  • PyQt5 ==5.15.10
  • PyWavelets ==1.4.1
  • Werkzeug ==3.0.3
  • absl-py ==2.1.0
  • accelerate ==0.21.0
  • aiohttp ==3.9.5
  • aiohttp-cors ==0.7.0
  • aiosignal ==1.3.1
  • annotated-types ==0.7.0
  • antlr4-python3-runtime ==4.9.3
  • astunparse ==1.6.3
  • async-timeout ==4.0.3
  • autogluon ==1.1.1
  • autogluon.common ==1.1.1
  • autogluon.core ==1.1.1
  • autogluon.features ==1.1.1
  • autogluon.multimodal ==1.1.1
  • autogluon.tabular ==1.1.1
  • autogluon.timeseries ==1.1.1
  • autokeras ==1.1.0
  • blis ==0.7.11
  • boto3 ==1.34.144
  • botocore ==1.34.144
  • catalogue ==2.0.10
  • catboost ==1.2.5
  • category-encoders ==2.6.3
  • cloudpathlib ==0.18.1
  • coloredlogs ==15.0.1
  • colorful ==0.5.6
  • colorlog ==5.0.1
  • confection ==0.1.5
  • cymem ==2.0.8
  • datasets ==2.20.0
  • dill ==0.3.8
  • dm-tree ==0.1.8
  • evaluate ==0.4.2
  • fastai ==2.7.15
  • fastcore ==1.5.54
  • fastdownload ==0.0.7
  • fastprogress ==1.0.3
  • flake8 ==7.1.0
  • flake8-mypy ==17.8.0
  • flatbuffers ==24.3.25
  • frozenlist ==1.4.1
  • gast ==0.4.0
  • gdown ==5.2.0
  • gluonts ==0.15.1
  • google-api-core ==2.19.1
  • google-auth ==2.30.0
  • google-auth-oauthlib ==1.0.0
  • google-pasta ==0.2.0
  • googleapis-common-protos ==1.63.2
  • graphviz ==0.20.3
  • grpcio ==1.64.1
  • h5py ==3.11.0
  • huggingface-hub ==0.23.4
  • humanfriendly ==10.0
  • hyperopt ==0.2.7
  • imagecorruptions ==1.1.2
  • imageio ==2.34.1
  • imgaug ==0.4.0
  • jedi ==0.17.2
  • jmespath ==1.0.1
  • kaleido ==0.2.1
  • keras ==2.13.1
  • keras-core ==0.1.5
  • keras-nlp ==0.6.1
  • keras-tuner ==1.4.7
  • kt-legacy ==1.0.5
  • langcodes ==3.4.0
  • language_data ==1.2.0
  • lazy_loader ==0.4
  • libclang ==18.1.1
  • lightgbm ==4.3.0
  • lightning ==2.3.3
  • lightning-utilities ==0.11.5
  • llvmlite ==0.41.1
  • marisa-trie ==1.2.0
  • markdown-it-py ==3.0.0
  • mccabe ==0.7.0
  • mdurl ==0.1.2
  • minio ==7.2.7
  • mkl-service ==2.4.0
  • mlforecast ==0.10.0
  • model-index ==0.1.11
  • multidict ==6.0.5
  • multiprocess ==0.70.16
  • murmurhash ==1.0.10
  • mypy ==1.10.0
  • mypy-extensions ==1.0.0
  • namex ==0.0.8
  • nlpaug ==1.1.11
  • nltk ==3.8.1
  • nptyping ==2.4.1
  • numba ==0.58.1
  • nvidia-cublas-cu11 ==11.11.3.6
  • nvidia-cublas-cu12 ==12.1.3.1
  • nvidia-cuda-cupti-cu12 ==12.1.105
  • nvidia-cuda-nvrtc-cu12 ==12.1.105
  • nvidia-cuda-runtime-cu12 ==12.1.105
  • nvidia-cudnn-cu11 ==8.6.0.163
  • nvidia-cudnn-cu12 ==8.9.2.26
  • nvidia-cufft-cu12 ==11.0.2.54
  • nvidia-curand-cu12 ==10.3.2.106
  • nvidia-cusolver-cu12 ==11.4.5.107
  • nvidia-cusparse-cu12 ==12.1.0.106
  • nvidia-ml-py3 ==7.352.0
  • nvidia-nccl-cu12 ==2.20.5
  • nvidia-nvjitlink-cu12 ==12.5.82
  • nvidia-nvtx-cu12 ==12.1.105
  • oauthlib ==3.2.2
  • omegaconf ==2.2.3
  • onnx ==1.16.1
  • onnxruntime ==1.18.1
  • opencensus ==0.11.4
  • opencensus-context ==0.1.3
  • opencv-python ==4.10.0.84
  • opendatalab ==0.0.10
  • openmim ==0.3.9
  • openml ==0.14.2
  • openxlab ==0.0.11
  • opt-einsum ==3.3.0
  • optimum ==1.18.1
  • ordered-set ==4.1.0
  • orjson ==3.10.6
  • panda ==0.3.1
  • parso ==0.7.1
  • patsy ==0.5.6
  • pdf2image ==1.17.0
  • plotly ==5.14.1
  • ply ==3.11
  • preshed ==3.0.9
  • proto-plus ==1.24.0
  • protobuf ==4.25.4
  • py-spy ==0.3.14
  • py4j ==0.10.9.7
  • pyarrow ==16.1.0
  • pyarrow-hotfix ==0.6
  • pyasn1 ==0.6.0
  • pyasn1_modules ==0.4.0
  • pycodestyle ==2.12.0
  • pycryptodome ==3.20.0
  • pydantic ==2.8.2
  • pydantic_core ==2.20.1
  • pydocstyle ==6.3.0
  • pyflakes ==3.2.0
  • pytesseract ==0.3.10
  • pytorch-lightning ==2.3.3
  • pytorch-metric-learning ==2.3.0
  • ray ==2.10.0
  • regex ==2024.5.15
  • requests-oauthlib ==2.0.0
  • rich ==13.7.1
  • rsa ==4.9
  • s3transfer ==0.10.2
  • safetensors ==0.4.3
  • scikit-image ==0.20.0
  • scikit-learn ==1.3.2
  • scipy ==1.9.1
  • sentencepiece ==0.2.0
  • seqeval ==1.2.2
  • shapely ==2.0.4
  • shellingham ==1.5.4
  • smart-open ==7.0.4
  • spacy ==3.7.5
  • spacy-legacy ==3.0.12
  • spacy-loggers ==1.0.5
  • srsly ==2.4.8
  • statsforecast ==1.4.0
  • statsmodels ==0.14.1
  • tabulate ==0.9.0
  • tenacity ==8.4.1
  • tensorboard ==2.13.0
  • tensorboard-data-server ==0.7.2
  • tensorboard-plugin-wit ==1.8.1
  • tensorboardX ==2.6.2.2
  • tensorflow ==2.13.1
  • tensorflow-estimator ==2.13.0
  • tensorflow-hub ==0.16.1
  • tensorflow-io-gcs-filesystem ==0.34.0
  • tensorflow-text ==2.10.0
  • tensorrt ==10.3.0
  • tensorrt-cu12 ==10.3.0
  • tensorrt-cu12-bindings ==10.3.0
  • tensorrt-cu12-libs ==10.3.0
  • termcolor ==2.4.0
  • text-unidecode ==1.3
  • tf-keras ==2.15.0
  • thinc ==8.2.5
  • tifffile ==2023.7.10
  • timm ==0.9.16
  • tokenizers ==0.15.2
  • torch ==2.3.1
  • torchaudio ==2.3.1
  • torchmetrics ==1.2.1
  • torchvision ==0.18.1
  • transformers ==4.39.3
  • triton ==2.3.1
  • typer ==0.12.3
  • typing_extensions ==4.5.0
  • urllib3 ==1.26.19
  • utilsforecast ==0.0.10
  • wasabi ==1.1.3
  • weasel ==0.4.1
  • webencodings ==0.5.1
  • window_ops ==0.0.15
  • wrapt ==1.16.0
  • xarray ==2023.1.0
  • xgboost ==2.1.0
  • xmltodict ==0.13.0
  • xxhash ==3.4.1
  • yarl ==1.9.4
  • zstandard ==0.22.0
setup.py pypi
src/data_imputation_paper.egg-info/requires.txt pypi
  • pytest *
  • pytest-cov *
src/jenga/environment.yaml pypi
  • flake8 *
  • flake8-mypy *
  • imagecorruptions *
  • imgaug *
  • jsonpickle *
  • tensorflow-data-validation *
src/jenga/setup.py pypi