data-imputation-paper

Research code for the paper "A Benchmark for Data Imputation Methods".

https://github.com/se-jaeger/data-imputation-paper

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 3 DOI reference(s) in README
✓
Academic publication links
Links to: frontiersin.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (15.5%) to scientific vocabulary

Keywords

data-imputation data-quality machine-learning

Last synced: 6 months ago · JSON representation ·

Repository

Research code for the paper "A Benchmark for Data Imputation Methods".

Basic Info

Host: GitHub
Owner: se-jaeger
License: apache-2.0
Language: Jupyter Notebook
Default Branch: main
Homepage: https://www.frontiersin.org/articles/10.3389/fdata.2021.693674/full
Size: 7.88 MB

Statistics

Stars: 8
Watchers: 2
Forks: 2
Open Issues: 0
Releases: 0

Topics

data-imputation data-quality machine-learning

Created about 5 years ago · Last pushed about 3 years ago

Metadata Files

Readme Changelog License Citation Authors

Source Code for the Paper: Benchmark for Data Imputation Methods

Check out the final paper at: https://www.frontiersin.org/articles/10.3389/fdata.2021.693674/full

Disclaimer

This is research code and in no way ready for production usage!

Citing us!

If you want to reference our paper or this code, please use the following BibTex:

@ARTICLE{imputation_benchmark_jaeger_2021, AUTHOR={Jäger, Sebastian and Allhorn, Arndt and Bießmann, Felix}, TITLE={A Benchmark for Data Imputation Methods}, JOURNAL={Frontiers in Big Data}, VOLUME={4}, PAGES={48}, YEAR={2021}, URL={https://www.frontiersin.org/article/10.3389/fdata.2021.693674}, DOI={10.3389/fdata.2021.693674}, ISSN={2624-909X}, ABSTRACT={With the increasing importance and complexity of data pipelines, data quality became one of the key challenges in modern software applications. The importance of data quality has been recognized beyond the field of data engineering and database management systems (DBMSs). Also, for machine learning (ML) applications, high data quality standards are crucial to ensure robust predictive performance and responsible usage of automated decision making. One of the most frequent data quality problems is missing values. Incomplete datasets can break data pipelines and can have a devastating impact on downstream ML applications when not detected. While statisticians and, more recently, ML researchers have introduced a variety of approaches to impute missing values, comprehensive benchmarks comparing classical and modern imputation approaches under fair and realistic conditions are underrepresented. Here, we aim to fill this gap. We conduct a comprehensive suite of experiments on a large number of datasets with heterogeneous data and realistic missingness conditions, comparing both novel deep learning approaches and classical ML imputation methods when either only test or train and test data are affected by missing data. Each imputation method is evaluated regarding the imputation quality and the impact imputation has on a downstream ML task. Our results provide valuable insights into the performance of a variety of imputation methods under realistic conditions. We hope that our results help researchers and engineers to guide their data preprocessing method selection for automated data quality improvement.} }

Installation

In order to set up the necessary environment:

create an environment data-imputation-paper with the help of conda, bash conda env create -f environment.yaml
activate the new environment with bash conda activate data-imputation-paper
install jenga with bash cd src/jenga python setup.py develop
install data-imputation-paper with: bash cd ../.. python setup.py develop # or `install`

Optional and needed only once after git clone:

install several pre-commit git hooks with: bash pre-commit install and checkout the configuration under .pre-commit-config.yaml. The -n, --no-verify flag of git commit can be used to deactivate pre-commit hooks temporarily.

Then take a look into the scripts and notebooks folders.

Dependency Management & Reproducibility

Always keep your abstract (unpinned) dependencies updated in environment.yaml and eventually in setup.cfg if you want to ship and install your package via pip later on.
Create concrete dependencies as environment.lock.yaml for the exact reproduction of your environment with: conda env export -n data-imputation-paper -f environment.lock.yaml For multi-OS development, consider using --no-builds during the export.
Update your current environment with respect to a new environment.lock.yaml using: conda env update -f environment.lock.yaml --prune ### Project Organization

├── AUTHORS.rst ├── CHANGELOG.rst ├── LICENSE.txt ├── README.md ├── configs ├── data │ ├── external │ ├── interim │ ├── processed │ └── raw ├── docs ├── environment.yaml ├── models │ ├── notebooks │ │ ├── references ├── reports │ └── figures ├── scripts │ ├── setup.cfg ├── setup.py | ├── src │ └── │ └── jenga ├── tests ├── .coveragerc ├── .isort.cfg └── .pre-commit-config.yaml <- List of developers and maintainers. <- Changelog to keep track of new features and fixes. <- License as chosen on the command-line. <- The top-level README for developers. <- Directory for configurations of model & application. <- Data from third party sources. <- Intermediate data that has been transformed. <- The final, canonical data sets for modeling. <- The original, immutable data dump. <- Directory for Sphinx documentation in rst or md. <- The conda environment file for reproducibility. <- Trained and serialized models, model predictions, or model summaries. <- Jupyter notebooks. Naming convention is a number (for ordering), the creator's initials and a description, e.g. `1.0-fw-initial-data-exploration`. <- Data dictionaries, manuals, and all other materials. <- Generated analysis as HTML, PDF, LaTeX, etc. <- Generated plots and figures for reports. <- Analysis and production scripts which import the actual PYTHON_PKG, e.g. train_model. <- Declarative configuration of your project. <- Use `python setup.py develop` to install for development or or create a distribution with `python setup.py bdist_wheel`. data-imputation-paper <- Actual Python package where the main functionality goes. <- Jenga code, used to add data corruptions/missingness. <- Unit tests which can be run with `py.test`. <- Configuration for coverage reports of unit tests. <- Configuration for git hook that sorts imports. <- Configuration of pre-commit git hooks.

Note

This project has been set up using PyScaffold 3.2.2 and the dsproject extension 0.4. For details and usage information on PyScaffold see https://pyscaffold.org/.

Owner

Name: Sebastian Jäger
Login: se-jaeger
Kind: user
Location: Berlin
Company: @calgo-lab

Website: https://sebastian-jaeger.me
Twitter: se_jaeger
Repositories: 8
Profile: https://github.com/se-jaeger

Mastodon - (at)seja(at)sigmoid.social | LinkedIn - se-jaeger | Google Scholar - https://scholar.google.co.uk/citations?user=oQb71zUAAAAJ

Citation (CITATION.bib)

@ARTICLE{imputation_benchmark_jaeger_2021,
	AUTHOR={Jäger, Sebastian and Allhorn, Arndt and Bießmann, Felix},
	TITLE={A Benchmark for Data Imputation Methods},
	JOURNAL={Frontiers in Big Data},
	VOLUME={4},
	PAGES={48},
	YEAR={2021},
	URL={https://www.frontiersin.org/article/10.3389/fdata.2021.693674},
	DOI={10.3389/fdata.2021.693674},
	ISSN={2624-909X},
	ABSTRACT={With the increasing importance and complexity of data pipelines, data quality became one of the key challenges in modern software applications. The importance of data quality has been recognized beyond the field of data engineering and database management systems (DBMSs). Also, for machine learning (ML) applications, high data quality standards are crucial to ensure robust predictive performance and responsible usage of automated decision making. One of the most frequent data quality problems is missing values. Incomplete datasets can break data pipelines and can have a devastating impact on downstream ML applications when not detected. While statisticians and, more recently, ML researchers have introduced a variety of approaches to impute missing values, comprehensive benchmarks comparing classical and modern imputation approaches under fair and realistic conditions are underrepresented. Here, we aim to fill this gap. We conduct a comprehensive suite of experiments on a large number of datasets with heterogeneous data and realistic missingness conditions, comparing both novel deep learning approaches and classical ML imputation methods when either only test or train and test data are affected by missing data. Each imputation method is evaluated regarding the imputation quality and the impact imputation has on a downstream ML task. Our results provide valuable insights into the performance of a variety of imputation methods under realistic conditions. We hope that our results help researchers and engineers to guide their data preprocessing method selection for automated data quality improvement.}
}

GitHub Events

Total

Fork event: 1

Last Year

Fork event: 1

Committers

Last synced: 8 months ago

All Time

Total Commits: 47
Total Committers: 4
Avg Commits per committer: 11.75
Development Distribution Score (DDS): 0.085

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Sebastian Jäger	m**e@s**e	43
tripl3a	3****a	2
felixbiessmann	f****n	1
Sebastian Jäger	g**b@s**e	1

Committer Domains (Top 20 + Academic)

sebastian-jaeger.me: 2

Issues and Pull Requests

Last synced: 8 months ago

All Time

Total issues: 4
Total pull requests: 55
Average time to close issues: 23 days
Average time to close pull requests: 6 days
Total issue authors: 1
Total pull request authors: 3
Average comments per issue: 1.0
Average comments per pull request: 0.45
Merged pull requests: 47
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

data-imputation-paper

Science Score: 54.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Source Code for the Paper: Benchmark for Data Imputation Methods

Disclaimer

Citing us!

Installation

Dependency Management & Reproducibility

Note

Owner

Citation (CITATION.bib)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels