https://github.com/calgo-lab/error-paper

Code for the Paper "Towards Realistic Error Models for Tabular Data" submitted to the Journal of Data and Information Quality

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.0%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

Code for the Paper "Towards Realistic Error Models for Tabular Data" submitted to the Journal of Data and Information Quality

Basic Info

Host: GitHub
Owner: calgo-lab
Language: Jupyter Notebook
Default Branch: main
Homepage:
Size: 256 MB

Statistics

Stars: 1
Watchers: 3
Forks: 0
Open Issues: 0
Releases: 0

Created almost 2 years ago · Last pushed 10 months ago

Metadata Files

Readme

Towards Realistic Error Models for Tabular Data

The notebooks in this repository were used to execute the experimental evaluation of our paper "Towards Realistic Error Models for Tabular Data". Specifically,

1) the notebook dataset_generation.ipynb contains the procedure we followed to generate datasets corresponding to the error scenarios we describe in our paper. 2) The notebook dataset_analysis.ipynb contains our analysis of the HOSP dataset. 3) The notebook plots.ipynb contains the procedure we use to generate the figures in our publication. It reads experiment's results from the error_paper/measurements/ directory -- check the notebook's code for details.

Installation

We use poetry to manage dependencies. Simply run poetry install to install all dependencies.

Experiments

In our experiments, we examine data cleaning and downstream machine learning task impact using tab_err. - In the first part of the data cleaning experiments, we generate various erroneous versions of the HOSP dataset and clean them with HoloClean (benchmarks/hosp-impact). - We then proceed to generate various erroneous versions of datasets bridges, beers, restaurant and cars and correct them with algorithms baran&raha, holoclean and renuver (benchmarks/cleaning-impact). - In the downstream machine learning task impact, we look at how ML models behave given data with various errors (benchmarks/ml_downstream_experiments).

Check the documentation in benchmarks/README.md for instructions on how to replicate our measurements.

Profiling

We also looked at the memory and runtime of tab_err using various error models and dataset sizes. See the directory benchmarks/profiling for examples.

Owner

Name: Cognitive Algorithms Lab
Login: calgo-lab
Kind: organization
Location: Germany

Repositories: 2
Profile: https://github.com/calgo-lab

GitHub Events

Total

Delete event: 2
Push event: 42
Pull request review comment event: 6
Pull request review event: 2
Pull request event: 5
Create event: 7

Last Year

Delete event: 2
Push event: 42
Pull request review comment event: 6
Pull request review event: 2
Pull request event: 5
Create event: 7

Dependencies

poetry.lock pypi

121 dependencies

pyproject.toml pypi

error-generation *
jupyter ^1.0.0
matplotlib ^3.9.2
pandas ^2.2.2
pyarrow ^16.1.0
python >=3.9,<3.12
seaborn ^0.13.2

benchmarks/holoclean/Dockerfile docker

python 3.7 build

benchmarks/holoclean/docker-compose.yml docker

hc36 latest
postgres 11

benchmarks/holoclean/requirements.txt pypi

enum34 ==1.1.6
gensim ==3.7.1
numpy ==1.16.1
pandas ==0.24.1
psycopg2-binary ==2.7.7
pyitlib ==0.2.0
pytest-xdist ==1.26.1
python-Levenshtein ==0.12.0
scikit-learn ==0.20.0
scipy ==1.2.1
sqlalchemy ==1.2.17
torch ==1.0.1
tqdm ==4.31.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/calgo-lab/error-paper

Science Score: 26.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Towards Realistic Error Models for Tabular Data

Installation

Experiments

Profiling

Owner

GitHub Events

Total

Last Year

Dependencies