active_imputer
Implementation of confidence-based active data imputation using distribution matching via optimal transport.
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.3%) to scientific vocabulary
Repository
Implementation of confidence-based active data imputation using distribution matching via optimal transport.
Basic Info
- Host: GitHub
- Owner: ZarinTahia
- Language: Jupyter Notebook
- Default Branch: Main
- Size: 1.32 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Active Imputation
Overview
This repository contains the code for the paper "Confidence-Based Active Data Imputation via Distribution Matching".
Introduction
Handling missing data is critical for building reliable machine learning models. While automated imputation methods are common, they often overlook the value of human expertise—especially in cases of high uncertainty.
This project introduces an active imputation framework that uses confidence-based guidance to involve human input effectively. We estimate confidence by measuring the variability of Optimal Transport (OT) imputations and propose an efficient approximation using dropout noise injection.
We implement two active strategies: - Batch-Impute: Queries uncertain values in a single step. - Iter-Impute: Distributes queries across multiple iterations for progressive refinement.
Experiments across five real-world datasets show that Iter-Impute outperforms both baseline and random querying methods, achieving better accuracy with limited user interaction. This makes our approach well-suited for practical, semi-automated data cleaning.
Data Preparation
All datasets are automatically downloaded using fetch_openml from scikit-learn. No manual download is needed.
We used the following real-world datasets in our experiments: - Adult - Wine - German Credit - Diabetic - Breast Cancer
Preprocessing Steps:
Feature Scaling:
All numerical features are scaled to the [0, 1] range usingMinMaxScalerfromscikit-learn.Missing Value Injection:
Missing values are injected artificially using theInject_Missing_Value.pyscript.
This includes support for:- MCAR (Missing Completely at Random)
- MAR (Missing At Random)
- MNAR (Missing Not At Random)
Experiments
All experiments are conducted in the notebooks/ folder using Jupyter notebooks, with one notebook dedicated to each dataset.
Brier Score Evaluation
- The notebook
experiment.ipynbcalculates the Brier score across all datasets to evaluate the confidence calibration of the imputation algorithms.
Baseline Imputation Methods
Each dataset-specific notebook begins by applying the following three core methods under various missingness mechanisms (MCAR, MAR, MNAR):OT-Impute, OT-Rand, Batch-Impute These three methods are first compared to evaluate the effectiveness of active selection over random or passive strategies.
In a later section of the notebook, these methods are further compared with additional baseline imputation techniques, including: - Mean/Median imputation - MICE - KNN-Impute
Optimization: Dropout Noise Sensitivity
- To improve confidence estimation in Batch-Impute, we inject different levels of dropout noise during the OT optimization process.
- We empirically evaluate how the choice of dropout noise affects imputation quality.
- Based on these experiments, the best-performing dropout level is selected.
Iterative Active Imputation (Iter-Impute)
- In the later sections of each notebook, Iter-Impute is executed across multiple iterations.
- This method simulates active user involvement over time, gradually refining the imputation by querying the most uncertain values at each step.
Visualization
- At the end of each notebook, all results are visualized.
- Plots include:
- MAE comparisons across methods
- Runtime analysis
- Dropout noise impact
- Performance across different iteration steps and budgets
All figures and result CSVs are saved in the
Output/directory for easy access.
Results
All csv files and figues are saved in the Output directory.
Owner
- Name: Zarin Tahia Hossain
- Login: ZarinTahia
- Kind: user
- Repositories: 1
- Profile: https://github.com/ZarinTahia
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this code, please cite our paper:"
title: "Confidence-Based Active Data Imputation via Distribution
Matching [Regular]"
authors:
- family-names: Hossain
given-names: Zarin Tahia
affiliation: Department of Computer Science, Western University
email: zarin.hossain@uwo.ca
- family-names: Milani
given-names: Mostafa
affiliation: Department of Computer Science, Western University
email: mostafa.milani@uwo.ca
date-released: 2025-05-13
version: 1.0.0
#doi: 10.48550/arXiv.2505.01234 # replace with real DOI/arXiv
url: https://github.com/ZarinTahia/Confidence-Based-Active-Data-Imputation-via-Distribution-Matching-Regular-.git
GitHub Events
Total
- Push event: 5
Last Year
- Push event: 5