active_imputer

Implementation of confidence-based active data imputation using distribution matching via optimal transport.

https://github.com/zarintahia/active_imputer

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.3%) to scientific vocabulary
Last synced: 9 months ago · JSON representation ·

Repository

Implementation of confidence-based active data imputation using distribution matching via optimal transport.

Basic Info
  • Host: GitHub
  • Owner: ZarinTahia
  • Language: Jupyter Notebook
  • Default Branch: Main
  • Size: 1.32 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created about 1 year ago · Last pushed about 1 year ago
Metadata Files
Readme Citation

README.md

Active Imputation

Overview

This repository contains the code for the paper "Confidence-Based Active Data Imputation via Distribution Matching".

Introduction

Handling missing data is critical for building reliable machine learning models. While automated imputation methods are common, they often overlook the value of human expertise—especially in cases of high uncertainty.

This project introduces an active imputation framework that uses confidence-based guidance to involve human input effectively. We estimate confidence by measuring the variability of Optimal Transport (OT) imputations and propose an efficient approximation using dropout noise injection.

We implement two active strategies: - Batch-Impute: Queries uncertain values in a single step. - Iter-Impute: Distributes queries across multiple iterations for progressive refinement.

Experiments across five real-world datasets show that Iter-Impute outperforms both baseline and random querying methods, achieving better accuracy with limited user interaction. This makes our approach well-suited for practical, semi-automated data cleaning.

Data Preparation

All datasets are automatically downloaded using fetch_openml from scikit-learn. No manual download is needed.

We used the following real-world datasets in our experiments: - Adult - Wine - German Credit - Diabetic - Breast Cancer

Preprocessing Steps:

  1. Feature Scaling:
    All numerical features are scaled to the [0, 1] range using MinMaxScaler from scikit-learn.

  2. Missing Value Injection:
    Missing values are injected artificially using the Inject_Missing_Value.py script.
    This includes support for:

    • MCAR (Missing Completely at Random)
    • MAR (Missing At Random)
    • MNAR (Missing Not At Random)

Experiments

All experiments are conducted in the notebooks/ folder using Jupyter notebooks, with one notebook dedicated to each dataset.

Brier Score Evaluation

  • The notebook experiment.ipynb calculates the Brier score across all datasets to evaluate the confidence calibration of the imputation algorithms.

Baseline Imputation Methods

Each dataset-specific notebook begins by applying the following three core methods under various missingness mechanisms (MCAR, MAR, MNAR):OT-Impute, OT-Rand, Batch-Impute These three methods are first compared to evaluate the effectiveness of active selection over random or passive strategies.

In a later section of the notebook, these methods are further compared with additional baseline imputation techniques, including: - Mean/Median imputation - MICE - KNN-Impute

Optimization: Dropout Noise Sensitivity

  • To improve confidence estimation in Batch-Impute, we inject different levels of dropout noise during the OT optimization process.
  • We empirically evaluate how the choice of dropout noise affects imputation quality.
  • Based on these experiments, the best-performing dropout level is selected.

Iterative Active Imputation (Iter-Impute)

  • In the later sections of each notebook, Iter-Impute is executed across multiple iterations.
  • This method simulates active user involvement over time, gradually refining the imputation by querying the most uncertain values at each step.

Visualization

  • At the end of each notebook, all results are visualized.
  • Plots include:
    • MAE comparisons across methods
    • Runtime analysis
    • Dropout noise impact
    • Performance across different iteration steps and budgets

All figures and result CSVs are saved in the Output/ directory for easy access.

Results

All csv files and figues are saved in the Output directory.

Owner

  • Name: Zarin Tahia Hossain
  • Login: ZarinTahia
  • Kind: user

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this code, please cite our paper:"
title: "Confidence-Based Active Data Imputation via Distribution
Matching [Regular]"
authors:
  - family-names: Hossain
    given-names: Zarin Tahia
    affiliation: Department of Computer Science, Western University
    email: zarin.hossain@uwo.ca
  - family-names: Milani
    given-names: Mostafa
    affiliation: Department of Computer Science, Western University
    email: mostafa.milani@uwo.ca
date-released: 2025-05-13
version: 1.0.0
#doi: 10.48550/arXiv.2505.01234  # replace with real DOI/arXiv
url: https://github.com/ZarinTahia/Confidence-Based-Active-Data-Imputation-via-Distribution-Matching-Regular-.git

GitHub Events

Total
  • Push event: 5
Last Year
  • Push event: 5