mlimputer

MLimputer: Missing Data Imputation Framework for Machine Learning

https://github.com/tslu1s/mlimputer

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.3%) to scientific vocabulary

Keywords

automated-machine-learning data-science imputation-algorithm imputation-methods imputation-optimizer machine-learning missing-data missing-data-handling missing-data-imputation null-imputation predictive-imputation python
Last synced: 6 months ago · JSON representation ·

Repository

MLimputer: Missing Data Imputation Framework for Machine Learning

Basic Info
  • Host: GitHub
  • Owner: TsLu1s
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 4.22 MB
Statistics
  • Stars: 8
  • Watchers: 1
  • Forks: 1
  • Open Issues: 0
  • Releases: 0
Topics
automated-machine-learning data-science imputation-algorithm imputation-methods imputation-optimizer machine-learning missing-data missing-data-handling missing-data-imputation null-imputation predictive-imputation python
Created about 3 years ago · Last pushed about 1 year ago
Metadata Files
Readme License Citation

README.md

LinkedIn Contributors Stargazers MIT License Downloads Month Downloads


MLimputer: Missing Data Imputation Framework for Machine Learning

Framework Contextualization

The MLimputer project constitutes an complete and integrated pipeline to automate the handling of missing values in datasets through regression prediction and aims at reducing bias and increase the precision of imputation results when compared to more classic imputation methods. This package provides multiple algorithm options to impute your data, in which every observed data column with existing missing values is fitted with a robust preprocessing approach and subsequently predicted.

The architecture design includes three main sections, these being: missing data analysis, data preprocessing and supervised model imputation which are organized in a customizable pipeline structure.

This project aims at providing the following application capabilities:

  • General applicability on tabular datasets: The developed imputation procedures are applicable on any data table associated with any Supervised ML scopes, based on missing data columns to be imputed.

  • Robustness and improvement of predictive results: The application of the MLimputer preprocessing aims at improve the predictive performance through customization and optimization of existing missing values imputation in the dataset input columns.

Main Development Tools

Major frameworks used to built this project:

Where to get it

Binary installer for the latest released version is available at the Python Package Index (PyPI).

Installation

To install this package from Pypi repository run the following command:

pip install mlimputer

MLImputer - Usage Examples

The first needed step after importing the package is to load a dataset (split it) and define your choosen imputation model. The imputation model options for handling the missing data in your dataset are the following: * RandomForest * ExtraTrees * GBR * KNN * XGBoost * Lightgbm * Catboost

After creating a MLimputer object with your imputation selected model, you can then fit the missing data through the fit_imput method. From there you can impute the future datasets with transform_imput (validate, test ...) with the same data properties. Note, as it shows in the example bellow, you can also customize your model imputer parameters by changing it's configurations and then, implementing them in the imputer_configs parameter.

Through the cross_validation function you can also compare the predictive performance evalution of multiple imputations, allowing you to validate which imputation model fits better your future predictions.

```py

from mlimputer.imputation import MLimputer import mlimputer.modelselection as ms from mlimputer.parameters import imputerparameters import pandas as pd import numpy as np from sklearn.modelselection import traintest_split import warnings warnings.filterwarnings("ignore", category=Warning) #-> For a clean console

data = pd.readcsv('csvdirectory_path') # Dataframe Loading Example

Important note: If Classification, target should be categorical. -> data[target]=data[target].astype('object')

train,test = traintestsplit(data, trainsize=0.8) train,test = train.resetindex(drop=True), test.reset_index(drop=True) # <- Required

All model imputation options -> "RandomForest","ExtraTrees","GBR","KNN","XGBoost","Lightgbm","Catboost"

Customizing Hyperparameters Example

hparameters = imputer_parameters() print(hparameters) hparameters["KNN"]["n_neighbors"] = 5 hparameters["RandomForest"]["n_estimators"] = 30

Imputation Example 1 : KNN

mliknn = MLimputer(imputmodel = "KNN", imputerconfigs = hparameters) mliknn.fitimput(X = train) trainknn = mliknn.transformimput(X = train) testknn = mliknn.transform_imput(X = test)

Imputation Example 2 : RandomForest

mlirf = MLimputer(imputmodel = "RandomForest", imputerconfigs = hparameters) mlirf.fitimput(X = train) trainrf = mlirf.transformimput(X = train) testrf = mlirf.transform_imput(X = test)

(...)

Export Imputation Metadata

import pickle output = open("imputerrf.pkl", 'wb') pickle.dump(mlirf, output)

```

Performance Evaluation

The MLimputer framework includes a robust evaluation module that enables users to assess and compare the performance of different imputation strategies. This evaluation process is crucial for selecting the most effective imputation approach for your specific dataset and use case.

Evaluation Process Overview

The framework implements a comprehensive two-stage evaluation approach: 1. Cross-Validation Assessment: Evaluates multiple imputation models using k-fold cross-validation to ensure robust performance metrics. 2. Test Set Validation: Validates the selected imputation strategy on a separate test set to confirm generalization capability.

Implementation Example:

The following example demonstrates how to evaluate imputation models and select the best performing approach for your data:

```py import mlimputer.evaluation as Evaluator
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor from sklearn.tree import DecisionTreeClassifier from xgboost import XGBRegressor

Define evaluation parameters

imputationmodels = ["RandomForest", "ExtraTrees", "GBR", "KNN",] #"XGBoost", "Lightgbm", "Catboost"] # List of imputation models to evaluate nsplits = 3 # Number of splits for cross-validation

Selected models for classification and regression

if train[target].dtypes == "object":
models = [RandomForestClassifier(), DecisionTreeClassifier()] else: models = [XGBRegressor(), RandomForestRegressor()]

Initialize the evaluator

evaluator = Evaluator( imputationmodels = imputationmodels,
train = train, target = target, nsplits = nsplits,
hparameters = hparameters)

Perform evaluations

cvresults = evaluator.evaluateimputation_models( models = models)

bestimputer = evaluator.getbest_imputer() # Get best-performing imputation model

testresults = evaluator.evaluatetestset( test = test, imputmodel = best_imputer, models = models)

```

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Luis Santos - LinkedIn

Owner

  • Name: Luís Santos
  • Login: TsLu1s
  • Kind: user
  • Location: Braga, Portugal

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this Python package, please cite it as below."
authors:
- family-names: "Fernando da Silva Santos"
  given-names: "Luís"
  orcid: "https://orcid:0000-0002-4121-1133"
title: "MLimputer - Null Imputation Framework for Supervised Machine Learning"
version: 0.0.9
doi: ""
date-released: 2023-02-07
url: "https://github.com/TsLu1s/MLimputer"

GitHub Events

Total
  • Issues event: 2
  • Watch event: 3
  • Issue comment event: 4
  • Push event: 6
Last Year
  • Issues event: 2
  • Watch event: 3
  • Issue comment event: 4
  • Push event: 6

Committers

Last synced: almost 3 years ago

All Time
  • Total Commits: 64
  • Total Committers: 2
  • Avg Commits per committer: 32.0
  • Development Distribution Score (DDS): 0.109
Top Committers
Name Email Commits
Luís Santos 8****s@u****m 57
TsLuis l****8@h****m 7

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 1
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 1
  • Total pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • dbuscombe-usgs (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 117 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 20
  • Total maintainers: 1
pypi.org: mlimputer

MLimputer - Missing Data Imputation Framework for Machine Learning

  • Versions: 20
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 117 Last month
Rankings
Dependent packages count: 6.6%
Downloads: 18.3%
Average: 21.6%
Stargazers count: 21.8%
Forks count: 30.5%
Dependent repos count: 30.6%
Maintainers (1)
Last synced: 6 months ago

Dependencies

requirements.txt pypi
  • atlantic >=1.0.12
  • catboost >=1.1.1
  • lightgbm >=3.3.5
  • scikit-learn >=1.0.50
  • xgboost >=1.7.3
setup.py pypi