mlimputer

MLimputer: Missing Data Imputation Framework for Machine Learning

Keywords

automated-machine-learning data-science imputation-algorithm imputation-methods imputation-optimizer machine-learning missing-data missing-data-handling missing-data-imputation null-imputation predictive-imputation python

Last synced: 9 months ago · JSON representation ·

Repository

MLimputer: Missing Data Imputation Framework for Machine Learning

Basic Info

Host: GitHub
Owner: TsLu1s
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 4.22 MB

Statistics

Stars: 8
Watchers: 1
Forks: 1
Open Issues: 0
Releases: 0

Topics

automated-machine-learning data-science imputation-algorithm imputation-methods imputation-optimizer machine-learning missing-data missing-data-handling missing-data-imputation null-imputation predictive-imputation python

Created over 3 years ago · Last pushed over 1 year ago

Metadata Files

Readme License Citation

README.md

MLimputer: Missing Data Imputation Framework for Machine Learning

Framework Contextualization

The MLimputer project constitutes an complete and integrated pipeline to automate the handling of missing values in datasets through regression prediction and aims at reducing bias and increase the precision of imputation results when compared to more classic imputation methods. This package provides multiple algorithm options to impute your data, in which every observed data column with existing missing values is fitted with a robust preprocessing approach and subsequently predicted.

The architecture design includes three main sections, these being: missing data analysis, data preprocessing and supervised model imputation which are organized in a customizable pipeline structure.

This project aims at providing the following application capabilities:

General applicability on tabular datasets: The developed imputation procedures are applicable on any data table associated with any Supervised ML scopes, based on missing data columns to be imputed.
Robustness and improvement of predictive results: The application of the MLimputer preprocessing aims at improve the predictive performance through customization and optimization of existing missing values imputation in the dataset input columns.

Main Development Tools

Major frameworks used to built this project:

Where to get it

Binary installer for the latest released version is available at the Python Package Index (PyPI).

Installation

To install this package from Pypi repository run the following command:

pip install mlimputer

MLImputer - Usage Examples

The first needed step after importing the package is to load a dataset (split it) and define your choosen imputation model. The imputation model options for handling the missing data in your dataset are the following: * RandomForest * ExtraTrees * GBR * KNN * XGBoost * Lightgbm * Catboost

After creating a MLimputer object with your imputation selected model, you can then fit the missing data through the fit_imput method. From there you can impute the future datasets with transform_imput (validate, test ...) with the same data properties. Note, as it shows in the example bellow, you can also customize your model imputer parameters by changing it's configurations and then, implementing them in the imputer_configs parameter.

Through the cross_validation function you can also compare the predictive performance evalution of multiple imputations, allowing you to validate which imputation model fits better your future predictions.

```py

from mlimputer.imputation import MLimputer import mlimputer.modelselection as ms from mlimputer.parameters import imputerparameters import pandas as pd import numpy as np from sklearn.modelselection import traintest_split import warnings warnings.filterwarnings("ignore", category=Warning) #-> For a clean console

data = pd.readcsv('csvdirectory_path') # Dataframe Loading Example

Important note: If Classification, target should be categorical. -> data[target]=data[target].astype('object')

train,test = traintestsplit(data, trainsize=0.8) train,test = train.resetindex(drop=True), test.reset_index(drop=True) # <- Required

All model imputation options -> "RandomForest","ExtraTrees","GBR","KNN","XGBoost","Lightgbm","Catboost"

Customizing Hyperparameters Example

hparameters = imputer_parameters() print(hparameters) hparameters["KNN"]["n_neighbors"] = 5 hparameters["RandomForest"]["n_estimators"] = 30

Imputation Example 1 : KNN

mliknn = MLimputer(imputmodel = "KNN", imputerconfigs = hparameters) mliknn.fitimput(X = train) trainknn = mliknn.transformimput(X = train) testknn = mliknn.transform_imput(X = test)

Imputation Example 2 : RandomForest

mlirf = MLimputer(imputmodel = "RandomForest", imputerconfigs = hparameters) mlirf.fitimput(X = train) trainrf = mlirf.transformimput(X = train) testrf = mlirf.transform_imput(X = test)

(...)

Export Imputation Metadata

import pickle output = open("imputerrf.pkl", 'wb') pickle.dump(mlirf, output)

```

Performance Evaluation

The MLimputer framework includes a robust evaluation module that enables users to assess and compare the performance of different imputation strategies. This evaluation process is crucial for selecting the most effective imputation approach for your specific dataset and use case.

Evaluation Process Overview

The framework implements a comprehensive two-stage evaluation approach: 1. Cross-Validation Assessment: Evaluates multiple imputation models using k-fold cross-validation to ensure robust performance metrics. 2. Test Set Validation: Validates the selected imputation strategy on a separate test set to confirm generalization capability.

Implementation Example:

The following example demonstrates how to evaluate imputation models and select the best performing approach for your data:

```py import mlimputer.evaluation as Evaluator
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor from sklearn.tree import DecisionTreeClassifier from xgboost import XGBRegressor

Define evaluation parameters

imputationmodels = ["RandomForest", "ExtraTrees", "GBR", "KNN",] #"XGBoost", "Lightgbm", "Catboost"] # List of imputation models to evaluate nsplits = 3 # Number of splits for cross-validation

Selected models for classification and regression

if train[target].dtypes == "object":
models = [RandomForestClassifier(), DecisionTreeClassifier()] else: models = [XGBRegressor(), RandomForestRegressor()]

Initialize the evaluator

evaluator = Evaluator( imputationmodels = imputationmodels,
train = train, target = target, nsplits = nsplits,
hparameters = hparameters)

Perform evaluations

cvresults = evaluator.evaluateimputation_models( models = models)

bestimputer = evaluator.getbest_imputer() # Get best-performing imputation model

testresults = evaluator.evaluatetestset( test = test, imputmodel = best_imputer, models = models)

```

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Luis Santos - LinkedIn

Owner

Name: Luís Santos
Login: TsLu1s
Kind: user
Location: Braga, Portugal

Repositories: 4
Profile: https://github.com/TsLu1s

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this Python package, please cite it as below."
authors:
- family-names: "Fernando da Silva Santos"
  given-names: "Luís"
  orcid: "https://orcid:0000-0002-4121-1133"
title: "MLimputer - Null Imputation Framework for Supervised Machine Learning"
version: 0.0.9
doi: ""
date-released: 2023-02-07
url: "https://github.com/TsLu1s/MLimputer"

GitHub Events

Total

Issues event: 2
Watch event: 3
Issue comment event: 4
Push event: 6

Last Year

Issues event: 2
Watch event: 3
Issue comment event: 4
Push event: 6

Committers

Last synced: about 3 years ago

All Time

Total Commits: 64
Total Committers: 2
Avg Commits per committer: 32.0
Development Distribution Score (DDS): 0.109

Top Committers

Name	Email	Commits
Luís Santos	8**s@u**m	57
TsLuis	l**8@h**m	7

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 1
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 1
Total pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

dbuscombe-usgs (1)

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 117 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 20
Total maintainers: 1

pypi.org: mlimputer

MLimputer - Missing Data Imputation Framework for Machine Learning

Homepage: https://github.com/TsLu1s/MLimputer
Documentation: https://mlimputer.readthedocs.io/
License: MIT
Latest release: 1.0.80
published over 1 year ago

Versions: 20
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 117 Last month

Rankings

Dependent packages count: 6.6%

Downloads: 18.3%

Average: 21.6%

Stargazers count: 21.8%

Forks count: 30.5%

Dependent repos count: 30.6%

Maintainers (1)

TsLu1s

Last synced: 10 months ago

mlimputer

Science Score: 44.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

MLimputer: Missing Data Imputation Framework for Machine Learning

Framework Contextualization

Main Development Tools

Where to get it

Installation

MLImputer - Usage Examples

Important note: If Classification, target should be categorical. -> data[target]=data[target].astype('object')

All model imputation options -> "RandomForest","ExtraTrees","GBR","KNN","XGBoost","Lightgbm","Catboost"

Customizing Hyperparameters Example

Imputation Example 1 : KNN

Imputation Example 2 : RandomForest

(...)

Export Imputation Metadata

Performance Evaluation

Evaluation Process Overview

Implementation Example:

Define evaluation parameters

Selected models for classification and regression

Initialize the evaluator

Perform evaluations

License

Contact

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: mlimputer

Rankings

Maintainers (1)

Dependencies