mlimputer
MLimputer: Missing Data Imputation Framework for Machine Learning
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.3%) to scientific vocabulary
Keywords
Repository
MLimputer: Missing Data Imputation Framework for Machine Learning
Basic Info
Statistics
- Stars: 8
- Watchers: 1
- Forks: 1
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
MLimputer: Missing Data Imputation Framework for Machine Learning
Framework Contextualization
The MLimputer project constitutes an complete and integrated pipeline to automate the handling of missing values in datasets through regression prediction and aims at reducing bias and increase the precision of imputation results when compared to more classic imputation methods.
This package provides multiple algorithm options to impute your data, in which every observed data column with existing missing values is fitted with a robust preprocessing approach and subsequently predicted.
The architecture design includes three main sections, these being: missing data analysis, data preprocessing and supervised model imputation which are organized in a customizable pipeline structure.
This project aims at providing the following application capabilities:
General applicability on tabular datasets: The developed imputation procedures are applicable on any data table associated with any Supervised ML scopes, based on missing data columns to be imputed.
Robustness and improvement of predictive results: The application of the MLimputer preprocessing aims at improve the predictive performance through customization and optimization of existing missing values imputation in the dataset input columns.
Main Development Tools
Major frameworks used to built this project:
Where to get it
Binary installer for the latest released version is available at the Python Package Index (PyPI).
Installation
To install this package from Pypi repository run the following command:
pip install mlimputer
MLImputer - Usage Examples
The first needed step after importing the package is to load a dataset (split it) and define your choosen imputation model.
The imputation model options for handling the missing data in your dataset are the following:
* RandomForest
* ExtraTrees
* GBR
* KNN
* XGBoost
* Lightgbm
* Catboost
After creating a MLimputer object with your imputation selected model, you can then fit the missing data through the fit_imput method. From there you can impute the future datasets with transform_imput (validate, test ...) with the same data properties. Note, as it shows in the example bellow, you can also customize your model imputer parameters by changing it's configurations and then, implementing them in the imputer_configs parameter.
Through the cross_validation function you can also compare the predictive performance evalution of multiple imputations, allowing you to validate which imputation model fits better your future predictions.
```py
from mlimputer.imputation import MLimputer import mlimputer.modelselection as ms from mlimputer.parameters import imputerparameters import pandas as pd import numpy as np from sklearn.modelselection import traintest_split import warnings warnings.filterwarnings("ignore", category=Warning) #-> For a clean console
data = pd.readcsv('csvdirectory_path') # Dataframe Loading Example
Important note: If Classification, target should be categorical. -> data[target]=data[target].astype('object')
train,test = traintestsplit(data, trainsize=0.8) train,test = train.resetindex(drop=True), test.reset_index(drop=True) # <- Required
All model imputation options -> "RandomForest","ExtraTrees","GBR","KNN","XGBoost","Lightgbm","Catboost"
Customizing Hyperparameters Example
hparameters = imputer_parameters() print(hparameters) hparameters["KNN"]["n_neighbors"] = 5 hparameters["RandomForest"]["n_estimators"] = 30
Imputation Example 1 : KNN
mliknn = MLimputer(imputmodel = "KNN", imputerconfigs = hparameters) mliknn.fitimput(X = train) trainknn = mliknn.transformimput(X = train) testknn = mliknn.transform_imput(X = test)
Imputation Example 2 : RandomForest
mlirf = MLimputer(imputmodel = "RandomForest", imputerconfigs = hparameters) mlirf.fitimput(X = train) trainrf = mlirf.transformimput(X = train) testrf = mlirf.transform_imput(X = test)
(...)
Export Imputation Metadata
import pickle output = open("imputerrf.pkl", 'wb') pickle.dump(mlirf, output)
```
Performance Evaluation
The MLimputer framework includes a robust evaluation module that enables users to assess and compare the performance of different imputation strategies. This evaluation process is crucial for selecting the most effective imputation approach for your specific dataset and use case.
Evaluation Process Overview
The framework implements a comprehensive two-stage evaluation approach: 1. Cross-Validation Assessment: Evaluates multiple imputation models using k-fold cross-validation to ensure robust performance metrics. 2. Test Set Validation: Validates the selected imputation strategy on a separate test set to confirm generalization capability.
Implementation Example:
The following example demonstrates how to evaluate imputation models and select the best performing approach for your data:
```py
import mlimputer.evaluation as Evaluator
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBRegressor
Define evaluation parameters
imputationmodels = ["RandomForest", "ExtraTrees", "GBR", "KNN",] #"XGBoost", "Lightgbm", "Catboost"] # List of imputation models to evaluate nsplits = 3 # Number of splits for cross-validation
Selected models for classification and regression
if train[target].dtypes == "object":
models = [RandomForestClassifier(), DecisionTreeClassifier()]
else:
models = [XGBRegressor(), RandomForestRegressor()]
Initialize the evaluator
evaluator = Evaluator(
imputationmodels = imputationmodels,
train = train,
target = target,
nsplits = nsplits,
hparameters = hparameters)
Perform evaluations
cvresults = evaluator.evaluateimputation_models( models = models)
bestimputer = evaluator.getbest_imputer() # Get best-performing imputation model
testresults = evaluator.evaluatetestset( test = test, imputmodel = best_imputer, models = models)
```
License
Distributed under the MIT License. See LICENSE for more information.
Contact
Luis Santos - LinkedIn
Owner
- Name: Luís Santos
- Login: TsLu1s
- Kind: user
- Location: Braga, Portugal
- Repositories: 4
- Profile: https://github.com/TsLu1s
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this Python package, please cite it as below." authors: - family-names: "Fernando da Silva Santos" given-names: "Luís" orcid: "https://orcid:0000-0002-4121-1133" title: "MLimputer - Null Imputation Framework for Supervised Machine Learning" version: 0.0.9 doi: "" date-released: 2023-02-07 url: "https://github.com/TsLu1s/MLimputer"
GitHub Events
Total
- Issues event: 2
- Watch event: 3
- Issue comment event: 4
- Push event: 6
Last Year
- Issues event: 2
- Watch event: 3
- Issue comment event: 4
- Push event: 6
Committers
Last synced: almost 3 years ago
All Time
- Total Commits: 64
- Total Committers: 2
- Avg Commits per committer: 32.0
- Development Distribution Score (DDS): 0.109
Top Committers
| Name | Commits | |
|---|---|---|
| Luís Santos | 8****s@u****m | 57 |
| TsLuis | l****8@h****m | 7 |
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 1
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 1
- Total pull request authors: 0
- Average comments per issue: 0.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 1
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 1
- Pull request authors: 0
- Average comments per issue: 0.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- dbuscombe-usgs (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 117 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 20
- Total maintainers: 1
pypi.org: mlimputer
MLimputer - Missing Data Imputation Framework for Machine Learning
- Homepage: https://github.com/TsLu1s/MLimputer
- Documentation: https://mlimputer.readthedocs.io/
- License: MIT
-
Latest release: 1.0.80
published about 1 year ago
Rankings
Maintainers (1)
Dependencies
- atlantic >=1.0.12
- catboost >=1.1.1
- lightgbm >=3.3.5
- scikit-learn >=1.0.50
- xgboost >=1.7.3