zoish

Zoish is a Python package that streamlines machine learning by leveraging SHAP values for feature selection and interpretability, making model development more efficient and user-friendly

https://github.com/torkamanilab/zoish

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.3%) to scientific vocabulary

Keywords

automl data-science feature-engineering feature-selection machine-learning python scikit-learn
Last synced: 6 months ago · JSON representation ·

Repository

Zoish is a Python package that streamlines machine learning by leveraging SHAP values for feature selection and interpretability, making model development more efficient and user-friendly

Basic Info
Statistics
  • Stars: 11
  • Watchers: 3
  • Forks: 1
  • Open Issues: 2
  • Releases: 5
Topics
automl data-science feature-engineering feature-selection machine-learning python scikit-learn
Created over 3 years ago · Last pushed about 2 years ago
Metadata Files
Readme License Citation

README.md

GitHub Repo stars GitHub forks GitHub language count GitHub repo size GitHub PyPI - Downloads PyPI - Python Version

Zoish

Zoish is a Python package that simplifies the machine learning process by using SHAP values for feature importance. It integrates with a range of machine learning models, provides feature selection to enhance performance, and improves model interpretability. With Zoish, users can also visualize feature importance through SHAP summary and bar plots, creating an efficient and user-friendly environment for machine learning development.

Introduction

Zoish is a powerful tool for streamlining your machine learning pipeline by leveraging SHAP (SHapley Additive exPlanations) values for feature selection. Designed to work seamlessly with binary and multi-class classification models as well as regression models from sklearn, Zoish is also compatible with gradient boosting frameworks such as CatBoost, LightGBM and GPBoost.

Features

  • Model Flexibility: Zoish exhibits outstanding flexibility as it can work with most of the estimators and others supported by and even GPBoost or a superior estimator emerging from a tree-based optimization process. This enables it to integrate seamlessly into binary or multi-class Sklearn classification models, all Sklearn regression models, as well as with advanced gradient boosting frameworks such as CatBoost, LightGBM and GPBoost.

  • Feature Selection: By utilizing SHAP values, Zoish efficiently determines the most influential features for your predictive models. This improves the interpretability of your model and can potentially enhance model performance by reducing overfitting.

  • Visualization: Zoish includes capabilities for plotting important features using SHAP summary plots and SHAP bar plots, providing a clear and visual representation of feature importance.

Dependencies

The core dependency of Zoish is the shap package, which is used to compute the SHAP values for tree based machine learning model and others too. SHAP values are a unified measure of feature importance and they offer an improved interpretation of machine learning models. They are based on the concept of cooperative game theory and provide a fair allocation of the contribution of each feature to the prediction of each instance.

Installation

To install Zoish, use pip:

Installation

Zoish package is available on PyPI and can be installed with pip:

sh pip install zoish

For log configuration in development environment use

```sh export env=dev

```

For log configuration in production environment use

sh export env=prod

Examples

```

Built-in libraries

import pandas as pd

Scikit-learn libraries for model selection, metrics, pipeline, impute, preprocessing, compose, and ensemble

from sklearn.compose import ColumnTransformer from sklearn.ensemble import RandomForestClassifier from sklearn.featureselection import SelectFromModel from sklearn.impute import SimpleImputer from sklearn.metrics import classificationreport, confusionmatrix, f1score, makescorer from sklearn.modelselection import GridSearchCV, KFold, traintestsplit from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler

Other libraries

from categoryencoders import TargetEncoder from xgboost import XGBClassifier from zoish.featureselectors.shapselectors import ShapFeatureSelector, ShapPlotFeatures import logging from zoish import logger logger.setLevel(logging.ERROR) from featureengine.imputation import ( CategoricalImputer, MeanMedianImputer )

Set logging level

logger.setLevel(logging.ERROR)

```

Example: Audiology (Standardized) Data Set

https://archive.ics.uci.edu/ml/datasets/Audiology+%28Standardized%29

Read data

``` urldata = "https://archive.ics.uci.edu/ml/machine-learning-databases/lymphography/lymphography.data" urlname = "https://archive.ics.uci.edu/ml/machine-learning-databases/lung-cancer/lung-cancer.names"

column names

col_names = [ "class", "lymphatics", "block of affere", "bl. of lymph. c", "bl. of lymph. s", "by pass", "extravasates", "regeneration of", "early uptake in", "lym.nodes dimin", "lym.nodes enlar", "changes in lym.", "defect in node", "changes in node", "special forms", "dislocation of", "exclusion of no", "no. of nodes in",

]

data = pd.readcsv(urldata,names=colnames) data.head()

```

Define labels and train-test split

```

data.loc[(data["class"] == 1) | (data["class"] == 2), "class"] = 0 data.loc[data["class"] == 3, "class"] = 1 data.loc[data["class"] == 4, "class"] = 2 data["class"] = data["class"].astype(int) ```

Train test split

``` X = data.loc[:, data.columns != "class"] y = data.loc[:, data.columns == "class"]

Xtrain, Xtest, ytrain, ytest = traintestsplit( X, y, testsize=0.33, randomstate=42 ) ```

Defining the feature pipeline steps:

Here, we use an untuned XGBClassifier model with the ShapFeatureSelector.In the next section, we will repeat the same process but with a tuned XGBClassifier. The aim is to demonstrate that a better estimator can yield improved results when used with the ShapFeatureSelector.

``` estimatorforfeatureselector= XGBClassifier()
estimator
forfeatureselector.fit(Xtrain, ytrain) shapfeatureselector = ShapFeatureSelector(model=estimatorforfeatureselector, numfeatures=5, cv = 5, scoring='accuracy', direction='maximum', n_iter=10, algorithm='auto')

Define pre-processing for numeric columns (float and integer types)

numericfeatures = Xtrain.selectdtypes(include=['int64', 'float64']).columns numerictransformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='mean')), ('scaler', StandardScaler())])

Define pre-processing for categorical features

categoricalfeatures = Xtrain.selectdtypes(include=['object']).columns categoricaltransformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='constant', fillvalue='missing')), ('encoder', TargetEncoder(handlemissing='return_nan'))])

Combine preprocessing into one column transformer

preprocessor = ColumnTransformer( transformers=[ ('num', numerictransformer, numericfeatures), ('cat', categoricaltransformer, categoricalfeatures)])

Feature Selection using ShapSelector

featureselection = shapfeature_selector

Classifier model

classifier = RandomForestClassifier(n_estimators=100)

Create a pipeline that combines the preprocessor with a feature selection and a classifier

pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('featureselection', featureselection), ('classifier', classifier)])

Fit the model

pipeline.fit(Xtrain, ytrain)

Predict on test data

ytestpred = pipeline.predict(X_test)

Output first 10 predictions

print(ytestpred[:10]) ```

Check performance of the Pipeline

```

print("F1 score : ") print(f1score(ytest, ytestpred,average='micro')) print("Classification report : ") print(classificationreport(ytest, ytestpred)) print("Confusion matrix : ") print(confusionmatrix(ytest, ytestpred))

```

Use better estimator:

In this iteration, we will utilize the optimally tuned estimator with the ShapFeatureSelector, which is expected to yield improved results."

``` intcols = Xtrain.select_dtypes(include=['int']).columns.tolist()

Define the XGBClassifier

xgb_clf = XGBClassifier()

Define the parameter grid for XGBClassifier

paramgrid = { 'learningrate': [0.01, 0.1], 'maxdepth': [ 4, 5], 'minchild_weight': [1, 2, 3], 'gamma': [0, 0.1, 0.2], }

Define the scoring function

scoring = makescorer(f1score, average='micro') # Use 'micro' average in case of multiclass target

Set up GridSearchCV

gridsearch = GridSearchCV(xgbclf, paramgrid, cv=5, scoring=scoring, verbose=1) gridsearch.fit(Xtrain, ytrain)

Fit the GridSearchCV object

estimatorforfeatureselector= gridsearch.bestestimator shapfeatureselector = ShapFeatureSelector(model=estimatorforfeatureselector, numfeatures=5, scoring='accuracy', algorithm='auto',cv = 5, n_iter=10, direction='maximum')

pipeline =Pipeline([ # int missing values imputers ('floatimputer', MeanMedianImputer( imputationmethod='mean', variables=intcols)),

        ('shap_feature_selector', shap_feature_selector),
        ('classfier', RandomForestClassifier(n_estimators=100))

])

Fit the model

pipeline.fit(Xtrain, ytrain)

Predict on test data

ytestpred = pipeline.predict(X_test)

Output first 10 predictions

print(ytestpred[:10])

```

Performance has improved

```

print("F1 score : ") print(f1score(ytest, ytestpred,average='micro')) print("Classification report : ") print(classificationreport(ytest, ytestpred)) print("Confusion matrix : ") print(confusionmatrix(ytest, ytestpred))

Shap related plots

```

Plot the features importance

``` plotfactory = ShapPlotFeatures(shapfeature_selector)

```

Summary Plot of the selected features

``` plotfactory.summaryplot()

``` summary plot

Summary Plot of the all features

``` plotfactory.summaryplot_full()

``` summary plot full

Bar Plot of the selected features

plot_factory.bar_plot() bar plot

Bar Plot of the all features

plot_factory.bar_plot_full() bar plot full

More examples are available in the examples.

License

Licensed under the BSD 2-Clause License.

Owner

  • Name: The Scripps Research Institute - Torkamani Lab
  • Login: TorkamaniLab
  • Kind: organization
  • Email: atorkama@scripps.edu

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Javedani Sadaei"
  given-names: "Hossein"
  orcid: "https://orcid.org/0000-0002-0848-9280"
- family-names: "Torkamani"
  given-names: "Ali"
  orcid: "https://orcid.org/0000-0003-0232-8053"
title: "Zoish: Automated feature selectoion tools"
version: 3
doi: 'DOI: 10.5281/zenodo.8336342'
date-released: 2023-04-18
url: "https://github.com/TorkamaniLab/zoish"

GitHub Events

Total
  • Issues event: 1
  • Watch event: 1
Last Year
  • Issues event: 1
  • Watch event: 1

Committers

Last synced: about 2 years ago

All Time
  • Total Commits: 407
  • Total Committers: 2
  • Avg Commits per committer: 203.5
  • Development Distribution Score (DDS): 0.002
Past Year
  • Commits: 135
  • Committers: 2
  • Avg Commits per committer: 67.5
  • Development Distribution Score (DDS): 0.007
Top Committers
Name Email Commits
drhosseinjavedani h****i@g****m 406
Vigneshwaran s****3@g****m 1

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 11
  • Total pull requests: 73
  • Average time to close issues: about 1 month
  • Average time to close pull requests: 33 minutes
  • Total issue authors: 3
  • Total pull request authors: 3
  • Average comments per issue: 0.82
  • Average comments per pull request: 0.08
  • Merged pull requests: 71
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 2
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 2
  • Pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • ShaunFChen (8)
  • HajarSigarchian (1)
  • brcopeland (1)
Pull Request Authors
  • drhosseinjavedani (69)
  • ShaunFChen (2)
  • neshvig10 (1)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 127 last-month
  • Total dependent packages: 1
  • Total dependent repositories: 0
  • Total versions: 36
  • Total maintainers: 1
pypi.org: zoish

Zoish is a Python package that streamlines machine learning by leveraging SHAP values for feature selection and interpretability, making model development more efficient and user-friendly.

  • Versions: 36
  • Dependent Packages: 1
  • Dependent Repositories: 0
  • Downloads: 127 Last month
Rankings
Dependent packages count: 6.6%
Downloads: 20.3%
Average: 25.4%
Forks count: 30.5%
Dependent repos count: 30.6%
Stargazers count: 39.1%
Maintainers (1)
Last synced: 6 months ago

Dependencies

poetry.lock pypi
  • 188 dependencies
pyproject.toml pypi
  • black ^22.3.0 develop
  • bump2version ^1.0.1 develop
  • bumpver ^2022.1116 develop
  • flake8 ^4.0.1 develop
  • ipykernel ^6.15.1 develop
  • nox ^2022.1.7 develop
  • pytest >=6.2.4 develop
  • catboost ^1.0.6
  • category-encoders ^2.5.0
  • click ^8.1.3
  • fasttreeshap ^0.1.2
  • feature-engine ^1.4.1
  • imblearn ^0.0
  • lightgbm ^3.3.2
  • lohrasb ^2.1.0
  • matplotlib ^3.5.2
  • numba ^0.55.2
  • numpy <1.63.0
  • optuna ^2.10.1
  • pandas ^1.4.3
  • pip-licenses ^3.5.4
  • pycox ^0.2.3
  • python >=3.8,<3.11
  • python-dotenv ^0.21.0
  • scikit-learn ^1.1.1
  • scipy ^1.8.1
  • shap ^0.41.0
  • xgboost ^1.6.1
  • xgbse ^0.2.3