pyCeterisParibus

pyCeterisParibus: explaining Machine Learning models with Ceteris Paribus Profiles in Python - Published in JOSS (2019)

https://github.com/modeloriented/pyceterisparibus

Science Score: 95.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 4 DOI reference(s) in README and JOSS metadata
✓
Academic publication links
Links to: joss.theoj.org, zenodo.org
✓
Committers with academic emails
1 of 4 committers (25.0%) from academic institutions
○
Institutional organization owner
✓
JOSS paper metadata
Published in Journal of Open Source Software

Keywords

ceteris-paribus-plots explainable-ai python-library

Scientific Fields

Computer Science Computer Science - 84% confidence

Last synced: 6 months ago · JSON representation

Repository

Python library for Ceteris Paribus Plots (What-if plots)

Basic Info

Host: GitHub
Owner: ModelOriented
License: apache-2.0
Language: Python
Default Branch: master
Homepage:
Size: 1.33 MB

Statistics

Stars: 25
Watchers: 3
Forks: 5
Open Issues: 1
Releases: 2

Topics

ceteris-paribus-plots explainable-ai python-library

Created over 7 years ago · Last pushed almost 5 years ago

Metadata Files

Readme Contributing License Zenodo

pyCeterisParibus

Please note that the Ceteris Paribus method is moved to the dalex Python package which is actively maintained. If you will experience any problem with pyCeterisParibus please consider the dalex implementation at https://dalex.drwhy.ai/python/api/.

pyCeterisParibus is a Python library based on an R package CeterisParibus. It implements Ceteris Paribus Plots. They allow understanding how the model response would change if a selected variable is changed. It’s a perfect tool for What-If scenarios. Ceteris Paribus is a Latin phrase meaning all else unchanged. These plots present the change in model response as the values of one feature change with all others being fixed. Ceteris Paribus method is model-agnostic - it works for any Machine Learning model. The idea is an extension of PDP (Partial Dependency Plots) and ICE (Individual Conditional Expectations) plots. It allows explaining single observations for multiple variables at the same time. The plot engine is developed here.

Why is it so useful?

There might be several motivations behind utilizing this idea. Imagine a person gets a low credit score. The client wants to understand how to increase the score and the scoring institution (e.g. a bank) should be able to answer such questions. Moreover, this method is useful for researchers and developers to analyze, debug, explain and improve Machine Learning models, assisting the entire process of the model design.

Setup

Tested on Python 3.5+

PyCeterisParibus is on PyPI. Simply run:

bash pip install pyCeterisParibus or install the newest version from GitHub by executing: bash pip install git+https://github.com/ModelOriented/pyCeterisParibus or download the sources, enter the main directory and perform: bash https://github.com/ModelOriented/pyCeterisParibus.git cd pyCeterisParibus python setup.py install # (alternatively use pip install .)

Docs

A detailed description of all methods and their parameters might be found in documentation.

To build the documentation locally: bash pip install -r requirements-dev.txt cd docs make html and open _build/html/index.html

Examples

Below we present use cases on two well-known datasets - Titanic and Iris. More examples e.g. for regression problems might be found here and in jupyter notebooks here.

Note, that in order to run the examples you need to install extra requirements from requirements-dev.txt.

Use case - Titanic survival

We demonstrate Ceteris Paribus Plots using the well-known Titanic dataset. In this problem, we examine the chance of survival for Titanic passengers. We start with preprocessing the data and creating an XGBoost model. ```python import pandas as pd df = pd.readcsv('titanictrain.csv')

y = df['Survived'] x = df.drop(['Survived', 'PassengerId', 'Name', 'Cabin', 'Ticket'], inplace=False, axis=1)

valid = x['Age'].isnull() | x['Embarked'].isnull() x = x[-valid] y = y[-valid]

from sklearn.modelselection import traintestsplit Xtrain, Xtest, ytrain, ytest = traintestsplit(x, y, testsize=0.2, random_state=42) python from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.compose import ColumnTransformer

We create the preprocessing pipelines for both numeric and categorical data.

numericfeatures = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare'] numerictransformer = Pipeline(steps=[ ('scaler', StandardScaler())])

categoricalfeatures = ['Embarked', 'Sex'] categoricaltransformer = Pipeline(steps=[ ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer( transformers=[ ('num', numerictransformer, numericfeatures), ('cat', categoricaltransformer, categoricalfeatures)]) ```

python from xgboost import XGBClassifier xgb_clf = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', XGBClassifier())]) xgb_clf.fit(X_train, y_train)

Here the pyCeterisParibus starts. Since this library works in a model agnostic fashion, first we need to create a wrapper around the model with uniform predict interface. python from ceteris_paribus.explainer import explain explainer_xgb = explain(xgb_clf, data=x, y=y, label='XGBoost', predict_function=lambda X: xgb_clf.predict_proba(X)[::, 1])

Single variable profile

Let's look at Mr Ernest James Crease, the 19-year-old man, travelling on the 3. class from Southampton with an 8 pounds ticket in his pocket. He died on Titanic. Most likely, this would not have been the case had Ernest been a few years younger. Figure 1 presents the chance of survival for a person like Ernest at different ages. We can see things were tough for people like him unless they were a child.

python ernest = X_test.iloc[10] label_ernest = y_test.iloc[10] from ceteris_paribus.profiles import individual_variable_profile cp_xgb = individual_variable_profile(explainer_xgb, ernest, label_ernest)

Having calculated the profile we can plot it. Note, that plot_notebook might be used instead of plot when used in Jupyter notebooks.

python from ceteris_paribus.plots.plots import plot plot(cp_xgb, selected_variables=["Age"])

Chance of survival depending on age

Many models

The above picture explains the prediction of XGBoost model. What if we compare various models?

```python from sklearn.ensemble import RandomForestClassifier from sklearn.linearmodel import LogisticRegression rfclf = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', RandomForestClassifier())]) linear_clf = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', LogisticRegression())])

rfclf.fit(Xtrain, ytrain) linearclf.fit(Xtrain, ytrain)

explainerrf = explain(rfclf, data=x, y=y, label='RandomForest', predictfunction=lambda X: rfclf.predictproba(X)[::, 1]) explainerlinear = explain(linearclf, data=x, y=y, label='LogisticRegression', predictfunction=lambda X: linearclf.predictproba(X)[::, 1])

plot(cpxgb, cprf, cplinear, selectedvariables=["Age"]) ```

The probability of survival estimated with various models.

Clearly, XGBoost offers a better fit than Logistic Regression. Also, it predicts a higher chance of survival at child's age than the Random Forest model does.

Profiles for many variables

This time we have a look at Miss. Elizabeth Mussey Eustis. She is 54 years old, travels at 1. class with her sister Marta, as they return to the US from their tour of southern Europe. They both survived the disaster.

python elizabeth = X_test.iloc[1] label_elizabeth = y_test.iloc[1] cp_xgb_2 = individual_variable_profile(explainer_xgb, elizabeth, label_elizabeth)

python plot(cp_xgb_2, selected_variables=["Pclass", "Sex", "Age", "Embarked"])

Profiles for many variables.

Would she have returned home if she had travelled at 3. class or if she had been a man? As we can observe this is less likely. On the other hand, for a first class, female passenger chances of survival were high regardless of age. Note, this was different in the case of Ernest. Place of embarkment (Cherbourg) has no influence, which is expected behaviour.

Feature interactions and average response

Now, what if we look at passengers most similar to Miss. Eustis (middle-aged, upper class)?

python from ceteris_paribus.select_data import select_neighbours neighbours = select_neighbours(X_train, elizabeth, selected_variables=['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'], n=15) cp_xgb_ns = individual_variable_profile(explainer_xgb, neighbours)

python plot(cp_xgb_ns, color="Sex", selected_variables=["Pclass", "Age"], aggregate_profiles='mean', size_pdps=6, alpha_pdps=1, size=2)

Interaction with gender. Apart from charts with Ceteris Paribus Profiles (top of the visualisation), we can plot a table with observations used to calculate these profiles (bottom of the visualisation).

There are two distinct clusters of passengers determined with their gender, therefore a PDP average plot (on grey) does not show the whole picture. Children of both genders were likely to survive, but then we see a large gap. Also, being female increased the chance of survival mostly for second and first class passengers.

Plot function comes with extensive customization options. List of all parameters might be found in the documentation. Additionally, one can interact with the plot by hovering over a point of interest to see more details. Similarly, there is an interactive table with options for highlighting relevant elements as well as filtering and sorting rows.

Multiclass models - Iris dataset

Prepare dataset and model ```python iris = load_iris()

def randomforestclassifier(): rfmodel = RandomForestClassifier(nestimators=100, randomstate=42) rfmodel.fit(iris['data'], iris['target']) return rfmodel, iris['data'], iris['target'], iris['featurenames'] ```

Wrap model into explainers ```python rfmodel, irisx, irisy, irisvarnames = randomforest_classifier()

explainerrf1 = explain(rfmodel, irisvarnames, irisx, irisy, predictfunction= lambda X: rfmodel.predictproba(X)[::, 0], label=iris.targetnames[0]) explainerrf2 = explain(rfmodel, irisvarnames, irisx, irisy, predictfunction= lambda X: rfmodel.predictproba(X)[::, 1], label=iris.targetnames[1]) explainerrf3 = explain(rfmodel, irisvarnames, irisx, irisy, predictfunction= lambda X: rfmodel.predictproba(X)[::, 2], label=iris.targetnames[2]) ```

Calculate profiles and plot ```python cprf1 = individualvariableprofile(explainerrf1, irisx[0], irisy[0]) cprf2 = individualvariableprofile(explainerrf2, irisx[0], irisy[0]) cprf3 = individualvariableprofile(explainerrf3, irisx[0], irisy[0])

plot(cprf1, cprf2, cprf3, selectedvariables=['petal length (cm)', 'petal width (cm)', 'sepal length (cm)']) ``` Multiclass models

Contributing

You're more than welcomed to contribute to this package. See the guideline.

Acknowledgments

Work on this package was financially supported by the ‘NCN Opus grant 2016/21/B/ST6/0217’.

Owner

Name: Model Oriented
Login: ModelOriented
Kind: organization
Location: MI2DataLab @ Warsaw University of Technology

Website: https://mi2.ai/
Repositories: 41
Profile: https://github.com/ModelOriented

JOSS Publication

pyCeterisParibus: explaining Machine Learning models with Ceteris Paribus Profiles in Python

Published

May 06, 2019

DOI

10.21105/joss.01389

Volume 4, Issue 37, Page 1389

Authors

Michał Kuźba

Faculty of Mathematics and Information Science, Warsaw University of Technology, Faculty of Mathematics, Informatics, and Mechanics, University of Warsaw

Ewa Baranowska

Faculty of Mathematics and Information Science, Warsaw University of Technology

Przemysław Biecek

Faculty of Mathematics, Informatics, and Mechanics, University of Warsaw, Faculty of Mathematics and Information Science, Warsaw University of Technology

Editor

Kathryn Huff

GitHub Events

Total

Watch event: 5

Last Year

Watch event: 5

Committers

Last synced: 7 months ago

All Time

Total Commits: 175
Total Committers: 4
Avg Commits per committer: 43.75
Development Distribution Score (DDS): 0.051

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Michał Kuźba	k**8@g**m	166
Przemysław Biecek	p****k	3
Justin Shenk	s**n@g**m	3
Mateusz Staniak	m**k@m**l	3

Committer Domains (Top 20 + Academic)

mini.pw.edu.pl: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 13
Total pull requests: 15
Average time to close issues: about 1 month
Average time to close pull requests: 15 minutes
Total issue authors: 2
Total pull request authors: 3
Average comments per issue: 0.62
Average comments per pull request: 0.27
Merged pull requests: 15
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

mstaniak (11)
justinshenk (2)

Pull Request Authors

kmichael08 (13)
mstaniak (1)
justinshenk (1)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 112 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 5
Total maintainers: 1

pypi.org: pyceterisparibus

Ceteris Paribus python package

Homepage: https://github.com/ModelOriented/pyCeterisParibus
Documentation: https://pyceterisparibus.readthedocs.io/
License: Apache Software License
Latest release: 0.5.1
published almost 7 years ago

Versions: 5
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 112 Last month

Rankings

Dependent packages count: 10.1%

Stargazers count: 13.7%

Forks count: 14.3%

Average: 16.5%

Dependent repos count: 21.6%

Downloads: 22.9%

Maintainers (1)

kmichael08

Last synced: 6 months ago

Dependencies

requirements-dev.txt pypi

Keras >=2.2.4 development
Sphinx >=1.8.3 development
codecov >=2.0.15 development
coverage >=4.5.2 development
m2r ==0.2.1 development
pytest >=4.0.1 development
pytest-cov >=2.6.0 development
scikit-learn * development
sklearn ==0.0 development
tensorflow >=1.12.0 development
xgboost >=0.82 development

requirements.txt pypi

Flask >=1.0.2
numpy >=1.15.4
pandas >=0.23.4

pyCeterisParibus

Science Score: 95.0%

Keywords

Scientific Fields

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

pyCeterisParibus

Why is it so useful?

Setup

Docs

Examples

Use case - Titanic survival

We create the preprocessing pipelines for both numeric and categorical data.

Single variable profile

Many models

Profiles for many variables

Feature interactions and average response

Multiclass models - Iris dataset

Contributing

Acknowledgments

Owner

JOSS Publication

pyCeterisParibus: explaining Machine Learning models with Ceteris Paribus Profiles in Python

Authors

Editor

Tags

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: pyceterisparibus

Rankings

Maintainers (1)

Dependencies