effectplots

Fast Effect Plots in R

https://github.com/mayer79/effectplots

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.6%) to scientific vocabulary

Keywords

machine-learning r regression xai
Last synced: 6 months ago · JSON representation

Repository

Fast Effect Plots in R

Basic Info
Statistics
  • Stars: 21
  • Watchers: 2
  • Forks: 1
  • Open Issues: 1
  • Releases: 4
Topics
machine-learning r regression xai
Created over 1 year ago · Last pushed 9 months ago
Metadata Files
Readme Changelog License

README.md

effectplots

R-CMD-check Codecov test coverage CRAN_Status_Badge

{effectplots} is an R package for calculating and plotting feature effects of any model. It is very fast thanks to {collapse}.

The main function feature_effects() crunches these statistics per feature X over values/bins:

  • Average observed y values: Descriptive associations between response y and features.
  • Average predictions: Combined effect of X and other features (M Plots, Apley [1]).
  • Partial dependence (Friedman [2]): How does the average prediction react on X, keeping other features fixed.
  • Accumulated local effects (Apley [1]): Alternative to partial dependence.

Furthermore, it calculates counts, weight sums, average residuals, and standard deviations of observed y and residuals. All statistics respect optional case weights.

We highly recommend Christoph Molnar's book [3] for more info on feature effects.

It takes 1 second on a normal laptop to get all statistics for 10 features on 10 Mio rows (+ prediction time).

Workflow

  1. Crunch values via feature_effects() or the little helpers average_observed(), partial_dependence() etc.
  2. Update the results with update(): Combine rare levels of categorical features, sort results by importance, turn values of discrete features to factor etc.
  3. Plot the results with plot(): Choose between ggplot2/patchwork and plotly.

Outlier capping: Extreme outliers in numeric features are capped by default (but not deleted). To avoid capping, set outlier_iqr = Inf.

Installation

You can install the development version of {effectplots} from GitHub with:

``` r

install.packages("pak")

pak::pak("mayer79/effectplots", dependencies = TRUE) ```

Usage

We use a 1 Mio row dataset on Motor TPL insurance. The aim is to model claim frequency. Before modeling, we want to study the association between features and response.

``` r library(effectplots) library(OpenML) library(lightgbm)

set.seed(1)

df <- getOMLDataSet(data.id = 45106L)$data

xvars <- c("year", "town", "driverage", "carweight", "carpower", "carage")

0.1s on laptop

averageobserved(df[xvars], y = df$claimnb) |> update(tofactor = TRUE) |> # turn discrete numerics to factors plot(sharey = "all") ```

A shared y axis helps to compare the strength of the association across features.

Fit model

Next, let's fit a boosted trees model.

```r ix <- sample(nrow(df), 0.8 * nrow(df)) train <- df[ix, ] test <- df[-ix, ] Xtrain <- data.matrix(train[xvars]) Xtest <- data.matrix(test[xvars])

Training, using slightly optimized parameters found via cross-validation

params <- list( learningrate = 0.05, objective = "poisson", numleaves = 7, mindatainleaf = 50, minsumhessianinleaf = 0.001, colsamplebynode = 0.8, baggingfraction = 0.8, lambdal1 = 3, lambdal2 = 5, numthreads = 7 )

fit <- lgb.train( params = params, data = lgb.Dataset(Xtrain, label = train$claimnb), nrounds = 300 ) ```

Inspect model

Let's crunch all statistics on the test data. Sorting is done by weighted variance of partial dependence, a main-effect importance measure related to [4].

The average predictions closely follow the average observed, i.e., the model seems to do a good job. Comparing partial dependence/ALE with average predicted gives insights on whether an effect mainly comes from the feature on the x axis or from other, correlated, features.

```r

0.1s + 0.15s prediction time

featureeffects(fit, v = xvars, data = Xtest, y = test$claimnb) |> update(sortby = "pd") |> plot() ```

Flexibility

What about combining training and test results? Or comparing different models or subgroups? No problem:

```r mtrain <- featureeffects(fit, v = xvars, data = Xtrain, y = train$claimnb) mtest <- featureeffects(fit, v = xvars, data = Xtest, y = test$claimnb)

Pick top 3 based on train

mtrain <- mtrain |> update(sortby = "pd") |> head(3) mtest <- mtest[names(mtrain)]

Concatenate train and test results and plot them

c(mtrain, mtest) |> plot( sharey = "rows", ncol = 2, byrow = FALSE, stats = c("ymean", "predmean"), subplottitles = FALSE, # plotly = TRUE, title = "Left: Train - Right: Test", ) ```

To look closer at bias, let's select the statistic "resid_mean" along with pointwise 95% confidence intervals for the true conditional bias.

r c(m_train, m_test) |> update(drop_below_n = 50) |> plot( ylim = c(-0.07, 0.08), ncol = 2, byrow = FALSE, stats = "resid_mean", subplot_titles = FALSE, title = "Left: Train - Right: Test", # plotly = TRUE, interval = "ci" )

More examples

Most models work out-of-the box, including DALEX explainers and Tidymodels models. If not, a tailored prediction function can be specified.

DALEX

```r library(effectplots) library(DALEX) library(ranger)

set.seed(1)

fit <- ranger(Sepal.Length ~ ., data = iris) ex <- DALEX::explain(fit, data = iris[, -1], y = iris[, 1])

featureeffects(ex, breaks = 5) |> plot(sharey = "all") ```

Tidymodels

Note that ALE plots are only available for continuous variables.

```r library(effectplots) library(tidymodels)

set.seed(1)

xvars <- c("carat", "color", "clarity", "cut")

split <- initial_split(diamonds) train <- training(split) test <- testing(split)

dia_recipe <- train |> recipe(reformulate(xvars, "price"))

mod <- randforest(trees = 100) |> setengine("ranger") |> set_mode("regression")

diawf <- workflow() |> addrecipe(diarecipe) |> addmodel(mod)

fit <- dia_wf |> fit(train)

Mtrain <- featureeffects(fit, v = xvars, data = train, y = "price") Mtest <- featureeffects(fit, v = xvars, data = test, y = "price")

plot( Mtrain + Mtest, byrow = FALSE, ncol = 2, sharey = "rows", rotatex = rep(45 * xvars %in% c("clarity", "cut"), each = 2), subplot_titles = FALSE, # plotly = TRUE, title = "Left: train - Right: test" ) ```

Probabilistic classification

We focus on a single class.

```r library(effectplots) library(ranger)

set.seed(1)

fit <- ranger(Species ~ ., data = iris, probability = TRUE)

M <- partialdependence( fit, v = colnames(iris[1:4]), data = iris, whichpred = 1 # "setosa" is the first class ) plot(M, bar_height = 0.33, ylim = c(0, 0.7)) ```

References

  1. Apley, Daniel W., and Jingyu Zhu. 2020. Visualizing the Effects of Predictor Variables in Black Box Supervised Learning Models. Journal of the Royal Statistical Society Series B: Statistical Methodology, 82 (4): 1059–1086. doi:10.1111/rssb.12377.
  2. Friedman, Jerome H. 2001. Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics 29 (5): 1189–1232. doi:10.1214/aos/1013203451.
  3. Molnar, Christoph. 2019. Interpretable Machine Learning: A Guide for Making Black Box Models Explainable. https://christophm.github.io/interpretable-ml-book/.
  4. Greenwell, Brandon M., Bradley C. Boehmke, and Andrew J. McCarthy. 2018. A Simple and Effective Model-Based Variable Importance Measure. arXiv preprint. https://arxiv.org/abs/1805.04755.

Owner

  • Name: Michael Mayer
  • Login: mayer79
  • Kind: user

Responsible statistics | ML

GitHub Events

Total
  • Create event: 57
  • Issues event: 16
  • Release event: 4
  • Watch event: 19
  • Delete event: 56
  • Issue comment event: 10
  • Push event: 250
  • Pull request event: 99
  • Fork event: 1
Last Year
  • Create event: 57
  • Issues event: 16
  • Release event: 4
  • Watch event: 19
  • Delete event: 56
  • Issue comment event: 10
  • Push event: 250
  • Pull request event: 99
  • Fork event: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 10
  • Total pull requests: 65
  • Average time to close issues: 10 days
  • Average time to close pull requests: about 2 hours
  • Total issue authors: 3
  • Total pull request authors: 2
  • Average comments per issue: 1.0
  • Average comments per pull request: 0.08
  • Merged pull requests: 60
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 10
  • Pull requests: 65
  • Average time to close issues: 10 days
  • Average time to close pull requests: about 2 hours
  • Issue authors: 3
  • Pull request authors: 2
  • Average comments per issue: 1.0
  • Average comments per pull request: 0.08
  • Merged pull requests: 60
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • mayer79 (10)
  • SebKrantz (1)
Pull Request Authors
  • mayer79 (97)
  • btupper (1)
Top Labels
Issue Labels
enhancement (4) bug (1)
Pull Request Labels
enhancement (10)

Packages

  • Total packages: 1
  • Total downloads:
    • cran 647 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 4
  • Total maintainers: 1
cran.r-project.org: effectplots

Effect Plots

  • Versions: 4
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 647 Last month
Rankings
Dependent packages count: 27.8%
Dependent repos count: 34.2%
Average: 49.7%
Downloads: 87.0%
Maintainers (1)
Last synced: 7 months ago

Dependencies

.github/workflows/R-CMD-check.yaml actions
  • actions/checkout v4 composite
  • r-lib/actions/check-r-package v2 composite
  • r-lib/actions/setup-pandoc v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
.github/workflows/pkgdown.yaml actions
  • JamesIves/github-pages-deploy-action v4.5.0 composite
  • actions/checkout v4 composite
  • r-lib/actions/setup-pandoc v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
DESCRIPTION cran
  • R >= 4.1.0 depends
  • ggplot2 * imports
  • patchwork * imports
  • plotly * imports
  • stats * imports
  • testthat >= 3.0.0 suggests