Science Score: 49.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.6%) to scientific vocabulary
Keywords
Repository
Fast Effect Plots in R
Basic Info
- Host: GitHub
- Owner: mayer79
- License: gpl-3.0
- Language: R
- Default Branch: main
- Homepage: https://mayer79.github.io/effectplots/
- Size: 11 MB
Statistics
- Stars: 21
- Watchers: 2
- Forks: 1
- Open Issues: 1
- Releases: 4
Topics
Metadata Files
README.md
effectplots 
{effectplots} is an R package for calculating and plotting feature effects of any model. It is very fast thanks to {collapse}.
The main function feature_effects() crunches these statistics per feature X over values/bins:
- Average observed y values: Descriptive associations between response y and features.
- Average predictions: Combined effect of X and other features (M Plots, Apley [1]).
- Partial dependence (Friedman [2]): How does the average prediction react on X, keeping other features fixed.
- Accumulated local effects (Apley [1]): Alternative to partial dependence.
Furthermore, it calculates counts, weight sums, average residuals, and standard deviations of observed y and residuals. All statistics respect optional case weights.
We highly recommend Christoph Molnar's book [3] for more info on feature effects.
It takes 1 second on a normal laptop to get all statistics for 10 features on 10 Mio rows (+ prediction time).
Workflow
- Crunch values via
feature_effects()or the little helpersaverage_observed(),partial_dependence()etc. - Update the results with
update(): Combine rare levels of categorical features, sort results by importance, turn values of discrete features to factor etc. - Plot the results with
plot(): Choose between ggplot2/patchwork and plotly.
Outlier capping: Extreme outliers in numeric features are capped by default (but not deleted).
To avoid capping, set outlier_iqr = Inf.
Installation
You can install the development version of {effectplots} from GitHub with:
``` r
install.packages("pak")
pak::pak("mayer79/effectplots", dependencies = TRUE) ```
Usage
We use a 1 Mio row dataset on Motor TPL insurance. The aim is to model claim frequency. Before modeling, we want to study the association between features and response.
``` r library(effectplots) library(OpenML) library(lightgbm)
set.seed(1)
df <- getOMLDataSet(data.id = 45106L)$data
xvars <- c("year", "town", "driverage", "carweight", "carpower", "carage")
0.1s on laptop
averageobserved(df[xvars], y = df$claimnb) |> update(tofactor = TRUE) |> # turn discrete numerics to factors plot(sharey = "all") ```
A shared y axis helps to compare the strength of the association across features.
Fit model
Next, let's fit a boosted trees model.
```r ix <- sample(nrow(df), 0.8 * nrow(df)) train <- df[ix, ] test <- df[-ix, ] Xtrain <- data.matrix(train[xvars]) Xtest <- data.matrix(test[xvars])
Training, using slightly optimized parameters found via cross-validation
params <- list( learningrate = 0.05, objective = "poisson", numleaves = 7, mindatainleaf = 50, minsumhessianinleaf = 0.001, colsamplebynode = 0.8, baggingfraction = 0.8, lambdal1 = 3, lambdal2 = 5, numthreads = 7 )
fit <- lgb.train( params = params, data = lgb.Dataset(Xtrain, label = train$claimnb), nrounds = 300 ) ```
Inspect model
Let's crunch all statistics on the test data. Sorting is done by weighted variance of partial dependence, a main-effect importance measure related to [4].
The average predictions closely follow the average observed, i.e., the model seems to do a good job. Comparing partial dependence/ALE with average predicted gives insights on whether an effect mainly comes from the feature on the x axis or from other, correlated, features.
```r
0.1s + 0.15s prediction time
featureeffects(fit, v = xvars, data = Xtest, y = test$claimnb) |> update(sortby = "pd") |> plot() ```
Flexibility
What about combining training and test results? Or comparing different models or subgroups? No problem:
```r mtrain <- featureeffects(fit, v = xvars, data = Xtrain, y = train$claimnb) mtest <- featureeffects(fit, v = xvars, data = Xtest, y = test$claimnb)
Pick top 3 based on train
mtrain <- mtrain |> update(sortby = "pd") |> head(3) mtest <- mtest[names(mtrain)]
Concatenate train and test results and plot them
c(mtrain, mtest) |> plot( sharey = "rows", ncol = 2, byrow = FALSE, stats = c("ymean", "predmean"), subplottitles = FALSE, # plotly = TRUE, title = "Left: Train - Right: Test", ) ```
To look closer at bias, let's select the statistic "resid_mean" along with pointwise 95% confidence intervals for the true conditional bias.
r
c(m_train, m_test) |>
update(drop_below_n = 50) |>
plot(
ylim = c(-0.07, 0.08),
ncol = 2,
byrow = FALSE,
stats = "resid_mean",
subplot_titles = FALSE,
title = "Left: Train - Right: Test",
# plotly = TRUE,
interval = "ci"
)
More examples
Most models work out-of-the box, including DALEX explainers and Tidymodels models. If not, a tailored prediction function can be specified.
DALEX
```r library(effectplots) library(DALEX) library(ranger)
set.seed(1)
fit <- ranger(Sepal.Length ~ ., data = iris) ex <- DALEX::explain(fit, data = iris[, -1], y = iris[, 1])
featureeffects(ex, breaks = 5) |> plot(sharey = "all") ```
Tidymodels
Note that ALE plots are only available for continuous variables.
```r library(effectplots) library(tidymodels)
set.seed(1)
xvars <- c("carat", "color", "clarity", "cut")
split <- initial_split(diamonds) train <- training(split) test <- testing(split)
dia_recipe <- train |> recipe(reformulate(xvars, "price"))
mod <- randforest(trees = 100) |> setengine("ranger") |> set_mode("regression")
diawf <- workflow() |> addrecipe(diarecipe) |> addmodel(mod)
fit <- dia_wf |> fit(train)
Mtrain <- featureeffects(fit, v = xvars, data = train, y = "price") Mtest <- featureeffects(fit, v = xvars, data = test, y = "price")
plot( Mtrain + Mtest, byrow = FALSE, ncol = 2, sharey = "rows", rotatex = rep(45 * xvars %in% c("clarity", "cut"), each = 2), subplot_titles = FALSE, # plotly = TRUE, title = "Left: train - Right: test" ) ```
Probabilistic classification
We focus on a single class.
```r library(effectplots) library(ranger)
set.seed(1)
fit <- ranger(Species ~ ., data = iris, probability = TRUE)
M <- partialdependence( fit, v = colnames(iris[1:4]), data = iris, whichpred = 1 # "setosa" is the first class ) plot(M, bar_height = 0.33, ylim = c(0, 0.7)) ```
References
- Apley, Daniel W., and Jingyu Zhu. 2020. Visualizing the Effects of Predictor Variables in Black Box Supervised Learning Models. Journal of the Royal Statistical Society Series B: Statistical Methodology, 82 (4): 1059–1086. doi:10.1111/rssb.12377.
- Friedman, Jerome H. 2001. Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics 29 (5): 1189–1232. doi:10.1214/aos/1013203451.
- Molnar, Christoph. 2019. Interpretable Machine Learning: A Guide for Making Black Box Models Explainable. https://christophm.github.io/interpretable-ml-book/.
- Greenwell, Brandon M., Bradley C. Boehmke, and Andrew J. McCarthy. 2018. A Simple and Effective Model-Based Variable Importance Measure. arXiv preprint. https://arxiv.org/abs/1805.04755.
Owner
- Name: Michael Mayer
- Login: mayer79
- Kind: user
- Repositories: 12
- Profile: https://github.com/mayer79
Responsible statistics | ML
GitHub Events
Total
- Create event: 57
- Issues event: 16
- Release event: 4
- Watch event: 19
- Delete event: 56
- Issue comment event: 10
- Push event: 250
- Pull request event: 99
- Fork event: 1
Last Year
- Create event: 57
- Issues event: 16
- Release event: 4
- Watch event: 19
- Delete event: 56
- Issue comment event: 10
- Push event: 250
- Pull request event: 99
- Fork event: 1
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 10
- Total pull requests: 65
- Average time to close issues: 10 days
- Average time to close pull requests: about 2 hours
- Total issue authors: 3
- Total pull request authors: 2
- Average comments per issue: 1.0
- Average comments per pull request: 0.08
- Merged pull requests: 60
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 10
- Pull requests: 65
- Average time to close issues: 10 days
- Average time to close pull requests: about 2 hours
- Issue authors: 3
- Pull request authors: 2
- Average comments per issue: 1.0
- Average comments per pull request: 0.08
- Merged pull requests: 60
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- mayer79 (10)
- SebKrantz (1)
Pull Request Authors
- mayer79 (97)
- btupper (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- cran 647 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 4
- Total maintainers: 1
cran.r-project.org: effectplots
Effect Plots
- Homepage: https://github.com/mayer79/effectplots
- Documentation: http://cran.r-project.org/web/packages/effectplots/effectplots.pdf
- License: GPL (≥ 3)
-
Latest release: 0.2.2
published 12 months ago
Rankings
Maintainers (1)
Dependencies
- actions/checkout v4 composite
- r-lib/actions/check-r-package v2 composite
- r-lib/actions/setup-pandoc v2 composite
- r-lib/actions/setup-r v2 composite
- r-lib/actions/setup-r-dependencies v2 composite
- JamesIves/github-pages-deploy-action v4.5.0 composite
- actions/checkout v4 composite
- r-lib/actions/setup-pandoc v2 composite
- r-lib/actions/setup-r v2 composite
- r-lib/actions/setup-r-dependencies v2 composite
- R >= 4.1.0 depends
- ggplot2 * imports
- patchwork * imports
- plotly * imports
- stats * imports
- testthat >= 3.0.0 suggests