Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (17.8%) to scientific vocabulary
Keywords
explainable-machine-learning
Last synced: 6 months ago
·
JSON representation
Repository
Explainable Ensemble Trees
Basic Info
Statistics
- Stars: 6
- Watchers: 5
- Forks: 3
- Open Issues: 0
- Releases: 1
Topics
explainable-machine-learning
Created over 3 years ago
· Last pushed 7 months ago
Metadata Files
Readme
Changelog
License
README.Rmd
---
output: github_document
---
# Explainable Ensemble Trees (e2tree)
[](https://github.com/massimoaria/e2tree/actions/workflows/R-CMD-check.yaml)
[](https://CRAN.R-project.org/package=e2tree) `r badger::badge_cran_download("e2tree", "grand-total")`
The **Explainable Ensemble Trees** (**e2tree**) key idea consists of the definition of an algorithm to represent every ensemble approach based on decision trees model using a single tree-like structure. The goal is to explain the results from the esemble algorithm while preserving its level of accuracy, which always outperforms those provided by a decision tree. The proposed method is based on identifying the relationship tree-like structure explaining the classification or regression paths summarizing the whole ensemble process. There are two main advantages of e2tree:
- building an explainable tree that ensures the predictive performance of an RF model - allowing the decision-maker to manage with an intuitive structure (such as a tree-like structure).
In this example, we focus on Random Forest but, again, the algorithm can be generalized to every ensemble approach based on decision trees.
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%",
dpi = 300
)
```
## Setup
You can install the **developer version** of e2tree from [GitHub](https://github.com) with:
```{r eval=FALSE}
install.packages("remotes")
remotes::install_github("massimoaria/e2tree")
```
You can install the **released version** of e2tree from [CRAN](https://CRAN.R-project.org) with:
```{r eval=FALSE}
if (!require("e2tree", quietly=TRUE)) install.packages("e2tree")
```
```{r warning=FALSE, message=FALSE}
require(e2tree)
require(randomForest)
require(ranger)
require(dplyr)
require(ggplot2)
if (!(require(rsample, quietly=TRUE))){install.packages("rsample"); require(rsample, quietly=TRUE)}
options(dplyr.summarise.inform = FALSE)
```
```{r set-theme, include=FALSE}
theme_set(
theme_classic() +
theme(
plot.background = element_rect(fill = "transparent", colour = NA),
panel.background = element_rect(fill = "transparent", colour = NA)
)
)
knitr::opts_chunk$set(dev.args = list(bg = "transparent"))
```
## Warning
This package is still under development and, for the time being, the following limitations apply:
- Only ensembles trained with the **randomForest** and **ranger** packages are currently supported. Support for additional packages and approaches will be added in the future.
- Currently **e2tree** works only for classification and regression problems. It will gradually be extended to handle other types of response variables, such as count data, multivariate responses, and more.
## Example 1: IRIS dataset
In this example, we want to show the main functions of the e2tree package.
Starting from the IRIS dataset, we will train an ensemble tree using the randomForest package and then subsequently use e2tree to obtain an explainable tree synthesis of the ensemble classifier.
We run a Random Forest (RF) model, and then obtain the proximity matrix of the observations as output. The idea behind the proximity matrix: if a pair of observations is often at the a terminal node of several trees, this means that both explain an underlying relationship.
From this we are able to calculate co-occurrences at nodes between pairs of observations and obtain a matrix O of Co-Occurrences that will then be used to construct the graphical E2Tree output.
The final aim will be to explain the relationship between predictors and response, reconstructing the same structure as the proximity matrix output of the RF model.
```{r}
# Set random seed to make results reproducible:
set.seed(0)
# Initialize the split
iris_split <- iris %>% initial_split(prop = 0.6)
iris_split
# Assign the data to the correct sets
training <- iris_split %>% training()
validation <- iris_split %>% testing()
response_training <- training[,5]
response_validation <- validation[,5]
```
Train an Random Forest model with 1000 weak learners
```{r}
# Perform training with "ranger" or "randomForest" package:
## RF with "ranger" package
ensemble <- ranger(Species ~ ., data = training, num.trees = 1000, importance = 'impurity')
## RF with "randomForest" package
#ensemble = randomForest(Species ~ ., data = training, importance = TRUE, proximity = TRUE)
```
Here, we create the dissimilarity matrix between observations through the createDisMatrix function
```{r}
D = createDisMatrix(ensemble, data = training, label = "Species", parallel = list(active = FALSE, no_cores = NULL))
```
setting e2tree parameters
```{r}
setting=list(impTotal=0.1, maxDec=0.01, n=2, level=5)
```
Build an explainable tree for RF
```{r}
tree <- e2tree(Species ~ ., data = training, D, ensemble, setting)
```
Plot the Explainable Ensemble Tree
```{r}
expl_plot <- rpart2Tree(tree, ensemble)
# Plot using rpart.plot package:
plot_e2tree <- rpart.plot::rpart.plot(expl_plot,
type=1,
fallen.leaves = T,
cex =0.55,
branch.lty = 6,
nn = T,
roundint=F,
digits = 2,
box.palette="lightgrey"
)
```
Prediction with the new tree (example on training)
```{r}
pred <- ePredTree(tree, training[,-5], target="virginica")
```
Comparison of predictions (training sample) of RF and e2tree
```{r}
# "ranger" package
table(pred$fit, ensemble$predictions)
# "randomForest" package
#table(pred$fit, ensemble$predicted)
```
Comparison of predictions (training sample) of RF and correct response
```{r}
# "ranger" package
table(ensemble$predictions, response_training)
## "randomForest" package
#table(ensemble$predicted, response_training)
```
Comparison of predictions (training sample) of e2tree and correct response
```{r}
table(pred$fit,response_training)
```
Variable Importance
```{r}
V <- vimp(tree, training)
V
```
Comparison with the validation sample
```{r}
ensemble.pred <- predict(ensemble, validation[,-5])
pred_val<- ePredTree(tree, validation[,-5], target="virginica")
```
Comparison of predictions (sample validation) of RF and e2tree
```{r}
## "ranger" package
table(pred_val$fit, ensemble.pred$predictions)
## "randomForest" package
#table(pred_val$fit, ensemble.pred$predicted)
```
Comparison of predictions (validation sample) of e2tree and correct response
```{r}
table(pred_val$fit, response_validation)
roc_res <- roc(response_validation, pred_val$score, target="virginica")
roc_res$auc
```
To evaluate how well our tree captures the structure of the RF and replicates its classification, we introduce a procedure to measure the goodness of explainability.
We start by visualizing the final partition generated by the RF through a heatmap — a graphical representation of the co-occurrence matrix, which reflects how often pairs of observations are grouped together across the ensemble.
Each cell shows a pairwise similarity:
the darker the cell, the closer to 1 the similarity — meaning the two observations were frequently assigned to the same leaf.
Comparing these two matrices — both visually and statistically — allows us to assess how well E2Tree reproduces the ensemble structure.
To formally test this alignment, we use the [Mantel test](https://aacrjournals.org/cancerres/article/27/2_Part_1/209/476508/The-Detection-of-Disease-Clustering-and-a), a statistical method that quantifies the correlation between the two matrices. The Mantel test is a non-parametric method used to assess the correlation between two distance or similarity matrices. It is particularly useful when we are interested to study the relationships between dissimilarity structures. The test uses permutation to generate a null distribution, comparing the observed statistic against values obtained under random reordering.
```{r}
eComparison(training, tree, D, graph = TRUE)
```
Owner
- Name: Massimo Aria
- Login: massimoaria
- Kind: user
- Location: Naples, Italy
- Company: University of Naples Federico II (www.unina.it)
- Website: www.massimoaria.com
- Repositories: 5
- Profile: https://github.com/massimoaria
Massimo Aria is a full professor in Statistics for Social Sciences at the Department of Economics and Statistics of the University of Naples Federico II
GitHub Events
Total
- Release event: 1
- Issues event: 6
- Watch event: 2
- Issue comment event: 2
- Push event: 15
- Pull request event: 9
- Fork event: 1
Last Year
- Release event: 1
- Issues event: 6
- Watch event: 2
- Issue comment event: 2
- Push event: 15
- Pull request event: 9
- Fork event: 1
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 4
- Total pull requests: 6
- Average time to close issues: 6 months
- Average time to close pull requests: 2 days
- Total issue authors: 2
- Total pull request authors: 2
- Average comments per issue: 0.0
- Average comments per pull request: 0.17
- Merged pull requests: 2
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 3
- Pull requests: 6
- Average time to close issues: about 1 month
- Average time to close pull requests: 2 days
- Issue authors: 2
- Pull request authors: 2
- Average comments per issue: 0.0
- Average comments per pull request: 0.17
- Merged pull requests: 2
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- massimoaria (3)
- talegari (1)
Pull Request Authors
- agostinognasso (12)
- talegari (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- cran 285 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 2
- Total maintainers: 1
cran.r-project.org: e2tree
Explainable Ensemble Trees
- Homepage: https://github.com/massimoaria/e2tree
- Documentation: http://cran.r-project.org/web/packages/e2tree/e2tree.pdf
- License: MIT + file LICENSE
-
Latest release: 0.2.0
published 7 months ago
Rankings
Dependent packages count: 26.8%
Dependent repos count: 33.0%
Average: 48.8%
Downloads: 86.7%
Maintainers (1)
Last synced:
6 months ago
Dependencies
DESCRIPTION
cran
- Matrix * imports
- dplyr * imports
- future.apply * imports
- ggplot2 * imports
- partitions * imports
- purrr * imports
- tidyr * imports
.github/workflows/R-CMD-check.yaml
actions
- actions/checkout v2 composite
- r-lib/actions/check-r-package v2 composite
- r-lib/actions/setup-pandoc v2 composite
- r-lib/actions/setup-r v2 composite
- r-lib/actions/setup-r-dependencies v2 composite
.github/workflows/pkgdown.yaml
actions
- JamesIves/github-pages-deploy-action 4.1.4 composite
- actions/checkout v2 composite
- r-lib/actions/setup-pandoc v2 composite
- r-lib/actions/setup-r v2 composite
- r-lib/actions/setup-r-dependencies v2 composite