e2tree

Explainable Ensemble Trees

https://github.com/massimoaria/e2tree

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (17.8%) to scientific vocabulary

Keywords

explainable-machine-learning

Last synced: 9 months ago · JSON representation

Repository

Explainable Ensemble Trees

Basic Info

Host: GitHub
Owner: massimoaria
License: other
Language: R
Default Branch: main
Homepage:
Size: 5.3 MB

Statistics

Stars: 6
Watchers: 5
Forks: 3
Open Issues: 0
Releases: 1

Topics

explainable-machine-learning

Created over 3 years ago · Last pushed 10 months ago

Metadata Files

Readme Changelog License

README.Rmd

---
output: github_document
---



# Explainable Ensemble Trees (e2tree)


[![R-CMD-check](https://github.com/massimoaria/e2tree/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/massimoaria/e2tree/actions/workflows/R-CMD-check.yaml)
[![CRAN status](https://www.r-pkg.org/badges/version/e2tree)](https://CRAN.R-project.org/package=e2tree) `r badger::badge_cran_download("e2tree", "grand-total")`








The **Explainable Ensemble Trees** (**e2tree**) key idea consists of the definition of an algorithm to represent every ensemble approach based on decision trees model using a single tree-like structure. The goal is to explain the results from the esemble algorithm while preserving its level of accuracy, which always outperforms those provided by a decision tree. The proposed method is based on identifying the relationship tree-like structure explaining the classification or regression paths summarizing the whole ensemble process. There are two main advantages of e2tree:
- building an explainable tree that ensures the predictive performance of an RF model - allowing the decision-maker to manage with an intuitive structure (such as a tree-like structure).

In this example, we focus on Random Forest but, again, the algorithm can be generalized to every ensemble approach based on decision trees.


```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%",
  dpi = 300
)
```

## Setup

You can install the **developer version** of e2tree from [GitHub](https://github.com) with:

```{r eval=FALSE}
install.packages("remotes")
remotes::install_github("massimoaria/e2tree")
```

You can install the **released version** of e2tree from [CRAN](https://CRAN.R-project.org) with:

```{r eval=FALSE}
if (!require("e2tree", quietly=TRUE)) install.packages("e2tree")
```

```{r warning=FALSE, message=FALSE}
require(e2tree)
require(randomForest)
require(ranger)
require(dplyr)
require(ggplot2)
if (!(require(rsample, quietly=TRUE))){install.packages("rsample"); require(rsample, quietly=TRUE)} 
options(dplyr.summarise.inform = FALSE)
```

```{r set-theme, include=FALSE}
theme_set(
  theme_classic() +
    theme(
      plot.background = element_rect(fill = "transparent", colour = NA),
      panel.background = element_rect(fill = "transparent", colour = NA)
    )
)
knitr::opts_chunk$set(dev.args = list(bg = "transparent"))
```

## Warning

This package is still under development and, for the time being, the following limitations apply:

- Only ensembles trained with the **randomForest** and **ranger** packages are currently supported. Support for additional packages and approaches will be added in the future.

- Currently **e2tree** works only for classification and regression problems. It will gradually be extended to handle other types of response variables, such as count data, multivariate responses, and more.


## Example 1: IRIS dataset

In this example, we want to show the main functions of the e2tree package.

Starting from the IRIS dataset, we will train an ensemble tree using the randomForest package and then subsequently use e2tree to obtain an explainable tree synthesis of the ensemble classifier.
We run a Random Forest (RF) model, and then obtain the proximity matrix of the observations as output. The idea behind the proximity matrix: if a pair of observations is often at the a terminal node of several trees, this means that both explain an underlying relationship.
From this we are able to calculate co-occurrences at nodes between pairs of observations and obtain a matrix O of Co-Occurrences that will then be used to construct the graphical E2Tree output. 
The final aim will be to explain the relationship between predictors and response, reconstructing the same structure as the proximity matrix output of the RF model. 


```{r}
# Set random seed to make results reproducible:
set.seed(0)

# Initialize the split
iris_split <- iris %>% initial_split(prop = 0.6)
iris_split
# Assign the data to the correct sets
training <- iris_split %>% training()
validation <- iris_split %>% testing()
response_training <- training[,5]
response_validation <- validation[,5]

```


Train an Random Forest model with 1000 weak learners

```{r}
# Perform training with "ranger" or "randomForest" package:
## RF with "ranger" package
ensemble <- ranger(Species ~ ., data = training, num.trees = 1000, importance = 'impurity')

## RF with "randomForest" package
#ensemble = randomForest(Species ~ ., data = training, importance = TRUE, proximity = TRUE)
```

Here, we create the dissimilarity matrix between observations through the createDisMatrix function

```{r}
D = createDisMatrix(ensemble, data = training, label = "Species", parallel = list(active = FALSE, no_cores = NULL))
```

setting e2tree parameters

```{r}
setting=list(impTotal=0.1, maxDec=0.01, n=2, level=5)
```


Build an explainable tree for RF

```{r}
tree <- e2tree(Species ~ ., data = training, D, ensemble, setting)
```

Plot the Explainable Ensemble Tree
```{r}
expl_plot <- rpart2Tree(tree, ensemble)

# Plot using rpart.plot package:
plot_e2tree <- rpart.plot::rpart.plot(expl_plot,
                                      type=1,
                                      fallen.leaves = T,
                                      cex =0.55, 
                                      branch.lty = 6,
                                      nn = T, 
                                      roundint=F, 
                                      digits = 2,
                                      box.palette="lightgrey" 
                                      ) 

```


Prediction with the new tree (example on training)

```{r}
pred <- ePredTree(tree, training[,-5], target="virginica")
```

Comparison of predictions (training sample) of RF and e2tree

```{r}
# "ranger" package
table(pred$fit, ensemble$predictions)

# "randomForest" package
#table(pred$fit, ensemble$predicted)
```

Comparison of predictions (training sample) of RF and correct response

```{r}
# "ranger" package
table(ensemble$predictions, response_training)

## "randomForest" package
#table(ensemble$predicted, response_training)
```

Comparison of predictions (training sample) of e2tree and correct response

```{r}
table(pred$fit,response_training)
```

Variable Importance

```{r}
V <- vimp(tree, training)
V
  
```


Comparison with the validation sample

```{r}
ensemble.pred <- predict(ensemble, validation[,-5])

pred_val<- ePredTree(tree, validation[,-5], target="virginica")
```

Comparison of predictions (sample validation) of RF and e2tree

```{r}
## "ranger" package
table(pred_val$fit, ensemble.pred$predictions)

## "randomForest" package
#table(pred_val$fit, ensemble.pred$predicted)
```


Comparison of predictions (validation sample) of e2tree and correct response

```{r}
table(pred_val$fit, response_validation)
roc_res <- roc(response_validation, pred_val$score, target="virginica")
roc_res$auc
```

To evaluate how well our tree captures the structure of the RF and replicates its classification, we introduce a procedure to measure the goodness of explainability.
We start by visualizing the final partition generated by the RF through a heatmap — a graphical representation of the co-occurrence matrix, which reflects how often pairs of observations are grouped together across the ensemble.
Each cell shows a pairwise similarity:
the darker the cell, the closer to 1 the similarity — meaning the two observations were frequently assigned to the same leaf.
Comparing these two matrices — both visually and statistically — allows us to assess how well E2Tree reproduces the ensemble structure.
To formally test this alignment, we use the [Mantel test](https://aacrjournals.org/cancerres/article/27/2_Part_1/209/476508/The-Detection-of-Disease-Clustering-and-a), a statistical method that quantifies the correlation between the two matrices. The Mantel test is a non-parametric method used to assess the correlation between two distance or similarity matrices. It is particularly useful when we are interested to study the relationships between dissimilarity structures. The test uses permutation to generate a null distribution, comparing the observed statistic against values obtained under random reordering.



```{r}
eComparison(training, tree, D, graph = TRUE)
```

Owner

Name: Massimo Aria
Login: massimoaria
Kind: user
Location: Naples, Italy
Company: University of Naples Federico II (www.unina.it)

Website: www.massimoaria.com
Repositories: 5
Profile: https://github.com/massimoaria

Massimo Aria is a full professor in Statistics for Social Sciences at the Department of Economics and Statistics of the University of Naples Federico II

GitHub Events

Total

Release event: 1
Issues event: 6
Watch event: 2
Issue comment event: 2
Push event: 15
Pull request event: 9
Fork event: 1

Last Year

Release event: 1
Issues event: 6
Watch event: 2
Issue comment event: 2
Push event: 15
Pull request event: 9
Fork event: 1

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 4
Total pull requests: 6
Average time to close issues: 6 months
Average time to close pull requests: 2 days
Total issue authors: 2
Total pull request authors: 2
Average comments per issue: 0.0
Average comments per pull request: 0.17
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 3
Pull requests: 6
Average time to close issues: about 1 month
Average time to close pull requests: 2 days
Issue authors: 2
Pull request authors: 2
Average comments per issue: 0.0
Average comments per pull request: 0.17
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

massimoaria (3)
talegari (1)

Pull Request Authors

agostinognasso (12)
talegari (1)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- cran 285 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 2
Total maintainers: 1

cran.r-project.org: e2tree

Explainable Ensemble Trees

Homepage: https://github.com/massimoaria/e2tree
Documentation: http://cran.r-project.org/web/packages/e2tree/e2tree.pdf
License: MIT + file LICENSE
Latest release: 0.2.0
published 10 months ago

Versions: 2
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 285 Last month

Rankings

Dependent packages count: 26.8%

Dependent repos count: 33.0%

Average: 48.8%

Downloads: 86.7%

Maintainers (1)

aria@unina.it

Last synced: 9 months ago

Dependencies

DESCRIPTION cran

Matrix * imports
dplyr * imports
future.apply * imports
ggplot2 * imports
partitions * imports
purrr * imports
tidyr * imports

.github/workflows/R-CMD-check.yaml actions

actions/checkout v2 composite
r-lib/actions/check-r-package v2 composite
r-lib/actions/setup-pandoc v2 composite
r-lib/actions/setup-r v2 composite
r-lib/actions/setup-r-dependencies v2 composite

.github/workflows/pkgdown.yaml actions

JamesIves/github-pages-deploy-action 4.1.4 composite
actions/checkout v2 composite
r-lib/actions/setup-pandoc v2 composite
r-lib/actions/setup-r v2 composite
r-lib/actions/setup-r-dependencies v2 composite

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

e2tree

Science Score: 26.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.Rmd

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

cran.r-project.org: e2tree

Rankings

Maintainers (1)

Dependencies