fgclustering
Explainability for Random Forest Models.
https://github.com/helmholtzai-consultants-munich/fg-clustering
Science Score: 65.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 6 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
✓Institutional organization owner
Organization helmholtzai-consultants-munich has institutional domain (www.helmholtz.ai) -
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.8%) to scientific vocabulary
Keywords
Repository
Explainability for Random Forest Models.
Basic Info
- Host: GitHub
- Owner: HelmholtzAI-Consultants-Munich
- License: mit
- Language: Jupyter Notebook
- Default Branch: main
- Homepage: https://forest-guided-clustering.readthedocs.io/
- Size: 125 MB
Statistics
- Stars: 31
- Watchers: 2
- Forks: 12
- Open Issues: 3
- Releases: 13
Topics
Metadata Files
README.md
# *Forest-Guided Clustering* - Shedding light into the Random Forest Black Box
[](https://forest-guided-clustering.readthedocs.io/en/latest/)
[](https://pypi.org/project/fgclustering)
[](https://pepy.tech/projects/fgclustering)
[](https://github.com/HelmholtzAI-Consultants-Munich/forest_guided_clustering/stargazers)
[](https://opensource.org/licenses/MIT)
[](https://doi.org/10.48550/arXiv.2507.19455)
[](https://github.com/HelmholtzAI-Consultants-Munich/fg-clustering/actions/workflows/test.yml)
✨ About this Package
Why Use Forest-Guided Clustering?
Forest-Guided Clustering (FGC) is an explainability method for Random Forest models that addresses one of the key limitations of many standard XAI techniques: the inability to effectively handle correlated features and complex decision patterns. Traditional methods like permutation importance, SHAP, and LIME often assume feature independence and focus on individual feature contributions, which can lead to misleading or incomplete explanations. As machine learning models are increasingly deployed in sensitive domains like healthcare, finance, and HR, understanding why a model makes a decision is as important as the decision itself. This is not only a matter of trust and fairness, but also a legal requirement in many jurisdictions, such as the European Union's GDPR which mandates a “right to explanation” for automated decisions.
FGC offers a different approach: instead of approximating the model with simpler surrogates, it uses the internal structure of the Random Forest itself. By analyzing the tree traversal patterns of individual samples, FGC clusters data points that follow similar decision paths. This reveals how the forest segments the input space, enabling a human-interpretable view of the model's internal logic. FGC is particularly useful when features are highly correlated, as it does not rely on assumptions of feature independence. It bridges the gap between model accuracy and model transparency, offering a powerful tool for global, model-specific interpretation of Random Forests.
📢 New! Forest-Guided Clustering is now on arXiv
Please see our paper Forest-Guided Clustering - Shedding Light into the Random Forest Black Box for a detailed description of the method, its theoretical foundations, and practical applications. Check it out to learn more about how FGC reveals structure in your Random Forest models!
Prefer a visual walkthrough? Watch our short introduction video by clicking below:
Curious how Forest-Guided Clustering compares to standard methods? See our notebook: Introduction to FGC: Comparison of Forest-Guided Clustering and Feature Importance.
Want to dive deeper? Visit our full documentation for:
- Getting Started – Installation and quick start
- Tutorials – Use cases for classification, regression, and large datasets
- API Reference – Detailed descriptions of functions and classes
🛠️ Installation
Requirements
This package was tested for Python 3.8 - 3.13 on ubuntu, macos and windows. It depends on the kmedoids python package. If you are using windows or macos, you may need to first install Rust/Cargo with:
conda install -c conda-forge rust
If this does not work, please try to install Cargo from source:
git clone https://github.com/rust-lang/cargo
cd cargo
cargo build --release
For further information on the kmedoids package, please visit this page.
All other required packages are automatically installed if installation is done via pip.
Install Options
The installation of the package is done via pip. Note: if you are using conda, first install pip with: conda install pip.
PyPI install:
pip install fgclustering
Installation from source:
git clone https://github.com/HelmholtzAI-Consultants-Munich/fg-clustering.git
Installation as python package (run inside directory):
pip install .Development Installation as python package (run inside directory):
pip install -e .
💻 How to Use Forest-Guided Clustering
Basic Usage
To apply Forest-Guided Clustering (FGC) for explaining a Random Forest model, you can follow the simple workflow consisting of three main steps: computing the forest-guided clusters, evaluating feature importance, and visualizing the results.
```python
compute the forest-guided clusters
fgc = forestguidedclustering( estimator=model, X=X, y=y, clusteringdistancemetric=DistanceRandomForestProximity(), clustering_strategy=ClusteringKMedoids(), )
evaluate feature importance
featureimportance = forestguidedfeatureimportance( X=X, y=y, clusterlabels=fgc.clusterlabels, modeltype=fgc.modeltype, )
visualize the results
plotforestguidedfeatureimportance( featureimportancelocal=featureimportance.featureimportancelocal, featureimportanceglobal=featureimportance.featureimportanceglobal )
plotforestguideddecisionpaths( dataclustering=featureimportance.dataclustering, modeltype=fgc.model_type, ) ```
where
- estimator is the trained Random Forest model
- X is the feature matrix
- y is the target variable
- clustering_distance_metric defines how similarity between samples is measured based on the Random Forest structure
- clustering_strategy determines how the proximity-based clustering is performed
For a detailed walkthrough, refer to the Introduction to FGC: Simple Use Cases notebook.
Using FGC on Large Datasets
When working with datasets containing a large number of samples, Forest-Guided Clustering (FGC) provides several strategies to ensure efficient performance and scalability:
Parallelize Cluster Optimization: Leverage multiple CPU cores by setting the
n_jobsparameter to a value greater than 1 in theforest_guided_clustering()function. This will parallelize the bootstrapping process for evaluating cluster stability.Use a Faster Clustering Algorithm: Improve the efficiency of the K-Medoids clustering step by using the optimized
"fasterpam"algorithm. Set themethodparameter of your clustering strategy (e.g.,ClusteringKMedoids(method="fasterpam")) to activate this faster implementation.Enable Subsampling with CLARA: For extremely large datasets, consider using the CLARA (Clustering Large Applications) variant by choosing
ClusteringClara()as your clustering strategy. CLARA performs clustering on smaller random subsamples, making it suitable for high-volume data.
For a detailed example, please refer to the notebook Special Case: FGC for Big Datasets.
🤝 Contributing
We welcome contributions of all kinds—whether it’s improvements to the code, documentation, tutorials, or examples. Your input helps make Forest-Guided Clustering more robust and useful for the community.
To contribute:
- Fork the repository.
- Make your changes in a feature branch.
- Submit a pull request to the main branch.
We’ll review your submission and work with you to get it merged.
If you have any questions or ideas you'd like to discuss before contributing, feel free to reach out to Lisa Barros de Andrade e Sousa.
📝 How to cite
If you find Forest-Guided Clustering useful in your research or applications, please consider citing it:
``` @article{barros2025forest, title = {Forest-Guided Clustering -- Shedding Light into the Random Forest Black Box}, author = {Lisa Barros de Andrade e Sousa, Gregor Miller, Ronan Le Gleut, Dominik Thalmeier, Helena Pelin, Marie Piraud}, journal = {ArXiv}, year = {2025}, url = {https://doi.org/10.48550/arXiv.2507.19455} }
```
🛡️ License
The fgclustering package is released under the MIT License. You are free to use, modify, and distribute it under the terms outlined in the LICENSE file.
Owner
- Name: HelmholtzAI-consultants-HMGU
- Login: HelmholtzAI-Consultants-Munich
- Kind: organization
- Email: contact@helmholtz.ai
- Location: Munich, Germany
- Website: www.helmholtz.ai
- Repositories: 19
- Profile: https://github.com/HelmholtzAI-Consultants-Munich
Leading in applied AI by combining unique research questions, data and expertise with democratized access to new tools.
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: >-
Forest-Guided Clustering - Explainability for
Random Forest Models
message: >-
"If you use this software, please cite it as
below."
type: software
authors:
- given-names: Lisa
family-names: Barros de Andrade e Sousa
email: lisa.barros.andrade.sousa@gmail.com
affiliation: Helmholtz AI
orcid: 'https://orcid.org/0000-0001-7702-9782'
- given-names: Dominik
family-names: Thalmeier
- given-names: Helena
family-names: Pelin
email: helena.pelin@helmholtz-muenchen.de
affiliation: Helmholtz AI
orcid: 'https://orcid.org/0000-0001-8875-4285'
- given-names: Marie
family-names: Piraud
identifiers:
- type: doi
value: 10.5281/zenodo.6445529
repository-code: >-
https://github.com/HelmholtzAI-Consultants-Munich/fg-clustering
url: >-
https://forest-guided-clustering.readthedocs.io/en/latest/
abstract: >-
This python package is about explainability of
Random Forest models. Standard explainability
methods (e.g. feature importance) assume
independence of model features and hence, are not
suited in the presence of correlated features. The
Forest-Guided Clustering algorithm does not assume
independence of model features, because it computes
the feature importance based on subgroups of
instances that follow similar decision rules within
the Random Forest model. Hence, this method is well
suited for cases with high correlation among model
features.
keywords:
- XAI
- explainable machine learning
- Random Forest
license: MIT
version: v0.2.0
date-released: '2022-04-11'
GitHub Events
Total
- Create event: 2
- Release event: 2
- Issues event: 2
- Watch event: 5
- Delete event: 2
- Issue comment event: 2
- Push event: 33
- Pull request review event: 3
- Pull request event: 6
- Fork event: 3
Last Year
- Create event: 2
- Release event: 2
- Issues event: 2
- Watch event: 5
- Delete event: 2
- Issue comment event: 2
- Push event: 33
- Pull request review event: 3
- Pull request event: 6
- Fork event: 3
Packages
- Total packages: 1
-
Total downloads:
- pypi 83 last-month
- Total dependent packages: 0
- Total dependent repositories: 1
- Total versions: 15
- Total maintainers: 3
pypi.org: fgclustering
Forest-Guided Clustering - Explainability method for Random Forest models.
- Documentation: https://fgclustering.readthedocs.io/
- License: MIT License
-
Latest release: 2.0.2
published 7 months ago
Rankings
Maintainers (3)
Dependencies
- ipykernel *
- nbsphinx *
- actions/checkout v3 composite
- actions/setup-python v4 composite
- actions/checkout v3 composite
- actions/setup-python v4 composite
- s-weigand/setup-conda v1 composite
- kmedoids *
- matplotlib *
- numba *
- numexpr >=2.8.4
- numpy *
- pandas *
- scikit-learn *
- scipy *
- seaborn >=0.12
- statsmodels >=0.13.5
- tqdm *