Recent Releases of fgclustering

fgclustering - Preprint Release, Fixes & Enhancements

This release brings several improvements and additions to Forest-Guided Clustering, including bug fixes, feature enhancements, and support for the new preprint publication.

🔧 Bug Fixes - Consistent Cluster Ordering: Fixed an issue where the cluster ordering differed between the feature importance plots and the heatmap/boxplot visualizations.

✨ Enhancements - CLARA Subsampling Improvements: Added support for subsampling in CLARA clustering with preservation of the original target label distribution, improving robustness and interpretability.

📄 New Additions - Preprint Publication Scripts: Added new scripts and Jupyter notebooks used in the preparation of the FGC preprint, enabling full reproducibility. - README Update: Repository overview now includes a citation badge and link to the preprint on arXiv.

- Jupyter Notebook
Published by lisa-sousa 7 months ago

fgclustering - Preprint Release, Fixes & Enhancements

This release brings several improvements and additions to Forest-Guided Clustering, including bug fixes, feature enhancements, and support for the new preprint publication.

🔧 Bug Fixes - Consistent Cluster Ordering: Fixed an issue where the cluster ordering differed between the feature importance plots and the heatmap/boxplot visualizations.

✨ Enhancements - CLARA Subsampling Improvements: Added support for subsampling in CLARA clustering with preservation of the original target label distribution, improving robustness and interpretability.

📄 New Additions - Preprint Publication Scripts: Added new scripts and Jupyter notebooks used in the preparation of the FGC preprint, enabling full reproducibility. - README Update: Repository overview now includes a citation badge and link to the preprint on arXiv.

- Jupyter Notebook
Published by lisa-sousa 7 months ago

fgclustering - v2.0.1 Bug Fixes

This patch release includes two minor but important bug fixes:

Bug Fixes

  • Fixed an inconsistency in plot_heatmap_classification related to renaming of variable names
  • Removed an unnecessary else branch in DistanceJensenShannon.calculate_distance_cluster_vs_background

- Jupyter Notebook
Published by lisa-sousa 8 months ago

fgclustering - v2.0.0 Refactored, Scalable, and Faster

Major Release

This release delivers major performance improvements, scalable computation, and a full API refactor to align with the scikit-learn ecosystem. It introduces a memory-efficient distance matrix implementation, enhanced stability analysis, and a modular architecture for flexible usage and extension.

Performance & Storage Improvements

  • Faster distance matrix computation using optimized logic.
  • Memory-efficient distance matrix via on-disk memmap storage format.

Stability Estimation with Jaccard Index

  • Switched from minimum to mean Jaccard Index as the stability metric.
  • Each cluster’s stability is now reported individually and reporting format changed.
  • Bootstrapping now supports sampling only a fraction (x%) of the dataset.

Refactored API (Scikit-learn Compatible)

  • Main logic now exposed as a functional API returning sklearn.utils.Bunch.
  • Distance metrics (e.g., Random Forest proximity) refactored into reusable classes.
  • Clustering algorithms (KMedoids, CLARA) encapsulated as configurable classes.
  • Feature importance computation and cluster optimization structured as modular classes.

New: CLARA Clustering Algorithm

  • Integrated ClusteringClara for handling large datasets.
  • Supports bootstrapped inputs with missing sample indices.
  • Verified inertia and label assignment logic.
  • Includes unit tests for stability and correctness.

Documentation Updates

  • Updated README, example notebooks, docstrings, and ReadTheDocs.
  • Standardized function and class documentation for clarity and completeness.

- Jupyter Notebook
Published by lisa-sousa 8 months ago

fgclustering - v1.2.0 - New Importance Calculation

This is a major release where we:

  • substitute p-values as measure for feature importance with Wasserstein and Jensen-Shannon distance
  • enable the usage of custom colors for heatmap plotting and saving of interactive heatmaps as html files

- Jupyter Notebook
Published by lisa-sousa 9 months ago

fgclustering - v1.1.1 - decision path plotting heatmap

This is a minor reease where we: - refactored the code for the decision path heatmap plot - added the possibility to plot interactive heatmaps using plotly

- Jupyter Notebook
Published by lisa-sousa over 1 year ago

fgclustering - v1.1.0 - importance computation and plotting

This is a major release where we changed: - Importance computation: adjust importance value computation from 1-p-value to a normalized negative log transformation. In addition, clip the p-values at 1e-50 to avoid log10(0) and then normalize by log10(1e-50): -log10(p-value) / log10(1e-50). New importance values range between 0 and 1 but stretch the smaller p-values (closer to 0) more and compress the larger ones (closer to 1). - Plotting: merge two feature impotance plotting functions (plot_global_feature_importance and plot_local_feature_importance) into one function plot_feature_importance, which plots global and lcoal feature importance in one grid plot

- Jupyter Notebook
Published by lisa-sousa over 1 year ago

fgclustering - v1.0.4 - additional functionalities and bug fix

Major changes in release: - github actions miniconda installation error fix - bug fix pandas groupby error for new pandas version - enable selection of only top n features for plotting functions - add multiple testing correction to p-value calculation across clusters - bug fix calculation of chi square test for multiple categories over multiple clusters

- Jupyter Notebook
Published by lisa-sousa over 1 year ago

fgclustering - v1.0.3 - bug fix

This is a minor release with small bug fix: fix pytest.

- Jupyter Notebook
Published by lisa-sousa almost 3 years ago

fgclustering - v1.0.2 - bug fix

This is a minor release with small bug fix: adding missing import statement.

- Jupyter Notebook
Published by lisa-sousa almost 3 years ago

fgclustering - v1.0.1 - speed-up

This is a major release where we changed:

  • new functions added that are speeding up the code
  • new package for kmedoids calculation is used which enables faster computation
  • parallelization of cluster optimization step
  • introduction of new parameters such as n_jobs for number of parallel jobs to execute or verbose for printing the output
  • update of the documentation
  • updated and extended tutorial
  • minor bug fixes - fgc runs now on categorical target inputs; use metric = precomputed for the k-medoids calculation
  • added extra function calculate_statistics which enables visualisations with features not seen by the random forest or fgc

- Jupyter Notebook
Published by hpelin over 3 years ago

fgclustering - v0.3 - New Plotting Functionalities

This is a minor release, where we changed:

  • change plotboxplots() into plotdistribution() and display categorical features as barplots
  • merge plotheatmap() and plotboxplots() into plotdecisionpaths(), as both plots are used to derive decision rules from Random Forests
  • minor bug fix: when computing the global feature, copy X such that original X is not modified

- Jupyter Notebook
Published by lisa-sousa over 3 years ago

fgclustering - v0.2.0 - first release of FGC package

The first version of the package was developed by @lisa-sousa and @DoTha .

Full Changelog: https://github.com/HelmholtzAI-Consultants-Munich/fg-clustering/commits/v0.2.0

- Jupyter Notebook
Published by lisa-sousa almost 4 years ago