clustpy

A Python library for advanced clustering algorithms

https://github.com/collinleiber/clustpy

Last synced: 10 months ago · JSON representation

Repository

A Python library for advanced clustering algorithms

Basic Info

Host: GitHub
Owner: collinleiber
License: bsd-3-clause
Language: Python
Default Branch: main
Homepage: https://clustpy.readthedocs.io/en/
Size: 3.89 MB

Statistics

Stars: 130
Watchers: 10
Forks: 15
Open Issues: 19
Releases: 4

Created over 5 years ago · Last pushed 11 months ago

Metadata Files

Readme License Citation

The package provides a simple way to perform clustering in Python. For this purpose it provides a variety of algorithms from different domains. Additionally, ClustPy includes methods that are often needed for research purposes, such as plots, clustering metrics or evaluation methods. Further, it integrates various frequently used datasets (e.g., from the UCI repository) through largely automated loading options.

The focus of the ClustPy package is not on efficiency (here we recommend e.g. pyclustering), but on the possibility to try out a wide range of modern scientific methods. In particular, this should also make lesser-known methods accessible in a simple and convenient way. To get an initial overview of the integrated deep clustering methods, the following survey paper may be helpful:
An Introductory Survey to Autoencoder-based Deep Clustering - Sandboxes for Combining Clustering with Deep Learning

Since it largely follows the implementation conventions of sklearn clustering, it can be combined with many other packages (see below).

Installation

For Users

Stable Version

The current stable version can be installed by the following command:

pip install clustpy

If you want to install the complete package including all data loader functions, you should use:

pip install clustpy[full]

Note that a gcc compiler is required for installation. Therefore, in case of an installation error, make sure that: - Windows: Microsoft C++ Build Tools is installed - Linux/Mac: Python dev is installed (e.g., by running apt-get install python-dev - the exact command may differ depending on the linux distribution)

The error messages may look like this: - 'error: command 'gcc' failed: No such file or directory' - 'Could not build wheels for clustpy, which is required to install pyproject.toml-based projects' - 'Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools'

Development Version

The current development version can be installed directly from git by executing:

sudo pip install git+https://github.com/collinleiber/ClustPy.git

Alternatively, clone the repository, go to the directory and execute:

sudo python setup.py install

If you have no sudo rights you can use:

python setup.py install --prefix ~/.local

For Developers

Clone the repository, go to the directory and do the following (NumPy must be installed beforehand).

Install package locally and compile C files:

python setup.py install --prefix ~/.local

Copy compiled C files to correct file location:

python setup.py build_ext --inplace

Remove clustpy via pip to avoid ambiguities during development, e.g., when changing files in the code:

pip uninstall clustpy

Components

Clustering Algorithms

Partition-based Clustering

| Algorithm | Publication | Published at | Original Code | Docs | |-------------------------------|--------------------------------------------------------------------------|--------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------| | DipInit (incl. DipExt) | Utilizing Structure-Rich Features to Improve Clustering | ECML PKDD 2020 | Link (R) | Link | | DipMeans | Dip-means: an incremental clustering method for estimating the number of clusters | NIPS 2012 | Link (Matlab) | Link | | Dip'n'sub (incl. TailoredDip) | Extension of the Dip-test Repertoire - Efficient and Differentiable p-value Calculation for Clustering | SIAM SDM 2023 | Link (Python) | Link | | GapStatistic | Estimating the number of clusters in a data set via the gap statistic | RSS: Series B 2002 | - | Link | | G-Means | Learning the k in k-means | NIPS 2003 | - | Link | | LDA-K-Means | Adaptive dimension reduction using discriminant analysis and K-means clustering | ICML 2007 | - | Link | | PG-Means | PG-means: learning the number of clusters in data | NIPS 2006 | - | Link | | Projected Dip-Means | The Projected Dip-means Clustering Algorithm | SETN 2018 | - | Link |
| SkinnyDip (incl. UniDip) | Skinny-dip: Clustering in a Sea of Noise | KDD 2016 | Link (R) | Link | | SpecialK | k Is the Magic NumberInferring the Number of Clusters Through Nonparametric Concentration Inequalities | ECML PKDD 2019 | Link (Python) | Link | | SubKmeans | Towards an Optimal Subspace for K-Means | KDD 2017 | Link (Scala) | Link | | X-Means | X-means: Extending k-means with efficient estimation of the number of clusters | ICML 2000 | - | Link |

Density-based Clustering

| Algorithm | Publication | Published at | Original Code | Docs | |-------------------------------|--------------------------------------------------------------------------|------------------------------------------|---------------|--------------------------------------------------------------------------------------------------------------------------------| | Multi Density DBSCAN | Multi Density DBSCAN | IDEAL 2011 | - | Link |

Hierarchical Clustering

| Algorithm | Publication | Published at | Original Code | Docs | |-----------|--------------------------------------------------------------------------|--------------|---------------|-------------------------------------------------------------------------------------------------------------| | DIANA | Finding Groups in Data: An Introduction to Cluster Analysis | JASA 1991 | - | Link |

Alternative Clustering / Non-redundant Clustering

| Algorithm | Publication | Published at | Original Code | Docs | |---------------|--------------------------------------------------------------------------|---------------|-------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------| | AutoNR | Automatic Parameter Selection for Non-Redundant Clustering | SIAM SDM 2022 | Link (Python) | Link | | NR-Kmeans | Discovering Non-Redundant K-means Clusterings in Optimal Subspaces | KDD 2018 | Link (Scala) | Link | | Orth1 + Orth2 | Non-redundant multi-view clustering via orthogonalization | ICDM 2007 | - | Link |

Deep Clustering

| Algorithm | Publication | Published at | Original Code | Docs | |------------|-------------------------------------------------------------------------------------------------------------------------------|----------------------------------|-------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------| | ACe/DeC | Details (Don't) Matter: Isolating Cluster Information in Deep Embedded Spaces | IJCAI 2021 | Link (Python + PyTorch) | Link | | AEC | Auto-encoder based data clustering | CIARP 2013 | Link (Matlab) | Link | | DCN | Towards K-means-friendly spaces: simultaneous deep learning and clustering | ICML 2017 | Link (Python + Theano) | Link | | DDC | Deep density-based image clustering | Knowledge-Based Systems 2020 | Link (Python + Keras) | Link | | DEC | Unsupervised deep embedding for clustering analysis | ICML 2016 | Link (Python + Caffe) | Link | | DeepECT | Deep embedded cluster tree | ICDM 2019 | Link (Python + PyTorch) | Link | | DEN | Deep Embedding Network for Clustering | ICPR 2014 | - | Link | | DipDECK | Dip-based Deep Embedded Clustering with k-Estimation | KDD 2021 | Link (Python + PyTorch) | Link | | DipEncoder | The DipEncoder: Enforcing Multimodality in Autoencoders | KDD 2022 | Link (Python + PyTorch) | Link | | DKM | Deep k-Means: Jointly clustering with k-Means and learning representations | Pattern Recognition Letters 2020 | Link (Python + Tensorflow) | Link | | ENRC | Deep Embedded Non-Redundant Clustering | AAAI 2020 | Link (Python + PyTorch) | Link | | IDEC | Improved Deep Embedded Clustering with Local Structure Preservation | IJCAI 2017 | Link (Python + Keras) | Link | | N2D | N2d:(not too) deep clustering via clustering the local manifold of an autoencoded embedding | ICPR 2021 | Link (Python + Keras) | Link | | VaDE | Variational Deep Embedding: An Unsupervised and Generative Approach to Clustering | IJCAI 2017 | Link (Python + Keras) | Link |

Neural Networks

| Algorithm | Publication | Published at | Original Code | Docs | |------------------------------------|--------------------------------------------------------------------------------------------------------------------------------|--------------|---------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------| | Convolutional Autoencoder (ResNet) | Deep Residual Learning for Image Recognition | CVPR 2016 | - | Link | | Feedforward Autoencoder | Modular Learning in Neural Networks | AAAI 1987 | - | Link | | Neighbor Encoder | Representation Learning by Reconstructing Neighborhoods | arXiv 2018 | - | Link | | Stacked Autoencoder | Greedy Layer-Wise Training of Deep Networks | NIPS 2006 | - | Link | | Variational Autoencoder | Auto-Encoding Variational Bayes | ICLR 2014 | - | Link |

Other implementations

Metrics
- Confusion Matrix [Docs]
- Fair Normalized Mutual Information (FNMI) [Publication] [Docs]
- Hierarchical Metrics
  - Dendrogram Purity [Publication] [Docs]
  - Leaf Purity [Publication] [Docs]
- Information-Theoretic External Cluster-Validity Measure (DOM) [Publication] [Docs]
- Pair Counting Scores (f1, rand, jaccard, recall, precision) [Publication] [Docs]
- Purity [Publication] [Docs]
- Scores for multiple labelings (see alternative clustering algorithms)
  - Multiple Labelings Confusion Matrix [Docs]
  - Multiple Labelings Pair Counting Scores [Publication] [Docs]
- Unsupervised Clustering Accuracy [Publication] [Docs]
- Variation of information [Publication] [Docs]
Utils
- Automatic evaluation methods [Docs]
- Hartigans Dip-test [Publication] [Docs]
- Various plots [Docs]
Datasets
- Synthetic dataset creators
  - For common subspace clustering [Docs]
  - For alternative clustering [Docs]
- Real-world dataset loaders (e.g., Iris, Wine, Mice protein, Optdigits, MNIST, ...)
  - UCI Repository [Website]
  - UEA & UCR Time Series Classification Repository [Website]
  - MedMNIST [Website]
  - Torchvision Datasets [Website]
  - Sklearn Datasets [Website]
  - Others
- Dataset loaders for datasets with multiple labelings
  - ALOI (subset) [Website]
  - CMU Face [Website]
  - Dancing Stickfigures [Publication]
  - Fruit [Publication]
  - NRLetters [Publication]
  - WebKB [Website]

Python environments

ClustPy utilizes global Python environment variables in some places. These can be defined using os.environ['VARIABLE_NAME'] = VARIABLE_VALUE. The following variable names are used:

'CLUSTPY_DATA': Defines the path where downloaded datasets should be saved.
'CLUSTPYDEVICE': Define the device to be used for Pytorch applications. Example: `os.environ['CLUSTPYDEVICE'] = 'cuda:1'`

Compatible packages

We stick as close as possible to the implementation details of sklean clustering. Therefore, our methods are compatible with many other packages. Examples are:

sklearn clustering
- K-Means
- Affinity propagation
- Mean-shift
- Spectral clustering
- Ward hierarchical clustering
- Agglomerative clustering
- DBSCAN
- OPTICS
- Gaussian mixtures
- BIRCH
kmodes
- k-modes
- k-prototypes
HDBSCAN
- HDBSCAN
scikit-learn-extra
- k-medoids
- Density-Based common-nearest-neighbors clustering
Density Peak Clustering
- DPC

Coding Examples

1)

In this first example, the subspace algorithm SubKmeans is run on a synthetic subspace dataset. Afterward, the clustering accuracy is calculated to evaluate the result.

```python from clustpy.partition import SubKmeans from clustpy.data import createsubspacedata from clustpy.metrics import unsupervisedclusteringaccuracy as acc

data, labels = createsubspacedata(1000, nclusters=4, subspacefeatures=[2,5]) sk = SubKmeans(4) sk.fit(data) accres = acc(labels, sk.labels) print("Clustering accuracy:", acc_res) ```

2)

The second example covers the topic of non-redundant/alternative clustering. Here, the NrKmeans algorithm is run on the Fruit dataset. Beware that NrKmeans as a non-redundant clustering algorithm returns multiple labelings. Therefore, we calculate the confusion matrix by comparing each combination of labels using the normalized mutual information (nmi). The confusion matrix will be printed and finally the best matching nmi will be stated for each set of labels.

```python from clustpy.alternative import NrKmeans from clustpy.data import loadfruit from clustpy.metrics import MultipleLabelingsConfusionMatrix from sklearn.metrics import normalizedmutualinfoscore as nmi import numpy as np

data, labels = loadfruit(returnXy=True) nk = NrKmeans([3, 3]) nk.fit(data) mlcm = MultipleLabelingsConfusionMatrix(labels, nk.labels, nmi) mlcm.rearrange() print(mlcm.confusionmatrix) print(np.max(mlcm.confusionmatrix, axis=1)) ```

3)

One mentionable feature of the ClustPy package is the ability to run various modern deep clustering algorithms out of the box. For example, the following code runs the DEC algorithm on the Optdigits dataset. To evaluate the result, we compute the adjusted RAND index (ari).

```python from clustpy.deep import DEC from clustpy.data import loadoptdigits from sklearn.metrics import adjustedrand_score as ari

data, labels = loadoptdigits(returnXy=True) dec = DEC(10) dec.fit(data) myari = ari(labels, dec.labels) print(myari) ```

4)

In this more complex example, we use ClustPy's evaluation functions, which automatically run the specified algorithms multiple times on previously defined datasets. All results of the given metrics are stored in a Pandas dataframe.

```python from clustpy.utils import EvaluationDataset, EvaluationAlgorithm, EvaluationMetric, evaluatemultipledatasets from clustpy.partition import ProjectedDipMeans, SubKmeans from sklearn.metrics import normalizedmutualinfoscore as nmi, silhouettescore from sklearn.cluster import KMeans, DBSCAN from clustpy.data import loadbreastcancer, loadiris, loadwine from clustpy.metrics import unsupervisedclusteringaccuracy as acc from sklearn.decomposition import PCA import numpy as np

def reducedimensionality(X, dims): pca = PCA(dims) Xnew = pca.fittransform(X) return Xnew

def znorm(X): return (X - np.mean(X)) / np.std(X)

def minmax(X): return (X - np.min(X)) / (np.max(X) - np.min(X))

datasets = [ EvaluationDataset("Breastpcaznorm", data=loadbreastcancer, preprocessmethods=[reducedimensionality, znorm], preprocessparams=[{"dims": 0.9}, {}], ignorealgorithms=["pdipmeans"]), EvaluationDataset("Irispca", data=loadiris, preprocessmethods=reducedimensionality, preprocessparams={"dims": 0.9}), EvaluationDataset("Wine", data=loadwine), EvaluationDataset("Wineznorm", data=loadwine, preprocess_methods=znorm)]

algorithms = [ EvaluationAlgorithm("SubKmeans", SubKmeans, {"nclusters": None}), EvaluationAlgorithm("pdipmeans", ProjectedDipMeans, {}), # Determines nclusters automatically EvaluationAlgorithm("dbscan", DBSCAN, {"eps": 0.01, "minsamples": 5}, preprocessmethods=minmax, deterministic=True), EvaluationAlgorithm("kmeans", KMeans, {"nclusters": None}), EvaluationAlgorithm("kmeansminmax", KMeans, {"nclusters": None}, preprocessmethods=minmax)]

metrics = [EvaluationMetric("NMI", nmi), EvaluationMetric("ACC", acc), EvaluationMetric("Silhouette", silhouettescore, metrictype="internal")]

df = evaluatemultipledatasets(datasets, algorithms, metrics, nrepetitions=5, aggregationfunctions=[np.mean, np.std, np.max, np.min], addruntime=True, addnclusters=True, savepath=None, saveintermediateresults=False) print(df) ```

Citation

If you use the ClustPy package in the context of a scientific publication, please cite it as follows:

Leiber, C., Miklautz, L., Plant, C., Bhm, C. (2023, December). Benchmarking Deep Clustering Algorithms With ClustPy. 2023 IEEE International Conference on Data Mining Workshops (ICDMW). [DOI]

BibTeX: latex @inproceedings{leiber2023benchmarking, title = {Benchmarking Deep Clustering Algorithms With ClustPy}, author = {Leiber, Collin and Miklautz, Lukas and Plant, Claudia and Bhm, Christian}, booktitle = {2023 IEEE International Conference on Data Mining Workshops (ICDMW)}, year = {2023}, pages = {625-632}, publisher = {IEEE}, doi = {10.1109/ICDMW60847.2023.00087} }

Publications using ClustPy

Application of Deep Clustering Algorithms (CIKM 10/2023)
Benchmarking Deep Clustering Algorithms With ClustPy (ICDMW 12/2023)
Data with Density-Based Clusters: A Generator for Systematic Evaluation of Clustering Algorithms (ECML PKDD 08/2024)
Statistical Modeling of Univariate Multimodal Data (arXiv 12/2024)
SHADE: Deep Density-based Clustering (ICDM 12/2024)
Dying Clusters Is All You Need - Deep Clustering With an Unknown Number of Clusters (ICDMW 12/2024)
A Symmetric Purity Measure for Clustering Comparison (Annals of Data Science 04/2025)
Breaking the Reclustering Barrier in Centroid-based Deep Clustering (ICLR 04/2025)

Owner

Name: Collin Leiber
Login: collinleiber
Kind: user
Location: Munich
Company: Ludwig-Maximilians-Universität München

Repositories: 4
Profile: https://github.com/collinleiber

GitHub Events

Total

Issues event: 4
Watch event: 37
Issue comment event: 8
Push event: 20
Pull request review comment event: 2
Pull request review event: 5
Pull request event: 9
Fork event: 4
Create event: 3

Last Year

Issues event: 4
Watch event: 37
Issue comment event: 8
Push event: 20
Pull request review comment event: 2
Pull request review event: 5
Pull request event: 9
Fork event: 4
Create event: 3

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 57
Total pull requests: 49
Average time to close issues: 6 months
Average time to close pull requests: 13 days
Total issue authors: 12
Total pull request authors: 10
Average comments per issue: 0.44
Average comments per pull request: 0.53
Merged pull requests: 45
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 5
Pull requests: 7
Average time to close issues: 2 months
Average time to close pull requests: 6 days
Issue authors: 3
Pull request authors: 2
Average comments per issue: 1.8
Average comments per pull request: 0.71
Merged pull requests: 5
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

collinleiber (21)
Lumik7 (19)
timoklein (4)
Han007 (2)
randomn4me (2)
max6457 (1)
kwonsik0404 (1)
AndyRandy (1)
rpsdm (1)
Fulin-Gao (1)
Philloraptor (1)
jaanisfehling (1)

Pull Request Authors

collinleiber (30)
Lumik7 (10)
zoidberg77 (3)
AndyRandy (3)
julianSchilcher (2)
lor-enz (2)
randomn4me (2)
timoklein (1)
rpsdm (1)
FridoSa (1)

Top Labels

Issue Labels

enhancement (25) bug (2)

Pull Request Labels

enhancement (1)

Packages

Total packages: 1
Total downloads:
- pypi 97 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 4
Total maintainers: 1

pypi.org: clustpy

A Python library for advanced clustering algorithms

Homepage: https://clustpy.readthedocs.io/en/latest/
Documentation: https://clustpy.readthedocs.io/
License: BSD-3-Clause License
Latest release: 0.0.2
published almost 2 years ago

Versions: 4
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 97 Last month

Rankings

Stargazers count: 9.8%

Dependent packages count: 10.1%

Forks count: 11.9%

Average: 16.0%

Dependent repos count: 21.6%

Downloads: 26.7%

Maintainers (1)

cleiber

Last synced: 11 months ago

clustpy

Science Score: 49.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Installation

For Users

Stable Version

Development Version

For Developers

Components

Clustering Algorithms

Partition-based Clustering

Density-based Clustering

Hierarchical Clustering

Alternative Clustering / Non-redundant Clustering

Deep Clustering

Neural Networks

Other implementations

Python environments

Compatible packages

Coding Examples

1)

2)

3)

4)

Citation

Publications using ClustPy

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: clustpy

Rankings

Maintainers (1)