Fast k-medoids Clustering in Rust and Python

Fast k-medoids Clustering in Rust and Python - Published in JOSS (2022)

https://github.com/kno10/python-kmedoids

Science Score: 77.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 10 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org
  • Committers with academic emails
    2 of 3 committers (66.7%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.0%) to scientific vocabulary

Scientific Fields

Artificial Intelligence and Machine Learning Computer Science - 37% confidence
Last synced: 4 months ago · JSON representation ·

Repository

Fast K-Medoids clustering in Python with FasterPAM

Basic Info
Statistics
  • Stars: 84
  • Watchers: 3
  • Forks: 7
  • Open Issues: 0
  • Releases: 11
Created almost 5 years ago · Last pushed about 1 year ago
Metadata Files
Readme Changelog Funding License Citation

README.md

k-Medoids Clustering in Python with FasterPAM

PyPI version Conda Version Conda Platforms

This python package implements k-medoids clustering with PAM and variants of clustering by direct optimization of the (Medoid) Silhouette. It can be used with arbitrary dissimilarites, as it requires a dissimilarity matrix as input.

This software package has been introduced in JOSS:

Erich Schubert and Lars Lenssen
Fast k-medoids Clustering in Rust and Python
Journal of Open Source Software 7(75), 4183
https://doi.org/10.21105/joss.04183 (open access)

For further details on the implemented algorithm FasterPAM, see:

Erich Schubert, Peter J. Rousseeuw
Fast and Eager k-Medoids Clustering:
O(k) Runtime Improvement of the PAM, CLARA, and CLARANS Algorithms
Information Systems (101), 2021, 101804
https://doi.org/10.1016/j.is.2021.101804 (open access)

an earlier (slower, and now obsolete) version was published as:

Erich Schubert, Peter J. Rousseeuw:
Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS Algorithms
In: 12th International Conference on Similarity Search and Applications (SISAP 2019), 171-187.
https://doi.org/10.1007/978-3-030-32047-8_16
Preprint: https://arxiv.org/abs/1810.05691

This is a port of the original Java code from ELKI to Rust. The Rust version is then wrapped for use with Python.

For further details on medoid Silhouette clustering with automatic cluster number selection (FasterMSC, DynMSC), see:

Lars Lenssen, Erich Schubert:
Medoid silhouette clustering with automatic cluster number selection
Information Systems (120), 2024, 102290
https://doi.org/10.1016/j.is.2023.102290
Preprint: https://arxiv.org/abs/2309.03751

the basic FasterMSC method was first published as:

Lars Lenssen, Erich Schubert:
Clustering by Direct Optimization of the Medoid Silhouette
In: 15th International Conference on Similarity Search and Applications (SISAP 2022)
https://doi.org/10.1007/978-3-031-17849-8_15

If you use this code in scientific work, please cite above papers. Thank you.

Documentation

Full python documentation is included, and available on python-kmedoids.readthedocs.io

Installation

Installation with pip or conda

Pre-built packages for many Linux, Windows, and OSX systems are available in PyPI and conda-forge can be installed with * pip install kmedoids respectively * conda install -c conda-forge kmedoids.

On uncommon architectures, you may need to first install Cargo (i.e., the Rust programming language) first, and a subsequent pip install kmedoids will try to compile the package for your CPU architecture and operating system.

Compilation from source

You need to have Python 3 installed.

Unless you already have Rust, install Rust/Cargo.

Installation uses maturin for compiling and installing the Rust extension. Maturin is best used within a Python virtual environment: ```sh

activate your desired virtual environment first, then:

pip install maturin git clone https://github.com/kno10/python-kmedoids.git cd python-kmedoids

build and install the package:

maturin develop --release Integration test to validate the installation. sh pip install numpy python -m unittest discover tests ```

This procedure uses the latest git version from https://github.com/kno10/rust-kmedoids. If you want to use local modifications to the Rust code, you need to provide the source folder of the Rust module in Cargo.toml by setting the path= option of the kmedoids dependency.

Example

Given a distance matrix distmatrix, cluster into k = 5 clusters:

python import kmedoids c = kmedoids.fasterpam(distmatrix, 5) print("Loss is:", c.loss)

Using the sklearn-compatible API

Note that KMedoids defaults to the "precomputed" metric, expecting a pairwise distance matrix. If you have sklearn installed, you can also use metric="euclidean" and other distances supported by sklearn.

python import kmedoids km = kmedoids.KMedoids(5, method='fasterpam') c = km.fit(distmatrix) print("Loss is:", c.inertia_)

MNIST (10k samples)

python import kmedoids, numpy, time from sklearn.datasets import fetch_openml from sklearn.metrics.pairwise import euclidean_distances X, _ = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False) X = X[:10000] diss = euclidean_distances(X) start = time.time() fp = kmedoids.fasterpam(diss, 100) print("FasterPAM took: %.2f ms" % ((time.time() - start)*1000)) print("Loss with FasterPAM:", fp.loss) start = time.time() pam = kmedoids.pam(diss, 100) print("PAM took: %.2f ms" % ((time.time() - start)*1000)) print("Loss with PAM:", pam.loss)

Choose the optimal number of clusters

This package includes DynMSC, an algorithm that optimizes the Medoid Silhouette, and chooses the "optimal" number of clusters in a range of kmin..kmax. Beware that if you allow a too large kmax, the optimum result will likely have many one-elemental clusters. A too high kmax may mask more desirable results, hence it is recommended that you choose only 2-3 times the number of clusters you expect as maximum.

python import kmedoids, numpy from sklearn.datasets import fetch_openml from sklearn.metrics.pairwise import euclidean_distances X, _ = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False) X = X[:10000] diss = euclidean_distances(X) kmin, kmax = 10, 20 dm = kmedoids.dynmsc(diss, kmax, kmin) print("Optimal number of clusters according to the Medoid Silhouette:", dm.bestk) print("Medoid Silhouette over range of k:", dm.losses) print("Range of k:", dm.rangek)

Full Colab notebook example.

Memory Requirements

Because the algorithms require a distance matrix as input, you need O(N²) memory to use these implementations. With single precision, this matrix needs 4·N² bytes, so a typical laptop with 8 GB of RAM could handle data sets of over 40.000 instances, but if your computation of the distance matrix incurs copying the matrix, only 30.000 or less may be feasible.

The majority of run time usually is the distance matrix computation, so it is recommended you only compute it once, then experiment with different algorithm settings. Avoid recomputing it repeatedly.

For larger data sets, it is recommended to only cluster a representative sample of the data. Usually, this will still yield sufficient result quality.

Implemented Algorithms

  • FasterPAM (Schubert and Rousseeuw, 2020, 2021)
  • FastPAM1 (Schubert and Rousseeuw, 2019, 2021)
  • PAM (Kaufman and Rousseeuw, 1987) with BUILD and SWAP
  • Alternating optimization (k-means-style algorithm)
  • Silhouette index for evaluation (Rousseeuw, 1987)
  • FasterMSC (Lenssen and Schubert, 2022)
  • FastMSC (Lenssen and Schubert, 2022)
  • DynMSC (Lenssen and Schubert, 2023)
  • PAMSIL (Van der Laan and Pollard, 2003)
  • PAMMEDSIL (Van der Laan and Pollard, 2003)
  • Medoid Silhouette index for evaluation (Van der Laan and Pollard, 2003)

Note that the k-means-like algorithm for k-medoids tends to find much worse solutions.

Contributing to python-kmedoids

Third-party contributions are welcome. Please use pull requests to submit patches.

Reporting issues

Please report errors as an issue within the repository's issue tracker.

Support requests

If you need help, please submit an issue within the repository's issue tracker.

License: GPL-3 or later

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.

Owner

  • Name: Erich Schubert
  • Login: kno10
  • Kind: user
  • Location: Dortmund, Germany
  • Company: TU Dortmund University

Professor of Data Mining. Working on: Outlier Detection; Cluster Analysis; Text Mining; Intrinsic Dimensionality; High-Dimensional Data

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Schubert
    given-names: Erich
    orcid: "https://orcid.org/0000-0001-9143-4880"
  - family-names: Lenssen
    given-names: Lars
    orcid: "https://orcid.org/0000-0003-0037-0418"
title: "Fast k-medoids Clustering in Rust and Python"
journal: "J. Open Source Softw."
doi: 10.21105/joss.04183
version: 0.5.3
date-released: 2024-12-02
license: GPL-3.0
preferred-citation:
  title: "Fast k-medoids Clustering in Rust and Python"
  year: "2022"
  type: article
  journal: "J. Open Source Softw."
  doi: 10.21105/joss.04183
  authors:
    - family-names: Schubert
      given-names: Erich
    - family-names: Lenssen
      given-names: Lars
references:
- title: "Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS Algorithms"
  doi: "10.1007/978-3-030-32047-8_16"
  year: "2019"
  type: conference-paper
  conference: "Similarity Search and Applications - 12th International Conference, SISAP 2019, Newark, NJ, USA, October 2-4, 2019, Proceedings"
  authors:
    - family-names: Schubert
      given-names: Erich
    - family-names: Rousseeuw
      given-names: Peter J.
- title: "Fast and eager k-medoids clustering: O(k) runtime improvement of the PAM, CLARA, and CLARANS algorithms"
  doi: "10.1016/j.is.2021.101804"
  year: "2021"
  type: article
  journal: "Inf. Syst."
  volume: 101
  start: 101804
  authors:
    - family-names: Schubert
      given-names: Erich
    - family-names: Rousseeuw
      given-names: Peter J.
- title: "Clustering by Direct Optimization of the Medoid Silhouette"
  doi: "10.1007/978-3-031-17849-8_15"
  year: "2022"
  type: conference-paper
  conference: "Similarity Search and Applications - 15th International Conference, SISAP 2022, Bologna, Italy, October 5-7, 2022, Proceedings"
  authors:
    - family-names: Lenssen
      given-names: Lars
    - family-names: Schubert
      given-names: Erich
- title: "Medoid silhouette clustering with automatic cluster number selection"
  doi: "10.1016/j.is.2023.102290"
  year: "2024"
  type: article
  journal: "Inf. Syst."
  volume: 120
  start: 102290
  authors:
    - family-names: Lenssen
      given-names: Lars
    - family-names: Schubert
      given-names: Erich

GitHub Events

Total
  • Create event: 2
  • Release event: 1
  • Issues event: 5
  • Watch event: 16
  • Delete event: 1
  • Issue comment event: 4
  • Push event: 11
  • Fork event: 1
Last Year
  • Create event: 2
  • Release event: 1
  • Issues event: 5
  • Watch event: 16
  • Delete event: 1
  • Issue comment event: 4
  • Push event: 11
  • Fork event: 1

Committers

Last synced: 5 months ago

All Time
  • Total Commits: 90
  • Total Committers: 3
  • Avg Commits per committer: 30.0
  • Development Distribution Score (DDS): 0.322
Past Year
  • Commits: 11
  • Committers: 1
  • Avg Commits per committer: 11.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Erich Schubert e****t@t****e 61
lale1009 l****n@t****e 26
David Muhr m****d@g****m 3
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 22
  • Total pull requests: 8
  • Average time to close issues: 13 days
  • Average time to close pull requests: 12 days
  • Total issue authors: 17
  • Total pull request authors: 6
  • Average comments per issue: 3.09
  • Average comments per pull request: 1.75
  • Merged pull requests: 3
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 4
  • Pull requests: 0
  • Average time to close issues: 4 days
  • Average time to close pull requests: N/A
  • Issue authors: 3
  • Pull request authors: 0
  • Average comments per issue: 2.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • timClicks (4)
  • davnn (2)
  • khieu (2)
  • apcamargo (2)
  • Theylikeme (1)
  • j-adamczyk (1)
  • bzm3r (1)
  • hpelin (1)
  • awarebayes (1)
  • rjustindavis (1)
  • matudor (1)
  • heshpdx (1)
  • RoyiAvital (1)
  • finloop (1)
  • AlexHermansson (1)
Pull Request Authors
  • davnn (3)
  • RoyiAvital (1)
  • lale1009 (1)
  • rjustindavis (1)
  • bitsnaps (1)
  • danielskatz (1)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 2
  • Total downloads:
    • pypi 19,697 last-month
  • Total dependent packages: 4
    (may contain duplicates)
  • Total dependent repositories: 6
    (may contain duplicates)
  • Total versions: 19
  • Total maintainers: 2
pypi.org: kmedoids

k-Medoids Clustering in Python with FasterPAM

  • Versions: 17
  • Dependent Packages: 4
  • Dependent Repositories: 6
  • Downloads: 19,567 Last month
Rankings
Downloads: 3.5%
Dependent packages count: 4.8%
Dependent repos count: 6.0%
Average: 7.4%
Stargazers count: 9.4%
Forks count: 13.3%
Maintainers (1)
Last synced: 4 months ago
pypi.org: kmedoids-debug

k-Medoids Clustering in Python with FasterPAM

  • Versions: 2
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 130 Last month
Rankings
Dependent packages count: 6.6%
Stargazers count: 10.0%
Downloads: 14.1%
Forks count: 14.5%
Average: 15.2%
Dependent repos count: 30.6%
Maintainers (1)
Last synced: 9 months ago

Dependencies

Cargo.toml cargo
  • ndarray 0.15
  • numpy 0.16
  • pyo3 0.16.5
  • rand 0.8
  • rayon 1.5
  • rustkmedoids 0.3.3
docs/requirements.txt pypi
  • kmedoids *
  • scikit-learn *
requirements-dev.txt pypi
  • maturin >=0.12
  • pip >=21.3
  • pytest >=3.5.0
requirements.txt pypi
  • numpy *
.github/workflows/wheels.yml actions
  • actions-rs/toolchain v1 composite
  • actions/checkout v3 composite
  • actions/download-artifact v3 composite
  • actions/setup-python v4 composite
  • actions/upload-artifact v3 composite
  • pypa/gh-action-pypi-publish release/v1 composite
pyproject.toml pypi