k-means-constrained

K-Means clustering - constrained with minimum and maximum cluster size. Documentation: https://joshlk.github.io/k-means-constrained

https://github.com/joshlk/k-means-constrained

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.1%) to scientific vocabulary

Keywords

clustering k-means kmeans-constrained maximum-cluster-sizes minimum-cluster-sizes ml optimization python
Last synced: 6 months ago · JSON representation ·

Repository

K-Means clustering - constrained with minimum and maximum cluster size. Documentation: https://joshlk.github.io/k-means-constrained

Basic Info
Statistics
  • Stars: 214
  • Watchers: 5
  • Forks: 45
  • Open Issues: 1
  • Releases: 2
Topics
clustering k-means kmeans-constrained maximum-cluster-sizes minimum-cluster-sizes ml optimization python
Created over 7 years ago · Last pushed 8 months ago
Metadata Files
Readme License Citation

README.md

PyPI Python Build Documentation

k-means-constrained

K-means clustering implementation whereby a minimum and/or maximum size for each cluster can be specified.

This K-means implementation modifies the cluster assignment step (E in EM) by formulating it as a Minimum Cost Flow (MCF) linear network optimisation problem. This is then solved using a cost-scaling push-relabel algorithm and uses Google's Operations Research tools's SimpleMinCostFlow which is a fast C++ implementation.

This package is inspired by Bradley et al.. The original Minimum Cost Flow (MCF) network proposed by Bradley et al. has been modified so maximum cluster sizes can also be specified along with minimum cluster size.

The code is based on scikit-lean's KMeans and implements the same API with modifications.

Ref: 1. Bradley, P. S., K. P. Bennett, and Ayhan Demiriz. "Constrained k-means clustering." Microsoft Research, Redmond (2000): 1-8. 2. Google's SimpleMinCostFlow C++ implementation

Installation

You can install the k-means-constrained from PyPI:

pip install k-means-constrained

It is supported on Python 3.10, 3.11, 3.12 and 3.13. Previous versions of k-means-constrained support older versions of Python and Numpy.

Example

More details can be found in the API documentation.

```python

from kmeansconstrained import KMeansConstrained import numpy as np X = np.array([[1, 2], [1, 4], [1, 0], ... [4, 2], [4, 4], [4, 0]]) clf = KMeansConstrained( ... nclusters=2, ... sizemin=2, ... sizemax=5, ... randomstate=0 ... ) clf.fitpredict(X) array([0, 0, 0, 1, 1, 1], dtype=int32) clf.clustercenters_ array([[ 1., 2.], [ 4., 2.]]) clf.labels_ array([0, 0, 0, 1, 1, 1], dtype=int32) ```

Code only ``` from k_means_constrained import KMeansConstrained import numpy as np X = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]]) clf = KMeansConstrained( n_clusters=2, size_min=2, size_max=5, random_state=0 ) clf.fit_predict(X) clf.cluster_centers_ clf.labels_ ```

Time complexity and runtime

k-means-constrained is a more complex algorithm than vanilla k-means and therefore will take longer to execute and has worse scaling characteristics.

Given a number of data points $n$ and clusters $c$, the time complexity of: * k-means: $\mathcal{O}(nc)$ * k-means-constrained1: $\mathcal{O}((n^3c+n^2c^2+nc^3)\log(n+c)))$

This assumes a constant number of algorithm iterations and data-point features/dimensions.

If you consider the case where $n$ is the same order as $c$ ($n \backsim c$) then: * k-means: $\mathcal{O}(n^2)$ * k-means-constrained1: $\mathcal{O}(n^4\log(n)))$

Below is a runtime comparison between k-means and k-means-constrained whereby the number of iterations, initializations, multi-process pool size and dimension size are fixed. The number of clusters is also always one-tenth the number of data points $n=10c$. It is shown above that the runtime is independent of the minimum or maximum cluster size, and so none is included below.

Data-points vs execution time for k-means vs k-means-constrained. Data-points=10*clusters. No min/max constraints

System details * OS: Linux-5.15.0-75-generic-x86_64-with-glibc2.35 * CPU: AMD EPYC 7763 64-Core Processor * CPU cores: 120 * k-means-constrained version: 0.7.3 * numpy version: 1.24.2 * scipy version: 1.11.1 * ortools version: 9.6.2534 * joblib version: 1.3.1 * sklearn version: 1.3.0

1: Ortools states the time complexity of their cost-scaling push-relabel algorithm for the min-cost flow problem as $\mathcal{O}(n^2m\log(nC))$ where $n$ is the number of nodes, $m$ is the number of edges and $C$ is the maximum absolute edge cost.

Change log

  • v0.7.6 (2025-06-30) Add Python v3.13 and Linux ARM support.
  • v0.7.5 fix comment in README on Python version that is supported
  • v0.7.4 compatible with Numpy +v2.1.1. Added Python 3.12 support and dropped Python 3.8 and 3.9 support (due to Numpy). Linux ARM support has been dropped as we use GitHub runners to build the package and ARM machines was being emulated using QEMU. This however was producing numerical errors. GitHub should natively support Ubuntu ARM images soon and then we can start to re-build them.
  • v0.7.3 compatible with Numpy v1.23.0 to 1.26.4

Citations

If you use this software in your research, please use the following citation:

@software{Levy-Kramer_k-means-constrained_2018, author = {Levy-Kramer, Josh}, month = apr, title = {{k-means-constrained}}, url = {https://github.com/joshlk/k-means-constrained}, year = {2018} }

Owner

  • Name: Josh Levy-Kramer
  • Login: joshlk
  • Kind: user
  • Location: London, UK

ML Research Engineer. Views are my own.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Levy-Kramer"
  given-names: "Josh"
  orcid: "https://orcid.org/0000-0002-4350-6197"
title: "k-means-constrained"
date-released: 2018-04-23
url: "https://github.com/joshlk/k-means-constrained"

GitHub Events

Total
  • Issues event: 8
  • Watch event: 26
  • Delete event: 1
  • Issue comment event: 7
  • Push event: 17
  • Pull request event: 3
  • Fork event: 3
  • Create event: 1
Last Year
  • Issues event: 8
  • Watch event: 26
  • Delete event: 1
  • Issue comment event: 7
  • Push event: 17
  • Pull request event: 3
  • Fork event: 3
  • Create event: 1

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 248
  • Total Committers: 2
  • Avg Commits per committer: 124.0
  • Development Distribution Score (DDS): 0.004
Past Year
  • Commits: 51
  • Committers: 1
  • Avg Commits per committer: 51.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Josh Levy-Kramer j****h@l****k 247
Esmail e****l 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 49
  • Total pull requests: 13
  • Average time to close issues: 3 months
  • Average time to close pull requests: about 1 month
  • Total issue authors: 47
  • Total pull request authors: 4
  • Average comments per issue: 2.69
  • Average comments per pull request: 0.92
  • Merged pull requests: 9
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 4
  • Pull requests: 2
  • Average time to close issues: 7 days
  • Average time to close pull requests: less than a minute
  • Issue authors: 4
  • Pull request authors: 1
  • Average comments per issue: 0.25
  • Average comments per pull request: 0.0
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • amks1 (2)
  • mosikh (2)
  • calebhallinan (1)
  • ryanpeel (1)
  • avranil26 (1)
  • meteve (1)
  • joamatab (1)
  • esmail (1)
  • PeterIlinovich (1)
  • drcandacemakedamoore (1)
  • ismail-khan-tajir (1)
  • adinarayanaPalvadi (1)
  • nsitlokeshjain (1)
  • GiuliaFrigerio (1)
  • Jhellewaard (1)
Pull Request Authors
  • joshlk (10)
  • bibblybobblyben (2)
  • AdityaSavara (1)
  • esmail (1)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 71,361 last-month
  • Total docker downloads: 1,097
  • Total dependent packages: 7
  • Total dependent repositories: 9
  • Total versions: 14
  • Total maintainers: 1
pypi.org: k-means-constrained

K-Means clustering constrained with minimum and maximum cluster size

  • Versions: 14
  • Dependent Packages: 7
  • Dependent Repositories: 9
  • Downloads: 71,361 Last month
  • Docker Downloads: 1,097
Rankings
Dependent packages count: 1.2%
Downloads: 1.4%
Docker downloads count: 2.0%
Average: 3.6%
Dependent repos count: 4.9%
Stargazers count: 5.4%
Forks count: 6.7%
Maintainers (1)
Last synced: 6 months ago

Dependencies

requirements-dev.txt pypi
  • bump2version * development
  • cython >=0.29 development
  • nose * development
  • numpydoc * development
  • pandas >=1.0.4 development
  • pytest >=5.1 development
  • scikit-learn >=0.24.2 development
  • setuptools * development
  • sphinx * development
  • sphinx-rtd-theme * development
  • twine * development
  • wheel * development
requirements.txt pypi
  • joblib *
  • numpy >=1.22.0
  • ortools >=9.0.9048
  • scipy >=1.6.3
  • six *