banditpam

BanditPAM C++ implementation and Python package

https://github.com/motiwari/banditpam

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    1 of 20 committers (5.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (7.9%) to scientific vocabulary

Keywords

clustering machine-learning python
Last synced: 6 months ago · JSON representation

Repository

BanditPAM C++ implementation and Python package

Basic Info
  • Host: GitHub
  • Owner: motiwari
  • License: mit
  • Language: C++
  • Default Branch: main
  • Homepage:
  • Size: 73.9 MB
Statistics
  • Stars: 652
  • Watchers: 9
  • Forks: 43
  • Open Issues: 77
  • Releases: 17
Topics
clustering machine-learning python
Created over 5 years ago · Last pushed 6 months ago
Metadata Files
Readme License

README.md

BanditPAM: Almost Linear-Time $k$-Medoids Clustering

Linux - build package and run tests Linux - build source distribution and wheels Mac ARM64 - build package and run tests Mac ARM64 - Run CMake Build and Tests Mac Intel - build package and run tests Mac Intel - Run CMake Build and Tests MacOS - build wheels pages-build-deployment R-CMD-check.yaml Run style checks

This repo contains a high-performance implementation of BanditPAM from BanditPAM: Almost Linear-Time k-Medoids Clustering and BanditPAM++: Faster k-medoids Clustering. The code can be called directly from Python, R, or C++.

If you use this software, please cite:

Mo Tiwari, Martin Jinye Zhang, James Mayclin, Sebastian Thrun, Chris Piech, Ilan Shomorony. "BanditPAM: Almost Linear Time k-medoids Clustering via Multi-Armed Bandits" Advances in Neural Information Processing Systems (NeurIPS) 2020.

Mo Tiwari, Ryan Kang, Donghyun Lee, Sebastian Thrun, Chris Piech, Ilan Shomorony, Martin Jinye Zhang. "BanditPAM++: Faster k-medoids Clustering" Advances in Neural Information Processing Systems (NeurIPS) 2023.

```python @inproceedings{tiwari2020banditpam, title={BanditPAM: Almost Linear Time $k$-medoids Clustering via Multi-Armed Bandits}, author={Tiwari, Mo and Zhang, Martin J and Mayclin, James and Thrun, Sebastian and Piech, Chris and Shomorony, Ilan}, booktitle={Advances in Neural Information Processing Systems}, pages={368--374}, year={2020} }

@inproceedings{tiwari2023banditpam++, title={BanditPAM++: Faster $k$-medoids Clustering}, author={Tiwari, Mo and Kang, Ryan and Lee, Donghyun and Thrun, Sebastian and Shomorony, Ilan and Zhang, Martin J}, journal={Advances in Neural Information Processing Systems}, volume={36}, pages={73371--73382}, year={2023} } ```

Requirements

TL;DR run python -m pip install banditpam or install.packages(banditpam) and jump to the examples.

If you have any difficulties, please see the platform-specific guides and file a Github issue if you have additional trouble.

Further Reading

Python Quickstart

Install the repo and its dependencies:

This can be done either through PyPI (recommended) python /BanditPAM/: python -m pip install -r requirements.txt /BanditPAM/: python -m pip install banditpam OR through the source code via python /BanditPAM/: git submodule update --init --recursive /BanditPAM/: cd headers/carma /BanditPAM/: mkdir build && cd build && cmake -DCARMA_INSTALL_LIB=ON .. && sudo cmake --build . --config Release --target install /BanditPAM/: cd ../../.. /BanditPAM/: python -m pip install -r requirements.txt /BanditPAM/: sudo python -m pip install .

Example 1: Synthetic data from a Gaussian Mixture Model

```python from banditpam import KMedoids import numpy as np import matplotlib.pyplot as plt

Generate data from a Gaussian Mixture Model with the given means:

np.random.seed(0) npercluster = 40 means = np.array([[0,0], [-5,5], [5,5]]) X = np.vstack([np.random.randn(npercluster, 2) + mu for mu in means])

Fit the data with BanditPAM:

kmed = KMedoids(n_medoids=3, algorithm="BanditPAM") kmed.fit(X, 'L2')

print(kmed.average_loss) # prints 1.2482391595840454 print(kmed.labels) # prints cluster assignments [0] * 40 + [1] * 40 + [2] * 40

Visualize the data and the medoids:

for pidx, point in enumerate(X): if pidx in map(int, kmed.medoids): plt.scatter(X[pidx, 0], X[pidx, 1], color='red', s = 40) else: plt.scatter(X[pidx, 0], X[pidx, 1], color='blue', s = 10)

plt.show() ```

png

Example 2: MNIST and its medoids visualized via t-SNE

```python

Start in the repository root directory, i.e. '/BanditPAM/'.

from banditpam import KMedoids import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.manifold import TSNE

Load the 1000-point subset of MNIST and calculate its t-SNE embeddings for visualization:

X = pd.readcsv('data/MNIST1k.csv', sep=' ', header=None).tonumpy() Xtsne = TSNE(ncomponents=2).fittransform(X)

Fit the data with BanditPAM:

kmed = KMedoids(n_medoids=10, algorithm="BanditPAM") kmed.fit(X, 'L2')

Visualize the data and the medoids via t-SNE:

for pidx, point in enumerate(X): if pidx in map(int, kmed.medoids): plt.scatter(Xtsne[pidx, 0], Xtsne[pidx, 1], color='red', s = 40) else: plt.scatter(Xtsne[pidx, 0], Xtsne[pidx, 1], color='blue', s = 5)

plt.show() ```

R Examples

Please see here.

Documentation

Documentation for BanditPAM can be found on read the docs.

Building the C++ executable from source

Please note that it is NOT necessary to build the C++ executable from source to use the Python code above. However, if you would like to use the C++ executable directly, follow the instructions below.

Option 1: Building with Docker

We highly recommend building using Docker. One can download and install Docker by following instructions at the Docker install page. Once you have Docker installed and the Docker Daemon is running, run the following commands:

/BanditPAM/scripts/docker$ chmod +x env_setup.sh /BanditPAM/scripts/docker$ ./env_setup.sh /BanditPAM/scripts/docker$ ./run_docker.sh

which will start a Docker instance with the necessary dependencies. Then:

/BanditPAM$ mkdir build && cd build /BanditPAM/build$ cmake .. && make

This will create an executable named BanditPAM in BanditPAM/build/src.

Option 2: Installing requirements and building directly

Building this repository requires four external requirements: * CMake >= 3.17 * Armadillo >= 10.5.3 * OpenMP >= 2.5 (OpenMP is supported by default on most Linux platforms, and can be downloaded through homebrew on MacOS) * CARMA >= 0.6.2

If installing these requirements from source, one can generally use the following procedure to install each requirement from the library's root folder (with armadillo used as an example here): /armadillo$ mkdir build && cd build /armadillo/build$ cmake .. && make && sudo make install

Note that CARMA has different installation instructions; see its instructions.

Platform-specific installation guides

Further installation information for MacOS, Linux, and Windows is available in the docs folder. Ensure all the requirements above are installed and then run:

/BanditPAM$ mkdir build && cd build /BanditPAM/build$ cmake .. && make

This will create an executable named BanditPAM in BanditPAM/build/src.

C++ Usage

Once the executable has been built, it can be invoked with: /BanditPAM/build/src/BanditPAM -f [path/to/input.csv] -k [number of clusters]

  • -f is mandatory and specifies the path to the dataset
  • -k is mandatory and specifies the number of clusters with which to fit the data

For example, if you ran ./env_setup.sh and downloaded the MNIST dataset, you could run:

/BanditPAM/build/src/BanditPAM -f ../data/MNIST_1k.csv -k 10

The expected output in the command line will be: Medoids: 694,168,306,714,324,959,527,251,800,737

Implementing a custom distance metric

One of the advantages of $k$-medoids is that it works with arbitrary distance metrics; in fact, your "metric" need not even be a real metric -- it can be negative, asymmetric, and/or not satisfy the triangle inequality or homogeneity. Any pairwise dissimilarity function works with $k$-medoids.

This also allows for clustering of "exotic" objects like trees, graphs, natural language, and more -- settings where running $k$-means wouldn't even make sense. We talk about one such setting in the full paper.

The package currently supports a number of distance metrics, including all $L_p$ losses and cosine distance.

If you're willing to write a little C++, you only need to add a few lines to kmedoids_algorithm.cpp and kmedoids_algorithm.hpp to implement your distance metric / pairwise dissimilarity!

Then, be sure to re-install the repository with a python -m pip install . (note the trailing .).

The maintainers of this repository are working on permitting arbitrary dissimilarity metrics that users write in Python, as well; see #4.

Testing

To run the full suite of tests, run in the root directory:

/BanditPAM$ python -m unittest discover -s tests

Alternatively, to run a "smaller" set of tests, from the main repo folder run python tests/test_smaller.py or python tests/test_larger.py to run a set of longer, more intensive tests.

Credits

Mo Tiwari wrote the original Python implementation of BanditPAM and many features of the C++ implementation. Mo and Adarsh Kumarappan now maintain the implementations.

James Mayclin developed the initial C++ implementation of BanditPAM.

The original BanditPAM paper was published by Mo Tiwari, Martin Jinye Zhang, James Mayclin, Sebastian Thrun, Chris Piech, and Ilan Shomorony.

We would like to thank Jerry Quinn, David Durst, Geet Sethi, and Max Horton for helpful guidance regarding the C++ implementation.

Owner

  • Name: Mo Tiwari
  • Login: motiwari
  • Kind: user
  • Location: Stanford, CA
  • Company: Stanford University

Ph.D. Student in Artificial Intelligence/Machine Learning at Stanford University

GitHub Events

Total
  • Create event: 7
  • Issues event: 1
  • Release event: 3
  • Watch event: 9
  • Delete event: 1
  • Issue comment event: 8
  • Push event: 91
  • Pull request review event: 2
  • Pull request event: 17
  • Fork event: 2
Last Year
  • Create event: 7
  • Issues event: 1
  • Release event: 3
  • Watch event: 9
  • Delete event: 1
  • Issue comment event: 8
  • Push event: 91
  • Pull request review event: 2
  • Pull request event: 17
  • Fork event: 2

Committers

Last synced: about 2 years ago

All Time
  • Total Commits: 1,016
  • Total Committers: 20
  • Avg Commits per committer: 50.8
  • Development Distribution Score (DDS): 0.373
Past Year
  • Commits: 246
  • Committers: 10
  • Avg Commits per committer: 24.6
  • Development Distribution Score (DDS): 0.533
Top Committers
Name Email Commits
motiwari m****b@g****m 637
Adarsh321123 a****7@g****m 91
esfrankel e****7@g****m 79
Thomas Cheung t****g@T****l 58
James Mayclin 2****n 53
Mo Tiwari m****c@g****m 36
Luke Donghyun Lee l****i@L****l 14
Balasubramanian Narasimhan b****s@g****m 11
dhaval737 d****7@g****m 10
System Administrator r****t@L****l 5
Ben-H3 b****a@i****m 4
Luke Donghyun Lee l****i@D****t 3
Luke Donghyun Lee l****i@L****l 3
Prasoon Tiwari d****k@g****m 3
Ubuntu u****u@i****l 2
Ubuntu u****u@i****l 2
Adarsh Kumarappan 5****3 2
Ying Sheng s****5@g****m 1
Mo Tiwari m****i@s****u 1
Luke Donghyun Lee l****i@D****t 1

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 96
  • Total pull requests: 34
  • Average time to close issues: 4 months
  • Average time to close pull requests: 21 days
  • Total issue authors: 19
  • Total pull request authors: 9
  • Average comments per issue: 1.31
  • Average comments per pull request: 0.97
  • Merged pull requests: 20
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 8
  • Average time to close issues: N/A
  • Average time to close pull requests: 7 minutes
  • Issue authors: 1
  • Pull request authors: 3
  • Average comments per issue: 0.0
  • Average comments per pull request: 1.13
  • Merged pull requests: 3
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • motiwari (66)
  • kno10 (5)
  • lukeleeai (3)
  • Sushil-Thapa (2)
  • Adarsh321123 (2)
  • retzerjj (2)
  • afaghiri (1)
  • m-muaz (1)
  • SehongOh (1)
  • tanweer-mahdi (1)
  • kayuksel (1)
  • Dobatymo (1)
  • bnaras (1)
  • KeyCode17 (1)
  • tazitoo (1)
Pull Request Authors
  • motiwari (21)
  • Chris-Alex-7 (6)
  • Adarsh321123 (5)
  • bnaras (4)
  • lukeleeai (4)
  • mikldk (2)
  • dasvision0212 (2)
  • eltociear (1)
  • aumrp77 (1)
Top Labels
Issue Labels
feature (3) v4.0.4 (2) release (1) documentation (1) OpenMP (1) v4.0.3 (1)
Pull Request Labels

Packages

  • Total packages: 2
  • Total downloads:
    • cran 234 last-month
    • pypi 2,992 last-month
  • Total dependent packages: 0
    (may contain duplicates)
  • Total dependent repositories: 1
    (may contain duplicates)
  • Total versions: 22
  • Total maintainers: 2
pypi.org: banditpam

BanditPAM: A state-of-the-art, high-performance k-medoids algorithm.

  • Versions: 19
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 2,992 Last month
Rankings
Stargazers count: 2.5%
Forks count: 6.8%
Downloads: 8.2%
Average: 9.8%
Dependent packages count: 10.1%
Dependent repos count: 21.6%
Maintainers (1)
Last synced: 6 months ago
cran.r-project.org: banditpam

Almost Linear-Time k-Medoids Clustering

  • Versions: 3
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 234 Last month
Rankings
Stargazers count: 1.5%
Forks count: 3.6%
Average: 28.2%
Dependent packages count: 29.8%
Dependent repos count: 35.5%
Downloads: 70.7%
Maintainers (1)
Last synced: 6 months ago

Dependencies

requirements.txt pypi
  • matplotlib >=3.2.1
  • myst-parser >=0.16.0
  • numpy >=1.16.2
  • pandas >=0.24.1
  • pybind11 >=2.5.0
.github/workflows/build_linux_wheels.yml actions
  • actions/checkout v2 composite
  • actions/download-artifact v2 composite
  • actions/upload-artifact v2 composite
  • pypa/cibuildwheel v2.3.1 composite
  • pypa/gh-action-pypi-publish v1.4.2 composite
.github/workflows/build_mac_wheels.yml actions
  • actions/checkout v2 composite
  • actions/download-artifact v2 composite
  • actions/upload-artifact v2 composite
  • pypa/cibuildwheel v2.3.1 composite
  • pypa/gh-action-pypi-publish v1.4.2 composite
.github/workflows/run_linux_tests.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
.github/workflows/run_mac_tests.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
.github/workflows/run_style_checks.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
.github/workflows/run_windows_tests.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
scripts/docker/Dockerfile docker
  • quay.io/pypa/manylinux1_x86_64 latest build