https://github.com/btraven00/clustering_example

A clustering omnibenchmark

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: sciencedirect.com
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.5%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

A clustering omnibenchmark

Basic Info

Host: GitHub
Owner: btraven00
License: mit
Language: Shell
Default Branch: main
Homepage:
Size: 132 KB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 1
Releases: 0

Fork of omnibenchmark/clustering_example

Created over 1 year ago · Last pushed about 1 year ago

https://github.com/btraven00/clustering_example/blob/main/

A clustering example for omnibenchmark

# How to run

1. Install omnibenchmark: `pip install omnibenchmark>=0.2.0`
2. Clone the benchmark definition in this repository with `git clone https://github.com/omnibenchmark/clustering_example`
3. Move into the cloned folder: `cd clustering_example`
4. Run locally, with the desired degree of parallelism:
   `ob run benchmark -b  --local --cores 6`.
   Choose your `Clustering_*.yml` specification based on the backend you want to run (conda, easybuild or apptainer). [More details about the available backends and how to build or enable them](https://github.com/omnibenchmark/clustering_example/blob/main/envs/README.md).

# Software backends and variants

* All needed recipes can be found under `envs`: conda, apptainer, easybuild (lmod modules)
* The `_smoketest` variants are meant for [quick testing](https://en.wikipedia.org/wiki/Smoke_testing_(software))
* The default `apptainer` container fetches images from an online registry.
* `apptainer-vanilla` makes reference to a container image with stock python (`3.12`)
* `apptainer-optimized` makes reference to a container image with a custom compiled python (`3.12.9`), just to check if optimization flags have a noticeable effect.
* `envmodules` will need you to previously build the `.eb` easyconfigs with easybuild. We plan to make these modules publicly available in the future.
* `conda` environments will fetch software from the configured conda channels and pypi. Does not compile anything, fetches pre-built binaries (assuming there's a build in those channels for your architecture, that is)

[More info in the envs/ folder](https://github.com/omnibenchmark/clustering_example/blob/main/envs/README.md).


# Summary

- Data. Example datasets (not a comprehensive list, it's >79 of them):
  - https://github.com/imallona/clustbench_data
    - args: ["--dataset_generator", "mnist", "--dataset_name", "fashion"]
    - args: ["--dataset_generator", "other", "--dataset_name", "iris"]
    - args: ["--dataset_generator", "mnist", "--dataset_name", "digits"]
    - args: ["--dataset_generator", "wut", "--dataset_name", "circles"]
- Method families/providers (they include several methods each)
  - https://github.com/imallona/clustbench_fastcluster
    - args: ["--linkage", "complete"]
    - args: ["--linkage", "ward"]
    - args: ["--linkage", "average"]
    - args: ["--linkage", "weighted"]
    - args: ["--linkage", "median"]
    - args: ["--linkage", "centroid"]
  - https://github.com/imallona/clustbench_sklearn
    - args: ["--method", "birch"]
    - args: ["--method", "kmeans"]
    - args: ["--method", "spectral"] ## too slow
    - args: ["--method", "gm"]
  - https://github.com/imallona/clustbench_agglomerative
    - args: ["--linkage", "average"]
    - args: ["--linkage", "complete"]
    - args: ["--linkage", "ward"]
  - https://github.com/imallona/clustbench_genieclust
    - args: ["--method", "genie", "--gini_threshold", 0.5]
    - args: ["--method", "gic"]
    - args: ["--method", "ica"]
  - https://github.com/imallona/clustbench_fcps
    - args: ["--method", "FCPS_Minimax"]
    - args: ["--method", "FCPS_MinEnergy"]
    - args: ["--method", "FCPS_HDBSCAN_2"]
    - args: ["--method", "FCPS_HDBSCAN_4"]
    - args: ["--method", "FCPS_HDBSCAN_8"]
    - args: ["--method", "FCPS_Diana"]
    - args: ["--method", "FCPS_Fanny"]
    - args: ["--method", "FCPS_Hardcl"]
    - args: ["--method", "FCPS_Softcl"]
    - args: ["--method", "FCPS_Clara"]
    - args: ["--method", "FCPS_PAM"]
- Metric providers (several metrics)
  - https://github.com/imallona/clustbench_metrics
    - args: ["--metric", "normalized_clustering_accuracy"]
    - args: ["--metric", "adjusted_fm_score"]
    - args: ["--metric", "adjusted_mi_score"]
    - args: ["--metric", "adjusted_rand_score"]
    - args: ["--metric", "fm_score"]
    - args: ["--metric", "mi_score"]
    - args: ["--metric", "normalized_clustering_accuracy"]
    - args: ["--metric", "normalized_mi_score"]
    - args: ["--metric", "normalized_pivoted_accuracy"]
    - args: ["--metric", "pair_sets_index"]
    - args: ["--metric", "rand_score"]
- Metric collector
  - https://github.com/imallona/clustering_report
- Daniel modules (independent from clustbench)
  - https://github.com/omnibenchmark-example/iris.git
  - https://github.com/omnibenchmark-example/penguins.git
  - https://github.com/omnibenchmark-example/kmeans.git
  - https://github.com/omnibenchmark-example/ward.git
  - https://github.com/omnibenchmark-example/ari.git
  - https://github.com/omnibenchmark-example/accuracy.git


# Omnibenchmark YAML generation

The current repo has base templates for different runs.
Install [yq](https://github.com/mikefarah/yq) and run `make generate` if you want to modify the base template in your tests.

# Warnings

Mind we try to run clusterings specifying the true number of clusters +- 2. But sometimes the true number is k=3. Then we do `k=2, k=2, k=3, k=5, k=6` filling with k=2s as needed, and recomputing the same values multiple times (so runtimes are comparable across datasets, regardless of their true number of clusters).

Also, we have modules by Daniel not fully incorporated into Gagolewski's flow.

# Data disclaimer

Some datasets are commented out to speed up calculations.

From [Are cluster validity measures (in) valid?](https://www.sciencedirect.com/science/article/pii/S0020025521010082):

> The original benchmark battery consists of 79 data instances, however 16 datasets are accompanied by labels that yield ; they were omitted for their computation would be too lengthy (namely: mnist/digits, mnist/fashion, other/chameleon_t7_10k, other/chameleon_t8_8k, sipu/a1, sipu/a2, sipu/a3, sipu/birch1, sipu/birch2, sipu/d31, sipu/s1, sipu/s2, sipu/s3, sipu/s4, sipu/worms_2, sipu/worms_64). Also uci/glass has been removed as one of its 25-near-neighbour graphs connected components was too small for the NN-based methods to succeed. This leaves us with 62 datasets in total, see Table 1.

A yaml such as [0a88c91](https://github.com/omnibenchmark/clustering_example/blob/0a88c910bbda62d1b593f4215a682770227f39ff/Clustering.yaml) with 30 cores should run half of the stuff in ~4 h and reach 97% completion in ~8h.

# Clustbench attribution

by Marek Gagolewski, modified by Izaskun Mallona

Owner

Name: btraven
Login: btraven00
Kind: user

Repositories: 1
Profile: https://github.com/btraven00

GitHub Events

Total

Push event: 52
Pull request event: 1
Create event: 7

Last Year

Push event: 52
Pull request event: 1
Create event: 7

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/btraven00/clustering_example

Science Score: 10.0%

Repository

Basic Info

Statistics

https://github.com/btraven00/clustering_example/blob/main/

Owner

GitHub Events

Total

Last Year