https://github.com/btraven00/omnibenchmark_paper

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: sciencedirect.com
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (7.3%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: btraven00
License: mit
Default Branch: imallona
Size: 144 KB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 0

Fork of omnibenchmark/omnibenchmark_paper

Created about 1 year ago · Last pushed about 1 year ago

https://github.com/btraven00/omnibenchmark_paper/blob/imallona/

A clustering example for omnibenchmark

# How to run

1. Install omnibenchmark using [our tutorial](https://omnibenchmark.org/tutorial/)
2. Clone the benchmark definition / this repository with `git clone git@github.com:omnibenchmark/clustering_example.git`
3. Move to the cloned repository `cd clustering_example`
4. Run locally, somewhat in parallel `ob run benchmark -b CLUSTERING.YAML  --local --threads 6`. Choose `Clustering.yml` specification based on whether running it with conda, easybuild, apptainer, etc. [More details about the available backends](https://github.com/omnibenchmark/clustering_example/blob/main/envs/README.md).

# Clustbench attribution

by Marek Gagolewski, modified by Izaskun Mallona

# Data disclaimer

Some datasets are commented out to speed up calculations.

From [Are cluster validity measures (in) valid?](https://www.sciencedirect.com/science/article/pii/S0020025521010082):

> The original benchmark battery consists of 79 data instances, however 16 datasets are accompanied by labels that yield ; they were omitted for their computation would be too lengthy (namely: mnist/digits, mnist/fashion, other/chameleon_t7_10k, other/chameleon_t8_8k, sipu/a1, sipu/a2, sipu/a3, sipu/birch1, sipu/birch2, sipu/d31, sipu/s1, sipu/s2, sipu/s3, sipu/s4, sipu/worms_2, sipu/worms_64). Also uci/glass has been removed as one of its 25-near-neighbour graphs connected components was too small for the NN-based methods to succeed. This leaves us with 62 datasets in total, see Table 1.

A yaml such as [0a88c91](https://github.com/omnibenchmark/clustering_example/blob/0a88c910bbda62d1b593f4215a682770227f39ff/Clustering.yaml) with 30 cores should run half of the stuff in ~4 h and reach 97% completion in ~8h.

# Summary

- Data. Example datasets (not a comprehensive list, it's >79 of them):
  - https://github.com/imallona/clustbench_data 
    - args: ["--dataset_generator", "mnist", "--dataset_name", "fashion"]
    - args: ["--dataset_generator", "other", "--dataset_name", "iris"]
    - args: ["--dataset_generator", "mnist", "--dataset_name", "digits"]
    - args: ["--dataset_generator", "wut", "--dataset_name", "circles"]
- Method families/providers (they include several methods each)
  - https://github.com/imallona/clustbench_fastcluster
    - args: ["--linkage", "complete"]
    - args: ["--linkage", "ward"]
    - args: ["--linkage", "average"]
    - args: ["--linkage", "weighted"]
    - args: ["--linkage", "median"]
    - args: ["--linkage", "centroid"]
  - https://github.com/imallona/clustbench_sklearn 
    - args: ["--method", "birch"]
    - args: ["--method", "kmeans"]
    - args: ["--method", "spectral"] ## too slow
    - args: ["--method", "gm"]
  - https://github.com/imallona/clustbench_agglomerative
    - args: ["--linkage", "average"]
    - args: ["--linkage", "complete"]
    - args: ["--linkage", "ward"]
  - https://github.com/imallona/clustbench_genieclust
    - args: ["--method", "genie", "--gini_threshold", 0.5]
    - args: ["--method", "gic"]
    - args: ["--method", "ica"]
  - https://github.com/imallona/clustbench_fcps
    - args: ["--method", "FCPS_Minimax"]
    - args: ["--method", "FCPS_MinEnergy"]
    - args: ["--method", "FCPS_HDBSCAN_2"]
    - args: ["--method", "FCPS_HDBSCAN_4"]
    - args: ["--method", "FCPS_HDBSCAN_8"]
    - args: ["--method", "FCPS_Diana"]
    - args: ["--method", "FCPS_Fanny"]
    - args: ["--method", "FCPS_Hardcl"]
    - args: ["--method", "FCPS_Softcl"]
    - args: ["--method", "FCPS_Clara"]
    - args: ["--method", "FCPS_PAM"]
- Metric providers (several metrics)
  - https://github.com/imallona/clustbench_metrics
    - args: ["--metric", "normalized_clustering_accuracy"]
    - args: ["--metric", "adjusted_fm_score"]
    - args: ["--metric", "adjusted_mi_score"]
    - args: ["--metric", "adjusted_rand_score"]
    - args: ["--metric", "fm_score"]
    - args: ["--metric", "mi_score"]
    - args: ["--metric", "normalized_clustering_accuracy"]
    - args: ["--metric", "normalized_mi_score"]
    - args: ["--metric", "normalized_pivoted_accuracy"]
    - args: ["--metric", "pair_sets_index"]
    - args: ["--metric", "rand_score"]
- Metric collector
  - https://github.com/imallona/clustering_report
- Daniel modules (independent from clustbench)
  - https://github.com/omnibenchmark-example/iris.git
  - https://github.com/omnibenchmark-example/penguins.git
  - https://github.com/omnibenchmark-example/kmeans.git
  - https://github.com/omnibenchmark-example/ward.git
  - https://github.com/omnibenchmark-example/ari.git
  - https://github.com/omnibenchmark-example/accuracy.git
  
  
# Software backends

In `envs`: conda, apptainer, easybuild (lmod modules)

# Warnings

Mind we try to run clusterings specifying the true number of clusters +- 2. But sometimes the true number is k=3. Then we do `k=2, k=2, k=3, k=5, k=6` filling with k=2s as needed, and recomputing the same values multiple times (so runtimes are comparable across datasets, regardless of their true number of clusters).

Also, we have modules by Daniel not fully incorporated into Gagolewski's flow.

Owner

Name: btraven
Login: btraven00
Kind: user

Repositories: 1
Profile: https://github.com/btraven00

GitHub Events

Total

Push event: 9

Last Year

Push event: 9

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science