gap-stat

Dynamically get the suggested clusters in the data for unsupervised learning.

https://github.com/milesgranger/gap_statistic

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    1 of 6 committers (16.7%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (16.2%) to scientific vocabulary

Keywords

cluster cluster-count clustering kmeans python scikit-learn unsupervised unsupervised-learning
Last synced: 6 months ago · JSON representation

Repository

Dynamically get the suggested clusters in the data for unsupervised learning.

Basic Info
  • Host: GitHub
  • Owner: milesgranger
  • License: unlicense
  • Language: Rust
  • Default Branch: master
  • Homepage:
  • Size: 392 KB
Statistics
  • Stars: 226
  • Watchers: 5
  • Forks: 47
  • Open Issues: 7
  • Releases: 15
Topics
cluster cluster-count clustering kmeans python scikit-learn unsupervised unsupervised-learning
Created almost 10 years ago · Last pushed over 1 year ago
Metadata Files
Readme Changelog License

README.md

Python implementation of the Gap Statistic

PythonCI RustCI

Downloads Coverage Status Code Health Code Style


Maintenance mode

I've lost interest/time in developing this further, other things have taken priority for some time now. However, all is not lost. I will be willing to review/comment on any issues/PRs but will not complete any fixes or feature requests myself.


Purpose

Dynamically identify the suggested number of clusters in a data-set using the gap statistic.


Full example available in a notebook HERE


Install:

Bleeding edge: commandline pip install git+git://github.com/milesgranger/gap_statistic.git

PyPi:
commandline pip install --upgrade gap-stat

With Rust extension: commandline pip install --upgrade gap-stat[rust]


Uninstall:

commandline pip uninstall gap-stat


Methodology:

This package provides several methods to assist in choosing the optimal number of clusters for a given dataset, based on the Gap method presented in "Estimating the number of clusters in a data set via the gap statistic" (Tibshirani et al.).

The methods implemented can cluster a given dataset using a range of provided k values, and provide you with statistics that can help in choosing the right number of clusters for your dataset. Three possible methods are:

  • Taking the k maximizing the Gap value, which is calculated for each k. This, however, might not always be possible, as for many datasets this value is monotonically increasing or decreasing.
  • Taking the smallest k such that Gap(k) >= Gap(k+1) - s(k+1). This is the method suggested in Tibshirani et al. (consult the paper for details). The measure diff = Gap(k) - Gap(k+1) + s(k+1) is calculated for each k; the parallel here, then, is to take the smallest k for which diff is positive. Note that in some cases this can be true for the entire range of k.
  • Taking the k maximizing the Gap* value, an alternative measure suggested in "A comparison of Gap statistic definitions with and with-out logarithm function" by Mohajer, Englmeier and Schmid. The authors claim this measure avoids the over-estimation of the number of clusters from which the original Gap statistics suffers, and can also suggest an optimal value for k for cases in which Gap cannot. They do warn, however, that the original Gap statistic performs better than Gap* in the case of overlapped clusters, due to its tendency to overestimate the number of clusters.

Note that none of the above methods is guaranteed to find an optimal value for k, and that they often contradict one another. Rather, they can provide more information on which to base your choice of k, which should take numerous other factors into account.


Use:

First, construct an OptimalK object. Optional intialization parameters are:

  • n_jobs - Splits computation into this number of parallel jobs. Requires choosing a parallel backend.
  • parallel_backend - Possible values are joblib, rust or multiprocessing for the built-in Python backend. If parallel_backend == 'rust' it will use all cores.
  • clusterer - Takes a custom clusterer function to be used when clustering. See the example notebook for more details.
  • clusterer_kwargs - Any keyword arguments to be forwarded to the custom clusterer function on each call.

An example intialization: python optimalK = OptimalK(n_jobs=4, parallel_backend='joblib')

After the object is created, it can be called like a function, and provided with a dataset for which the optimal K is found and returned. Parameters are:

  • X - A pandas dataframe or numpy array of data points of shape (n_samples, n_features).
  • n_refs - The number of random reference data sets to use as inertia reference to actual data. Optional.
  • cluster_array - A 1-dimensional iterable of integers; each representing n_clusters to try on the data. Optional.

For example: python import numpy as np n_clusters = optimalK(X, cluster_array=np.arange(1, 15))

After performing the search procedure, a DataFrame of gap values and other usefull statistics for each passed cluster count is now available as the gap_df attributre of the OptimalK object:

python optimalK.gap_df.head()

The columns of the dataframe are:

  • n_clusters - The number of clusters for which the statistics in this row were calculated.
  • gap_value - The Gap value for this n.
  • gap* - The Gap* value for this n.
  • ref_dispersion_std - The standard deviation of the reference distributions for this n.
  • sk - The standard error of the Gap statistic for this n.
  • sk* - The standard error of the Gap* statistic for this n.
  • diff - The diff value for this n (see the methodology section for details).
  • diff* - The diff* value for this n (corresponding to the diff value for Gap*).

Additionally, the relation between the above measures and the number of clusters can be plotted by calling the OptimalK.plot_results() method (meant to be used inside a Jupyter Notebook or a similar IPython-based notebook), which prints four plots:

  • A plot of the Gap value versus n, the number of clusters.
  • A plot of diff versus n.
  • A plot of the Gap* value versus n, the number of clusters.
  • A plot of the diff* value versus n.

Owner

  • Name: Miles
  • Login: milesgranger
  • Kind: user
  • Location: Bergen, Norway
  • Company: Noetic AS

Just a happy engineer.

GitHub Events

Total
  • Watch event: 9
  • Issue comment event: 2
  • Fork event: 1
Last Year
  • Watch event: 9
  • Issue comment event: 2
  • Fork event: 1

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 129
  • Total Committers: 6
  • Avg Commits per committer: 21.5
  • Development Distribution Score (DDS): 0.047
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Miles Granger m****3@g****m 123
Dmitry Vukolov d****o@g****m 2
psads-git 7****t 1
Shay Palachy s****5 1
Lev E. Givon l****v@c****u 1
Alina Selega a****a@g****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 39
  • Total pull requests: 26
  • Average time to close issues: 3 months
  • Average time to close pull requests: 2 days
  • Total issue authors: 23
  • Total pull request authors: 6
  • Average comments per issue: 1.82
  • Average comments per pull request: 1.62
  • Merged pull requests: 23
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 1
  • Average comments per issue: 3.0
  • Average comments per pull request: 3.0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • milesgranger (8)
  • nry123 (4)
  • shaypal5 (4)
  • psads-git (2)
  • shahabh (2)
  • dvukolov (2)
  • Li-Xue (1)
  • lebedov (1)
  • sinanazeri (1)
  • supersaiyanesee (1)
  • johnvorsten (1)
  • kikiegoguma (1)
  • hanhanwu (1)
  • rakshita95 (1)
  • Charles-Z (1)
Pull Request Authors
  • milesgranger (19)
  • lebedov (3)
  • dvukolov (2)
  • alinaselega (1)
  • psads-git (1)
  • shaypal5 (1)
Top Labels
Issue Labels
enhancement (1) help wanted (1)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 1,019 last-month
  • Total dependent packages: 1
  • Total dependent repositories: 10
  • Total versions: 16
  • Total maintainers: 1
pypi.org: gap-stat

Python implementation of the gap statistic with optional Rust optimizations.

  • Versions: 16
  • Dependent Packages: 1
  • Dependent Repositories: 10
  • Downloads: 1,019 Last month
Rankings
Downloads: 4.3%
Dependent repos count: 4.6%
Stargazers count: 4.8%
Average: 6.0%
Forks count: 6.0%
Dependent packages count: 10.1%
Maintainers (1)
Last synced: 6 months ago

Dependencies

Cargo.lock cargo
  • aho-corasick 0.7.8
  • atty 0.2.14
  • autocfg 0.1.7
  • autocfg 1.0.0
  • bitflags 1.2.1
  • bstr 0.2.11
  • byteorder 1.3.4
  • cast 0.2.3
  • cfg-if 0.1.10
  • clap 2.33.0
  • cloudabi 0.0.3
  • criterion 0.2.11
  • criterion-plot 0.3.1
  • crossbeam-deque 0.7.2
  • crossbeam-epoch 0.8.0
  • crossbeam-queue 0.2.1
  • crossbeam-utils 0.7.0
  • csv 1.1.3
  • csv-core 0.1.10
  • ctor 0.1.12
  • either 1.5.3
  • fuchsia-cprng 0.1.1
  • ghost 0.1.1
  • hermit-abi 0.1.6
  • indoc 0.3.4
  • indoc-impl 0.3.4
  • inventory 0.1.5
  • inventory-impl 0.1.5
  • itertools 0.7.11
  • itertools 0.8.2
  • itoa 0.4.5
  • lazy_static 1.4.0
  • libc 0.2.66
  • matrixmultiply 0.1.15
  • memchr 2.3.2
  • memoffset 0.5.3
  • ndarray 0.12.1
  • ndarray-parallel 0.9.1
  • ndarray-rand 0.9.0
  • num-complex 0.2.4
  • num-traits 0.2.11
  • num_cpus 1.12.0
  • numpy 0.7.0
  • paste 0.1.6
  • paste-impl 0.1.6
  • proc-macro-hack 0.5.11
  • proc-macro2 1.0.8
  • pyo3 0.8.5
  • pyo3-derive-backend 0.8.5
  • pyo3cls 0.8.5
  • quote 1.0.2
  • rand 0.3.23
  • rand 0.4.6
  • rand 0.6.5
  • rand_chacha 0.1.1
  • rand_core 0.3.1
  • rand_core 0.4.2
  • rand_hc 0.1.0
  • rand_isaac 0.1.1
  • rand_jitter 0.1.4
  • rand_os 0.1.3
  • rand_pcg 0.1.2
  • rand_xorshift 0.1.1
  • rand_xoshiro 0.1.0
  • rawpointer 0.1.0
  • rayon 1.3.0
  • rayon-core 1.7.0
  • rdrand 0.4.0
  • regex 1.3.4
  • regex-automata 0.1.8
  • regex-syntax 0.6.14
  • rustc_version 0.2.3
  • ryu 1.0.2
  • same-file 1.0.6
  • scopeguard 1.0.0
  • semver 0.9.0
  • semver-parser 0.7.0
  • serde 1.0.104
  • serde_derive 1.0.104
  • serde_json 1.0.48
  • spin 0.5.2
  • statrs 0.9.0
  • syn 1.0.14
  • textwrap 0.11.0
  • thread_local 1.0.1
  • tinytemplate 1.0.3
  • unicode-width 0.1.7
  • unicode-xid 0.2.0
  • unindent 0.1.5
  • version_check 0.9.1
  • walkdir 2.3.1
  • winapi 0.3.8
  • winapi-i686-pc-windows-gnu 0.4.0
  • winapi-util 0.1.3
  • winapi-x86_64-pc-windows-gnu 0.4.0
Cargo.toml cargo
  • criterion 0.2 development
  • ndarray 0.12.0
  • ndarray-parallel 0.9.0
  • ndarray-rand 0.9.0
  • num-traits 0.2.4
  • numpy 0.7
  • pyo3 0.8
  • rand 0.6.0
  • rayon 1.0.1
  • statrs 0.9.0
setup.py pypi
  • numpy *
.github/workflows/python.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v1 composite
.github/workflows/release.yml actions
  • actions-rs/toolchain v1 composite
  • actions/checkout v2 composite
  • actions/setup-python v1 composite
.github/workflows/rust.yml actions
  • actions-rs/toolchain v1 composite
  • actions/checkout v2 composite
  • actions/setup-python v1 composite