gap-stat

Dynamically get the suggested clusters in the data for unsupervised learning.

https://github.com/milesgranger/gap_statistic

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
✓
Committers with academic emails
1 of 6 committers (16.7%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (16.2%) to scientific vocabulary

Keywords

cluster cluster-count clustering kmeans python scikit-learn unsupervised unsupervised-learning

Last synced: 6 months ago · JSON representation

Repository

Dynamically get the suggested clusters in the data for unsupervised learning.

Basic Info

Host: GitHub
Owner: milesgranger
License: unlicense
Language: Rust
Default Branch: master
Homepage:
Size: 392 KB

Statistics

Stars: 226
Watchers: 5
Forks: 47
Open Issues: 7
Releases: 15

Topics

cluster cluster-count clustering kmeans python scikit-learn unsupervised unsupervised-learning

Created almost 10 years ago · Last pushed over 1 year ago

Metadata Files

Readme Changelog License

Python implementation of the Gap Statistic

Maintenance mode

I've lost interest/time in developing this further, other things have taken priority for some time now. However, all is not lost. I will be willing to review/comment on any issues/PRs but will not complete any fixes or feature requests myself.

Purpose

Dynamically identify the suggested number of clusters in a data-set using the gap statistic.

Full example available in a notebook HERE

Install:

Bleeding edge: commandline pip install git+git://github.com/milesgranger/gap_statistic.git

PyPi:
commandline pip install --upgrade gap-stat

With Rust extension: commandline pip install --upgrade gap-stat[rust]

Uninstall:

commandline pip uninstall gap-stat

Methodology:

This package provides several methods to assist in choosing the optimal number of clusters for a given dataset, based on the Gap method presented in "Estimating the number of clusters in a data set via the gap statistic" (Tibshirani et al.).

The methods implemented can cluster a given dataset using a range of provided k values, and provide you with statistics that can help in choosing the right number of clusters for your dataset. Three possible methods are:

Taking the k maximizing the Gap value, which is calculated for each k. This, however, might not always be possible, as for many datasets this value is monotonically increasing or decreasing.
Taking the smallest k such that Gap(k) >= Gap(k+1) - s(k+1). This is the method suggested in Tibshirani et al. (consult the paper for details). The measure diff = Gap(k) - Gap(k+1) + s(k+1) is calculated for each k; the parallel here, then, is to take the smallest k for which diff is positive. Note that in some cases this can be true for the entire range of k.
Taking the k maximizing the Gap* value, an alternative measure suggested in "A comparison of Gap statistic definitions with and with-out logarithm function" by Mohajer, Englmeier and Schmid. The authors claim this measure avoids the over-estimation of the number of clusters from which the original Gap statistics suffers, and can also suggest an optimal value for k for cases in which Gap cannot. They do warn, however, that the original Gap statistic performs better than Gap* in the case of overlapped clusters, due to its tendency to overestimate the number of clusters.

Note that none of the above methods is guaranteed to find an optimal value for k, and that they often contradict one another. Rather, they can provide more information on which to base your choice of k, which should take numerous other factors into account.

Use:

First, construct an OptimalK object. Optional intialization parameters are:

n_jobs - Splits computation into this number of parallel jobs. Requires choosing a parallel backend.
parallel_backend - Possible values are joblib, rust or multiprocessing for the built-in Python backend. If parallel_backend == 'rust' it will use all cores.
clusterer - Takes a custom clusterer function to be used when clustering. See the example notebook for more details.
clusterer_kwargs - Any keyword arguments to be forwarded to the custom clusterer function on each call.

An example intialization: python optimalK = OptimalK(n_jobs=4, parallel_backend='joblib')

After the object is created, it can be called like a function, and provided with a dataset for which the optimal K is found and returned. Parameters are:

X - A pandas dataframe or numpy array of data points of shape (n_samples, n_features).
n_refs - The number of random reference data sets to use as inertia reference to actual data. Optional.
cluster_array - A 1-dimensional iterable of integers; each representing n_clusters to try on the data. Optional.

For example: python import numpy as np n_clusters = optimalK(X, cluster_array=np.arange(1, 15))

After performing the search procedure, a DataFrame of gap values and other usefull statistics for each passed cluster count is now available as the gap_df attributre of the OptimalK object:

python optimalK.gap_df.head()

The columns of the dataframe are:

n_clusters - The number of clusters for which the statistics in this row were calculated.
gap_value - The Gap value for this n.
gap* - The Gap* value for this n.
ref_dispersion_std - The standard deviation of the reference distributions for this n.
sk - The standard error of the Gap statistic for this n.
sk* - The standard error of the Gap* statistic for this n.
diff - The diff value for this n (see the methodology section for details).
diff* - The diff* value for this n (corresponding to the diff value for Gap*).

Additionally, the relation between the above measures and the number of clusters can be plotted by calling the OptimalK.plot_results() method (meant to be used inside a Jupyter Notebook or a similar IPython-based notebook), which prints four plots:

A plot of the Gap value versus n, the number of clusters.
A plot of diff versus n.
A plot of the Gap* value versus n, the number of clusters.
A plot of the diff* value versus n.

Owner

Name: Miles
Login: milesgranger
Kind: user
Location: Bergen, Norway
Company: Noetic AS

Repositories: 71
Profile: https://github.com/milesgranger

Just a happy engineer.

GitHub Events

Total

Watch event: 9
Issue comment event: 2
Fork event: 1

Last Year

Watch event: 9
Issue comment event: 2
Fork event: 1

Committers

Last synced: 9 months ago

All Time

Total Commits: 129
Total Committers: 6
Avg Commits per committer: 21.5
Development Distribution Score (DDS): 0.047

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Miles Granger	m**3@g**m	123
Dmitry Vukolov	d**o@g**m	2
psads-git	7****t	1
Shay Palachy	s****5	1
Lev E. Givon	l**v@c**u	1
Alina Selega	a**a@g**m	1

Committer Domains (Top 20 + Academic)

columbia.edu: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 39
Total pull requests: 26
Average time to close issues: 3 months
Average time to close pull requests: 2 days
Total issue authors: 23
Total pull request authors: 6
Average comments per issue: 1.82
Average comments per pull request: 1.62
Merged pull requests: 23
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 1
Average comments per issue: 3.0
Average comments per pull request: 3.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

milesgranger (8)
nry123 (4)
shaypal5 (4)
psads-git (2)
shahabh (2)
dvukolov (2)
Li-Xue (1)
lebedov (1)
sinanazeri (1)
supersaiyanesee (1)
johnvorsten (1)
kikiegoguma (1)
hanhanwu (1)
rakshita95 (1)
Charles-Z (1)

Pull Request Authors

milesgranger (19)
lebedov (3)
dvukolov (2)
alinaselega (1)
psads-git (1)
shaypal5 (1)

Top Labels

Issue Labels

enhancement (1) help wanted (1)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 1,019 last-month

Total dependent packages: 1
Total dependent repositories: 10
Total versions: 16
Total maintainers: 1

pypi.org: gap-stat

Python implementation of the gap statistic with optional Rust optimizations.

Homepage: https://github.com/milesgranger/gap_statistic
Documentation: https://gap-stat.readthedocs.io/
License: MIT
Latest release: 2.0.3
published over 2 years ago

Versions: 16
Dependent Packages: 1
Dependent Repositories: 10
Downloads: 1,019 Last month

Rankings

Downloads: 4.3%

Dependent repos count: 4.6%

Stargazers count: 4.8%

Average: 6.0%

Forks count: 6.0%

Dependent packages count: 10.1%

Maintainers (1)

milesg

Last synced: 6 months ago

Dependencies

Cargo.lock cargo

aho-corasick 0.7.8
atty 0.2.14
autocfg 0.1.7
autocfg 1.0.0
bitflags 1.2.1
bstr 0.2.11
byteorder 1.3.4
cast 0.2.3
cfg-if 0.1.10
clap 2.33.0
cloudabi 0.0.3
criterion 0.2.11
criterion-plot 0.3.1
crossbeam-deque 0.7.2
crossbeam-epoch 0.8.0
crossbeam-queue 0.2.1
crossbeam-utils 0.7.0
csv 1.1.3
csv-core 0.1.10
ctor 0.1.12
either 1.5.3
fuchsia-cprng 0.1.1
ghost 0.1.1
hermit-abi 0.1.6
indoc 0.3.4
indoc-impl 0.3.4
inventory 0.1.5
inventory-impl 0.1.5
itertools 0.7.11
itertools 0.8.2
itoa 0.4.5
lazy_static 1.4.0
libc 0.2.66
matrixmultiply 0.1.15
memchr 2.3.2
memoffset 0.5.3
ndarray 0.12.1
ndarray-parallel 0.9.1
ndarray-rand 0.9.0
num-complex 0.2.4
num-traits 0.2.11
num_cpus 1.12.0
numpy 0.7.0
paste 0.1.6
paste-impl 0.1.6
proc-macro-hack 0.5.11
proc-macro2 1.0.8
pyo3 0.8.5
pyo3-derive-backend 0.8.5
pyo3cls 0.8.5
quote 1.0.2
rand 0.3.23
rand 0.4.6
rand 0.6.5
rand_chacha 0.1.1
rand_core 0.3.1
rand_core 0.4.2
rand_hc 0.1.0
rand_isaac 0.1.1
rand_jitter 0.1.4
rand_os 0.1.3
rand_pcg 0.1.2
rand_xorshift 0.1.1
rand_xoshiro 0.1.0
rawpointer 0.1.0
rayon 1.3.0
rayon-core 1.7.0
rdrand 0.4.0
regex 1.3.4
regex-automata 0.1.8
regex-syntax 0.6.14
rustc_version 0.2.3
ryu 1.0.2
same-file 1.0.6
scopeguard 1.0.0
semver 0.9.0
semver-parser 0.7.0
serde 1.0.104
serde_derive 1.0.104
serde_json 1.0.48
spin 0.5.2
statrs 0.9.0
syn 1.0.14
textwrap 0.11.0
thread_local 1.0.1
tinytemplate 1.0.3
unicode-width 0.1.7
unicode-xid 0.2.0
unindent 0.1.5
version_check 0.9.1
walkdir 2.3.1
winapi 0.3.8
winapi-i686-pc-windows-gnu 0.4.0
winapi-util 0.1.3
winapi-x86_64-pc-windows-gnu 0.4.0

Cargo.toml cargo

criterion 0.2 development
ndarray 0.12.0
ndarray-parallel 0.9.0
ndarray-rand 0.9.0
num-traits 0.2.4
numpy 0.7
pyo3 0.8
rand 0.6.0
rayon 1.0.1
statrs 0.9.0

setup.py pypi

numpy *

.github/workflows/python.yml actions

actions/checkout v2 composite
actions/setup-python v1 composite

.github/workflows/release.yml actions

actions-rs/toolchain v1 composite
actions/checkout v2 composite
actions/setup-python v1 composite

.github/workflows/rust.yml actions

actions-rs/toolchain v1 composite
actions/checkout v2 composite
actions/setup-python v1 composite

gap-stat

Science Score: 23.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Python implementation of the Gap Statistic

Maintenance mode

Purpose

Full example available in a notebook HERE

Install:

Uninstall:

Methodology:

Use:

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: gap-stat

Rankings

Maintainers (1)

Dependencies