gap-stat
Dynamically get the suggested clusters in the data for unsupervised learning.
Science Score: 23.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
✓Committers with academic emails
1 of 6 committers (16.7%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (16.2%) to scientific vocabulary
Keywords
Repository
Dynamically get the suggested clusters in the data for unsupervised learning.
Basic Info
Statistics
- Stars: 226
- Watchers: 5
- Forks: 47
- Open Issues: 7
- Releases: 15
Topics
Metadata Files
README.md
Python implementation of the Gap Statistic
Maintenance mode
I've lost interest/time in developing this further, other things have taken priority for some time now. However, all is not lost. I will be willing to review/comment on any issues/PRs but will not complete any fixes or feature requests myself.
Purpose
Dynamically identify the suggested number of clusters in a data-set using the gap statistic.
Full example available in a notebook HERE
Install:
Bleeding edge:
commandline
pip install git+git://github.com/milesgranger/gap_statistic.git
PyPi:
commandline
pip install --upgrade gap-stat
With Rust extension:
commandline
pip install --upgrade gap-stat[rust]
Uninstall:
commandline
pip uninstall gap-stat
Methodology:
This package provides several methods to assist in choosing the optimal number of clusters for a given dataset, based on the Gap method presented in "Estimating the number of clusters in a data set via the gap statistic" (Tibshirani et al.).
The methods implemented can cluster a given dataset using a range of provided k values, and provide you with statistics that can help in choosing the right number of clusters for your dataset. Three possible methods are:
- Taking the
kmaximizing the Gap value, which is calculated for eachk. This, however, might not always be possible, as for many datasets this value is monotonically increasing or decreasing. - Taking the smallest
ksuch that Gap(k) >= Gap(k+1) - s(k+1). This is the method suggested in Tibshirani et al. (consult the paper for details). The measurediff = Gap(k) - Gap(k+1) + s(k+1)is calculated for eachk; the parallel here, then, is to take the smallestkfor whichdiffis positive. Note that in some cases this can be true for the entire range ofk. - Taking the
kmaximizing the Gap* value, an alternative measure suggested in "A comparison of Gap statistic definitions with and with-out logarithm function" by Mohajer, Englmeier and Schmid. The authors claim this measure avoids the over-estimation of the number of clusters from which the original Gap statistics suffers, and can also suggest an optimal value for k for cases in which Gap cannot. They do warn, however, that the original Gap statistic performs better than Gap* in the case of overlapped clusters, due to its tendency to overestimate the number of clusters.
Note that none of the above methods is guaranteed to find an optimal value for k, and that they often contradict one another. Rather, they can provide more information on which to base your choice of k, which should take numerous other factors into account.
Use:
First, construct an OptimalK object. Optional intialization parameters are:
n_jobs- Splits computation into this number of parallel jobs. Requires choosing a parallel backend.parallel_backend- Possible values arejoblib,rustormultiprocessingfor the built-in Python backend. Ifparallel_backend == 'rust'it will use all cores.clusterer- Takes a custom clusterer function to be used when clustering. See the example notebook for more details.clusterer_kwargs- Any keyword arguments to be forwarded to the custom clusterer function on each call.
An example intialization:
python
optimalK = OptimalK(n_jobs=4, parallel_backend='joblib')
After the object is created, it can be called like a function, and provided with a dataset for which the optimal K is found and returned. Parameters are:
X- A pandas dataframe or numpy array of data points of shape(n_samples, n_features).n_refs- The number of random reference data sets to use as inertia reference to actual data. Optional.cluster_array- A 1-dimensional iterable of integers; each representingn_clustersto try on the data. Optional.
For example:
python
import numpy as np
n_clusters = optimalK(X, cluster_array=np.arange(1, 15))
After performing the search procedure, a DataFrame of gap values and other usefull statistics for each passed cluster count is now available as the gap_df attributre of the OptimalK object:
python
optimalK.gap_df.head()
The columns of the dataframe are:
n_clusters- The number of clusters for which the statistics in this row were calculated.gap_value- The Gap value for thisn.gap*- The Gap* value for thisn.ref_dispersion_std- The standard deviation of the reference distributions for thisn.sk- The standard error of the Gap statistic for thisn.sk*- The standard error of the Gap* statistic for thisn.diff- The diff value for thisn(see the methodology section for details).diff*- The diff* value for thisn(corresponding to the diff value for Gap*).
Additionally, the relation between the above measures and the number of clusters can be plotted by calling the OptimalK.plot_results() method (meant to be used inside a Jupyter Notebook or a similar IPython-based notebook), which prints four plots:
- A plot of the Gap value versus n, the number of clusters.
- A plot of diff versus n.
- A plot of the Gap* value versus n, the number of clusters.
- A plot of the diff* value versus n.
Owner
- Name: Miles
- Login: milesgranger
- Kind: user
- Location: Bergen, Norway
- Company: Noetic AS
- Repositories: 71
- Profile: https://github.com/milesgranger
Just a happy engineer.
GitHub Events
Total
- Watch event: 9
- Issue comment event: 2
- Fork event: 1
Last Year
- Watch event: 9
- Issue comment event: 2
- Fork event: 1
Committers
Last synced: 9 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Miles Granger | m****3@g****m | 123 |
| Dmitry Vukolov | d****o@g****m | 2 |
| psads-git | 7****t | 1 |
| Shay Palachy | s****5 | 1 |
| Lev E. Givon | l****v@c****u | 1 |
| Alina Selega | a****a@g****m | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 39
- Total pull requests: 26
- Average time to close issues: 3 months
- Average time to close pull requests: 2 days
- Total issue authors: 23
- Total pull request authors: 6
- Average comments per issue: 1.82
- Average comments per pull request: 1.62
- Merged pull requests: 23
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 1
- Pull requests: 1
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 1
- Pull request authors: 1
- Average comments per issue: 3.0
- Average comments per pull request: 3.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- milesgranger (8)
- nry123 (4)
- shaypal5 (4)
- psads-git (2)
- shahabh (2)
- dvukolov (2)
- Li-Xue (1)
- lebedov (1)
- sinanazeri (1)
- supersaiyanesee (1)
- johnvorsten (1)
- kikiegoguma (1)
- hanhanwu (1)
- rakshita95 (1)
- Charles-Z (1)
Pull Request Authors
- milesgranger (19)
- lebedov (3)
- dvukolov (2)
- alinaselega (1)
- psads-git (1)
- shaypal5 (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 1,019 last-month
- Total dependent packages: 1
- Total dependent repositories: 10
- Total versions: 16
- Total maintainers: 1
pypi.org: gap-stat
Python implementation of the gap statistic with optional Rust optimizations.
- Homepage: https://github.com/milesgranger/gap_statistic
- Documentation: https://gap-stat.readthedocs.io/
- License: MIT
-
Latest release: 2.0.3
published over 2 years ago
Rankings
Maintainers (1)
Dependencies
- aho-corasick 0.7.8
- atty 0.2.14
- autocfg 0.1.7
- autocfg 1.0.0
- bitflags 1.2.1
- bstr 0.2.11
- byteorder 1.3.4
- cast 0.2.3
- cfg-if 0.1.10
- clap 2.33.0
- cloudabi 0.0.3
- criterion 0.2.11
- criterion-plot 0.3.1
- crossbeam-deque 0.7.2
- crossbeam-epoch 0.8.0
- crossbeam-queue 0.2.1
- crossbeam-utils 0.7.0
- csv 1.1.3
- csv-core 0.1.10
- ctor 0.1.12
- either 1.5.3
- fuchsia-cprng 0.1.1
- ghost 0.1.1
- hermit-abi 0.1.6
- indoc 0.3.4
- indoc-impl 0.3.4
- inventory 0.1.5
- inventory-impl 0.1.5
- itertools 0.7.11
- itertools 0.8.2
- itoa 0.4.5
- lazy_static 1.4.0
- libc 0.2.66
- matrixmultiply 0.1.15
- memchr 2.3.2
- memoffset 0.5.3
- ndarray 0.12.1
- ndarray-parallel 0.9.1
- ndarray-rand 0.9.0
- num-complex 0.2.4
- num-traits 0.2.11
- num_cpus 1.12.0
- numpy 0.7.0
- paste 0.1.6
- paste-impl 0.1.6
- proc-macro-hack 0.5.11
- proc-macro2 1.0.8
- pyo3 0.8.5
- pyo3-derive-backend 0.8.5
- pyo3cls 0.8.5
- quote 1.0.2
- rand 0.3.23
- rand 0.4.6
- rand 0.6.5
- rand_chacha 0.1.1
- rand_core 0.3.1
- rand_core 0.4.2
- rand_hc 0.1.0
- rand_isaac 0.1.1
- rand_jitter 0.1.4
- rand_os 0.1.3
- rand_pcg 0.1.2
- rand_xorshift 0.1.1
- rand_xoshiro 0.1.0
- rawpointer 0.1.0
- rayon 1.3.0
- rayon-core 1.7.0
- rdrand 0.4.0
- regex 1.3.4
- regex-automata 0.1.8
- regex-syntax 0.6.14
- rustc_version 0.2.3
- ryu 1.0.2
- same-file 1.0.6
- scopeguard 1.0.0
- semver 0.9.0
- semver-parser 0.7.0
- serde 1.0.104
- serde_derive 1.0.104
- serde_json 1.0.48
- spin 0.5.2
- statrs 0.9.0
- syn 1.0.14
- textwrap 0.11.0
- thread_local 1.0.1
- tinytemplate 1.0.3
- unicode-width 0.1.7
- unicode-xid 0.2.0
- unindent 0.1.5
- version_check 0.9.1
- walkdir 2.3.1
- winapi 0.3.8
- winapi-i686-pc-windows-gnu 0.4.0
- winapi-util 0.1.3
- winapi-x86_64-pc-windows-gnu 0.4.0
- criterion 0.2 development
- ndarray 0.12.0
- ndarray-parallel 0.9.0
- ndarray-rand 0.9.0
- num-traits 0.2.4
- numpy 0.7
- pyo3 0.8
- rand 0.6.0
- rayon 1.0.1
- statrs 0.9.0
- numpy *
- actions/checkout v2 composite
- actions/setup-python v1 composite
- actions-rs/toolchain v1 composite
- actions/checkout v2 composite
- actions/setup-python v1 composite
- actions-rs/toolchain v1 composite
- actions/checkout v2 composite
- actions/setup-python v1 composite