PyCVI

PyCVI: A Python package for internal Cluster Validity Indices, compatible with time-series data - Published in JOSS (2024)

https://github.com/nglm/pycvi

Science Score: 93.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 6 DOI reference(s) in README and JOSS metadata
✓
Academic publication links
Links to: joss.theoj.org
○
Committers with academic emails
○
Institutional organization owner
✓
JOSS paper metadata
Published in Journal of Open Source Software

Keywords

cluster-validity-index clustering machine-learning python time-series

Scientific Fields

Psychology Social Sciences - 40% confidence

Last synced: 6 months ago · JSON representation

Repository

Internal Cluster Validity Indices in python, compatible with time-series data

Basic Info

Host: GitHub
Owner: nglm
License: mit
Language: Python
Default Branch: master
Homepage:
Size: 37.1 MB

Statistics

Stars: 4
Watchers: 1
Forks: 0
Open Issues: 1
Releases: 6

Topics

cluster-validity-index clustering machine-learning python time-series

Created over 2 years ago · Last pushed over 1 year ago

Metadata Files

Readme License

PyCVI

PyCVI is a Python package specialized in internal Clustering Validity Indices (CVI). Internal CVIs are used to select the best clustering among a set of pre-computed clusterings when no external information is available such as the labels of the datapoints.

In addition, all CVIs rely on the definition of a distance between datapoints and most of them on the notion of cluster center.

For non-time-series data, the distance used is usually the euclidean distance and the cluster center is defined as the usual average. Libraries such as scipy, numpy, scikit-learn, etc. offer a large selection of distance measures that are compatible with all their functions.

For time-series data however, the common distance used is Dynamic Time Warping (DTW) [^DTW] and the barycenter of a group of time series is then not defined as the usual mean, but as the DTW Barycentric Average (DBA)[^DBA]. Unfortunately, DTW and DBA are not compatible with the libraries mentioned above, which among other reasons, made additional machine learning libraries specialized in time series data such as aeon, sktime and tslearn necessary.

PyCVI then implements 12 state-of-the-art internal CVIs and extended them to make them compatible with DTW and DBA when using time-series data. To compute DTW and DBA, PyCVI relies on the aeon library.

Documentation

The full documentation is available at pycvi.readthedocs.io.

Features

12 internal CVIs implemented: Hartigan[^Hart], Calinski-Harabasz[^CH], GapStatistic[^Gap], Silhouette[^Sil], ScoreFunction[^SF], Maulik-Bandyopadhyay[^MB], SD[^SD], SDbw[^SDbw], Dunn[^D], Xie-Beni[^XB], XB[^XB] and Davies-Bouldin[^DB].
Compute CVI values and select the best clustering based on the results.
Compatible with time-series, Dynamic Time Warping (DTW) and Dynamic time warping Barycentric Average (DBA). Compatible with scikit-learn, scikit-learn extra, aeon and sktime, for an easy integration into any clustering pipeline in python.
Can compute the clusterings beforehand if provided with a sklearn-like clustering class.
Enable users to define custom CVIs.
Multiple CVIs can easily be combined to select the best clustering based on a majority vote.
Variation of Information[^VI] implemented (distances between clustering).

Install

With poetry:

```bash

From PyPI

poetry add pycvi-lib

Alternatively, from github directly

poetry add git+https://github.com/nglm/pycvi.git ```

With pip:

```bash

From PyPI

pip install pycvi-lib

Alternatively, from github directly

pip install git+https://github.com/nglm/pycvi.git ```

With anaconda:

```bash

activate your environment (replace myEnv with your environment name)

conda activate myEnv

install pip first in your environment

conda install pip

install pycvi on your anaconda environment with pip

pip install pycvi-lib ```

Extra dependencies

In order to run the example scripts, extra dependencies are necessary. The install command is then:

```bash

For poetry

poetry add pycvi-lib[examples]

For pip and anaconda

pip install pycvi-lib[examples] ```

Alternatively, you can manually install in your environment the packages that are necessary to run the example scripts (matplotlib and/or scikit-learn-extra depending on the example).

If you wish to run the example scripts on your own computer, please follow the instructions detailed in the documentation first: Running example scripts on your computer.

Contribute

Issue Tracker: github.com/nglm/pycvi/issues.
Source Code: github.com/nglm/pycvi.

Support

If you are having issues, please let me know or create an issue.

License

The project is licensed under the MIT license.

How to cite PyCVI

If you are using PyCVI in your work, please cite us by using one of the following entries referring to the JOSS paper "PyCVI: A Python package for internal Cluster Validity Indices, compatible with time-series data" by N. Galmiche:

BibTeX

tex @article{Galmiche2024, author = {Natacha Galmiche}, title = {PyCVI: A Python package for internal Cluster Validity Indices, compatible with time-series data}, doi = {10.21105/joss.06841}, url = {https://doi.org/10.21105/joss.06841}, year = {2024}, publisher = {The Open Journal}, volume = {9}, number = {102}, pages = {6841}, journal = {Journal of Open Source Software} }

Plain text

text Galmiche, N., (2024). PyCVI: A Python package for internal Cluster Validity Indices, compatible with time-series data. Journal of Open Source Software, 9(102), 6841, https://doi.org/10.21105/joss.06841

[^DTW]: Donald J. Berndt and James Clifford. Using dynamic time warping to find patterns in time series. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, AAAIWS’94, page 359–370. AAAI Press, 1994 [^DBA]: F. Petitjean, A. Ketterlin, and P. Gan carski, “A global averaging method for dynamic time warping, with applications to clustering,” Pattern Recognition, vol. 44, pp. 678–693, Mar. 2011. [^Hart]: D. J. Strauss and J. A. Hartigan, “Clustering algorithms,” Biometrics, vol. 31, p. 793, sep 1975. [^CH]: T. Calinski and J. Harabasz, “A dendrite method for cluster analysis,” Communications in Statistics - Theory and Methods, vol. 3, no. 1, pp. 1–27, 1974. [^Gap]: R. Tibshirani, G. Walther, and T. Hastie, “Estimating the number of clusters in a data set via the gap statistic,” Journal of the Royal Statistical Society Series B: Statistical Methodology, vol. 63, pp. 411–423, July 2001. [^Sil]: P. J. Rousseeuw, “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis,” Journal of computational and applied mathematics, vol. 20, pp. 53–65, 1987. [^D]: J. C. Dunn, “Well-separated clusters and optimal fuzzy partitions,” Journal of Cybernetics, vol. 4, pp. 95–104, Jan. 1974. [^DB]: D. L. Davies and D. W. Bouldin, “A cluster separation measure,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-1, pp. 224–227, Apr. 1979. [^SD]: M. Halkidi, M. Vazirgiannis, and Y. Batistakis, “Quality scheme assessment in the clustering process,” in Principles of Data Mining and Knowledge Discovery, pp. 265–276, Springer Berlin Heidelberg, 2000 [^SDbw]: M. Halkidi and M. Vazirgiannis, “Clustering validity assessment: finding the optimal partitioning of a data set,” in Proceedings 2001 IEEE International Conference on Data Mining, pp. 187–194, IEEE Comput. Soc, 2001. [^XB]: X. Xie and G. Beni, “A validity measure for fuzzy clustering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 13, no. 8, pp. 841–847, 1991. [^XB]: M. Kim and R. Ramakrishna, “New indices for cluster validity assessment,” *Pattern Recognition Letters, vol. 26, pp. 2353–2363, Nov. 2005. [^SF]: S. Saitta, B. Raphael, and I. F. C. Smith, “A bounded index for cluster validity,” in Machine Learning and Data Mining in Pattern Recognition, pp. 174–187, Springer Berlin Heidelberg, 2007. [^MB]: U. Maulik and S. Bandyopadhyay, “Performance evaluation of some clustering algorithms and validity indices,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, pp. 1650–1654, Dec. 2002. [^VI]: M. Meil ̆a, Comparing Clusterings by the Variation of Information, p. 173–187. Springer Berlin Heidelberg, 2003.

Owner

Name: nglm
Login: nglm
Kind: user

Repositories: 4
Profile: https://github.com/nglm

JOSS Publication

PyCVI: A Python package for internal Cluster Validity Indices, compatible with time-series data

Published

October 08, 2024

DOI

10.21105/joss.06841

Volume 9, Issue 102, Page 6841

Authors

Natacha Galmiche

University of Bergen, Norway

Editor

Oskar Laverny

GitHub Events

Total

Release event: 1
Delete event: 2
Issue comment event: 1
Push event: 7
Create event: 2

Last Year

Release event: 1
Delete event: 2
Issue comment event: 1
Push event: 7
Create event: 2

Committers

Last synced: 7 months ago

All Time

Total Commits: 360
Total Committers: 1
Avg Commits per committer: 360.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 69
Committers: 1
Avg Commits per committer: 69.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Natacha, Galmiche	n**e@u**o	360

Committer Domains (Top 20 + Academic)

uib.no: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 5
Total pull requests: 0
Average time to close issues: about 1 month
Average time to close pull requests: N/A
Total issue authors: 4
Total pull request authors: 0
Average comments per issue: 2.2
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 2
Pull requests: 0
Average time to close issues: 4 days
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 0
Average comments per issue: 2.5
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

benjaminh (2)
laurent-vouriot (1)
wob86 (1)
robcaulk (1)

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 32 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 6
Total maintainers: 1

pypi.org: pycvi-lib

Internal Cluster Validity Indices in python, compatible with time-series data

Homepage: https://pycvi.readthedocs.io/en/latest/
Documentation: https://pycvi-lib.readthedocs.io/
License: mit
Latest release: 0.1.5
published over 1 year ago

Versions: 6
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 32 Last month

Rankings

Dependent packages count: 10.1%

Average: 38.4%

Dependent repos count: 66.7%

Maintainers (1)

nglm

Last synced: 6 months ago

Dependencies

poetry.lock pypi

colorama 0.4.6
exceptiongroup 1.1.3
importlib-metadata 6.8.0
iniconfig 2.0.0
joblib 1.3.2
llvmlite 0.40.1
numba 0.57.1
numpy 1.24.4
packaging 23.1
pandas 2.0.3
pluggy 1.3.0
pytest 7.4.0
python-dateutil 2.8.2
pytz 2023.3
scikit-learn 1.3.0
scipy 1.9.3
six 1.16.0
threadpoolctl 3.2.0
tomli 2.0.1
tslearn 0.6.2
tzdata 2023.3
zipp 3.16.2

pyproject.toml pypi

python ^3.8
tslearn ^0.6.2

PyCVI

Science Score: 93.0%

Keywords

Scientific Fields

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

PyCVI

Documentation

Features

Install

With poetry:

From PyPI

Alternatively, from github directly

With pip:

From PyPI

Alternatively, from github directly

With anaconda:

activate your environment (replace myEnv with your environment name)

install pip first in your environment

install pycvi on your anaconda environment with pip

Extra dependencies

For poetry

For pip and anaconda

Contribute

Support

License

How to cite PyCVI

Owner

JOSS Publication

PyCVI: A Python package for internal Cluster Validity Indices, compatible with time-series data

Authors

Editor

Tags

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: pycvi-lib

Rankings

Maintainers (1)

Dependencies