https://github.com/awslabs/amazon-denseclus

Clustering for mixed-type data

https://github.com/awslabs/amazon-denseclus

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.9%) to scientific vocabulary

Keywords

clustering embedding machinelearning-python python

Keywords from Contributors

diagram interactive transformers labels
Last synced: 5 months ago · JSON representation

Repository

Clustering for mixed-type data

Basic Info
Statistics
  • Stars: 99
  • Watchers: 6
  • Forks: 21
  • Open Issues: 11
  • Releases: 6
Topics
clustering embedding machinelearning-python python
Created over 4 years ago · Last pushed over 1 year ago
Metadata Files
Readme Contributing License Code of conduct

README.md

Amazon DenseClus

build total download month download weekly download PyPI version PyPI - Python Version PyPI - Wheel PyPI - License Code style: black Github Super-Linter

DenseClus is a Python module for clustering mixed type data using UMAP and HDBSCAN. Allowing for both categorical and numerical data, DenseClus makes it possible to incorporate all features in clustering.

Installation

bash python3 -m pip install amazon-denseclus

Quick Start

DenseClus requires a Panda's dataframe as input with both numerical and categorical columns. All preprocessing and extraction are done under the hood, just call fit and then retrieve the clusters!

```python from denseclus import DenseClus from denseclus.utils import make_dataframe

df = make_dataframe() clf = DenseClus(df) clf.fit(df)

scores = clf.evaluate() print(scores[0:10]) ```

Usage

Prediction

DenseClus uses a predict method when umap_combine_method is set to ensemble. Results are return in 2d array with the first part being the labels and the second part the probabilities.

```python from denseclus import DenseClus from denseclus.utils import make_dataframe

RANDOM_STATE = 10

df = makedataframe(randomstate=RANDOMSTATE) train = df.sample(frac=0.8, randomstate=RANDOMSTATE) test = df.drop(train.index) clf = DenseClus(randomstate=RANDOMSTATE, umapcombine_method='ensemble') clf.fit(train)

predictions = clf.predict(test) print(predictions) # labels, probabilities ```

On Combination Method

For a slower but more stable results select intersection_union_mapper to combine embedding layers via a third UMAP, which will provide equal weight to both numerics and categoriel columns. By default, you are setting the random seed which eliminates the ability for UMAP to run in parallel but will help circumevent some of the randomness of the algorithm.

python clf = DenseClus( umap_combine_method="intersection_union_mapper", )

To Use with GPU with Ensemble

To use with gpu first have rapids installed. You can do this as setup by providing cuda verision. pip install amazon-denseclus[gpu-cu12]

Then to run:

python clf = DenseClus( umap_combine_method="ensemble", use_gpu=True )

Advanced Usage

For advanced users, it's possible to select more fine-grained control of the underlying algorithms by passing dictionaries into DenseClus class for either UMAP or HDBSCAN.

For example: ```python from denseclus import DenseClus from denseclus.utils import make_dataframe

umapparams = { "categorical": {"nneighbors": 15, "mindist": 0.1}, "numerical": {"nneighbors": 20, "mindist": 0.1}, } hdbscanparams = {"minclustersize": 10}

df = make_dataframe()

clf = DenseClus(umapcombinemethod="union" , umapparams=umapparams , hdbscanparams=hdbscanparams , random_state=None) # this will run in parallel

clf.fit(df) ```

Examples

Notebooks

A hands-on example with an overview of how to use is currently available in the form of a Example Jupyter Notebook.

Should you need to tune HDBSCAN, here is an optional approach: Tuning with HDBSCAN Notebook

Should you need to validate UMAP emeddings, there is an approach to do so in the Validation for UMAP Notebook

Blogs

AWS Blog: Introducing DenseClus, an open source clustering package for mixed-type data

TDS Blog: How To Tune HDBSCAN

TDS Blog: On the Validation of UMAP

References

bibtex @article{mcinnes2018umap-software, title={UMAP: Uniform Manifold Approximation and Projection}, author={McInnes, Leland and Healy, John and Saul, Nathaniel and Grossberger, Lukas}, journal={The Journal of Open Source Software}, volume={3}, number={29}, pages={861}, year={2018} }

bibtex @article{mcinnes2017hdbscan, title={hdbscan: Hierarchical density based clustering}, author={McInnes, Leland and Healy, John and Astels, Steve}, journal={The Journal of Open Source Software}, volume={2}, number={11}, pages={205}, year={2017} }

Owner

  • Name: Amazon Web Services - Labs
  • Login: awslabs
  • Kind: organization
  • Location: Seattle, WA

AWS Labs

GitHub Events

Total
  • Watch event: 6
  • Fork event: 2
Last Year
  • Watch event: 6
  • Fork event: 2

Committers

Last synced: almost 3 years ago

All Time
  • Total Commits: 31
  • Total Committers: 9
  • Avg Commits per committer: 3.444
  • Development Distribution Score (DDS): 0.742
Top Committers
Name Email Commits
Baichuan Sun b****n@a****m 8
Charles Frenzel 6****l@u****m 7
Charles Frenzel f****c@a****m 6
Baichuan Sun s****0@u****m 4
dependabot[bot] 4****]@u****m 2
Alexandros Metsai a****i@g****m 1
Amazon GitHub Automation 5****o@u****m 1
Monk a****a@g****m 1
itaiara 9****a@u****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 23
  • Total pull requests: 122
  • Average time to close issues: 6 months
  • Average time to close pull requests: 22 days
  • Total issue authors: 14
  • Total pull request authors: 9
  • Average comments per issue: 1.04
  • Average comments per pull request: 0.7
  • Merged pull requests: 33
  • Bot issues: 0
  • Bot pull requests: 99
Past Year
  • Issues: 1
  • Pull requests: 16
  • Average time to close issues: N/A
  • Average time to close pull requests: 12 days
  • Issue authors: 1
  • Pull request authors: 1
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.56
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 16
Top Authors
Issue Authors
  • momonga-ml (6)
  • jbdatascience (3)
  • cwk20 (2)
  • monk1337 (1)
  • jovanaarsic (1)
  • ghost (1)
  • relifeted (1)
  • yusuftalhatamer (1)
  • Nick3523 (1)
  • porygonseverywhere (1)
  • AlexMetsai (1)
  • saujosai (1)
  • sgbaird (1)
  • timofeytkachenko (1)
Pull Request Authors
  • dependabot[bot] (174)
  • momonga-ml (13)
  • sunbc0120 (4)
  • srushtii-aws (3)
  • itaiara (1)
  • monk1337 (1)
  • AlexMetsai (1)
  • bharven (1)
  • amorisot (1)
Top Labels
Issue Labels
enhancement (6) duplicate (3) question (2) help wanted (2) good first issue (1) documentation (1) bug (1)
Pull Request Labels
dependencies (174) enhancement (7)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 228 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 9
  • Total maintainers: 2
pypi.org: amazon-denseclus

Dense Clustering for Mixed Data Types

  • Versions: 9
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 228 Last month
Rankings
Downloads: 4.7%
Stargazers count: 7.7%
Forks count: 8.1%
Dependent packages count: 10.1%
Average: 10.5%
Dependent repos count: 21.6%
Maintainers (2)
Last synced: 6 months ago

Dependencies

setup.py pypi
  • hdbscan >=0.8.27
  • numba >=0.51.2
  • numpy >=1.20.2
  • pandas >=1.2.4
  • scikit_learn >=0.24.2
  • umap_learn >=0.5.1
pyproject.toml pypi
.github/workflows/cd.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
.github/workflows/ci.yml actions
  • actions/cache v2 composite
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
requirements-dev.txt pypi
  • black ==23.11.0 development
  • coverage ==7.3.2 development
  • mypy ==1.7.0 development
  • nbqa ==1.7.0 development
  • pre-commit ==3.5.0 development
  • pylint ==3.0.2 development
  • pytest ==7.4.3 development
  • pytest-cov ==4.1.0 development
  • ruff ==0.1.6 development
  • tox ==4.11.3 development
  • tox-gh-actions ==3.1.3 development
requirements.txt pypi
  • hdbscan >=0.8.27
  • numba >=0.51.2
  • numpy >=1.20.2
  • pandas >=1.2.4
  • scikit_learn >=0.24.2
  • umap_learn >=0.5.1
environment.yaml pypi