https://github.com/awslabs/amazon-denseclus

Clustering for mixed-type data

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.9%) to scientific vocabulary

Keywords

clustering embedding machinelearning-python python

Keywords from Contributors

diagram interactive transformers labels

Last synced: 9 months ago · JSON representation

Repository

Clustering for mixed-type data

Basic Info

Host: GitHub
Owner: awslabs
License: mit-0
Language: Jupyter Notebook
Default Branch: main
Homepage: https://aws.amazon.com/blogs/opensource/introducing-denseclus-an-open-source-clustering-package-for-mixed-type-data/
Size: 4.53 MB

Statistics

Stars: 99
Watchers: 6
Forks: 21
Open Issues: 11
Releases: 6

Topics

clustering embedding machinelearning-python python

Created almost 5 years ago · Last pushed almost 2 years ago

Metadata Files

Readme Contributing License Code of conduct

Amazon DenseClus

DenseClus is a Python module for clustering mixed type data using UMAP and HDBSCAN. Allowing for both categorical and numerical data, DenseClus makes it possible to incorporate all features in clustering.

Installation

bash python3 -m pip install amazon-denseclus

Quick Start

DenseClus requires a Panda's dataframe as input with both numerical and categorical columns. All preprocessing and extraction are done under the hood, just call fit and then retrieve the clusters!

```python from denseclus import DenseClus from denseclus.utils import make_dataframe

df = make_dataframe() clf = DenseClus(df) clf.fit(df)

scores = clf.evaluate() print(scores[0:10]) ```

Usage

Prediction

DenseClus uses a predict method when umap_combine_method is set to ensemble. Results are return in 2d array with the first part being the labels and the second part the probabilities.

```python from denseclus import DenseClus from denseclus.utils import make_dataframe

RANDOM_STATE = 10

df = makedataframe(randomstate=RANDOMSTATE) train = df.sample(frac=0.8, randomstate=RANDOMSTATE) test = df.drop(train.index) clf = DenseClus(randomstate=RANDOMSTATE, umapcombine_method='ensemble') clf.fit(train)

predictions = clf.predict(test) print(predictions) # labels, probabilities ```

On Combination Method

For a slower but more stable results select intersection_union_mapper to combine embedding layers via a third UMAP, which will provide equal weight to both numerics and categoriel columns. By default, you are setting the random seed which eliminates the ability for UMAP to run in parallel but will help circumevent some of the randomness of the algorithm.

python clf = DenseClus( umap_combine_method="intersection_union_mapper", )

To Use with GPU with Ensemble

To use with gpu first have rapids installed. You can do this as setup by providing cuda verision. pip install amazon-denseclus[gpu-cu12]

Then to run:

python clf = DenseClus( umap_combine_method="ensemble", use_gpu=True )

Advanced Usage

For advanced users, it's possible to select more fine-grained control of the underlying algorithms by passing dictionaries into DenseClus class for either UMAP or HDBSCAN.

For example: ```python from denseclus import DenseClus from denseclus.utils import make_dataframe

umapparams = { "categorical": {"nneighbors": 15, "mindist": 0.1}, "numerical": {"nneighbors": 20, "mindist": 0.1}, } hdbscanparams = {"minclustersize": 10}

df = make_dataframe()

clf = DenseClus(umapcombinemethod="union" , umapparams=umapparams , hdbscanparams=hdbscanparams , random_state=None) # this will run in parallel

clf.fit(df) ```

Examples

Notebooks

A hands-on example with an overview of how to use is currently available in the form of a Example Jupyter Notebook.

Should you need to tune HDBSCAN, here is an optional approach: Tuning with HDBSCAN Notebook

Should you need to validate UMAP emeddings, there is an approach to do so in the Validation for UMAP Notebook

Blogs

AWS Blog: Introducing DenseClus, an open source clustering package for mixed-type data

TDS Blog: How To Tune HDBSCAN

TDS Blog: On the Validation of UMAP

References

bibtex @article{mcinnes2018umap-software, title={UMAP: Uniform Manifold Approximation and Projection}, author={McInnes, Leland and Healy, John and Saul, Nathaniel and Grossberger, Lukas}, journal={The Journal of Open Source Software}, volume={3}, number={29}, pages={861}, year={2018} }

bibtex @article{mcinnes2017hdbscan, title={hdbscan: Hierarchical density based clustering}, author={McInnes, Leland and Healy, John and Astels, Steve}, journal={The Journal of Open Source Software}, volume={2}, number={11}, pages={205}, year={2017} }

Owner

Name: Amazon Web Services - Labs
Login: awslabs
Kind: organization
Location: Seattle, WA

Website: http://amazon.com/aws/
Repositories: 914
Profile: https://github.com/awslabs

AWS Labs

GitHub Events

Total

Watch event: 6
Fork event: 2

Last Year

Watch event: 6
Fork event: 2

Committers

Last synced: about 3 years ago

All Time

Total Commits: 31
Total Committers: 9
Avg Commits per committer: 3.444
Development Distribution Score (DDS): 0.742

Top Committers

Name	Email	Commits
Baichuan Sun	b**n@a**m	8
Charles Frenzel	6**l@u**m	7
Charles Frenzel	f**c@a**m	6
Baichuan Sun	s**0@u**m	4
dependabot[bot]	4**]@u**m	2
Alexandros Metsai	a**i@g**m	1
Amazon GitHub Automation	5**o@u**m	1
Monk	a**a@g**m	1
itaiara	9**a@u**m	1

Committer Domains (Top 20 + Academic)

amazon.com: 2

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 23
Total pull requests: 122
Average time to close issues: 6 months
Average time to close pull requests: 22 days
Total issue authors: 14
Total pull request authors: 9
Average comments per issue: 1.04
Average comments per pull request: 0.7
Merged pull requests: 33
Bot issues: 0
Bot pull requests: 99

Past Year

Issues: 1
Pull requests: 16
Average time to close issues: N/A
Average time to close pull requests: 12 days
Issue authors: 1
Pull request authors: 1
Average comments per issue: 0.0
Average comments per pull request: 0.56
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 16

View more stats

Top Authors

Issue Authors

momonga-ml (6)
jbdatascience (3)
cwk20 (2)
monk1337 (1)
jovanaarsic (1)
ghost (1)
relifeted (1)
yusuftalhatamer (1)
Nick3523 (1)
porygonseverywhere (1)
AlexMetsai (1)
saujosai (1)
sgbaird (1)
timofeytkachenko (1)

Pull Request Authors

dependabot[bot] (174)
momonga-ml (13)
sunbc0120 (4)
srushtii-aws (3)
itaiara (1)
monk1337 (1)
AlexMetsai (1)
bharven (1)
amorisot (1)

Top Labels

Issue Labels

enhancement (6) duplicate (3) question (2) help wanted (2) good first issue (1) documentation (1) bug (1)

Pull Request Labels

dependencies (174) enhancement (7)

Packages

Total packages: 1
Total downloads:
- pypi 228 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 9
Total maintainers: 2

pypi.org: amazon-denseclus

Dense Clustering for Mixed Data Types

Homepage: https://github.com/awslabs/amazon-denseclus
Documentation: https://github.com/awslabs/amazon-denseclus/notebooks
License: MIT License
Latest release: 0.2.2
published about 2 years ago

Versions: 9
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 228 Last month

Rankings

Downloads: 4.7%

Stargazers count: 7.7%

Forks count: 8.1%

Dependent packages count: 10.1%

Average: 10.5%

Dependent repos count: 21.6%

Maintainers (2)

smart-patrol sunbc0120

Last synced: 9 months ago

Dependencies

setup.py pypi

hdbscan >=0.8.27
numba >=0.51.2
numpy >=1.20.2
pandas >=1.2.4
scikit_learn >=0.24.2
umap_learn >=0.5.1

pyproject.toml pypi

.github/workflows/cd.yml actions

actions/checkout v2 composite
actions/setup-python v2 composite

.github/workflows/ci.yml actions

actions/cache v2 composite
actions/checkout v2 composite
actions/setup-python v2 composite

requirements-dev.txt pypi

black ==23.11.0 development
coverage ==7.3.2 development
mypy ==1.7.0 development
nbqa ==1.7.0 development
pre-commit ==3.5.0 development
pylint ==3.0.2 development
pytest ==7.4.3 development
pytest-cov ==4.1.0 development
ruff ==0.1.6 development
tox ==4.11.3 development
tox-gh-actions ==3.1.3 development

requirements.txt pypi

hdbscan >=0.8.27
numba >=0.51.2
numpy >=1.20.2
pandas >=1.2.4
scikit_learn >=0.24.2
umap_learn >=0.5.1

environment.yaml pypi

https://github.com/awslabs/amazon-denseclus

Science Score: 13.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Amazon DenseClus

Installation

Quick Start

Usage

Prediction

On Combination Method

To Use with GPU with Ensemble

Advanced Usage

Examples

Notebooks

Blogs

References

Owner

GitHub Events

Total

Last Year

Committers

All Time

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: amazon-denseclus

Rankings

Maintainers (2)

Dependencies