https://github.com/awslabs/amazon-denseclus
Clustering for mixed-type data
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.9%) to scientific vocabulary
Keywords
Keywords from Contributors
Repository
Clustering for mixed-type data
Basic Info
- Host: GitHub
- Owner: awslabs
- License: mit-0
- Language: Jupyter Notebook
- Default Branch: main
- Homepage: https://aws.amazon.com/blogs/opensource/introducing-denseclus-an-open-source-clustering-package-for-mixed-type-data/
- Size: 4.53 MB
Statistics
- Stars: 99
- Watchers: 6
- Forks: 21
- Open Issues: 11
- Releases: 6
Topics
Metadata Files
README.md
Amazon DenseClus
DenseClus is a Python module for clustering mixed type data using UMAP and HDBSCAN. Allowing for both categorical and numerical data, DenseClus makes it possible to incorporate all features in clustering.
Installation
bash
python3 -m pip install amazon-denseclus
Quick Start
DenseClus requires a Panda's dataframe as input with both numerical and categorical columns. All preprocessing and extraction are done under the hood, just call fit and then retrieve the clusters!
```python from denseclus import DenseClus from denseclus.utils import make_dataframe
df = make_dataframe() clf = DenseClus(df) clf.fit(df)
scores = clf.evaluate() print(scores[0:10]) ```
Usage
Prediction
DenseClus uses a predict method when umap_combine_method is set to ensemble.
Results are return in 2d array with the first part being the labels and the second part the probabilities.
```python from denseclus import DenseClus from denseclus.utils import make_dataframe
RANDOM_STATE = 10
df = makedataframe(randomstate=RANDOMSTATE) train = df.sample(frac=0.8, randomstate=RANDOMSTATE) test = df.drop(train.index) clf = DenseClus(randomstate=RANDOMSTATE, umapcombine_method='ensemble') clf.fit(train)
predictions = clf.predict(test) print(predictions) # labels, probabilities ```
On Combination Method
For a slower but more stable results select intersection_union_mapper to combine embedding layers via a third UMAP, which will provide equal weight to both numerics and categoriel columns. By default, you are setting the random seed which eliminates the ability for UMAP to run in parallel but will help circumevent some of the randomness of the algorithm.
python
clf = DenseClus(
umap_combine_method="intersection_union_mapper",
)
To Use with GPU with Ensemble
To use with gpu first have rapids installed.
You can do this as setup by providing cuda verision.
pip install amazon-denseclus[gpu-cu12]
Then to run:
python
clf = DenseClus(
umap_combine_method="ensemble",
use_gpu=True
)
Advanced Usage
For advanced users, it's possible to select more fine-grained control of the underlying algorithms by passing
dictionaries into DenseClus class for either UMAP or HDBSCAN.
For example: ```python from denseclus import DenseClus from denseclus.utils import make_dataframe
umapparams = { "categorical": {"nneighbors": 15, "mindist": 0.1}, "numerical": {"nneighbors": 20, "mindist": 0.1}, } hdbscanparams = {"minclustersize": 10}
df = make_dataframe()
clf = DenseClus(umapcombinemethod="union" , umapparams=umapparams , hdbscanparams=hdbscanparams , random_state=None) # this will run in parallel
clf.fit(df) ```
Examples
Notebooks
A hands-on example with an overview of how to use is currently available in the form of a Example Jupyter Notebook.
Should you need to tune HDBSCAN, here is an optional approach: Tuning with HDBSCAN Notebook
Should you need to validate UMAP emeddings, there is an approach to do so in the Validation for UMAP Notebook
Blogs
AWS Blog: Introducing DenseClus, an open source clustering package for mixed-type data
TDS Blog: On the Validation of UMAP
References
bibtex
@article{mcinnes2018umap-software,
title={UMAP: Uniform Manifold Approximation and Projection},
author={McInnes, Leland and Healy, John and Saul, Nathaniel and Grossberger, Lukas},
journal={The Journal of Open Source Software},
volume={3},
number={29},
pages={861},
year={2018}
}
bibtex
@article{mcinnes2017hdbscan,
title={hdbscan: Hierarchical density based clustering},
author={McInnes, Leland and Healy, John and Astels, Steve},
journal={The Journal of Open Source Software},
volume={2},
number={11},
pages={205},
year={2017}
}
Owner
- Name: Amazon Web Services - Labs
- Login: awslabs
- Kind: organization
- Location: Seattle, WA
- Website: http://amazon.com/aws/
- Repositories: 914
- Profile: https://github.com/awslabs
AWS Labs
GitHub Events
Total
- Watch event: 6
- Fork event: 2
Last Year
- Watch event: 6
- Fork event: 2
Committers
Last synced: almost 3 years ago
All Time
- Total Commits: 31
- Total Committers: 9
- Avg Commits per committer: 3.444
- Development Distribution Score (DDS): 0.742
Top Committers
| Name | Commits | |
|---|---|---|
| Baichuan Sun | b****n@a****m | 8 |
| Charles Frenzel | 6****l@u****m | 7 |
| Charles Frenzel | f****c@a****m | 6 |
| Baichuan Sun | s****0@u****m | 4 |
| dependabot[bot] | 4****]@u****m | 2 |
| Alexandros Metsai | a****i@g****m | 1 |
| Amazon GitHub Automation | 5****o@u****m | 1 |
| Monk | a****a@g****m | 1 |
| itaiara | 9****a@u****m | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 23
- Total pull requests: 122
- Average time to close issues: 6 months
- Average time to close pull requests: 22 days
- Total issue authors: 14
- Total pull request authors: 9
- Average comments per issue: 1.04
- Average comments per pull request: 0.7
- Merged pull requests: 33
- Bot issues: 0
- Bot pull requests: 99
Past Year
- Issues: 1
- Pull requests: 16
- Average time to close issues: N/A
- Average time to close pull requests: 12 days
- Issue authors: 1
- Pull request authors: 1
- Average comments per issue: 0.0
- Average comments per pull request: 0.56
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 16
Top Authors
Issue Authors
- momonga-ml (6)
- jbdatascience (3)
- cwk20 (2)
- monk1337 (1)
- jovanaarsic (1)
- ghost (1)
- relifeted (1)
- yusuftalhatamer (1)
- Nick3523 (1)
- porygonseverywhere (1)
- AlexMetsai (1)
- saujosai (1)
- sgbaird (1)
- timofeytkachenko (1)
Pull Request Authors
- dependabot[bot] (174)
- momonga-ml (13)
- sunbc0120 (4)
- srushtii-aws (3)
- itaiara (1)
- monk1337 (1)
- AlexMetsai (1)
- bharven (1)
- amorisot (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 228 last-month
- Total dependent packages: 0
- Total dependent repositories: 1
- Total versions: 9
- Total maintainers: 2
pypi.org: amazon-denseclus
Dense Clustering for Mixed Data Types
- Homepage: https://github.com/awslabs/amazon-denseclus
- Documentation: https://github.com/awslabs/amazon-denseclus/notebooks
- License: MIT License
-
Latest release: 0.2.2
published almost 2 years ago
Rankings
Maintainers (2)
Dependencies
- hdbscan >=0.8.27
- numba >=0.51.2
- numpy >=1.20.2
- pandas >=1.2.4
- scikit_learn >=0.24.2
- umap_learn >=0.5.1
- actions/checkout v2 composite
- actions/setup-python v2 composite
- actions/cache v2 composite
- actions/checkout v2 composite
- actions/setup-python v2 composite
- black ==23.11.0 development
- coverage ==7.3.2 development
- mypy ==1.7.0 development
- nbqa ==1.7.0 development
- pre-commit ==3.5.0 development
- pylint ==3.0.2 development
- pytest ==7.4.3 development
- pytest-cov ==4.1.0 development
- ruff ==0.1.6 development
- tox ==4.11.3 development
- tox-gh-actions ==3.1.3 development
- hdbscan >=0.8.27
- numba >=0.51.2
- numpy >=1.20.2
- pandas >=1.2.4
- scikit_learn >=0.24.2
- umap_learn >=0.5.1