kmodes

Python implementations of the k-modes and k-prototypes clustering algorithms, for clustering categorical data

https://github.com/nicodv/kmodes

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    1 of 22 committers (4.5%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.5%) to scientific vocabulary

Keywords

clustering-algorithm k-modes k-prototypes python scikit-learn
Last synced: 6 months ago · JSON representation

Repository

Python implementations of the k-modes and k-prototypes clustering algorithms, for clustering categorical data

Basic Info
  • Host: GitHub
  • Owner: nicodv
  • License: mit
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 479 KB
Statistics
  • Stars: 1,261
  • Watchers: 51
  • Forks: 415
  • Open Issues: 17
  • Releases: 11
Topics
clustering-algorithm k-modes k-prototypes python scikit-learn
Created over 12 years ago · Last pushed over 1 year ago
Metadata Files
Readme License Codeowners

README.rst

.. image:: https://img.shields.io/pypi/v/kmodes.svg
    :target: https://pypi.python.org/pypi/kmodes/
    :alt: Version
.. image:: https://anaconda.org/conda-forge/kmodes/badges/version.svg
    :target: https://anaconda.org/conda-forge/kmodes
    :alt: Conda forge page
.. image:: https://github.com/nicodv/kmodes/actions/workflows/python-package.yml/badge.svg?branch=master
    :target: https://github.com/nicodv/kmodes/actions/workflows/python-package.yml
    :alt: Build status
.. image:: https://coveralls.io/repos/nicodv/kmodes/badge.svg
    :target: https://coveralls.io/r/nicodv/kmodes
    :alt: Test coverage
.. image:: https://api.codacy.com/project/badge/Grade/cb19f1f1093a44fa845ebfdaf76975f6
   :alt: Codacy
   :target: https://app.codacy.com/app/nicodv/kmodes?utm_source=github.com&utm_medium=referral&utm_content=nicodv/kmodes&utm_campaign=Badge_Grade_Dashboard
.. image:: https://img.shields.io/pypi/dm/kmodes.svg
    :target: https://pypi.python.org/pypi/kmodes/
    :alt: Monthly downloads
.. image:: https://img.shields.io/pypi/pyversions/kmodes.svg
    :target: https://pypi.python.org/pypi/kmodes/
    :alt: Supported Python versions
.. image:: https://img.shields.io/pypi/l/kmodes.svg
    :target: https://github.com/nicodv/kmodes/blob/master/LICENSE
    :alt: License

kmodes
======

Description
-----------

Python implementations of the k-modes and k-prototypes clustering
algorithms. Relies on numpy for a lot of the heavy lifting.

k-modes is used for clustering categorical variables. It defines clusters
based on the number of matching categories between data points. (This is
in contrast to the more well-known k-means algorithm, which clusters
numerical data based on Euclidean distance.) The k-prototypes algorithm
combines k-modes and k-means and is able to cluster mixed numerical /
categorical data.

Implemented are:

- k-modes [HUANG97]_ [HUANG98]_
- k-modes with initialization based on density [CAO09]_
- k-prototypes [HUANG97]_

The code is modeled after the clustering algorithms in :code:`scikit-learn`
and has the same familiar interface.

I would love to have more people play around with this and give me
feedback on my implementation. If you come across any issues in running or
installing kmodes,
`please submit a bug report `_.

Enjoy!

Installation
------------

`kmodes` can be installed using `pip`:

.. code:: bash

    pip install kmodes

To upgrade to the latest version (recommended), run it like this:

.. code:: bash

    pip install --upgrade kmodes

`kmodes` can also conveniently be installed with `conda` from the `conda-forge` channel:

.. code:: bash

    conda install -c conda-forge kmodes

Alternatively, you can build the latest development version from source:

.. code:: bash

    git clone https://github.com/nicodv/kmodes.git
    cd kmodes
    python setup.py install

Usage
-----
.. code:: python

    import numpy as np
    from kmodes.kmodes import KModes

    # random categorical data
    data = np.random.choice(20, (100, 10))

    km = KModes(n_clusters=4, init='Huang', n_init=5, verbose=1)

    clusters = km.fit_predict(data)

    # Print the cluster centroids
    print(km.cluster_centroids_)

The examples directory showcases simple use cases of both k-modes
('soybean.py') and k-prototypes ('stocks.py').

Parallel execution
------------------

The k-modes and k-prototypes implementations both offer support for
multiprocessing via the 
`joblib library `_,
similar to e.g. scikit-learn's implementation of k-means, using the
:code:`n_jobs` parameter. It generally does not make sense to set more jobs
than there are processor cores available on your system.

This potentially speeds up any execution with more than one initialization try,
:code:`n_init > 1`, which may be helpful to reduce the execution time for
larger problems. Note that it depends on your problem whether multiprocessing
actually helps, so be sure to try that out first. You can check out the
examples for some benchmarks.

FAQ
---

**Q: I'm seeing errors such as "TypeError: '<' not supported between instances of 'str' and 'float'"
when using the kprototypes algorithm.**

A: One or more of your numerical feature columns have string values in them. Make sure that all 
columns have consistent data types.

----

**Q: How does k-protypes know which of my features are numerical and which are categorical?**

A: You tell it which column indices are categorical using the :code:`categorical` argument. All others are assumed numerical. E.g., :code:`clusters = KPrototypes().fit_predict(X, categorical=[1, 2])`

----

**Q: I'm getting the following error, what gives? "ModuleNotFoundError: No module named 'kmodes.kmodes'; 'kmodes' is not a package".**

A: Make sure your working file is not called 'kmodes.py', because it might overrule the :code:`kmodes` package.

----

**Q: I'm getting the following error: "ValueError: Clustering algorithm could not initialize. Consider assigning the initial clusters manually."**

A: This is a feature, not a bug. :code:`kmodes` is telling you that it can't make sense of the data you are presenting it. At least, not with the parameters you are setting the algorithm with. It is up to you, the data scientist, to figure out why. Some hints to possible solutions:

- Run with fewer clusters as the data might not support a large number of clusters
- Explore and visualize your data, checking for weird distributions, outliers, etc.
- Clean and normalize the data
- Increase the ratio of rows to columns

----

**Q: I'm getting the following error: "ValueError: Input contains NaN, infinity, or a value too large for dtype('float64')."**

A: Following scikit-learn, the k-modes algorithm does not accept :code:`np.NaN` 
values in the :code:`X` matrix. Users are suggested to fill in the missing 
data in a way that makes sense for the problem at hand.

----

**Q: How would like your library to be cited?**

A: Something along these lines would do nicely:

.. code-block::

  @Misc{devos2015,
    author = {Nelis J. de Vos},
    title = {kmodes categorical clustering library},
    howpublished = {\url{https://github.com/nicodv/kmodes}},
    year = {2015--2024}
  }


References
----------

.. [HUANG97] Huang, Z.: Clustering large data sets with mixed numeric and
   categorical values, Proceedings of the First Pacific Asia Knowledge
   Discovery and Data Mining Conference, Singapore, pp. 21-34, 1997.

.. [HUANG98] Huang, Z.: Extensions to the k-modes algorithm for clustering
   large data sets with categorical values, Data Mining and Knowledge
   Discovery 2(3), pp. 283-304, 1998.

.. [CAO09] Cao, F., Liang, J, Bai, L.: A new initialization method for
   categorical data clustering, Expert Systems with Applications 36(7),
   pp. 10223-10228., 2009.

Owner

  • Name: Nico de Vos
  • Login: nicodv
  • Kind: user
  • Location: United States
  • Company: Salesforce

GitHub Events

Total
  • Watch event: 44
  • Issue comment event: 1
  • Fork event: 4
Last Year
  • Watch event: 44
  • Issue comment event: 1
  • Fork event: 4

Committers

Last synced: over 1 year ago

All Time
  • Total Commits: 479
  • Total Committers: 22
  • Avg Commits per committer: 21.773
  • Development Distribution Score (DDS): 0.376
Past Year
  • Commits: 4
  • Committers: 1
  • Avg Commits per committer: 4.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Nico de Vos n****s@g****m 299
nicodv n****s@a****m 99
Nico de Vos n****s@a****m 16
kklein k****u@g****m 12
BikashPandey17 p****8@g****m 9
Robin Hes r****s@o****m 8
Ben Andow b****w@n****u 7
Henry Wilde h****e@g****m 6
Ben Andow b****w@B****l 5
BikashPandey17 B****Y@s****m 3
lukeg l****z@g****m 2
Pedro Larroy p****s@g****m 2
Genie-liu f****s@1****m 2
Jeason Liu 3****u 1
nilkeshpatra 4****a 1
Dmitri Tchebotarev d****v@i****m 1
Fei f****n@j****m 1
Harish B d****h@o****m 1
Ian Warrington i****n@m****m 1
Rebecca r****s@g****m 1
j-hurwitz j****i@e****g 1
trevorstephens t****s@g****m 1

Issues and Pull Requests

Last synced: 9 months ago

All Time
  • Total issues: 80
  • Total pull requests: 27
  • Average time to close issues: 6 months
  • Average time to close pull requests: 3 months
  • Total issue authors: 70
  • Total pull request authors: 16
  • Average comments per issue: 2.59
  • Average comments per pull request: 1.85
  • Merged pull requests: 21
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • nicodv (6)
  • regorsmitz (3)
  • KhatriVivek (2)
  • wwjd1234 (2)
  • xinruxchen (2)
  • ori-katz100 (1)
  • tathianabarreto (1)
  • patryk-kowalski95 (1)
  • imamilion (1)
  • daffidwilde (1)
  • Francis-oa7 (1)
  • eli232323 (1)
  • mrlonely001 (1)
  • asmitakulkarni (1)
  • crixus5678 (1)
Pull Request Authors
  • nicodv (9)
  • ManuelConcepcion (2)
  • daffidwilde (2)
  • kklein (2)
  • joaquin-tempelsman (2)
  • kristina100 (1)
  • rggelles (1)
  • trevorstephens (1)
  • SatishDivakarla (1)
  • b-harish (1)
  • j-hurwitz (1)
  • larroy (1)
  • Genie-Liu (1)
  • BikashPandey17 (1)
  • AGMortimer (1)
Top Labels
Issue Labels
question (21) bug (20) enhancement (18) expected behavior (2) easy (2) performance (1) difficult (1) releases (1)
Pull Request Labels

Packages

  • Total packages: 4
  • Total downloads:
    • pypi 155,813 last-month
  • Total docker downloads: 1,476
  • Total dependent packages: 13
    (may contain duplicates)
  • Total dependent repositories: 297
    (may contain duplicates)
  • Total versions: 24
  • Total maintainers: 2
pypi.org: kmodes

Python implementations of the k-modes and k-prototypes clustering algorithms for clustering categorical data.

  • Versions: 16
  • Dependent Packages: 12
  • Dependent Repositories: 291
  • Downloads: 155,813 Last month
  • Docker Downloads: 1,476
Rankings
Downloads: 0.7%
Dependent repos count: 0.9%
Dependent packages count: 1.1%
Docker downloads count: 1.4%
Average: 1.4%
Stargazers count: 1.9%
Forks count: 2.6%
Maintainers (1)
Last synced: 6 months ago
conda-forge.org: kmodes
  • Versions: 6
  • Dependent Packages: 1
  • Dependent Repositories: 3
Rankings
Forks count: 8.5%
Stargazers count: 11.9%
Average: 16.9%
Dependent repos count: 18.1%
Dependent packages count: 29.0%
Last synced: 6 months ago
spack.io: py-kmodes

Python implementations of the k-modes and k-prototypes clustering algorithms for clustering categorical data.

  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent repos count: 0.0%
Forks count: 4.6%
Stargazers count: 6.9%
Average: 17.2%
Dependent packages count: 57.3%
Maintainers (1)
Last synced: 6 months ago
anaconda.org: kmodes

Python implementations of the k-modes and k-prototypes clustering algorithms for clustering categorical data.

  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 3
Rankings
Forks count: 16.6%
Stargazers count: 22.0%
Average: 34.1%
Dependent repos count: 46.6%
Dependent packages count: 51.2%
Last synced: 6 months ago

Dependencies

.github/workflows/python-package.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
setup.py pypi
  • Note *
  • as *
  • give *
  • joblib >=0.11
  • numpy >=1.10.4
  • scikit-learn >=0.22.0
  • scipy >=0.13.3