mixdir

Cluster high dimensional categorical datasets

https://github.com/const-ae/mixdir

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: ieee.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.9%) to scientific vocabulary

Keywords

categorical-data clustering questionnaires r-package variational-inference
Last synced: 6 months ago · JSON representation

Repository

Cluster high dimensional categorical datasets

Basic Info
  • Host: GitHub
  • Owner: const-ae
  • Language: R
  • Default Branch: master
  • Size: 336 KB
Statistics
  • Stars: 15
  • Watchers: 3
  • Forks: 0
  • Open Issues: 0
  • Releases: 3
Topics
categorical-data clustering questionnaires r-package variational-inference
Created about 8 years ago · Last pushed over 2 years ago
Metadata Files
Readme

README.Rmd

---
output:
  md_document:
    variant: markdown_github
---



```{r, echo = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README_plots/"
)
```

# mixdir

The goal of mixdir is to cluster high dimensional categorical datasets.

It can

* handle missing data
* infer a reasonable number of latent class (try `mixdir(select_latent=TRUE)`)
* cluster datasets with more than 70,000 observations and 60 features
* propagate uncertainty and produce a soft clustering


A detailed description of the algorithm and the features of the package can 
be found in the the accompanying [paper](http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8631438&isnumber=8631391).
If you find the package useful please cite

>C. Ahlmann-Eltze and C. Yau, "MixDir: Scalable Bayesian Clustering for High-Dimensional Categorical Data", 
2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), Turin, Italy, 2018, pp. 526-539.

## Installation

```{r installation, eval=FALSE, include=TRUE}
install.packages("mixdir")

# Or to get the latest version from github
devtools::install_github("const-ae/mixdir")
```


## Example

Clustering the [mushroom](https://archive.ics.uci.edu/ml/datasets/mushroom) data set.

![](man/figures/README_plots/clustering_overview.png)

```{r example_load}
# Loading the library and the data
library(mixdir)
set.seed(1)

data("mushroom")
# High dimensional dataset: 8124 mushroom and 23 different features
mushroom[1:10, 1:5]
```

Calling the clustering function `mixdir` on a subset of the data:

```{r}
# Clustering into 3 latent classes
result <- mixdir(mushroom[1:1000,  1:5], n_latent=3)
```


Analyzing the result

```{r example}
# Latent class of of first 10 mushrooms
head(result$pred_class, n=10)

# Soft Clustering for first 10 mushrooms
head(result$class_prob, n=10)
pheatmap::pheatmap(result$class_prob, cluster_cols=FALSE,
                  labels_col = paste("Class", 1:3))

# Structure of latent class 1
# (bruises, cap color either yellow or white, edible etc.)
purrr::map(result$category_prob, 1)

# The most predicitive features for each class
find_predictive_features(result, top_n=3)
# For example: if all I know about a mushroom is that it has a
# yellow cap, then I am 99% certain that it will be in class 1
predict(result, c(`cap-color`="yellow"))

# Note the most predictive features are different from the most typical ones
find_typical_features(result, top_n=3)
```

Dimensionality Reduction

```{r fig.width=8, fig.asp=0.31}
# Defining Features
def_feat <- find_defining_features(result, mushroom[1:1000,  1:5], n_features = 3)
print(def_feat)

# Plotting the most important features gives an immediate impression
# how the cluster differ
plot_features(def_feat$features, result$category_prob)
```



# Underlying Model

The package implements a variational inference algorithm to solve a Bayesian latent class model (LCM). 


![](man/figures/README_plots/model_plate_notation.png)

Owner

  • Name: Constantin
  • Login: const-ae
  • Kind: user
  • Location: Heidelberg, Germany
  • Company: EMBL

PhD Student, Biostats, R

GitHub Events

Total
Last Year

Committers

Last synced: over 2 years ago

All Time
  • Total Commits: 61
  • Total Committers: 2
  • Avg Commits per committer: 30.5
  • Development Distribution Score (DDS): 0.016
Past Year
  • Commits: 1
  • Committers: 1
  • Avg Commits per committer: 1.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
const-ae a****5@g****m 60
Constantin c****e 1

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 2
  • Total pull requests: 0
  • Average time to close issues: 1 day
  • Average time to close pull requests: N/A
  • Total issue authors: 2
  • Total pull request authors: 0
  • Average comments per issue: 5.5
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • jmelero611 (1)
  • nandobonf (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • cran 207 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 3
  • Total maintainers: 1
cran.r-project.org: mixdir

Cluster High Dimensional Categorical Datasets

  • Versions: 3
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 207 Last month
Rankings
Stargazers count: 15.1%
Forks count: 28.8%
Dependent packages count: 29.8%
Average: 35.2%
Dependent repos count: 35.5%
Downloads: 66.9%
Maintainers (1)
Last synced: 6 months ago

Dependencies

DESCRIPTION cran
  • R >= 2.10 depends
  • Rcpp * imports
  • extraDistr * imports
  • dplyr * suggests
  • ggplot2 * suggests
  • mcclust * suggests
  • pheatmap * suggests
  • purrr * suggests
  • rmutil * suggests
  • testthat * suggests
  • tibble * suggests
  • tidyr * suggests
  • utils * suggests