Science Score: 10.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: ieee.org -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.9%) to scientific vocabulary
Keywords
categorical-data
clustering
questionnaires
r-package
variational-inference
Last synced: 6 months ago
·
JSON representation
Repository
Cluster high dimensional categorical datasets
Basic Info
- Host: GitHub
- Owner: const-ae
- Language: R
- Default Branch: master
- Size: 336 KB
Statistics
- Stars: 15
- Watchers: 3
- Forks: 0
- Open Issues: 0
- Releases: 3
Topics
categorical-data
clustering
questionnaires
r-package
variational-inference
Created about 8 years ago
· Last pushed over 2 years ago
Metadata Files
Readme
README.Rmd
---
output:
md_document:
variant: markdown_github
---
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README_plots/"
)
```
# mixdir
The goal of mixdir is to cluster high dimensional categorical datasets.
It can
* handle missing data
* infer a reasonable number of latent class (try `mixdir(select_latent=TRUE)`)
* cluster datasets with more than 70,000 observations and 60 features
* propagate uncertainty and produce a soft clustering
A detailed description of the algorithm and the features of the package can
be found in the the accompanying [paper](http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8631438&isnumber=8631391).
If you find the package useful please cite
>C. Ahlmann-Eltze and C. Yau, "MixDir: Scalable Bayesian Clustering for High-Dimensional Categorical Data",
2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), Turin, Italy, 2018, pp. 526-539.
## Installation
```{r installation, eval=FALSE, include=TRUE}
install.packages("mixdir")
# Or to get the latest version from github
devtools::install_github("const-ae/mixdir")
```
## Example
Clustering the [mushroom](https://archive.ics.uci.edu/ml/datasets/mushroom) data set.

```{r example_load}
# Loading the library and the data
library(mixdir)
set.seed(1)
data("mushroom")
# High dimensional dataset: 8124 mushroom and 23 different features
mushroom[1:10, 1:5]
```
Calling the clustering function `mixdir` on a subset of the data:
```{r}
# Clustering into 3 latent classes
result <- mixdir(mushroom[1:1000, 1:5], n_latent=3)
```
Analyzing the result
```{r example}
# Latent class of of first 10 mushrooms
head(result$pred_class, n=10)
# Soft Clustering for first 10 mushrooms
head(result$class_prob, n=10)
pheatmap::pheatmap(result$class_prob, cluster_cols=FALSE,
labels_col = paste("Class", 1:3))
# Structure of latent class 1
# (bruises, cap color either yellow or white, edible etc.)
purrr::map(result$category_prob, 1)
# The most predicitive features for each class
find_predictive_features(result, top_n=3)
# For example: if all I know about a mushroom is that it has a
# yellow cap, then I am 99% certain that it will be in class 1
predict(result, c(`cap-color`="yellow"))
# Note the most predictive features are different from the most typical ones
find_typical_features(result, top_n=3)
```
Dimensionality Reduction
```{r fig.width=8, fig.asp=0.31}
# Defining Features
def_feat <- find_defining_features(result, mushroom[1:1000, 1:5], n_features = 3)
print(def_feat)
# Plotting the most important features gives an immediate impression
# how the cluster differ
plot_features(def_feat$features, result$category_prob)
```
# Underlying Model
The package implements a variational inference algorithm to solve a Bayesian latent class model (LCM).

Owner
- Name: Constantin
- Login: const-ae
- Kind: user
- Location: Heidelberg, Germany
- Company: EMBL
- Website: https://twitter.com/const_ae
- Repositories: 64
- Profile: https://github.com/const-ae
PhD Student, Biostats, R
GitHub Events
Total
Last Year
Committers
Last synced: over 2 years ago
Top Committers
| Name | Commits | |
|---|---|---|
| const-ae | a****5@g****m | 60 |
| Constantin | c****e | 1 |
Issues and Pull Requests
Last synced: 7 months ago
All Time
- Total issues: 2
- Total pull requests: 0
- Average time to close issues: 1 day
- Average time to close pull requests: N/A
- Total issue authors: 2
- Total pull request authors: 0
- Average comments per issue: 5.5
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- jmelero611 (1)
- nandobonf (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- cran 207 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 3
- Total maintainers: 1
cran.r-project.org: mixdir
Cluster High Dimensional Categorical Datasets
- Homepage: https://github.com/const-ae/mixdir
- Documentation: http://cran.r-project.org/web/packages/mixdir/mixdir.pdf
- License: GPL-3
-
Latest release: 0.3.0
published over 6 years ago
Rankings
Stargazers count: 15.1%
Forks count: 28.8%
Dependent packages count: 29.8%
Average: 35.2%
Dependent repos count: 35.5%
Downloads: 66.9%
Maintainers (1)
Last synced:
6 months ago
Dependencies
DESCRIPTION
cran
- R >= 2.10 depends
- Rcpp * imports
- extraDistr * imports
- dplyr * suggests
- ggplot2 * suggests
- mcclust * suggests
- pheatmap * suggests
- purrr * suggests
- rmutil * suggests
- testthat * suggests
- tibble * suggests
- tidyr * suggests
- utils * suggests