DiscreteGapStatistic

https://github.com/ecortesgomez/discretegapstatistic
Last synced: 10 months ago · JSON representation
Repository

Basic Info

Host: GitHub
Owner: ecortesgomez
License: other
Language: R
Default Branch: main
Size: 5.72 MB
Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0
Created over 2 years ago · Last pushed 12 months ago
Metadata Files

Readme License
README.Rmd

---
output: github_document
always_allow_html: true
---



```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# DiscreteGapStatistic




DiscreteGapStatistic estimates the number of clusters from (multiple choice) categorical response format data given any clustering algorithm extending the well-known gap statistic using a discrete distance based approach.

## Installation

You can install the development version of DiscreteGapStatistic from [GitHub](https://github.com/) with:

``` r
# install.packages("devtools")
devtools::install_github("ecortesgomez/DiscreteGapStatistic")
```

# Example: Math Anxiety Data

## Summarization and Data Exploration

Basic usage of `DiscreteGapStatistic` is shown using `likert`'s Math Anxiety data.

```{r example}
library(DiscreteGapStatistic)
library(dplyr)
library(ggplot2)
```

Libraries are uploaded. Questions are lightly reformatted and the categories are shortened.

```{r LoadData, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE}
## mass dataset is loaded automatically with DiscreteGapStatistic

data(mass)

massSh <- mass[, -1]

sID <- substring(mass$Gender, 1, 1) ## Removing Gender variable
sID[sID == 'F'] <- paste0(sID[sID == 'F'], 1:sum(sID == 'F'))
sID[sID == 'M'] <- paste0(sID[sID == 'M'], 1:sum(sID == 'M'))
rownames(massSh) <- sID

Cats <- setNames(c('SD', 'D', 'N', 'A', 'SA'),
   c('Strongly Disagree', 'Disagree', 'Neutral', 'Agree', 'Strongly Agree') )

massSh <- data.frame(apply(massSh, 2,
   function(x) Cats[as.character(x)] %>%
   factor(levels = Cats)),
   check.names=FALSE, row.names = sID)
colnames(massSh) <- c('Q1: Math Interesting', 'Q2: Uptight Math Test',
   'Q3: Use Math In Future', 'Q4: Mind Goes Blank', 'Q5: Math Relates to Life',
   'Q6: Ability Solve Math Probls', 'Q7: Sinking Feeling', 'Q8: Math Challenging',
   'Q9: Math Makes Me Nervous', 'Q10: Take More Math Classes',
   'Q11: Math Makes Me Uneasy','Q12: Math Favorite Subj.',
   'Q13: Enjoy Learning Math', 'Q14: Math Makes Feel Confused')
```

## Likert Heatmap {#LikertHeat}

The dataset is visualized using a heatmap similar to the one produced by `likert::likert.heatmap.plot`.

```{r LikertHeat,  out.width=800, dpi=500, fig.retina=NULL, echo=TRUE, fig.align = 'center'}
likert.heat.plot2(massSh,
                  allLevels = Cats,
                  text.size = 1.5)+
   labs(title = 'Math Anxiety Data Likert Heatmap Summary')+
   theme(axis.text = element_text(size = 6), 
         title = element_text(size = 6))
```

## Distance Matrices 

Five categorical distance functions are introduced to quantify dissimilarities/discrepancies between two categorical vectors (see function `distancematrix`). The following table describes the available distances and the names used within the package.

```{r Distances, fig.align = 'center', echo=FALSE}
library(knitr)
library(kableExtra)

data.frame(Distance = c('Hamming', 'chi-square', "Cramer's V", "Hellinger", 'Bhattacharyya'), 
           Name = c('hamming', 'chisquare', 'cramerV', 'hellinger', 'bhattacharyya')) %>%
   kbl() %>%
   kable_paper(full_width = FALSE) %>%
   column_spec(column = 1, border_left = TRUE) %>%
   column_spec(column = 2, monospace = TRUE, border_left = TRUE, border_right=TRUE)
```

The resulting distance matrix from a rectangular dataset can easily be displayed and organized using heatmaps. The following plots visualize the `massData` using the defined distances with the function `distanceHeat`. This function only requires the dataset object and the name of the distance. It can also take advantage of the options and functionalities available in the `pheatmap` function (see the code below) from the `pheatmap` R package [@kolde2019]. For instance, by default, the columns are clustered using `clustering_method = 'complete'` but this parameter can be specified according to `pheatmap`'s options. Different aesthetic options are exemplified in the plots below using the introduced distances. 

```{r DistanceMats1, out.width=360, dpi=500, echo=TRUE, results='hold', fig.show="hold"}
distanceHeat(x = massSh, 
             distName = 'hamming',
             main = 'Hamming Distance\nMath Anxiety Data', 
             fontsize = 6.5)

distanceHeat(x = massSh, 
             distName = 'cramerV',
             main = "Cramer's V Distance\nMath Anxiety Data", 
             fontsize = 6.5, 
             show_rownames = FALSE, 
             cluster_rows = FALSE, 
             cluster_cols=FALSE)
```

```{r DistanceMats2, out.width=360, dpi=500, echo=TRUE, results='hold', fig.show="hold"}
distanceHeat(x = massSh, 
             distName = 'hellinger',
             main = 'Hellinger Distance\nMath Anxiety Data', 
             fontsize = 6.5, 
             show_rownames = TRUE, 
             border_color = 'black', 
             cluster_rows = FALSE, 
             cluster_cols=FALSE)

distanceHeat(x = massSh, 
             distName = 'bhattacharyya',
             main = 'Bhattacharyya Distance\nMath Anxiety Data', 
             fontsize = 6.5, 
             show_rownames = TRUE, 
             border_color = 'lightgrey', 
             cluster_rows = FALSE, 
             cluster_cols=FALSE)
```

# `clusGapDiscr`: The Gap Statistic for discrete data {#chiclust}

The Gap Statistic (GS) in its original formulation by Tibshirani et. al aims to determine an optimal number of clusters given a rectangular dataset with continuous columns and a pre-specified clustering method given $k$ number of clusters (i.e. $k$-means clustering or partitioning around medoids (pam)). This is done by comparing the within-cluster dispersion of the data to a randomly generated reference null distribution of the data obtaining estimations via bootstrapping. The idea is to quantify whether the clustering structure observed in the data is considerably different compared to what would be expected by random chance. `clusGap` is a widely used implementation of the GS from the `cluster` package [@maechler2023cluster] with the following basic arguments:

```{r clusGap,  fig.align = 'center', echo=TRUE, eval=FALSE}
clusGap(x, 
        FUNcluster, 
        K.max, 
        B = 100, 
        verbose = interactive(), 
        ...)
```

- `x`: argument can be `data.frame` or `matrix` object where rows represent observations and columns correspond to data features or variables. This matrix is expected to be contain exclusively numerical input. 
- `FUNcluster`: clustering function accepting on its first argument a rectangular data object (like `x`) and a second one specifying a desired `k` number of clusters.  
- `K.max`: maximum number of clusters to consider.  
- `B`: integer indicating the number of times to perform the Montecarlo repetitions. 

The discrete GS (dGS) proposed applies and adapts the principle behind the original formulation, but for categorical data using a distance-based approach. The GS assumes that the distance between observations can be correctly described with the Euclidean distance, since it assumes the data is continuous. In the case of categorical data, this distance assumption is not appropriate and a different class of distances needs to be defined. `clusGapDiscr` is the main function that performs and implements dGS. Parameters `x`, `K.max` and `B` found in `clusGapDiscr` have the same meaning and usage to the ones found in `clusGap`. `clusterFUN` expects the name of a well-known clustering algorithm implementation available in `R` amenable to the proposed methodology. The input options available for this parameter are the following: 

```{r clusterFUNopt, echo=FALSE}
data.frame(clusterFUN = c('pam', 'fanny', 'diana', 
                          'agnes-\\{average, single, complete, ward, weighted\\}', 
                          'hclust-\\{average, single, complete, ward.D, ward.D2,  mcquitty, median, centroid\\}', 
                          'kmodes-\\{1, 2, ..., \\}'),
           Package = c(rep('cluster', 4), 'stats', 'klaR')) %>%
   kbl(caption = 'List of appliable clustering algorithms for DiscreteGapStatistic') %>%
   kable_paper(full_width = FALSE)
```

The first four options are straight-forward since originally these functions can output a clustering partition providing a distance matrix and a given number of clusters. The following two implementations use a flexible hierarchical clustering strategy. The number of desired clusters can be obtained appropriately by cutting the resulting hierarchical tree. This parameter requires the name of the implementation and a dash (-) followed by the exact name of the clustering strategy described in the package corresponding R package. Lastly, a k-modes implementation is included using code found on `klaR` with options `fast=TRUE, weighted = FALSE`. The expected string after the dash is a non-negative integer specifying the maximum number of iterations to carry out within the algorithm (`iter.max` option; inputting `kmodes` alone runs `iter.max = 10` by default). 

Additionally, the function requires a categorical distance from the list mentioned above. Another feature of the dGS is the reference null distribution used, which can be a discrete uniform distribution of the unique column-wise categories found in the data; this setting is defined as the Data Support (DS). In other settings, the categories to be used must be user-specified via a character vector or list. This setting will be referred as the Known Support (KS). 

```{r clusGapDiscr, echo=TRUE, eval=FALSE}
clusGapDiscr(x, 
             clusterFUN, 
             K.max, 
             B = nrow(x),
             distName, 
             value.range = 'DS', 
             verbose = interactive(), 
             useLog = TRUE ...)
```

- `distName`: name of discrete distance. The available options are `'hamming'`, `'chisquare'`, `'cramersV'`, `'bhattacharyya'` and `'hellinger'`.

- `value.range`: specifies the values of the reference null distribution. Possible values: `'DS'` or a character vector with unique values. `'DS'` option extracts the unique values found in `x`. 

- `useLog`: Binary truth variable specifying whether to evaluate the Gap Statistic with or without $log$ function.

Similar to `clusGap`, `clusGapDiscr` returns a matrix object with `K.max` rows corresponding to the user-specified number of clusters with an estimate of the GS and its corresponding standard error. This information would then determine the number of clusters according to the cluster-selection criterion. The one used in `DiscreteGapStatistic` is the 1-SE criterion implemented by the function `findK`, which accepts `clusGap` objects. The following example applies the dGS on the Math Anxiety data using the Hamming, $\chi^2$, and Cramer's V distances applying the `cluster::pam` algorithm. A plot is then generated displaying the dGS against the number of clusters. Each estimate is accompanied by corresponding standard-error bars and a blue dashed line indicating the chosen number of clusters.  

## Basics

The math anxiety data is further explored under different categorical distances using the `cluster::pam` algorithm. The number of bootstraps `B` is increased to `100` since the sample size for this dataset is not too large. The maximum number of clusters to consider is `9`.

```{r clusGapDiscrFit, echo=TRUE, eval=TRUE, fig.retina=NULL, out.width=250, dpi=500, echo=TRUE, results='hold', fig.show="hold"}
## Recall Cats:
# Cats <- setNames(object = c('SD', 'D', 'N', 'A', 'SA'),
#                  nm = c('Strongly Disagree', 'Disagree', 
#                         'Neutral', 
#                         'Agree', 'Strongly Agree') )

HammRun <- clusGapDiscr(x = massSh,
                        clusterFUN = 'pam',
                        B = 100,
                        K.max = 9,
                        value.range = 'DS',
                        distName = 'hamming')

chisqRun <- clusGapDiscr(x = massSh,
                        clusterFUN = 'pam',
                        B = 100,
                        K.max = 9,
                        value.range = Cats,
                        distName = 'chisquare')

crVRun <- clusGapDiscr(x = massSh,
                        clusterFUN = 'pam',
                        B = 100,
                        K.max = 9,
                        value.range = 'DS',
                        distName = 'cramerV')

plot(HammRun,
     main = "Discrete Gap statistic: Hamming Distance\nMath Anxiety Data",
     cex = 2, 
     cex.lab=1.2, 
     cex.axis=1.5, 
     cex.main=1.5)
abline(v = findK(HammRun), lty=3, lwd=2, col="Blue")

plot(chisqRun,
     main = "Discrete Gap statistic: chi-square Distance\nMath Anxiety Data",
     cex = 2, 
     cex.lab=1.2, 
     cex.axis=1.5, 
     cex.main=1.5)
abline(v = findK(chisqRun), lty=3, lwd=2, col="Blue")

plot(crVRun,
     main = "Discrete Gap statistic: Cramer's V Distance\nMath Anxiety Data",
     cex = 2, 
     cex.lab=1.2, 
     cex.axis=1.5, 
     cex.main=1.5)
abline(v = findK(crVRun), lty=3, lwd=2, col="Blue")
```

Notice that since all possible categorical values are available in the data, using the option `value.range = Cats` would yield exact same results. 

Heatmaps can be created to visualize the commonalities within each cluster and the differences between them.

```{r resHeatMaps, echo=TRUE, eval=TRUE, fig.retina=NULL,  dpi=500, fig.retina=NULL, out.width=230, dpi=500, results='hold', fig.show="hold"}

ResHeatmap(x = massSh,
           distName = 'hamming',
           clusterFUN = 'pam', 
           catVals = Cats,
           nCl = findK(HammRun),
           out = 'heatmap',
           prefObs = NULL, height = 6)

ResHeatmap(x = massSh,
           clusterFUN = 'pam', 
           distName = 'chisquare',
           catVals = Cats,
           nCl = findK(chisqRun),
           out = 'heatmap',
           prefObs = NULL, height = 6)

ResHeatmap(x = massSh,
           clusterFUN = 'pam', 
           distName = 'cramerV',
           catVals = Cats,
           nCl = findK(crVRun),
           out = 'heatmap',
           prefObs = NULL, height = 6)
```

## Re-arranging clusters

Notice that Hamming distance detects subclusters present in Cluster 2 from the $\chi^2$ distance run. To compare the similar clusters side-to-side, the parameter `nCl` can be used to reorder the clusters relative to the given ordering. This case `nCl = 2:1` will alter (invert) the order of the two clusters on the $\chi^2$ based cluster. A third heatmap is displayed relabeling the clusters from the $\chi^2$ run using the `clusterNames = 'renumber'` argument.

```{r resHeatMaps2, echo=TRUE, eval=TRUE, out.width=230, dpi=500, results='hold', fig.show="hold", message=FALSE, warning=FALSE, error=FALSE}

ResHeatmap(x = massSh,
           clusterFUN = 'pam', 
           distName = 'hamming',
           catVals = Cats,
           nCl = findK(HammRun),
           out = 'heatmap', 
           height = 6)

ResHeatmap(x = massSh,
           clusterFUN = 'pam', 
           distName = 'chisquare',
           catVals = Cats,
           nCl = 2:1,
           out = 'heatmap', 
           height = 6)

ResHeatmap(x = massSh,
           clusterFUN = 'pam', 
           distName = 'chisquare',
           catVals = Cats,
           nCl = 2:1,
           clusterNames = 'renumber',
           out = 'heatmap',
           height = 6)
```

## Alternative clustering algorithms

Other compatible clustering algorithms are considered additional to `cluster::pam` only using the Hamming distance.

### Other `cluster` algorithms: `diana` and `fanny`

```{r clusterRun, echo=TRUE, eval=TRUE, out.width=230, dpi=500, results='hold', fig.show="hold", message=FALSE, warning=FALSE, error=FALSE}

hDiaMa <- clusGapDiscr(massSh, 
                       clusterFUN = 'diana', 
                       K.max = 9, 
                       B = 100)
ResHeatmap(x = massSh,
           distName = 'hamming',
           clusterFUN = 'diana',
           catVals = Cats,
           nCl = findK(hDiaMa),
           out = 'heatmap',
           height = 6)

hFanMa <- clusGapDiscr(massSh, 
                       clusterFUN = 'fanny', 
                       K.max = 9, 
                       B = 100)
ResHeatmap(x = massSh,
           clusterFUN = 'fanny',
           distName = 'hamming',
           catVals = Cats,
           nCl = findK(hFanMa),
           out = 'heatmap', 
           height = 6)
```

### Hierarchical Clustering

```{r hclustRun, echo=TRUE, eval=TRUE, out.width=230, dpi=500, results='hold', fig.show="hold", message=FALSE, warning=FALSE, error=FALSE}
hAgnComMa <- clusGapDiscr(massSh, 
                          clusterFUN = 'agnes-complete', 
                          K.max = 9, 
                          B = 100)
ResHeatmap(x = massSh,
           distName = 'hamming',
           clusterFUN = 'agnes-complete',
           catVals = Cats,
           nCl = findK(hAgnComMa),
           out = 'heatmap',
           height = 6)

hhclComMa <- clusGapDiscr(massSh, 
                          clusterFUN = 'hclust-complete', 
                          K.max = 9, 
                          B = 100)
ResHeatmap(x = massSh,
           distName = 'hamming',
           clusterFUN = 'hclust-complete',
           catVals = Cats,
           nCl = findK(hhclComMa),
           out = 'heatmap',
           height = 6)

hhclMcqMa <- clusGapDiscr(massSh, 
                          clusterFUN = 'hclust-mcquitty', 
                          K.max = 9, 
                          B = 100)
ResHeatmap(x = massSh,
           distName = 'hamming',
           clusterFUN = 'hclust-mcquitty',
           catVals = Cats,
           nCl = findK(hhclMcqMa),
           out = 'heatmap', 
           height = 6)
```
Owner

Login: ecortesgomez
Kind: user
Repositories: 1
Profile: https://github.com/ecortesgomez
GitHub Events

Total

Push event: 19
Create event: 1
Last Year

Push event: 19
Create event: 1
Packages

Total packages: 1
Total downloads:
- cran 190 last-month
Total dependent packages: 0
Total dependent repositories: 0
Total versions: 3
Total maintainers: 1
cran.r-project.org: DiscreteGapStatistic

An Extension of the Gap Statistic for Ordinal/Categorical Data
Homepage: https://github.com/ecortesgomez/DiscreteGapStatistic
Documentation: http://cran.r-project.org/web/packages/DiscreteGapStatistic/DiscreteGapStatistic.pdf
License: MIT + file LICENSE
Latest release: 1.1.2
published over 1 year ago
Versions: 3
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 190 Last month
Rankings

Dependent packages count: 27.9%
Forks count: 29.0%
Dependent repos count: 34.4%
Stargazers count: 36.6%
Average: 43.0%
Downloads: 86.9%
Maintainers (1)

ecortesg@buffalo.edu
Last synced: 11 months ago
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science