iclustervb

https://github.com/abdalkarima/iclustervb

Last synced: 9 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: AbdalkarimA
License: other
Language: C++
Default Branch: main
Size: 80.4 MB

Statistics

Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created about 2 years ago · Last pushed over 1 year ago

Metadata Files

Readme License

README.Rmd

---
output: github_document
---



```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "75%"
)
```

# iClusterVB


[![R-CMD-check](https://github.com/AbdalkarimA/iClusterVB/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/AbdalkarimA/iClusterVB/actions/workflows/R-CMD-check.yaml)


iClusterVB allows for fast integrative clustering and feature selection for high dimensional data.
    
Using a variational Bayes approach, its key features - clustering of mixed-type data, automated determination of the number of clusters, and feature selection in high-dimensional settings - address the limitations of traditional clustering methods while offering an alternative and potentially faster approach than MCMC algorithms, making __iClusterVB__ a valuable tool for contemporary data analysis challenges.

## Installation

You can install iClusterVB from CRAN with:

``` r
install.packages("iClusterVB")
```

You can install the development version of iClusterVB from [GitHub](https://github.com/AbdalkarimA/iClusterVB) with:

``` r
# install.packages("devtools")
devtools::install_github("AbdalkarimA/iClusterVB")
```

## iClusterVB - The Main Function

***Mandatory arguments***

-   `mydata`: A list of length R, where R is the number of datasets,
    containing the input data.

    -   Note: For **categorical** data, `0`'s must be re-coded to
        another, non-`0` value.

-   `dist`: A vector of length R specifying the type of data or
    distribution. Options include: \"gaussian\" (for continuous data),
    \"multinomial\" (for binary or categorical data), and \"poisson\"
    (for count data).

::: flushleft
***Optional arguments***
:::

-   `K`: The maximum number of clusters, with a default value of 10. The
    algorithm will converge to a model with dominant clusters, removing
    redundant clusters and automating the process of determining the
    number of clusters.

-   `initial_method`: The method for the initial cluster allocation,
    which the iClusterVB algorithm will then use to determine the final
    cluster allocation. Options include \"VarSelLCM\" (default) for
    VarSelLCM, \"random\" for a random sample, \"kproto\"
    for k-prototypes, \"kmeans\" for k-means (continuous
    data only), \"mclust\" for mclust (continuous data only),
    or \"lca\" for poLCA (categorical data only).

-   `VS_method`: The feature selection method. The options are 0
    (default) for clustering without feature selection and 1 for
    clustering with feature selection

-   `initial_cluster`: The initial cluster membership. The default is
    NULL, which uses `initial_method` for initial cluster allocation. If
    it is not NULL, it will overwrite the previous initial values
    setting for this parameter.

-   `initial_vs_prob`: The initial feature selection probability, a
    scalar. The default is NULL, which assigns a value of 0.5.

-   `initial_fit`: Initial values based on a previously fitted
    iClusterVB model (an iClusterVB object). The default is NULL.

-   `initial_omega`: Customized initial values for feature inclusion
    probabilities. The default is NULL. If the argument is not NULL, it
    will overwrite the previous initial values setting for this
    parameter. If `VS_method = 1`, `initial_omega` is a list of length
    R, and each element of the list is an array with
    dim=c(N,p[[r]])). N is the sample size and p[[r]] is the
    number of features for dataset r, r = 1,...,R.

-   `initial_hyper_parameters`: A list of the initial hyper-parameters
    of the prior distributions for the model. The default is NULL, which
    assigns `alpha_00 = 0.001, mu_00 = 0,`\
    `s2_00 = 100, a_00 = 1, b_00 = 1, kappa_00 = 1, u_00 = 1, v_00 = 1`.
    These are
    $\boldsymbol{\alpha}_0, \mu_0, s^2_0, a_0, b_0, \boldsymbol{\kappa}_0, c_0, \text{and } d_0$
    described in https://dx.doi.org/10.2139/ssrn.4971680.

-   `max_iter`: The maximum number of iterations for the VB algorithm.
    The default is 200.

-   `early_stop`: Whether to stop the algorithm upon convergence or to
    continue until `max_iter` is reached. Options are 1 (default) to
    stop when the algorithm converges, and 0 to stop only when
    `max_iter` is reached.

-   `per`: Print information every \"per\" iteration. The default is 10.

-   `convergence_threshold`: The convergence threshold for the change in
    ELBO. The default is 0.0001.


## Simulated Data

We will demonstrate the clustering and feature selection performance of `iClusterVB` using a simulated dataset comprising \( N = 240 \) individuals and \( R = 4 \) data views with different data types. Two views were continuous,and  one was count -- a setup commonly found in genomics data where gene or mRNA expression (continuous), and DNA copy number (count) are observed. The true number of clusters (\( K \)) was set to 4, with balanced cluster proportions (\( \pi_1 = 0.25, \pi_2 = 0.25, \pi_3 = 0.25, \pi_4 = 0.25 \)). Each data view consisted of \( p_r = 500 \) features (\( r = 1, \dots, 3 \)), totaling \( p = \sum_{r=1}^3 p_r = 1500 \) features across all views. Within each view, only 50 features (10\%) were relevant for clustering, while the remaining features were noise. The relevant features were distributed across clusters as described in the table below:


::: {#tab:simulated-dataset}
  **Data View**    **Cluster**   **Distribution**
  ---------------- ------------- -------------------------------------
  1 (Continuous)   Cluster 1     $\mathcal{N}(10, 1)$ (Relevant)
                   Cluster 2     $\mathcal{N}(5, 1)$ (Relevant)
                   Cluster 3     $\mathcal{N}(-5, 1)$ (Relevant)
                   Cluster 4     $\mathcal{N}(-10, 1)$ (Relevant)
                                 $\mathcal{N}(0, 1)$ (Noise)
  2 (Continuous)   Cluster 1     $\mathcal{N}(-10, 1)$ (Relevant)
                   Cluster 2     $\mathcal{N}(-5, 1)$ (Relevant)
                   Cluster 3     $\mathcal{N}(5, 1)$ (Relevant)
                   Cluster 4     $\mathcal{N}(10, 1)$ (Relevant)
                                 $\mathcal{N}(0, 1)$ (Noise)
  3 (Count)        Cluster 1     $\text{Poisson}(50)$ (Relevant)
                   Cluster 2     $\text{Poisson}(35)$ (Relevant)
                   Cluster 3     $\text{Poisson}(20)$ (Relevant)
                   Cluster 4     $\text{Poisson}(10)$ (Relevant)
                                 $\text{Poisson}(2)$ (Noise)

  : Distribution of relevant and noise features across clusters in each
  data view
:::

The simulated dataset is included as a list in the package.

### Data pre-processing

```{r sim_data_example}
library(iClusterVB)

# Input data must be a list

dat1 <- list(gauss_1 = sim_data$continuous1_data,
             gauss_2 = sim_data$continuous2_data,
             multinomial_1 = sim_data$binary_data)

dist <- c("gaussian", "gaussian",
          "multinomial")

```

### Running the model

```{r model}
set.seed(123)
fit_iClusterVB <- iClusterVB(
  mydata = dat1,
  dist = dist,
  K = 8,
  initial_method = "VarSelLCM",
  VS_method = 1, # Variable Selection is on
  max_iter = 100,
  per = 100
)
```

### Comparing to True Cluster Membership

```{r table}
table(fit_iClusterVB$cluster, sim_data$cluster_true)
```


### Summary of the Model

```{r summary}
# We can obtain a summary using summary()
summary(fit_iClusterVB)
```


### Generic Plots

```{r plots}
plot(fit_iClusterVB)
```


### Probability of Inclusion Plots

```{r piplot}
# The `piplot` function can be used to visualize the probability of inclusion

piplot(fit_iClusterVB)
```


### Heat maps to visualize the clusters


```{r chmap, echo = TRUE, fig.show='hide'}
# The `chmap` function can be used to display heat maps for each data view

list_of_plots <- chmap(fit_iClusterVB, rho = 0,
      cols = c("green", "blue",
               "purple", "red"),
      scale = "none")
```

```{r gridExtra}
# The `grid.arrange` function from gridExtra can be used to display all the 
# plots together
gridExtra::grid.arrange(grobs = list_of_plots, ncol = 2, nrow = 2)
```

Owner

Login: AbdalkarimA
Kind: user

Repositories: 1
Profile: https://github.com/AbdalkarimA

GitHub Events

Total

Push event: 1

Last Year

Push event: 1

Packages

Total packages: 1
Total downloads:
- cran 175 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 4
Total maintainers: 1

cran.r-project.org: iClusterVB

Fast Integrative Clustering and Feature Selection for High Dimensional Data

Homepage: https://github.com/AbdalkarimA/iClusterVB
Documentation: http://cran.r-project.org/web/packages/iClusterVB/iClusterVB.pdf
License: MIT + file LICENSE
Latest release: 0.1.4
published over 1 year ago

Versions: 4
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 175 Last month

Rankings

Dependent packages count: 28.5%

Dependent repos count: 35.1%

Average: 50.1%

Downloads: 86.8%

Maintainers (1)

abdalkarim.alnajjar@queensu.ca

Last synced: 10 months ago

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science