Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (16.8%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: AbdalkarimA
  • License: other
  • Language: C++
  • Default Branch: main
  • Size: 80.4 MB
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created about 2 years ago · Last pushed over 1 year ago
Metadata Files
Readme License

README.Rmd

---
output: github_document
---



```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "75%"
)
```

# iClusterVB


[![R-CMD-check](https://github.com/AbdalkarimA/iClusterVB/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/AbdalkarimA/iClusterVB/actions/workflows/R-CMD-check.yaml)


iClusterVB allows for fast integrative clustering and feature selection for high dimensional data.
    
Using a variational Bayes approach, its key features - clustering of mixed-type data, automated determination of the number of clusters, and feature selection in high-dimensional settings - address the limitations of traditional clustering methods while offering an alternative and potentially faster approach than MCMC algorithms, making __iClusterVB__ a valuable tool for contemporary data analysis challenges.

## Installation

You can install iClusterVB from CRAN with:

``` r
install.packages("iClusterVB")
```

You can install the development version of iClusterVB from [GitHub](https://github.com/AbdalkarimA/iClusterVB) with:

``` r
# install.packages("devtools")
devtools::install_github("AbdalkarimA/iClusterVB")
```

## iClusterVB - The Main Function

***Mandatory arguments***

-   `mydata`: A list of length R, where R is the number of datasets,
    containing the input data.

    -   Note: For **categorical** data, `0`'s must be re-coded to
        another, non-`0` value.

-   `dist`: A vector of length R specifying the type of data or
    distribution. Options include: \"gaussian\" (for continuous data),
    \"multinomial\" (for binary or categorical data), and \"poisson\"
    (for count data).

::: flushleft
***Optional arguments***
:::

-   `K`: The maximum number of clusters, with a default value of 10. The
    algorithm will converge to a model with dominant clusters, removing
    redundant clusters and automating the process of determining the
    number of clusters.

-   `initial_method`: The method for the initial cluster allocation,
    which the iClusterVB algorithm will then use to determine the final
    cluster allocation. Options include \"VarSelLCM\" (default) for
    VarSelLCM, \"random\" for a random sample, \"kproto\"
    for k-prototypes, \"kmeans\" for k-means (continuous
    data only), \"mclust\" for mclust (continuous data only),
    or \"lca\" for poLCA (categorical data only).

-   `VS_method`: The feature selection method. The options are 0
    (default) for clustering without feature selection and 1 for
    clustering with feature selection

-   `initial_cluster`: The initial cluster membership. The default is
    NULL, which uses `initial_method` for initial cluster allocation. If
    it is not NULL, it will overwrite the previous initial values
    setting for this parameter.

-   `initial_vs_prob`: The initial feature selection probability, a
    scalar. The default is NULL, which assigns a value of 0.5.

-   `initial_fit`: Initial values based on a previously fitted
    iClusterVB model (an iClusterVB object). The default is NULL.

-   `initial_omega`: Customized initial values for feature inclusion
    probabilities. The default is NULL. If the argument is not NULL, it
    will overwrite the previous initial values setting for this
    parameter. If `VS_method = 1`, `initial_omega` is a list of length
    R, and each element of the list is an array with
    dim=c(N,p[[r]])). N is the sample size and p[[r]] is the
    number of features for dataset r, r = 1,...,R.

-   `initial_hyper_parameters`: A list of the initial hyper-parameters
    of the prior distributions for the model. The default is NULL, which
    assigns `alpha_00 = 0.001, mu_00 = 0,`\
    `s2_00 = 100, a_00 = 1, b_00 = 1, kappa_00 = 1, u_00 = 1, v_00 = 1`.
    These are
    $\boldsymbol{\alpha}_0, \mu_0, s^2_0, a_0, b_0, \boldsymbol{\kappa}_0, c_0, \text{and } d_0$
    described in https://dx.doi.org/10.2139/ssrn.4971680.

-   `max_iter`: The maximum number of iterations for the VB algorithm.
    The default is 200.

-   `early_stop`: Whether to stop the algorithm upon convergence or to
    continue until `max_iter` is reached. Options are 1 (default) to
    stop when the algorithm converges, and 0 to stop only when
    `max_iter` is reached.

-   `per`: Print information every \"per\" iteration. The default is 10.

-   `convergence_threshold`: The convergence threshold for the change in
    ELBO. The default is 0.0001.


## Simulated Data

We will demonstrate the clustering and feature selection performance of `iClusterVB` using a simulated dataset comprising \( N = 240 \) individuals and \( R = 4 \) data views with different data types. Two views were continuous,and  one was count -- a setup commonly found in genomics data where gene or mRNA expression (continuous), and DNA copy number (count) are observed. The true number of clusters (\( K \)) was set to 4, with balanced cluster proportions (\( \pi_1 = 0.25, \pi_2 = 0.25, \pi_3 = 0.25, \pi_4 = 0.25 \)). Each data view consisted of \( p_r = 500 \) features (\( r = 1, \dots, 3 \)), totaling \( p = \sum_{r=1}^3 p_r = 1500 \) features across all views. Within each view, only 50 features (10\%) were relevant for clustering, while the remaining features were noise. The relevant features were distributed across clusters as described in the table below:


::: {#tab:simulated-dataset}
  **Data View**    **Cluster**   **Distribution**
  ---------------- ------------- -------------------------------------
  1 (Continuous)   Cluster 1     $\mathcal{N}(10, 1)$ (Relevant)
                   Cluster 2     $\mathcal{N}(5, 1)$ (Relevant)
                   Cluster 3     $\mathcal{N}(-5, 1)$ (Relevant)
                   Cluster 4     $\mathcal{N}(-10, 1)$ (Relevant)
                                 $\mathcal{N}(0, 1)$ (Noise)
  2 (Continuous)   Cluster 1     $\mathcal{N}(-10, 1)$ (Relevant)
                   Cluster 2     $\mathcal{N}(-5, 1)$ (Relevant)
                   Cluster 3     $\mathcal{N}(5, 1)$ (Relevant)
                   Cluster 4     $\mathcal{N}(10, 1)$ (Relevant)
                                 $\mathcal{N}(0, 1)$ (Noise)
  3 (Count)        Cluster 1     $\text{Poisson}(50)$ (Relevant)
                   Cluster 2     $\text{Poisson}(35)$ (Relevant)
                   Cluster 3     $\text{Poisson}(20)$ (Relevant)
                   Cluster 4     $\text{Poisson}(10)$ (Relevant)
                                 $\text{Poisson}(2)$ (Noise)

  : Distribution of relevant and noise features across clusters in each
  data view
:::

The simulated dataset is included as a list in the package.

### Data pre-processing

```{r sim_data_example}
library(iClusterVB)

# Input data must be a list

dat1 <- list(gauss_1 = sim_data$continuous1_data,
             gauss_2 = sim_data$continuous2_data,
             multinomial_1 = sim_data$binary_data)

dist <- c("gaussian", "gaussian",
          "multinomial")

```

### Running the model

```{r model}
set.seed(123)
fit_iClusterVB <- iClusterVB(
  mydata = dat1,
  dist = dist,
  K = 8,
  initial_method = "VarSelLCM",
  VS_method = 1, # Variable Selection is on
  max_iter = 100,
  per = 100
)
```

### Comparing to True Cluster Membership

```{r table}
table(fit_iClusterVB$cluster, sim_data$cluster_true)
```


### Summary of the Model

```{r summary}
# We can obtain a summary using summary()
summary(fit_iClusterVB)
```


### Generic Plots

```{r plots}
plot(fit_iClusterVB)
```


### Probability of Inclusion Plots

```{r piplot}
# The `piplot` function can be used to visualize the probability of inclusion

piplot(fit_iClusterVB)
```


### Heat maps to visualize the clusters


```{r chmap, echo = TRUE, fig.show='hide'}
# The `chmap` function can be used to display heat maps for each data view

list_of_plots <- chmap(fit_iClusterVB, rho = 0,
      cols = c("green", "blue",
               "purple", "red"),
      scale = "none")
```

```{r gridExtra}
# The `grid.arrange` function from gridExtra can be used to display all the 
# plots together
gridExtra::grid.arrange(grobs = list_of_plots, ncol = 2, nrow = 2)
```

Owner

  • Login: AbdalkarimA
  • Kind: user

GitHub Events

Total
  • Push event: 1
Last Year
  • Push event: 1

Packages

  • Total packages: 1
  • Total downloads:
    • cran 175 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 4
  • Total maintainers: 1
cran.r-project.org: iClusterVB

Fast Integrative Clustering and Feature Selection for High Dimensional Data

  • Versions: 4
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 175 Last month
Rankings
Dependent packages count: 28.5%
Dependent repos count: 35.1%
Average: 50.1%
Downloads: 86.8%
Last synced: 10 months ago