iclustervb
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
✓DOI references
Found 3 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (16.8%) to scientific vocabulary
Last synced: 9 months ago
·
JSON representation
Repository
Basic Info
- Host: GitHub
- Owner: AbdalkarimA
- License: other
- Language: C++
- Default Branch: main
- Size: 80.4 MB
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Created about 2 years ago
· Last pushed over 1 year ago
Metadata Files
Readme
License
README.Rmd
---
output: github_document
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "75%"
)
```
# iClusterVB
[](https://github.com/AbdalkarimA/iClusterVB/actions/workflows/R-CMD-check.yaml)
iClusterVB allows for fast integrative clustering and feature selection for high dimensional data.
Using a variational Bayes approach, its key features - clustering of mixed-type data, automated determination of the number of clusters, and feature selection in high-dimensional settings - address the limitations of traditional clustering methods while offering an alternative and potentially faster approach than MCMC algorithms, making __iClusterVB__ a valuable tool for contemporary data analysis challenges.
## Installation
You can install iClusterVB from CRAN with:
``` r
install.packages("iClusterVB")
```
You can install the development version of iClusterVB from [GitHub](https://github.com/AbdalkarimA/iClusterVB) with:
``` r
# install.packages("devtools")
devtools::install_github("AbdalkarimA/iClusterVB")
```
## iClusterVB - The Main Function
***Mandatory arguments***
- `mydata`: A list of length R, where R is the number of datasets,
containing the input data.
- Note: For **categorical** data, `0`'s must be re-coded to
another, non-`0` value.
- `dist`: A vector of length R specifying the type of data or
distribution. Options include: \"gaussian\" (for continuous data),
\"multinomial\" (for binary or categorical data), and \"poisson\"
(for count data).
::: flushleft
***Optional arguments***
:::
- `K`: The maximum number of clusters, with a default value of 10. The
algorithm will converge to a model with dominant clusters, removing
redundant clusters and automating the process of determining the
number of clusters.
- `initial_method`: The method for the initial cluster allocation,
which the iClusterVB algorithm will then use to determine the final
cluster allocation. Options include \"VarSelLCM\" (default) for
VarSelLCM, \"random\" for a random sample, \"kproto\"
for k-prototypes, \"kmeans\" for k-means (continuous
data only), \"mclust\" for mclust (continuous data only),
or \"lca\" for poLCA (categorical data only).
- `VS_method`: The feature selection method. The options are 0
(default) for clustering without feature selection and 1 for
clustering with feature selection
- `initial_cluster`: The initial cluster membership. The default is
NULL, which uses `initial_method` for initial cluster allocation. If
it is not NULL, it will overwrite the previous initial values
setting for this parameter.
- `initial_vs_prob`: The initial feature selection probability, a
scalar. The default is NULL, which assigns a value of 0.5.
- `initial_fit`: Initial values based on a previously fitted
iClusterVB model (an iClusterVB object). The default is NULL.
- `initial_omega`: Customized initial values for feature inclusion
probabilities. The default is NULL. If the argument is not NULL, it
will overwrite the previous initial values setting for this
parameter. If `VS_method = 1`, `initial_omega` is a list of length
R, and each element of the list is an array with
dim=c(N,p[[r]])). N is the sample size and p[[r]] is the
number of features for dataset r, r = 1,...,R.
- `initial_hyper_parameters`: A list of the initial hyper-parameters
of the prior distributions for the model. The default is NULL, which
assigns `alpha_00 = 0.001, mu_00 = 0,`\
`s2_00 = 100, a_00 = 1, b_00 = 1, kappa_00 = 1, u_00 = 1, v_00 = 1`.
These are
$\boldsymbol{\alpha}_0, \mu_0, s^2_0, a_0, b_0, \boldsymbol{\kappa}_0, c_0, \text{and } d_0$
described in https://dx.doi.org/10.2139/ssrn.4971680.
- `max_iter`: The maximum number of iterations for the VB algorithm.
The default is 200.
- `early_stop`: Whether to stop the algorithm upon convergence or to
continue until `max_iter` is reached. Options are 1 (default) to
stop when the algorithm converges, and 0 to stop only when
`max_iter` is reached.
- `per`: Print information every \"per\" iteration. The default is 10.
- `convergence_threshold`: The convergence threshold for the change in
ELBO. The default is 0.0001.
## Simulated Data
We will demonstrate the clustering and feature selection performance of `iClusterVB` using a simulated dataset comprising \( N = 240 \) individuals and \( R = 4 \) data views with different data types. Two views were continuous,and one was count -- a setup commonly found in genomics data where gene or mRNA expression (continuous), and DNA copy number (count) are observed. The true number of clusters (\( K \)) was set to 4, with balanced cluster proportions (\( \pi_1 = 0.25, \pi_2 = 0.25, \pi_3 = 0.25, \pi_4 = 0.25 \)). Each data view consisted of \( p_r = 500 \) features (\( r = 1, \dots, 3 \)), totaling \( p = \sum_{r=1}^3 p_r = 1500 \) features across all views. Within each view, only 50 features (10\%) were relevant for clustering, while the remaining features were noise. The relevant features were distributed across clusters as described in the table below:
::: {#tab:simulated-dataset}
**Data View** **Cluster** **Distribution**
---------------- ------------- -------------------------------------
1 (Continuous) Cluster 1 $\mathcal{N}(10, 1)$ (Relevant)
Cluster 2 $\mathcal{N}(5, 1)$ (Relevant)
Cluster 3 $\mathcal{N}(-5, 1)$ (Relevant)
Cluster 4 $\mathcal{N}(-10, 1)$ (Relevant)
$\mathcal{N}(0, 1)$ (Noise)
2 (Continuous) Cluster 1 $\mathcal{N}(-10, 1)$ (Relevant)
Cluster 2 $\mathcal{N}(-5, 1)$ (Relevant)
Cluster 3 $\mathcal{N}(5, 1)$ (Relevant)
Cluster 4 $\mathcal{N}(10, 1)$ (Relevant)
$\mathcal{N}(0, 1)$ (Noise)
3 (Count) Cluster 1 $\text{Poisson}(50)$ (Relevant)
Cluster 2 $\text{Poisson}(35)$ (Relevant)
Cluster 3 $\text{Poisson}(20)$ (Relevant)
Cluster 4 $\text{Poisson}(10)$ (Relevant)
$\text{Poisson}(2)$ (Noise)
: Distribution of relevant and noise features across clusters in each
data view
:::
The simulated dataset is included as a list in the package.
### Data pre-processing
```{r sim_data_example}
library(iClusterVB)
# Input data must be a list
dat1 <- list(gauss_1 = sim_data$continuous1_data,
gauss_2 = sim_data$continuous2_data,
multinomial_1 = sim_data$binary_data)
dist <- c("gaussian", "gaussian",
"multinomial")
```
### Running the model
```{r model}
set.seed(123)
fit_iClusterVB <- iClusterVB(
mydata = dat1,
dist = dist,
K = 8,
initial_method = "VarSelLCM",
VS_method = 1, # Variable Selection is on
max_iter = 100,
per = 100
)
```
### Comparing to True Cluster Membership
```{r table}
table(fit_iClusterVB$cluster, sim_data$cluster_true)
```
### Summary of the Model
```{r summary}
# We can obtain a summary using summary()
summary(fit_iClusterVB)
```
### Generic Plots
```{r plots}
plot(fit_iClusterVB)
```
### Probability of Inclusion Plots
```{r piplot}
# The `piplot` function can be used to visualize the probability of inclusion
piplot(fit_iClusterVB)
```
### Heat maps to visualize the clusters
```{r chmap, echo = TRUE, fig.show='hide'}
# The `chmap` function can be used to display heat maps for each data view
list_of_plots <- chmap(fit_iClusterVB, rho = 0,
cols = c("green", "blue",
"purple", "red"),
scale = "none")
```
```{r gridExtra}
# The `grid.arrange` function from gridExtra can be used to display all the
# plots together
gridExtra::grid.arrange(grobs = list_of_plots, ncol = 2, nrow = 2)
```
Owner
- Login: AbdalkarimA
- Kind: user
- Repositories: 1
- Profile: https://github.com/AbdalkarimA
GitHub Events
Total
- Push event: 1
Last Year
- Push event: 1
Packages
- Total packages: 1
-
Total downloads:
- cran 175 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 4
- Total maintainers: 1
cran.r-project.org: iClusterVB
Fast Integrative Clustering and Feature Selection for High Dimensional Data
- Homepage: https://github.com/AbdalkarimA/iClusterVB
- Documentation: http://cran.r-project.org/web/packages/iClusterVB/iClusterVB.pdf
- License: MIT + file LICENSE
-
Latest release: 0.1.4
published over 1 year ago
Rankings
Dependent packages count: 28.5%
Dependent repos count: 35.1%
Average: 50.1%
Downloads: 86.8%
Maintainers (1)
Last synced:
10 months ago