baysc: An R package for Bayesian survey clustering
baysc: An R package for Bayesian survey clustering - Published in JOSS (2026)
https://github.com/smwu/baysc
This score indicates how likely this project is to be science-related based on various indicators:
-
-
-
-
✓
DOI references
Found 3 DOI reference(s) in README and JOSS metadata
-
○
Academic publication links
-
-
○
Institutional organization owner
-
✓
JOSS paper metadata
Published in Journal of Open Source Software
BAYesian Survey Clustering is an R package for running Bayesian clustering methods on survey data.
Basic Info
-
Host: GitHub
-
Owner: smwu
-
License: other
-
Language: R
-
Default Branch: main
-
Size: 106 MB
Statistics
-
Stars: 0
-
Watchers: 1
-
Forks: 2
-
Open Issues: 0
-
Releases: 3
Created over 2 years ago
· Last pushed 3 months ago
Metadata Files
Readme
License
baysc: BAYesian Survey Clustering
<!-- badges: end -->
An R package for running Bayesian supervised and unsupervised clustering methods on survey data.
Maintainer: Stephanie M. Wu (stephanie.wu@ucl.ac.uk)
Contributors: Matthew R. Williams (mrwilliams\@rti.org); Terrance D. Savitsky (savitsky.terrance\@bls.gov); Briana J.K. Stephenson (bstephenson\@hsph.harvard.edu)
Citation: Wu S, Williams M, Savitsky T, Stephenson B (2025). baysc: BAYesian Survey Clustering. R package version 0.1.0, https://github.com/smwu/baysc.
Table of contents
## Installation
``` r
# Install devtools for package loading
install.packages(devtools)
library(devtools)
# Install baysc from GitHub
devtools::install_github("smwu/baysc")
library(baysc)
```
During installation, the following errors may arise:
- *No package called 'rstantools'*: Please install the `rstantools` package using `install.packages("rstantools")`.
- *Library 'gfortran' not found*: This is a compiler configuration issue that can arise when using `Rcpp` on Mac computers with Apple silicon (e.g., M1-M4 chips). Users may need to install Xcode, GNU Fortran, and OpenMP, and edit the `~/.R/Makevars` file. For more details, see the "Supplementary" section below.
- *Library 'emutls_w' not found*: This is a toolchain mismatch issue that can arise when using `Rcpp`-dependent packages on Mac computers with Apple silicon (e.g., M1-M4 chips). Users may need to install gfortran and edit the `~/.R/Makevars` file. For more details, see the "Supplementary" section below.
## Overview
`baysc` is an R package for running Bayesian clustering methods on survey data. A Bayesian latent class analysis (LCA), termed the Weighted Overfitted Latent Class Analysis (WOLCA), is available for eliciting underlying cluster patterns from multivariate categorical data, incorporating survey sampling weights and other survey design elements. Options also exist for relating the patterns to a binary outcome, either by using a two-step approach that applies WOLCA and then runs a survey-weighted regression, or by utilizing a one-step supervised approach where creation of the cluster patterns is directly informed by the outcome, referred to as the Supervised Weighted Overfitted Latent Class Analysis (SWOLCA). Summary and plotting functions for visualizing output are also available, as are diagnostic functions for examining convergence of the sampler. More information about the models can be found in the following paper.
> Wu, S. M., Williams, M. R., Savitsky, T. D., & Stephenson, B. J. (2024). Derivation of outcome-dependent dietary patterns for low-income women obtained from survey data using a supervised weighted overfitted latent class analysis. Biometrics, 80(4), ujae122,
## Functions
Use the `wolca()` function to run an unsupervised WOLCA and obtain pattern profiles. `wolca_var_adjust()` provides a post-hoc variance adjustment that enables correct uncertainty estimation. To examine the association of pattern profiles with a binary outcome through a two-step appraoch, run `wolca_svyglm()`. Use the `swolca()` function to run a SWOLCA model that allows information about the binary outcome to directly inform the creation of the pattern profiles. `swolca_var_adjust()` provides a post-hoc variance adjustment that enbales correct uncertainty estimation. Detailed information about the functions and related statistical details can be found in the vignette, "[An introduction to the baysc package](https://raw.githubusercontent.com/smwu/baysc/refs/heads/main/vignettes/baysc.pdf)," in the [JOSS manuscript](https://github.com/smwu/baysc/blob/joss-paper/JOSS/paper.pdf), and in the paper linked above.
## Data
`baysc` applies Bayesian latent class analysis using the following input data:
- Multivariate categorical exposure: $nxJ$ matrix, where $n$ is the sample size and $J$ is the number of categorical item variables. Each item must be a categorical variable.
- (Optional) survey design elements such as stratum indicators, cluster indicators, and sampling weights: each formatted as a $nx1$ vector.
- (Optional) binary outcome: $nx1$ vector
- (Optional) additional confounders to adjust for when evaluating the exposure-outcome association: $nxQ$ dataframe, where $Q$ is the number of additional confounders.
We provide an example dataset from the National Health and Nutrition Examination Survey (NHANES) that includes multivariate categorical dietary intake data as well as binary hypertension data for low-income women in the United States. Survey sampling weights and information on stratification and clustering are included to allow for adjustment for survey design when conducting estimation and inference.