baysc: An R package for Bayesian survey clustering

baysc: An R package for Bayesian survey clustering - Published in JOSS (2026)

https://github.com/smwu/baysc

Science Score: 87.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README and JOSS metadata
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software
Last synced: 30 days ago · JSON representation

Repository

BAYesian Survey Clustering is an R package for running Bayesian clustering methods on survey data.

Basic Info
  • Host: GitHub
  • Owner: smwu
  • License: other
  • Language: R
  • Default Branch: main
  • Size: 106 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 2
  • Open Issues: 0
  • Releases: 3
Created over 2 years ago · Last pushed 3 months ago
Metadata Files
Readme License

README.md

baysc: BAYesian Survey Clustering

R-CMD-check <!-- badges: end -->

An R package for running Bayesian supervised and unsupervised clustering methods on survey data.

Maintainer: Stephanie M. Wu (stephanie.wu@ucl.ac.uk)

Contributors: Matthew R. Williams (mrwilliams\@rti.org); Terrance D. Savitsky (savitsky.terrance\@bls.gov); Briana J.K. Stephenson (bstephenson\@hsph.harvard.edu)

Citation: Wu S, Williams M, Savitsky T, Stephenson B (2025). baysc: BAYesian Survey Clustering. R package version 0.1.0, https://github.com/smwu/baysc.

Table of contents

## Installation ``` r # Install devtools for package loading install.packages(devtools) library(devtools) # Install baysc from GitHub devtools::install_github("smwu/baysc") library(baysc) ``` During installation, the following errors may arise: - *No package called 'rstantools'*: Please install the `rstantools` package using `install.packages("rstantools")`. - *Library 'gfortran' not found*: This is a compiler configuration issue that can arise when using `Rcpp` on Mac computers with Apple silicon (e.g., M1-M4 chips). Users may need to install Xcode, GNU Fortran, and OpenMP, and edit the `~/.R/Makevars` file. For more details, see the "Supplementary" section below. - *Library 'emutls_w' not found*: This is a toolchain mismatch issue that can arise when using `Rcpp`-dependent packages on Mac computers with Apple silicon (e.g., M1-M4 chips). Users may need to install gfortran and edit the `~/.R/Makevars` file. For more details, see the "Supplementary" section below.
## Overview `baysc` is an R package for running Bayesian clustering methods on survey data. A Bayesian latent class analysis (LCA), termed the Weighted Overfitted Latent Class Analysis (WOLCA), is available for eliciting underlying cluster patterns from multivariate categorical data, incorporating survey sampling weights and other survey design elements. Options also exist for relating the patterns to a binary outcome, either by using a two-step approach that applies WOLCA and then runs a survey-weighted regression, or by utilizing a one-step supervised approach where creation of the cluster patterns is directly informed by the outcome, referred to as the Supervised Weighted Overfitted Latent Class Analysis (SWOLCA). Summary and plotting functions for visualizing output are also available, as are diagnostic functions for examining convergence of the sampler. More information about the models can be found in the following paper. > Wu, S. M., Williams, M. R., Savitsky, T. D., & Stephenson, B. J. (2024). Derivation of outcome-dependent dietary patterns for low-income women obtained from survey data using a supervised weighted overfitted latent class analysis. Biometrics, 80(4), ujae122,
## Functions Use the `wolca()` function to run an unsupervised WOLCA and obtain pattern profiles. `wolca_var_adjust()` provides a post-hoc variance adjustment that enables correct uncertainty estimation. To examine the association of pattern profiles with a binary outcome through a two-step appraoch, run `wolca_svyglm()`. Use the `swolca()` function to run a SWOLCA model that allows information about the binary outcome to directly inform the creation of the pattern profiles. `swolca_var_adjust()` provides a post-hoc variance adjustment that enbales correct uncertainty estimation. Detailed information about the functions and related statistical details can be found in the vignette, "[An introduction to the baysc package](https://raw.githubusercontent.com/smwu/baysc/refs/heads/main/vignettes/baysc.pdf)," in the [JOSS manuscript](https://github.com/smwu/baysc/blob/joss-paper/JOSS/paper.pdf), and in the paper linked above.
## Data `baysc` applies Bayesian latent class analysis using the following input data: - Multivariate categorical exposure: $nxJ$ matrix, where $n$ is the sample size and $J$ is the number of categorical item variables. Each item must be a categorical variable. - (Optional) survey design elements such as stratum indicators, cluster indicators, and sampling weights: each formatted as a $nx1$ vector. - (Optional) binary outcome: $nx1$ vector - (Optional) additional confounders to adjust for when evaluating the exposure-outcome association: $nxQ$ dataframe, where $Q$ is the number of additional confounders. We provide an example dataset from the National Health and Nutrition Examination Survey (NHANES) that includes multivariate categorical dietary intake data as well as binary hypertension data for low-income women in the United States. Survey sampling weights and information on stratification and clustering are included to allow for adjustment for survey design when conducting estimation and inference.
## Example ``` r library(baysc) #==== Create data ==== # Load NHANES dataset data("data_nhanes") # Exposure matrix composed of food groups x_mat <- as.matrix(data_nhanes[, 11:38]) # Survey stratum indicators stratum_id <- data_nhanes$stratum_id # Survey cluster indicators cluster_id <- data_nhanes$cluster_id # Survey sampling weights sampling_wt <- data_nhanes$sample_wt # Outcome data on hypertension y_all <- data_nhanes$BP_flag # Create dataframe of additional confounders V_data <- data_nhanes[, c("age_cat", "racethnic", "smoker", "physactive")] # Regression formula for additional confounders glm_form <- "~ age_cat + racethnic + smoker + physactive" #==== Run model ==== # Run SWOLCA res_swolca <- swolca(x_mat = x_mat, y_all = y_all, V_data = V_data, glm_form = glm_form, sampling_wt = sampling_wt, cluster_id = cluster_id, stratum_id = stratum_id, adapt_seed = 888, n_runs = 300, burn = 150, thin = 3, update = 50, save_res = FALSE) # Apply variance adjustment res_swolca_adjust <- swolca_var_adjust(res = res_swolca, adjust_seed = 888, num_reps = 100, save_res = FALSE) #==== Display results ==== # Plot derived patterns plot_pattern_profiles(res = res_swolca_adjust) # Plot outcome regression coefficients regr_coefs <- get_regr_coefs(res = res_swolca_adjust, ci_level = 0.95, digits = 2) plot_regr_coefs(regr_coefs = regr_coefs, res = res_swolca_adjust) ```
## Supplementary ### gfortran error For users experiencing a "**library 'gfortran' not found**" error message during installation, additional steps are needed to install the `baysc` package. This is a compiler configuration issue that can arise when using Rcpp-dependent packages on Mac computers with Apple silicon (e.g., M1, M2, M3). Please follow the instructions listed below, adapted from instructions posted at [https://stackoverflow.com/questions/70638118/configuring-compilers-on-apple-silicon-m1-m2-m3-for-rcpp-and-other-tool)](https://stackoverflow.com/questions/70638118/configuring-compilers-on-apple-silicon-m1-m2-m3-for-rcpp-and-other-tool). 1. Download an R binary from CRAN at this link: . Select the binary built for Apple silicon (M1-M3), which will typically the top link on the left under “Latest release”, and install. 2. Go through the instructions to install R, entering in passwords if necessary. 3. Install Xcode by opening Terminal and running the following in the command line: `sudo xcode-select --install` . This will install the latest release version of Apple’s Command Line Tools for Xcode, which includes Apple Clang. 4. Download the GNU Fortran binary .tar.xz file at this link: . Install GNU fortran by running the following code in the command line: ``` curl -LO https://github.com/R-macos/gcc-12-branch/releases/download/12.2-darwin-r0/gfortran-12.2-darwin20-r0-universal.tar.xz sudo tar xvf gfortran-12.2-darwin20-r0-universal.tar.xz -C / sudo ln -sfn $(xcrun --show-sdk-path) /opt/gfortran/SDK ``` 5. Check your Apple Clang version by running the following in the command line: `clang --version`. 6. Download OpenMP at by clicking on the Release.tar.gz link corresponding to your Apple Clang version. 7. Install OpenMP by running the following in the command line, making sure to include the correct version. For example, for Apple clang version 15.0.0 (i.e., clang 1500.x), the corresponding OpenMP version is *16.0.4*, so the commands would be: ``` curl -LO https://mac.r-project.org/openmp/openmp-16.0.4-darwin20-Release.tar.gz sudo mkdir -p /opt/R/$(uname -m) sudo tar -xvf openmp-16.0.4-darwin20-Release.tar.gz --strip-components=2 -C /opt/R/$(uname -m) ``` 8. Navigate to the R Makevars file by running in command line: `cd ~/.R/Makevars`. If the file does not exist, create it by running `mkdir ~/.R/Makevars`. 9. In command line, run `vi`, add the below lines to the file, then save and exit out of the file by typing `ESC` followed by `:wq`. ``` CPPFLAGS += -Xclang -fopenmp LDFLAGS += -lomp ``` 10. Retry the `baysc` package installation by running `devtools::install_github("smwu/baysc")` in R. ### emutls_w error For users experiencing a "**library 'emutls_w' not found**" error message during installation, additional steps are needed to install the `baysc` package. This is likely due to a toolchain mismatch issue that can arise when using `Rcpp`-dependent packages on Mac computers with Apple silicon (e.g., M1, M2, M3, M4 chips). The issue occurs on these computers because R uses a Clang-based compiler toolchain that can sometimes conflict with gfortran, especially when the compilers are built for different architectures. In particular, the system may fail to locate the `emutls_w` library, which is needed by the Clang compiler to support thread-local storage in multithreaded C++ code. Please follow the instructions listed below to resolve the issue. 1. Install gfortran by going to this link: . Under “Mandatory tools” and bullet point “GNU Fortran compiler”, download the latest .dmg (e.g., `gfortran-14.2-arm64.dmg`). Make sure the version you install matches the version that is missing. 2. Confirm installation of gfortran: running `ls /opt/gfortran` in Terminal should yield folders including “bin”, “lib”, etc. 3. Find out where `libemutls_w.a` is stored. Run the below line in Terminal. You should see something like `/opt/gfortran/lib/gcc/aarch64-apple-darwin23/14.2.0/libemutls_w.a`. ``` find /opt/gfortran -name "libemutls_w.a" ``` 4. Explicitly point R to the library where `libemutls_w.a` lives by updating your R Makevars file. In Terminal, navigate to your R Makevars file by running `cd \~/.R/` followed by `vi Makevars`. If the `.R` folder doesn’t exist, run `mkdir \~/.R/`. 5. Add the below lines to the Makevars file, then save and exit out of the file by typing `ESC` followed by `:wq`. In the last line, the text after -L should match the output from Step 3 specifying where the `libemutls_w.a` library is. For example, here, it is `/opt/gfortran/lib/gcc/aarch64-apple-darwin23/14.2.0`. ``` CC = clang CXX = clang++ FC = /opt/gfortran/bin/gfortran CFLAGS = -O2 -arch arm64 CXXFLAGS = -O2 -arch arm64 FFLAGS = -O2 -arch arm64 # Linker flags to find emutls_w LDFLAGS = -L/opt/gfortran/lib/gcc/aarch64-apple-darwin23/14.2.0 -arch arm64 ``` 6. Restart R and retry the `baysc` package installation.
## Contributing and Getting Help Please report bugs by opening an [issue](https://github.com/smwu/baysc/issues/new/choose). If you wish to contribute, please make a pull request. If you have questions, you can open a [discussion thread](https://github.com/smwu/baysc/discussions).

JOSS Publication

baysc: An R package for Bayesian survey clustering
Published
March 11, 2026
Volume 11, Issue 119, Page 8382
Authors
Stephanie M. Wu ORCID
Division of Psychiatry, UCL, London, U.K.
Matthew R. Williams ORCID
RTI International, Research Triangle Park, North Carolina, U.S.A
Terrance D. Savitsky ORCID
Office of Survey Methods Research, U.S. Bureau of Labor Statistics, Washington, DC, U.S.A
Briana J.k. Stephenson ORCID
Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, U.S.A
Editor
Sehrish Kanwal ORCID
Tags
Bayesian survey model-based clustering dietary patterns

GitHub Events

Total
  • Release event: 2
  • Pull request event: 4
  • Fork event: 2
  • Issues event: 3
  • Issue comment event: 3
  • Push event: 16
  • Create event: 1
Last Year
  • Release event: 2
  • Pull request event: 4
  • Fork event: 2
  • Issues event: 3
  • Issue comment event: 3
  • Push event: 14
  • Create event: 1

Issues and Pull Requests

Last synced: 3 months ago

All Time
  • Total issues: 2
  • Total pull requests: 3
  • Average time to close issues: about 2 months
  • Average time to close pull requests: 2 days
  • Total issue authors: 1
  • Total pull request authors: 2
  • Average comments per issue: 4.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 3
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 2
  • Pull requests: 3
  • Average time to close issues: about 2 months
  • Average time to close pull requests: 2 days
  • Issue authors: 1
  • Pull request authors: 2
  • Average comments per issue: 4.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 3
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • CalebAbhulimhen (2)
Pull Request Authors
  • smwu (2)
  • hackdna (1)
Top Labels
Issue Labels
Pull Request Labels