simChef

simChef: High-quality data science simulations in R - Published in JOSS (2024)

https://github.com/yu-group/simchef

Science Score: 98.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in JOSS metadata
✓
Academic publication links
Links to: arxiv.org
✓
Committers with academic emails
2 of 4 committers (50.0%) from academic institutions
✓
Institutional organization owner
Organization yu-group has institutional domain (www.stat.berkeley.edu)
✓
JOSS paper metadata
Published in Journal of Open Source Software

Scientific Fields

Physics Physical Sciences - 40% confidence

Last synced: 6 months ago · JSON representation

Repository

An R package to facilitate PCS simulation studies.

Basic Info

Host: GitHub
Owner: Yu-Group
License: gpl-3.0
Language: R
Default Branch: main
Homepage: https://yu-group.github.io/simChef/
Size: 51.3 MB

Statistics

Stars: 23
Watchers: 5
Forks: 0
Open Issues: 21
Releases: 3

Created over 4 years ago · Last pushed about 1 year ago

Metadata Files

Readme Changelog Contributing License

simChef

Overview

The goal of simChef is to help you quickly cook up a fully-realized, high-quality, reproducible, and transparently-documented simulation study in a flexible, efficient, and low-code manner. simChef removes many of the administrative burdens of simulation design through:

An intuitive tidy grammar of data science simulations
Powerful abstractions for distributed simulation processing backed by future
Automated generation of interactive R Markdown simulation documentation, situating results next to the workflows needed to reproduce them.

Installation

simChef is under active development. To install the package directly from GitHub, please use:

r devtools::install_github("Yu-Group/simChef")

Example Usage

Consider the following toy simulation experiment, where we want to study the prediction accuracy of linear regression and random forests under both linear and non-linear data-generating processes for varying signal-to-noise ratios.

Let us first code up the necessary simulation components, namely, the linear and nonlinear (here, an exclusive-or) data-generating processes as well as the linear regression and random forest models. To evaluate the methods and visualize the results, one can also write custom code, but we will leverage built-in evaluation and visualization functions (e.g., summarize_pred_err and plot_pred_err) from simChef for convenience.

```r

Generate data via linear model

lineardgpfun <- function(ntrain, ntest, p, beta, noisesd) { n <- ntrain + ntest X <- matrix(rnorm(n * p), nrow = n, ncol = p) y <- X %*% beta + rnorm(n, sd = noisesd) datalist <- list( Xtrain = X[1:ntrain, , drop = FALSE], ytrain = y[1:ntrain], Xtest = X[(ntrain + 1):n, , drop = FALSE], ytest = y[(ntrain + 1):n] ) return(datalist) }

Generate data via exclusive-or model

xordgpfun <- function(ntrain, ntest, p, thresh, beta, noisesd) { n <- ntrain + ntest X <- matrix(rnorm(n * p), nrow = n, ncol = p) xor <- (((X[, 1] > thresh) + (X[, 2] > thresh)) == 1) y <- beta * xor + rnorm(n, sd = noisesd) datalist <- list( Xtrain = X[1:ntrain, , drop = FALSE], ytrain = y[1:ntrain], Xtest = X[(ntrain + 1):n, , drop = FALSE], ytest = y[(ntrain + 1):n] ) return(datalist) }

Fit linear regression model

linearregfun <- function(Xtrain, ytrain, Xtest, ytest) { traindf <- dplyr::bindcols(data.frame(Xtrain), y = ytrain) fit <- lm(y ~ ., data = traindf) predictions <- predict(fit, data.frame(Xtest)) return(list(predictions = predictions, ytest = ytest)) }

Fit random forest model

rffun <- function(Xtrain, ytrain, Xtest, ytest, ...) { traindf <- dplyr::bindcols(data.frame(Xtrain), y = ytrain) fit <- ranger::ranger(y ~ ., data = traindf, ...) predictions <- predict(fit, data.frame(Xtest))$predictions return(list(predictions = predictions, ytest = y_test)) } ```

From here, there is minimal coding on the user's end, as simChef provides a powerful tidy grammar to instantiate, assemble, and run various configurations of the simulation experiment.

```r library(simChef)

Uncomment to run experiment across multiple processors

library(future)

plan(multisession, workers = 5)

Create `simChef` DGPs (data-generating processes)

lineardgp <- createdgp( .dgpfun = lineardgpfun, .name = "Linear DGP", # additional named parameters to pass to .dgpfun() ntrain = 200, ntest = 200, p = 2, beta = c(1, 0), noisesd = 1 ) xordgp <- createdgp( .dgpfun = xordgpfun, .name = "XOR DGP", # additional named parameters to pass to .dgpfun() ntrain = 200, ntest = 200, p = 2, thresh = 0, beta = 1, noisesd = 1 )

Create `simChef` Methods

linearreg <- createmethod( .methodfun = linearregfun, .name = "Linear Regression" # additional named parameters to pass to .methodfun() ) rf <- createmethod( .methodfun = rffun, .name = "Random Forest", # additional named parameters to pass to .methodfun() num.threads = 1 )

Create `simChef` Evaluators

prederr <- createevaluator( .evalfun = summarizeprederr, .name = 'Prediction Accuracy', # additional named parameters to pass to .evalfun() truthcol = "ytest", estimate_col = "predictions" )

Create `simChef` Visualizers

prederrplot <- createvisualizer( .vizfun = plotprederr, .name = 'Prediction Accuracy Plot', # additional named parameters to pass to .vizfun() evalname = 'Prediction Accuracy' )

Create experiment

experiment <- createexperiment(name = "Test Experiment") |> adddgp(lineardgp) |> adddgp(xordgp) |> addmethod(linearreg) |> addmethod(rf) |> addevaluator(prederr) |> addvisualizer(prederrplot) |> # vary across noise parameter in linear dgp addvaryacross( .dgp = "Linear DGP", noisesd = c(0.1, 0.5, 1, 2) ) |> # vary across noise parameter in xor dgp addvaryacross( .dgp = "XOR DGP", noise_sd = c(0.1, 0.5, 1, 2) )

Run experiment over n_reps

results <- runexperiment(experiment, nreps = 100, save = TRUE)

Render automated documentation and view results

render_docs(experiment) ```

Simulation experiment complete!

In addition, the code, narrative, and results of the simulation experiment have been automatically rendered into an interactive html document via R Markdown (see ? render_docs), such as the one shown below:

Interactive R Markdown simulation documentation

For a more detailed walkthrough of this example usage, please see vignette("simChef").

For examples of real-world case study using simChef to develop novel statistical methodology, please check out:

Boileau et al. (2022). A Flexible Approach for Predictive Biomarker Discovery. (GitHub, simChef docs)
Huang et al. (2025). Distilling heterogeneous treatment effects: Stable subgroup estimation in causal inference. (GitHub, simChef docs)

More examples of the rendered documentation for different simulation experiments:

Grammar of a `simChef` Simulation Experiment

The simChef API distills a simulation experiment into four modular concepts, two of which are optional (but highly recommended): data-generating processes (DGPs), methods, evaluation (optional), and visualization (optional). simChef takes an object-oriented approach to encapsulate these simulation concepts, using R6 classes to make them concrete. These four classes are:

DGP: corresponds to the data-generating process from which to generate data.
- DGPs simply generate data in a reproducible and flexible manner, in the size and manner that you specify. For a library of preset but highly customizable DGPs, simChef has a sibling R package, dgpoix (currently in early development).
- Ex: In the above example usage, there are two DGPs: the linear DGP and the exclusive-or DGP.
Method: corresponds to the method (or model) to fit to the data in the experiment.
- Methods can be either a new method under study, a baseline comparison method, or any means by which to transform the simulated data (i.e,. the output of DGP).
- Ex: In the above example usage, there are two methods: linear regression and random forests.
Evaluator: corresponds to the evaluation metrics/functions to evaluate the methods' performance.
- Evaluators receive the results of the fitted methods and summarize them to produce meaningful statistics about the experiment.
- Ex: In the above example usage, there is one evaluation function that evaluates the test prediction accuracy.
Visualizer: corresponds to the visualization tools/functions to visualize results.
- These visualizations can be applied directly to the raw method outputs, the evaluation transformations/summaries, or both. Visualizers can output anything that can be rendered in an R Markdown document: static or interactive plots, tables, strings and captured output, markdown, generic HTML, etc.
- Ex: In the above example usage, there is one visualization function that visualizes the test prediction accuracy, averaged across experimental replicates.

A fifth R6 class and concept, Experiment, unites the four concepts above. More precisely, an Experiment is a collection of DGP(s), Method(s), Evaluator(s), and Visualizer(s), which are thoughtfully composed to answer a particular question of interest. An Experiment can also include references to DGP and/or Method parameters that should be varied and combined during the simulation run (see ? add_vary_across).

Using the DGP, Method, Evaluator, and Visualizer classes, users can easily build a simChef Experiment using reusable building blocks and customizable functions.

Once an Experiment has been constructed, users can finally run the simulation experiment via the function run_experiment(). As summarized in the figure below, running the experiment will (1) fit each Method on each DGP (and for each of the varying parameter configurations), (2) evaluate the experiment according to the given Evaluator(s), and (3) visualize the experiment according to the given Visualizer(s).

$Overview of running a `simChef` `Experiment`. The `Experiment` class handles relationships among the four classes: `DGP`, `Method`, `Evaluator`, and `Visualizer`. Experiments may have multiple `DGP`s and `Method`s, which are combined across the Cartesian product of their varying parameters (represented by `\*`). Once computed, each `Evaluator` and `Visualizer` takes in the fitted simulation replicates, while `Visualizer` additionally receives evaluation summaries.$

Origins of `simChef`

Towards veridical data science

In their 2020 paper "Veridical Data Science", Yu and Kumbier propose the predictability, computability, and stability (PCS) framework, a workflow and documentation for "responsible, reliable, reproducible, and transparent results across the data science life cycle". Under the umbrella of the PCS framework, we began the process of deriving a set of guidelines tailored specifically for simulation studies, inspired by both high-quality simulation studies from the literature and our own simulations to examine the statistical properties of methods within the PCS framework. While creating our own simulations, we soon found that no existing R package could fully satisfy our developing requirements. What began as a toolbox for our own simulations became `simChef`. We believe these tools will be useful for anyone intending to create their own simulation studies in R.

Thinking like a chef

The development of `simChef` has been guided by our love of... cooking? Perhaps surprisingly, we found that cooking serves as useful analogy for the process of creating a simulation study. For the aspiring chefs, consider the following components of a high-quality meal: - **Nutritious and delicious ingredients** -- All good meals start with good ingredients, and the same is true of simulation experiments. If realistic simulation data (entirely synthetic or driven by real-world data) is not available, then there is no hope of producing high-quality simulations. **Creating realistic synthetic data is the primary goal of our sibling package [`dgpoix`](https://yu-group.github.io/dgpoix/), which was initially integrated into `simChef`.** - **Skill and experience of the chef** -- Just as every chef's cooking is informed by the handful of cuisines in which they specialize, simulation experiments are motivated by scientific questions from a particular domain. Just as a chef does not have to become an expert knifemaker before cooking their first meal, nor should the domain scientist have to waste time writing boilerplate code to for the computation and documentation of their simulations. **`simChef` takes care of the details of running your experiments across the potentially large number of data and method perturbations you care about, freeing up time for you to focus on your scientific question.** - **High-quality tools in the kitchen** -- Our package should be like an excellent chef's knife or other kitchen essential. If a chef's knife doesn't cut straight or isn't sharpened, then kitchen speed and safety suffers, as does the final presentation. **`simChef` won't cook a good simulation experiment for you, but it will get you there with less effort and higher-quality presentation while helping you follow best-practices like reproducibility with minimal effort on your part.** No sharpening required! - **A high-quality meal is possible in almost any environment** -- While the scale of a delicious meal may be limited by environment, high-quality meals are not only found in the world's Michelin-starred restaurants but also in home kitchens and street food carts around the world. An effective simulation framework should also be agnostic to environment, and **`simChef` runs equally well on your laptop as on a high-performance computing cluster.** - **Appetizing and approachable presentation** -- Ultimately, a chef prepares food for a specific audience, and presentation is almost equal in importance to the underlying substance of the meal. However, a chef doesn't have to build the plate on which they serve their food. **`simChef` provides tools to turn your simulation experiment results into effective displays of quantitative information which are populated within preset and customizable R Markdown templates.**

Related R packages

Below, we examine the main functionality of a number of existing tools for running reproducible simulation experiments that are currently available on CRAN and have been updated within the last couple of years.

batchtools implements abstractions for "problems" (similar to our DGP concept), "algorithms" (Method in simChef), and "experiments". In addition to shared-memory computation via the parallel and snow packages, it also provides a number of utilities for working with high performance computing batch systems such as Slurm and Torque, which simChef supports via the future.batchtools package.
simulator provides a similar tidy human-readable framework for performing simulations such as those common in methodological statistics papers. simulator includes code for running simulations in parallel, storing simulation outputs, summarizing simulation results with plots and tables, and generating reports, among many other features.
SimDesign provides helper functions to define experimental conditions and then pass those experimental conditions to a user-defined data generation function, analysis function, and summary function. The package also provides a number of these functions for the user to choose from. Each experimental condition can be run over many replicates, computing results in parallel via the parallel package.
simhelpers defines functions to calculate Monte Carlo standard errors of simulation performance metrics, generate skeleton simulation code, and evaluate in parallel across simulation parameters via the future package.
The simTool package has two main functions: expand_tibble() and eval_tibble(). The former wraps the base R function expand.grid() to create a cartesian product of simulation functions and parameters, while the latter evaluates those functions in parallel via the parallel package.
The parSim package implements a single function of the same name which allows for parallelization of arbitrary R expressions across replicates and simulation conditions. parSim uses the snow package to setup parallel backends.
rsimsum is an R implementation of the Stata command simsum and provides helper functions for summarizing and visualizing the results of a simulation study.

Citing `simChef`

To cite simChef in publications, please use:

@software{duncan2024simchef, title={simChef: High-quality data science simulations in R}, author={Duncan, James and Tang, Tiffany and Elliott, Corrine F and Boileau, Philippe and Yu, Bin}, journal={Journal of Open Source Software}, volume={9}, number={95}, pages={6156}, year={2024} }

Owner

Name: Yu-Group
Login: Yu-Group
Kind: organization
Email: chandan_singh@berkeley.edu
Location: Berkeley, CA

Website: https://www.stat.berkeley.edu/~yugroup/
Repositories: 19
Profile: https://github.com/Yu-Group

Bin Yu Group at UC Berkeley

JOSS Publication

simChef: High-quality data science simulations in R

Published

March 28, 2024

DOI

10.21105/joss.06156

Volume 9, Issue 95, Page 6156

Authors

James Duncan

Graduate Group in Biostatistics, University of California, Berkeley, United States of America

Tiffany Tang

Department of Statistics, University of California, Berkeley, United States of America

Corrine F. Elliott

Department of Statistics, University of California, Berkeley, United States of America

Philippe Boileau

Graduate Group in Biostatistics, University of California, Berkeley, United States of America

Bin Yu

Graduate Group in Biostatistics, University of California, Berkeley, United States of America, Department of Statistics, University of California, Berkeley, United States of America, Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, United States of America, Center for Computational Biology, University of California, Berkeley, United States of America

Editor

Kevin M. Moerman

GitHub Events

Total

Create event: 14
Release event: 1
Issues event: 21
Watch event: 2
Delete event: 15
Issue comment event: 2
Push event: 64
Pull request event: 20

Last Year

Create event: 14
Release event: 1
Issues event: 21
Watch event: 2
Delete event: 15
Issue comment event: 2
Push event: 64
Pull request event: 20

Committers

Last synced: 7 months ago

All Time

Total Commits: 415
Total Committers: 4
Avg Commits per committer: 103.75
Development Distribution Score (DDS): 0.347

Past Year

Commits: 16
Committers: 1
Avg Commits per committer: 16.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Tiffany Tang	t**5@g**m	271
James Duncan	j**n@b**u	123
Philippe Boileau	p**u@b**u	18
GitHub Actions	a****s	3

Committer Domains (Top 20 + Academic)

berkeley.edu: 2

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 63
Total pull requests: 47
Average time to close issues: 7 months
Average time to close pull requests: 17 days
Total issue authors: 5
Total pull request authors: 3
Average comments per issue: 0.71
Average comments per pull request: 0.51
Merged pull requests: 39
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 6
Pull requests: 21
Average time to close issues: 3 days
Average time to close pull requests: about 6 hours
Issue authors: 1
Pull request authors: 1
Average comments per issue: 0.17
Average comments per pull request: 0.0
Merged pull requests: 17
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

tiffanymtang (36)
jpdunc23 (21)
rcannood (3)
yanshuotan (1)
PhilBoileau (1)

Pull Request Authors

tiffanymtang (38)
jpdunc23 (10)
PhilBoileau (2)

Top Labels

Issue Labels

enhancement (30) bug (11) documentation (8) question (5) testing (2) github_workflow (2) deprecation (1)

Pull Request Labels

enhancement (14) bug (5) documentation (1)

Dependencies

DESCRIPTION cran

R.utils * imports
R6 * imports
data.table * imports
dplyr * imports
fontawesome * imports
future * imports
future.apply * imports
magrittr * imports
methods * imports
purrr * imports
rlang * imports
rmarkdown * imports
rmdformats * imports
stringr * imports
tibble * imports
tidyselect * imports
vthemes * imports
yardstick * imports
MASS * suggests
broom * suggests
callr * suggests
dgpoix * suggests
fs * suggests
future.callr * suggests
ggplot2 * suggests
glmnet * suggests
here * suggests
knitr * suggests
lobstr * suggests
plotly * suggests
prettydoc * suggests
progressr >= 0.9.0 suggests
ranger * suggests
testthat >= 3.1.0 suggests
tidyr * suggests
vdiffr * suggests
withr >= 2.5.0 suggests

.github/workflows/check-standard.yaml actions

actions/checkout v3 composite
r-lib/actions/check-r-package v2 composite
r-lib/actions/setup-pandoc v2 composite
r-lib/actions/setup-r v2 composite
r-lib/actions/setup-r-dependencies v2 composite

.github/workflows/pkgdown.yaml actions

JamesIves/github-pages-deploy-action v4.4.1 composite
actions/checkout v3 composite
r-lib/actions/setup-pandoc v2 composite
r-lib/actions/setup-r v2 composite
r-lib/actions/setup-r-dependencies v2 composite

simChef

Science Score: 98.0%

Scientific Fields

Repository

Basic Info

Statistics

Metadata Files

README.md

simChef

Overview

Installation

Example Usage

Generate data via linear model

Generate data via exclusive-or model

Fit linear regression model

Fit random forest model

Uncomment to run experiment across multiple processors

library(future)

plan(multisession, workers = 5)

Create simChef DGPs (data-generating processes)

Create simChef Methods

Create simChef Evaluators

Create simChef Visualizers

Create experiment

Run experiment over n_reps

Render automated documentation and view results

Grammar of a simChef Simulation Experiment

Origins of simChef

Related R packages

Citing simChef

Owner

JOSS Publication

simChef: High-quality data science simulations in R

Authors

Editor

Tags

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

Create `simChef` DGPs (data-generating processes)

Create `simChef` Methods

Create `simChef` Evaluators

Create `simChef` Visualizers

Grammar of a `simChef` Simulation Experiment

Origins of `simChef`

Citing `simChef`