https://github.com/clevelandclinicqhs/burgle
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (15.2%) to scientific vocabulary
Last synced: 9 months ago
·
JSON representation
Repository
Basic Info
- Host: GitHub
- Owner: ClevelandClinicQHS
- License: other
- Language: R
- Default Branch: main
- Size: 136 KB
Statistics
- Stars: 3
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 3
Created over 2 years ago
· Last pushed 10 months ago
Metadata Files
Readme
Changelog
License
README.Rmd
---
output: github_document
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# burgle
[](https://cran.r-project.org/package=burgle)

The goal of burgle is to "steal" only the necessary parts of model objects for applications in simulation.
## Installation
``` {r, eval = FALSE}
install.packages("burgle")
```
Or you can install the development version of burgle like so:
``` {r, eval = FALSE}
devtools::install_github("ClevelandClinicQHS/burgle")
```
## Pros of `burgle`
1. The reduction in size can save on memory and storage space. A great advantage since R works with physical memory
2. The removal of data from the model objects can allow them to be shared freely, if there are data sharing concerns or requirements.
3. A streamlined method to simulate response for parameter uncertainty and probabilistic sampling
## Linear Model Example
```{r example}
set.seed(287453)
library(burgle)
fit <- lm(Sepal.Length ~., data = iris)
bfit <- burgle(fit)
pryr::object_size(fit)
pryr::object_size(bfit)
as.numeric(pryr::object_size(bfit)/pryr::object_size(fit))*100
```
Our `burgle_lm` is roughly `r paste0(round(as.numeric(pryr::object_size(bfit)/pryr::object_size(fit))*100,2), "%")` the size of the original `lm` object, the iris dataset has `r nrow(iris)` observations and `r ncol(iris)` columns.
Another example is the using the `nycflights13::flights` dataset.
```{r larger_example}
fit2 <- lm(arr_delay ~ as.factor(month) + dep_delay + origin + distance + hour, data = nycflights13::flights)
b_fit2 <- burgle(fit2)
as.numeric(pryr::object_size(b_fit2)/pryr::object_size(fit2))*100
```
Our `burgle_lm` is roughly `r paste0(round(as.numeric(pryr::object_size(b_fit2)/pryr::object_size(fit2))*100,2), "%")` the size of the original `lm` object. This dataset has `r nrow(nycflights13::flights)` observations and our model has used 5 of the `r ncol(nycflights13::flights)` columns as predictors.
## Simulation Massive Example
Here one can see a simulated dataset of 10 million data points with 3 random covariates.
```{r massive example}
N <- 1e7
df <- data.frame(y = rnorm(N), x1 = runif(N), x2 = runif(N, -1, 1), x3 = runif(N, -2, 2))
mfit <- lm(y~., data = df)
b_mfit <- burgle(mfit)
m0 <- pryr::object_size(mfit)
print(m0, units = "Gb")
pryr::object_size(b_mfit)
```
The `lm` is 1.6 Gb while the `burgle_lm` object is `r pryr::object_size(b_mfit)` bytes. A reduction of size by 10^6!
## Predictions
The new predict methods for our `burgle` objects allow for one to easily predict new values and multiple simluated responses of `newdata`.
The structure is as follows:
* The rows are indexed by the original row order given in the `newdata`
* The columns are the different sets of sampled model parameters from the coefficients and covariance matrix of the original model (number of `draws` set to 1 by default)
* If more than one simulation is done, then the simulations are items in a list per model and row observation (number of `sims` set to 1 by default)
If one wants to predict using the original model simply set `original = TRUE`.
Depending on the model object there are different types of predictions. By default it will return the linear predictor `(lp)`. If one wants to see the response (`type = "response"`), which makes more sense when using `glm`objects and survival models. The `se = FALSE` is an argument on whether to include the standard error of the model when simluating responses. `TRUE` means to use the model standard error when sampling. We recommend setting it to TRUE when doing more than one simulation or when setting `type = "response"`.
```{r predictions}
predict(bfit, newdata = head(iris), original = TRUE, draws = 1, se = FALSE, type = "lp")
predict(bfit, newdata = head(iris), original = FALSE, draws = 5, type = "lp")
## These two should be similar
predict(bfit, newdata = head(iris), original = FALSE, draws = 5, sims = 5, se = TRUE, type = "response")
```
## Generalized Linear Model
The framework should work with all `glm` family options, just the binomial example is demonstrated below.
```{r competiting risk}
b_glm <- burgle(glm(I(Species == "versicolor") ~ ., family = "binomial", data = iris))
predict(b_glm, head(iris), original = FALSE, se = TRUE, draws = 5, type = "lp")
predict(b_glm, head(iris), original = FALSE, se = TRUE, draws = 5, type = "response")
```
## Cox Proporiontal Hazards Model
```{r}
library(survival)
lung <- survival::lung
lung$status <- lung$status - 1
cox <- coxph(Surv(time, status) ~ age + sex + ph.ecog + ph.karno + pat.karno, data = lung)
cox_sm <- coxph(Surv(time, status) ~ age + sex + ph.ecog + ph.karno + pat.karno, data = lung, x = FALSE, y = FALSE)
b_cox <- burgle(cox)
pryr::object_size(cox)
pryr::object_size(cox_sm)
pryr::object_size(b_cox)
as.numeric(pryr::object_size(b_cox)/pryr::object_size(cox))*100
as.numeric(pryr::object_size(b_cox)/pryr::object_size(cox_sm))*100
```
Our `burgle_coxph` model is `r paste0(round(as.numeric(pryr::object_size(b_cox)/pryr::object_size(cox_sm))*100,2), "%")` the size of the original Cox proportional hazards model even after setting `x=FALSE` and `y= FALSE`. The lung dataset has `r nrow(lung)` observations.
One way to further reduce the size of the `burgle_coxph` is to reduce the number of unique time points in the data, since the it contains the baseline hazard of the model. The lung dataset as `r length(unique(lung$time))` unique values. If we were to round these to the nearest 14 days that would reduce the number of timepoints to `r length(unique(plyr::round_any(lung$time, 14)))`.
```{r}
lung$time2 <- plyr::round_any(lung$time, 14)
cox2 <- coxph(Surv(time2, status) ~ age + sex + ph.ecog + ph.karno + pat.karno, data = lung, x = TRUE, y = FALSE)
b_cox2 <- burgle(cox2)
pryr::object_size(cox2)
pryr::object_size(b_cox2)
as.numeric(pryr::object_size(b_cox2)/pryr::object_size(cox2))*100
```
This reduce the size to `r paste0(round(as.numeric(pryr::object_size(b_cox2)/pryr::object_size(cox2))*100,2), "%")` of the original `coxph` object.
## Survival predictions
Predictions are a slightly different structure for survival or longitudinal models if `type = "response"` or `"risk"`, since a time point is required. If `type = "response"` the returned results is a simulated 1 or 0 if the event has been experienced or not at a given time point(s).
The structure is as follows:
* The rows are indexed by the original row order given in `newdata`
* The columns are the different time points at which the risk is calculated
* The different elements of the lists are the different sampled model parameters (`draws`)
* If more than one simulation is done, then the different simulations are returned as lists within the list element for each model (see below for an example example)
```{r}
predict(b_cox, newdata = head(lung), original = TRUE, draws = 1, type = "lp")
predict(b_cox, newdata = head(lung), original = TRUE, draws = 1, type = "risk", times = 500)
predict(b_cox, newdata = head(lung), original = TRUE, draws = 1, type = "risk", times = c(500, 1000))
predict(b_cox, newdata = head(lung), original = TRUE, draws = 1, type = "response", times = c(500, 1000))
predict(b_cox, newdata = head(lung), original = FALSE, draws = 5, sims = 2, type = "response", times = c(500, 1000))
```
## Larger Simulation Example
```{r}
## The original model at time 500
predict(b_cox, newdata = head(lung), original = TRUE, draws = 1, type = "risk", times = c(500))
## Doing 1000 simulations from the original model and calculating the death rate
a0 <- predict(b_cox, newdata = head(lung), original = TRUE, draws = 1, sims = 1000, type = "response", times = c(500)) |>
purrr::list_flatten() |>
Reduce(f = cbind, x = _) |>
apply(1, mean)
a0
## Average survival death rate based on 1000 different models
a1 <- predict(b_cox, newdata = head(lung), original = FALSE, draws = 1000, type = "response", times = c(500)) |>
purrr::list_flatten() |>
Reduce(f = cbind, x = _) |>
apply(1, mean)
a1
## Average survival rate based on 100 simlutions for each of the 1000 models
a2 <- predict(b_cox, newdata = head(lung), original = FALSE, draws = 1000, sims = 100, type = "response", times = c(500))
### Average death per model
a3 <- lapply(a2, function(x) apply(Reduce(cbind, x), 1, mean))
## Median death rate across 1000 models and 100 simulations for each model
Reduce(rbind, a3) |>
apply(2, median)
```
This structure has also been implemented for `riskRegression::CSC` and `flexsurv::flexsurvreg` objects and numerous others on the works and the plan is to also incorporate `rstan` objects, and an overall `predict.burgle_default` method and which will only a mean and covariance matrix as inputs.
Owner
- Name: Cleveland Clinic Quantitative Health Sciences
- Login: ClevelandClinicQHS
- Kind: organization
- Email: QHSGitHubAdmin@ccf.org
- Location: United States of America
- Website: https://www.lerner.ccf.org/quantitative-health/
- Repositories: 12
- Profile: https://github.com/ClevelandClinicQHS
Open-source projects from the Department of Quantitative Health Sciences at Cleveland Clinic
GitHub Events
Total
- Push event: 3
Last Year
- Push event: 3
Dependencies
DESCRIPTION
cran
- R >= 4.0.0 depends
- MASS * imports
- stats * imports
- riskRegression * suggests
- survival * suggests