SuperNOVA

SuperNOVA: Semi-Parametric Identification and Estimation of Interaction and Effect Modification in Mixed Exposures using Stochastic Interventions in R - Published in JOSS (2023)

https://github.com/blind-contours/supernova

Science Score: 95.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 7 DOI reference(s) in README and JOSS metadata
  • Academic publication links
    Links to: joss.theoj.org, zenodo.org
  • Committers with academic emails
    1 of 6 committers (16.7%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Keywords

interactions machine-learning mixed-exposure statistics targeted-learning variable-importance

Keywords from Contributors

causal-effects causal-inference decision-trees exposure-mixtures robust-statistics
Last synced: 6 months ago · JSON representation

Repository

:dizzy: :dart: Automatic identification of variable and interaction importance using basis functions and non-parametric estimation of interactions/effect modification using joint stochastic interventions.

Basic Info
  • Host: GitHub
  • Owner: blind-contours
  • License: other
  • Language: R
  • Default Branch: main
  • Homepage:
  • Size: 193 MB
Statistics
  • Stars: 9
  • Watchers: 2
  • Forks: 1
  • Open Issues: 2
  • Releases: 2
Topics
interactions machine-learning mixed-exposure statistics targeted-learning variable-importance
Created about 4 years ago · Last pushed over 2 years ago
Metadata Files
Readme Contributing License

README.Rmd

---
output:
  rmarkdown::github_document
bibliography: "inst/references.bib"
always_allow_html: true
---



```{r, echo = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "README-"
)
```

# R/`SuperNOVA` 


[![R-CMD-check](https://github.com/blind-contours/SuperNOVA/workflows/R-CMD-check/badge.svg)](https://github.com/blind-contours/SuperNOVA/actions)
[![Coverage Status](https://img.shields.io/codecov/c/github/blind-contours/SuperNOVA/master.svg)](https://codecov.io/github/blind-contours/SuperNOVA?branch=master)
[![CRAN](https://www.r-pkg.org/badges/version/SupernOVA)](https://www.r-pkg.org/pkg/SuperNOVA)
[![CRAN downloads](https://cranlogs.r-pkg.org/badges/SuperNOVA)](https://CRAN.R-project.org/package=SuperNOVA)
[![CRAN total downloads](http://cranlogs.r-pkg.org/badges/grand-total/SuperNOVA)](https://CRAN.R-project.org/package=SuperNOVA)
[![Project Status: Active  The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
[![MIT license](https://img.shields.io/badge/license-MIT-brightgreen.svg)](https://opensource.org/licenses/MIT)




> Efficient Estimation of the Causal Effects of Non-Parametric Interactions, Effect Modifications and Mediation using Stochastic Interventions
__Authors:__ [David McCoy](https://davidmccoy.org)

---

## What's `SuperNOVA`?

The `SuperNOVA` R package offers a comprehensive toolset for identifying predictive variable sets, be it subsets of exposures, exposure-covariates, or exposure-mediators, for a specified outcome. It further assists in creating efficient estimators for the counterfactual mean of the outcome under stochastic interventions on these variable sets. This means making exposure changes that depend on naturally observed values, as described in past literature [@diaz2012population; @haneuse2013estimation].

`SuperNOVA` introduces several estimators, constructed based on the patterns found in the data. At present, semi-parametric estimators are available for various contexts: interaction, effect modification, marginal impacts, and mediation. The target parameters that this package calculates are grounded in previously published works: one regarding interaction and effect modification [@mccoy2023semiparametric] and another dedicated to mediation [McCoy2023mediation].

The `SuperNOVA` package builds upon the capabilities of the `txshift` package, which implements the TML estimator for a stochastic shift causal parameter [@diaz2012population]. A notable extension in SuperNOVA is its support for joint stochastic interventions on two exposures. This allows for the creation of a non-parametric interaction parameter, shedding light on the combined effect of concurrently shifting two variables relative to the cumulative effect of individual shifts. At this stage, the focus is on two-way shifts.

Various parameters are calculated based on identified patterns:

* Interaction Parameter: When two exposures show signs of interaction, this parameter is calculated.

* Individual Stochastic Interventions: When marginal effects are identified, this estimates the difference in outcomes upon making a specific shift compared to no intervention.

* Effect Modification: In scenarios where effect modification is evident, the target parameter for it is the mean outcome under intervention within specific regions of the covariate space that the data suggests. This contrasts with the marginal effect by diving deeper into strata of covariates, for example, gauging the effects of exposure shifts among distinct genders.

* Mediation: In scenarios where mediation is found, the target parameter are the natural direct and indirect effects as defined by stochastic intervention. That is, for the natural indirect effect, `SuperNOVA` estimates the impact of shifting an exposure through a mediator and the direct effect which is not through the mediator.

Mediation in `SuperNOVA` is a new feature. Here, `SuperNOVA` brings to the table estimates initially conceived in another work [@Diaz2020a], while also supporting mediation estimates for continuous exposures. 

The package ensures robustness by employing a k-fold cross-validation framework. This framework helps in estimating a data-adaptive parameter, which is the stochastic shift target parameters for the variable sets identified as influential for the outcome. The process begins by partitioning the data into parameter-generating and estimation samples. The former sample assists in fitting a collection of basis function estimators to the data. The one with the lowest cross-validated mean squared error is selected. Key variable sets are then extracted using an ANOVA-like variance decomposition methodology. For the estimation sample, targeted learning is harnessed to gauge causal target parameters across different contexts: interaction, mediation, effect modification, and individual variable shifts.

By using SuperNOVA, users get access to a tool that offers both k-fold specific and aggregated results for each target parameter, ensuring that researchers can glean the most information from their data. For a more in-depth exploration, there's an accompanying vignette.

To utilize the package, users need to provide vectors for exposures, covariates, mediators, and outcomes. They also specify the respective $\delta$ for each exposure (indicating the degree of shift) and if this delta should be adaptive in response to positivity violations. A detailed guide is provided in the vignette. With these inputs, `SuperNOVA` processes the data and delivers tables showcasing fold-specific results and aggregated outcomes, allowing users to glean insights effectively.

`SuperNOVA` also incorporates features from the `sl3` package [@coyle-sl3-rpkg], facilitating ensemble machine learning in the estimation process. If the user does not specify any stack parameters, `SuperNOVA` will automatically create an ensemble of machine learning algorithms that strike a balance between flexibility and computational efficiency.


---

## Installation

*Note:* Because the `SuperNOVA` package (currently) depends on `sl3` that
allows ensemble machine learning to be used for nuisance parameter
estimation and `sl3` is not on CRAN the `SuperNOVA` package is not
available on CRAN and must be downloaded here.

There are many depedencies for `SuperNOVA` so it's easier to break up 
installation of the various packages to ensure proper installation. 

First install the basis estimators used in the data-adaptive
variable discovery of the exposure and covariate space: 

```{r tree-packages, eval = FALSE}
install.packages("earth")
install.packages("hal9001")
```

`SuperNOVA` uses the `sl3` package to build ensemble machine learners for each nuisance parameter. 
We have to install off the development branch, first download these two packages for `sl3`

```{r super-learner-packages,  eval = FALSE}
install.packages(c("ranger", "arm", "xgboost", "nnls"))
```

Now install `sl3` on devel:

```{r sl3_devel,  eval = FALSE}
remotes::install_github("tlverse/sl3@devel")
```

Make sure `sl3` installs correctly then install `SuperNOVA`

```{r SuperNOVA_install,  eval = FALSE}
remotes::install_github("blind-contours/SuperNOVA@main")
```

`SuperNOVA` has some other miscellaneous dependencies that are used in the examples as well as in the plotting functions. 

```{r msc_package,  eval = FALSE}
install.packages(c("kableExtra", "hrbrthemes", "viridis"))
```

---

## Example

To illustrate how `SuperNOVA` may be used to ascertain the effect of a mixed exposure, consider the following example:

```{r example, warning=FALSE}
library(SuperNOVA)
library(devtools)
library(kableExtra)
library(sl3)

set.seed(429153)
# simulate simple data
n_obs <- 100000
```

The `simulate_data` is a function for generating synthetic data with a complex structure to study the causal effects of shifting values in the mixtures of exposures. The primary purpose of this simulation is to provide a controlled environment for testing and validating estimates given by the `SuperNOVA` package.

The simulate_data function generates synthetic data for n_obs observations with a pre-specified covariance structure (sigma_mod) and a shift parameter (delta). It simulates four mixture components (M1, M2, M3, M4) and three covariates (W1, W2, W3) with specific relationships between them. The outcome variable Y is generated as a function of these mixtures and covariates.

After generating the data, the function applies a shift (delta) to each mixture component separately and calculates the average treatment effect for each component. Additionally, it calculates the interaction effect of shifting two mixture components simultaneously (m14_intxn). These ground truth effects can be used for validating and comparing the performance of various causal inference methods on this synthetic dataset.

The function returns a list containing the generated data and the ground truth effects of the shifts applied to each mixture component, the interaction effect, and the modified effect results based on a specific level in the W3 covariate.


```{r simulate data, eval = TRUE}
sim_out <- simulate_data(n_obs = n_obs)
data <- sim_out$data
head(data) %>%
  kbl(caption = "Simulated Data") %>%
  kable_classic(full_width = F, html_font = "Cambria")
```



And therefore, in `SuperNOVA` we would expect most of the fold CIs to cover
this number and the pooled estimate to also cover this true effect. Let's
run `SuperNOVA` to see if it correctly identifies the exposures that drive 
the outcome and any interaction/effect modification that exists in the DGP.

Of note, there are three exposures M1, M2, M3 - M1 and M3 have individual effects
and interactions that drive the outcome. There is also effect modification 
between M3 and W1. 


```{r run SuperNOVA, eval = TRUE, message=FALSE, warning=FALSE}
data_sample <- data[sample(nrow(data), 4000), ]

w <- data_sample[, c("W1", "W2", "W3")]
a <- data_sample[, c("M1", "M2", "M3", "M4")]
y <- data_sample$Y

deltas <- list("M1" = 1, "M2" = 1, "M3" = 1, "M4" = 1)

ptm <- proc.time()
sim_results <- SuperNOVA(
  w = w,
  a = a,
  y = y,
  delta = deltas,
  n_folds = 3,
  num_cores = 6,
  outcome_type = "continuous",
  quantile_thresh = 0,
  seed = 294580
)
proc.time() - ptm

basis_in_folds <- sim_results$`Basis Fold Proportions`
indiv_shift_results <- sim_results$`Indiv Shift Results`
em_results <- sim_results$`Effect Mod Results`
joint_shift_results <- sim_results$`Joint Shift Results`
```

Let's first look at the variable relationships used in the folds: 

```{r basis in folds}
basis_in_folds
```

The above list shows that marginal effects for exposures M1 and M4 were found, an interaction for M1 and M4, and effect modification for M3 and W3 - all of which are correct as the outcome is generated from these relationships. There is no effect for M2 which we correctly reject.

Let's first look at the results for individual stochastic shifts by delta compared to no shift:

```{r individual shift results}
indiv_shift_results$M1 %>%
  kbl(caption = "Individual Stochastic Intervention Results for M1") %>%
  kable_classic(full_width = F, html_font = "Cambria")
```

The true effect for a shifted M1 vs observed M1 is: 
```{r m1 truth}
sim_out$m1_effect
```

And so we see that we have proper coverage.

Next we can look at effect modifications: 

```{r effect modification results}
em_results$M3W3 %>%
  kbl(caption = "Effect Modification Stochastic Intervention Results for M3 and W3") %>%
  kable_classic(full_width = F, html_font = "Cambria")
```

Let's first look at the truth: 
```{r em truth}
sim_out$effect_mod
```
When W3 is 1 the truth effect is 11, our estimates are 11 with CI coverage. When W3 is 0 the truth is 1, our estimate is 1.9 with CIs that cover the truth as well. 

And finally results for the joint shift which is a joint shift compared to additive individual shifts.

```{r joint results}
joint_shift_results$M1M4 %>%
  kbl(caption = "Interactions Stochastic Intervention Results for M1 and M4") %>%
  kable_classic(full_width = F, html_font = "Cambria")
```

Let's look at the truth again: 

```{r interaction truths}
sim_out$m1_effect
sim_out$m4_effect
sim_out$m14_effect
sim_out$m14_intxn
```
So comparing the results to the above table in the pooled section we can see all our estimates for the marginal shifts, dual shift, and difference between dual and sum of marginals have CIs that cover the truth. 

---

## Issues

If you encounter any bugs or have any specific feature requests, 
please [file an
issue](https://github.com/blind-contours/SuperNOVA/issues). Further details
on filing
issues are provided in our [contribution
guidelines](https://github.com/blind-contours/
SuperNOVA/main/contributing.md).

---

## Contributions

Contributions are very welcome. Interested contributors should consult our
[contribution
guidelines](https://github.com/blind-contours/SuperNOVA/blob/master/CONTRIBUTING.md)
prior to submitting a pull request.

---

## Citation

After using the `SuperNOVA` R package, please cite the following:


---

## Related

* [R/`tmle3shift`](https://github.com/tlverse/tmle3shift) - An R package
  providing an independent implementation of the same core routines for the TML
  estimation procedure and statistical methodology as is made available here,
  through reliance on a unified interface for Targeted Learning provided by the
  [`tmle3`](https://github.com/tlverse/tmle3) engine of the [`tlverse`
  ecosystem](https://github.com/tlverse).

* [R/`medshift`](https://github.com/nhejazi/medshift) - An R package providing
  facilities to estimate the causal effect of stochastic treatment regimes in
  the mediation setting, including classical (IPW) and augmented double robust
  (one-step) estimators. This is an implementation of the methodology explored
  by @diaz2020causal.

* [R/`haldensify`](https://github.com/nhejazi/haldensify) - A minimal package
  for estimating the conditional density treatment mechanism component of this
  parameter based on using the [highly adaptive
  lasso](https://github.com/tlverse/hal9001) [@coyle-hal9001-rpkg;
  @hejazi2020hal9001-joss] in combination with a pooled hazard regression. This
  package implements a variant of the approach advocated by @diaz2011super.

---

## Funding

The development of this software was supported in part through grants from the


---

## License

© 2020-2022 [David B. McCoy](https://davidmccoy.org)

The contents of this repository are distributed under the MIT license. See below
for details:
```
MIT License
Copyright (c) 2020-2022 David B. McCoy
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
```

---

## References

Owner

  • Name: David McCoy
  • Login: blind-contours
  • Kind: user

JOSS Publication

SuperNOVA: Semi-Parametric Identification and Estimation of Interaction and Effect Modification in Mixed Exposures using Stochastic Interventions in R
Published
November 05, 2023
Volume 8, Issue 91, Page 5422
Authors
David McCoy ORCID
Department of Biostatistics, University of California Berkeley, Berkeley, CA 94704, U.S.A.
Alejandro Schuler ORCID
Department of Biostatistics, University of California Berkeley, Berkeley, CA 94704, U.S.A.
Alan Hubbard ORCID
Department of Biostatistics, University of California Berkeley, Berkeley, CA 94704, U.S.A.
Mark van der Laan ORCID
Department of Biostatistics, University of California Berkeley, Berkeley, CA 94704, U.S.A.
Editor
Charlotte Soneson ORCID
Tags
causal inference machine learning stochastic interventions efficient estimation targeted learning mixed exposures

GitHub Events

Total
  • Issues event: 1
  • Watch event: 1
Last Year
  • Issues event: 1
  • Watch event: 1

Committers

Last synced: 7 months ago

All Time
  • Total Commits: 167
  • Total Committers: 6
  • Avg Commits per committer: 27.833
  • Development Distribution Score (DDS): 0.15
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
blind-contours d****y@b****u 142
David Brenton McCoy d****y@l****c 10
David Brenton McCoy d****y@l****c 7
David McCoy d****y@P****l 4
David Brenton McCoy d****y@l****c 3
David McCoy d****y@P****n 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 6
  • Total pull requests: 0
  • Average time to close issues: 10 days
  • Average time to close pull requests: N/A
  • Total issue authors: 4
  • Total pull request authors: 0
  • Average comments per issue: 1.5
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • lucasmiranda42 (3)
  • jlhanson5 (1)
  • hasdk (1)
  • rkmccord (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

.github/workflows/draft-pdf.yml actions
  • actions/checkout v3 composite
  • actions/upload-artifact v1 composite
  • openjournals/openjournals-draft-action master composite
.github/workflows/r.yml actions
  • actions/checkout v2 composite
  • r-lib/actions/setup-pandoc v1 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-tinytex v1 composite
DESCRIPTION cran
  • R >= 2.10 depends
  • MASS * imports
  • Rdpack * imports
  • assertthat * imports
  • cvTools * imports
  • data.table * imports
  • dplyr * imports
  • foreach * imports
  • furrr * imports
  • future * imports
  • ggplot2 * imports
  • haldensify * imports
  • magrittr * imports
  • partykit * imports
  • polspline * imports
  • pracma * imports
  • purrr * imports
  • rlang * imports
  • sl3 * imports
  • stringr * imports
  • kableExtra * suggests
  • knitr * suggests
  • rmarkdown * suggests
  • testthat >= 3.0.0 suggests