Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.6%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Basic Info
Statistics
  • Stars: 6
  • Watchers: 1
  • Forks: 1
  • Open Issues: 0
  • Releases: 1
Created almost 2 years ago · Last pushed 10 months ago
Metadata Files
Readme License

README.Rmd

---
output: github_document
---



```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = file.path("causalDT", "man", "figures", "README-")
)
```

# Causal Distillation Trees

[Causal Distillation Trees](https://arxiv.org/abs/2502.07275) (CDT) is a novel machine learning method for estimating interpretable subgroups in causal inference. CDT allows researchers to fit *any* machine learning model of their choice to estimate the individual-level treatment effect, and then leverages a simple, second-stage tree-based model to "distill" the estimated treatment effect into meaningful subgroups. As a result, CDT inherits the improvements in predictive performance from black-box machine learning models while preserving the interpretability of a simple decision tree. 

![](causalDT/man/figures/cdt_diagram.png)

Briefly, CDT is a two-stage learner that first fits a teacher model (e.g., a black-box metalearner) to estimate individual-level treatment effects and secondly fits a student model (e.g., a decision tree) to predict the estimated individual-level treatment effects, in effect distilling the estimated individual-level treatment effects and producing interpretable subgroups. This two-stage learner is learned using the training data. Finally, using the estimated subgroups, the subgroup average treatment effects are honestly estimated with a held-out estimation set.

For more details, check out [Huang, M., Tang, T. M., Kenney, A. M. "Distilling heterogeneous treatment effects: Stable subgroup estimation in causal inference." (2025).](https://arxiv.org/abs/2502.07275)

## Organization

This repository contains: 

1. An R package `causalDT` to run causal distillation trees on your own data (see [causalDT/](causalDT/))
2. All code necessary to reproduce the analysis and figures in [Huang et al. (2025)](https://arxiv.org/abs/2502.07275) (see [causalDT-manuscript/](causalDT-manuscript/) and additional results [here](https://tiffanymtang.github.io/causalDT/simulation_results.html))

## Installation of the R package

You can install the `causalDT` R package via:

``` r
# install.packages("devtools")
devtools::install_github("tiffanymtang/causalDT", subdir = "causalDT")
```

## Example Usage

To illustrate an example usage of `causalDT`, we will use the AIDS Clinical Trials Group Study 175 (ACTG 175), a randomized controlled trial to determine the effectiveness of monotherapy compared to combination therapy on HIV-1-infected patients. This data can be found in the `speff2trial` R package.

```{r load-data}
# install.packages("speff2trial")
library(speff2trial)
library(dplyr)

data <- speff2trial::ACTG175 |>
  dplyr::filter(arms %in% c(0, 2))

# pre-treatment covariates data
X <- data |> 
  dplyr::select(
    age, wtkg, hemo, homo, drugs, karnof, race, 
    gender, symptom, preanti, strat, cd80
  ) |> 
  as.matrix()
# treatment indicator variable
Z <- data |>
  dplyr::pull(treat)
# response variable
Y <- data |>
  dplyr::pull(cens)
```

Given the pre-treatment covariates data $X$, the treatment variable $Z$, and the response variable $Y$, we can run CDT as follows:

```{r causalDT, fig.width=12}
library(causalDT)

set.seed(331)
causal_forest_cdt <- causalDT(
  X = X, Y = Y, Z = Z,
  teacher_model = "causal_forest"
)

plot_cdt(causal_forest_cdt)
```

Note that when using CDT, a teacher model must be chosen (the default is a causal forest). To help researchers select an appropriate teacher model, the Jaccard subgroup stability index (SSI) was introduced in [Huang et al. (2025)](https://arxiv.org/abs/2502.07275). Generally, a higher Jaccard SSI indicates a better teacher model. This teacher model selection procedure can be run as follows:

```{r jaccard}
## uncomment to install rlearner, which is needed to run rboost
# remotes::install_github("xnie/rlearner")

# selecting between causal forest versus rboost
rboost_cdt <- causalDT(
  X = as.matrix(X), Y = Y, Z = Z,
  teacher_model = rlearner_teacher(rlearner::rboost)
)
plot_jaccard(`Causal Forest` = causal_forest_cdt, `Rboost` = rboost_cdt)
```

## Citation

```
@article{huang2025distilling,
  title={Distilling heterogeneous treatment effects: Stable subgroup estimation in causal inference}, 
  author={Melody Huang and Tiffany M. Tang and Ana M. Kenney},
  year={2025},
  eprint={2502.07275},
  archivePrefix={arXiv},
  primaryClass={stat.ME},
  url={https://arxiv.org/abs/2502.07275}, 
}
```

Owner

  • Name: Tiffany Tang
  • Login: tiffanymtang
  • Kind: user
  • Location: Berkeley, CA
  • Company: University of California, Berkeley

PhD student in Statistics

GitHub Events

Total
  • Release event: 1
  • Watch event: 5
  • Delete event: 2
  • Push event: 25
  • Public event: 1
  • Pull request event: 4
  • Fork event: 1
  • Create event: 3
Last Year
  • Release event: 1
  • Watch event: 5
  • Delete event: 2
  • Push event: 25
  • Public event: 1
  • Pull request event: 4
  • Fork event: 1
  • Create event: 3

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 0
  • Total pull requests: 4
  • Average time to close issues: N/A
  • Average time to close pull requests: about 1 month
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 3
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 4
  • Average time to close issues: N/A
  • Average time to close pull requests: about 1 month
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 3
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • tiffanymtang (4)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • cran 13 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 1
  • Total maintainers: 1
cran.r-project.org: causalDT

Causal Distillation Trees

  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 13 Last month
Rankings
Dependent packages count: 25.6%
Dependent repos count: 31.5%
Average: 47.4%
Downloads: 85.3%
Maintainers (1)
Last synced: 10 months ago

Dependencies

.github/workflows/R-CMD-check.yaml actions
  • actions/checkout v4 composite
  • r-lib/actions/check-r-package v2 composite
  • r-lib/actions/setup-pandoc v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
.github/workflows/pkgdown.yaml actions
  • JamesIves/github-pages-deploy-action v4.5.0 composite
  • actions/checkout v4 composite
  • r-lib/actions/setup-pandoc v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
causalDT/DESCRIPTION cran
  • Rcpp * imports
  • dplyr * imports
  • grf * imports
  • partykit * imports
  • purrr * imports
  • rlearner >= 1.1.0 imports
  • rpart * imports
  • stringr * imports
  • tibble * imports
  • tidyselect * imports
  • testthat >= 3.0.0 suggests