causalDT

https://github.com/tiffanymtang/causaldt

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.6%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: tiffanymtang
License: mit
Language: HTML
Default Branch: main
Homepage: https://tiffanymtang.github.io/causalDT/
Size: 116 MB

Statistics

Stars: 6
Watchers: 1
Forks: 1
Open Issues: 0
Releases: 1

Created almost 2 years ago · Last pushed 10 months ago

Metadata Files

Readme License

README.Rmd

---
output: github_document
---



```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = file.path("causalDT", "man", "figures", "README-")
)
```

# Causal Distillation Trees

[Causal Distillation Trees](https://arxiv.org/abs/2502.07275) (CDT) is a novel machine learning method for estimating interpretable subgroups in causal inference. CDT allows researchers to fit *any* machine learning model of their choice to estimate the individual-level treatment effect, and then leverages a simple, second-stage tree-based model to "distill" the estimated treatment effect into meaningful subgroups. As a result, CDT inherits the improvements in predictive performance from black-box machine learning models while preserving the interpretability of a simple decision tree. 

![](causalDT/man/figures/cdt_diagram.png)

Briefly, CDT is a two-stage learner that first fits a teacher model (e.g., a black-box metalearner) to estimate individual-level treatment effects and secondly fits a student model (e.g., a decision tree) to predict the estimated individual-level treatment effects, in effect distilling the estimated individual-level treatment effects and producing interpretable subgroups. This two-stage learner is learned using the training data. Finally, using the estimated subgroups, the subgroup average treatment effects are honestly estimated with a held-out estimation set.

For more details, check out [Huang, M., Tang, T. M., Kenney, A. M. "Distilling heterogeneous treatment effects: Stable subgroup estimation in causal inference." (2025).](https://arxiv.org/abs/2502.07275)

## Organization

This repository contains: 

1. An R package `causalDT` to run causal distillation trees on your own data (see [causalDT/](causalDT/))
2. All code necessary to reproduce the analysis and figures in [Huang et al. (2025)](https://arxiv.org/abs/2502.07275) (see [causalDT-manuscript/](causalDT-manuscript/) and additional results [here](https://tiffanymtang.github.io/causalDT/simulation_results.html))

## Installation of the R package

You can install the `causalDT` R package via:

``` r
# install.packages("devtools")
devtools::install_github("tiffanymtang/causalDT", subdir = "causalDT")
```

## Example Usage

To illustrate an example usage of `causalDT`, we will use the AIDS Clinical Trials Group Study 175 (ACTG 175), a randomized controlled trial to determine the effectiveness of monotherapy compared to combination therapy on HIV-1-infected patients. This data can be found in the `speff2trial` R package.

```{r load-data}
# install.packages("speff2trial")
library(speff2trial)
library(dplyr)

data <- speff2trial::ACTG175 |>
  dplyr::filter(arms %in% c(0, 2))

# pre-treatment covariates data
X <- data |> 
  dplyr::select(
    age, wtkg, hemo, homo, drugs, karnof, race, 
    gender, symptom, preanti, strat, cd80
  ) |> 
  as.matrix()
# treatment indicator variable
Z <- data |>
  dplyr::pull(treat)
# response variable
Y <- data |>
  dplyr::pull(cens)
```

Given the pre-treatment covariates data $X$, the treatment variable $Z$, and the response variable $Y$, we can run CDT as follows:

```{r causalDT, fig.width=12}
library(causalDT)

set.seed(331)
causal_forest_cdt <- causalDT(
  X = X, Y = Y, Z = Z,
  teacher_model = "causal_forest"
)

plot_cdt(causal_forest_cdt)
```

Note that when using CDT, a teacher model must be chosen (the default is a causal forest). To help researchers select an appropriate teacher model, the Jaccard subgroup stability index (SSI) was introduced in [Huang et al. (2025)](https://arxiv.org/abs/2502.07275). Generally, a higher Jaccard SSI indicates a better teacher model. This teacher model selection procedure can be run as follows:

```{r jaccard}
## uncomment to install rlearner, which is needed to run rboost
# remotes::install_github("xnie/rlearner")

# selecting between causal forest versus rboost
rboost_cdt <- causalDT(
  X = as.matrix(X), Y = Y, Z = Z,
  teacher_model = rlearner_teacher(rlearner::rboost)
)
plot_jaccard(`Causal Forest` = causal_forest_cdt, `Rboost` = rboost_cdt)
```

## Citation

```
@article{huang2025distilling,
  title={Distilling heterogeneous treatment effects: Stable subgroup estimation in causal inference}, 
  author={Melody Huang and Tiffany M. Tang and Ana M. Kenney},
  year={2025},
  eprint={2502.07275},
  archivePrefix={arXiv},
  primaryClass={stat.ME},
  url={https://arxiv.org/abs/2502.07275}, 
}
```

Owner

Name: Tiffany Tang
Login: tiffanymtang
Kind: user
Location: Berkeley, CA
Company: University of California, Berkeley

Website: tiffanymtang.github.io
Repositories: 6
Profile: https://github.com/tiffanymtang

PhD student in Statistics

GitHub Events

Total

Release event: 1
Watch event: 5
Delete event: 2
Push event: 25
Public event: 1
Pull request event: 4
Fork event: 1
Create event: 3

Last Year

Release event: 1
Watch event: 5
Delete event: 2
Push event: 25
Public event: 1
Pull request event: 4
Fork event: 1
Create event: 3

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 0
Total pull requests: 4
Average time to close issues: N/A
Average time to close pull requests: about 1 month
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 3
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 4
Average time to close issues: N/A
Average time to close pull requests: about 1 month
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 3
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

tiffanymtang (4)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- cran 13 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 1
Total maintainers: 1

cran.r-project.org: causalDT

Causal Distillation Trees

Homepage: https://tiffanymtang.github.io/causalDT/
Documentation: http://cran.r-project.org/web/packages/causalDT/causalDT.pdf
License: MIT + file LICENSE
Latest release: 1.0.0
published 10 months ago

Versions: 1
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 13 Last month

Rankings

Dependent packages count: 25.6%

Dependent repos count: 31.5%

Average: 47.4%

Downloads: 85.3%

Maintainers (1)

ttang4@nd.edu