mldid

https://github.com/jhatamyar/mldid

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.3%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: jhatamyar
Language: R
Default Branch: main
Size: 80.1 KB

Statistics

Stars: 13
Watchers: 2
Forks: 8
Open Issues: 1
Releases: 1

Created over 3 years ago · Last pushed about 2 years ago

Metadata Files

Readme Citation

MLDID

The MLDID package computes average and conditional average treatment effects on the treated (ATT and CATT), allowing for them to be used to study detailed drivers of treatment effect heterogeneity, see:

Hatamyar, J., Kreif, N., Rocha, R., Huber, M. Machine Learning for Staggered Difference-in-Differences and Dynamic Treatment Effect Heterogeneity (2023) Arxiv

Installation

You can install the package using

```

install.packages("devtools")

devtools::install_github("jhatamyar/MLDID") ```

Basics

This code demonstrates the basic functionality of the package using the data on minimum wage and county-level employment from Callaway & Sant'Anna (2021). Note that it is used here for demonstrative purposes only. Vignette for more detailed usage forthcoming.

``` R library(MLDID) library(did) ## to get the data data(mpdta) data <- mpdta

There is only one covariate in the data so generate a noisy extra, as MLDID requires more than one covariate

data$noise <- rnorm(nrow(mpdta)) ```

The Group-Time ATT and CATT estimator

The function MLDID() performs the main work of the package, implementing Algorithm 1 in the paper until the aggregation step. Functionality to estimate the nuisance functions using either rlearner or causal forest has been added, with the option to use the Superlearner to estimate delta, and the tune_penalty parameter can be set to TRUE to implement cross-validated parameter tuning for causal forest and the rlasso.

R att_gt.mp <- MLDID(outcome = 'lemp', group = 'first.treat', time = 'year', id_name = 'countyreal', data, xformla = ~lpop + noise, tune_penalty = F #nu_model = "cf", #sigma_model = "cf", #delta_model = "SL", ) The att_gt.mp object holds the estimated ATT and CATT for each group-time, as well as other information needed in post-processing steps.

Aggregating the estimates to Dynamic (time-to-event) time

The functions dynamic_att() and dynamic_cates() perform the aggregation steps in Equation 20 and 21 of the paper. The dynamic_cates() function can also aggregate the estimated robust scores.

```R

aggregate the ATTs

ATT.dynamic.MP <- dynamicattgt(attgt.mp)

print the dynamic estimates and the DRDID as in Callaway & Sant'Anna (2021) version

ATT.dynamic.MP[["dynamic.att.e"]] ATT.dynamic.MP[["dynamic.att.e.csa"]]

aggregate the CATTs and scores

cates.dynamic.MP <- dynamiccates(attgt.mp, type = "cates") scores.dynamic.MP <- dynamiccates(attgt.mp, type = "scores") ```

Heterogeneity Analysis

The functions BLP_eventtimes(), CLAN_glhtest() and CLAN_ttest() allow for inference on the estimated CATTs and scores. The dynamic CATTs/scores must first be appended to the original data according to event-time using the function het_prep(). The function BLP_summary() provides a readable output of the coefficients for each time period and their standard errors.

```R

prepare the data for analysis. Note depending on the size of your data this may also be slow

het.data.MP <- hetprep(attgt.mp, cates.dynamic.MP)

create a list of variables to test for heterogeneity, must be string

affected.mp <- c("lpop", "noise")

formulate them for the input - note that here you could also just use "lpop + noise" as the argument for the functions instead of "affected_str.mp"

affected_str.mp <- paste(affected.mp, collapse = " + ")

Run the BLP regressions: if error, reduce nperiods.

BLP.bye.MP <- BLPeventtimes(data = het.data.MP, nperiods = 3, rhsformula = affected_str.mp)

summarize the BLP output

BLPsummary(BLP.bye.MP, affected.mp)

you can also check the summary for regression output at each event-time e using BLP.bye.MP[[e+1]][["coefficients"]]:

BLP.bye.MP[[1]][["coefficients"]] ## will be output for e=0 BLP.bye.MP[[2]][["coefficients"]] ## output for e=1

Run the CLANs two ways, first using the glh test as in Chernozhukov et al (2018), then a simple ttest of means of the most/least affected groups

CLAN.glh.cates.MP <- CLANglhtest(het.data.MP, affected=affected.mp) CLAN.ttest.cates.MP <- CLANttest(het.data.MP, affected=affected.mp)

note there is no post-processing function for CLANS as of this update, but will be forthcoming

can check the output by printing the object. Each matrix represents an event time.

CLAN.glh.cates.MP

You can also visualize the BLP coefficients - we recommend interpreting with caution, as ideally the lpop variable should be discretized:

plotBLP(BLP.bye.MP, affected.mp) ```

packagedemo

Troubleshooting

If you experience issues with this package: - Ensure that there are no missing values (NA) - Ensure that you are using at least two covariates in the model formula, as the ML methods may require two dimensions - Check for small number of observations - if very few per group-time, methods may not converge - Ensure that you are using a balanced panel (work is ongoing to allow for unbalanced panels to be used) - Check whether setting t_func = F resolves the issue - Please create a github "issue" if problems persist!

References

Callaway, B. and P. H. Sant’Anna (2021). Difference-in-differences with multiple time periods. In: Journal of Econometrics 225.2, pp. 200–230.

Chernozhukov, V., M. Demirer, E. Duflo, and I. Fernandez-Val (2018). Generic machine learning inference on heterogeneous treatment effects in randomized experiments, with an application to immunization in India. Tech. rep. National Bureau of Economic Research.

Wager, S. and S. Athey (2018). Estimation and inference of heterogeneous treatment effects using random forests. In: Journal of the American Statistical Association 113.523, pp. 1228–1242.

This work was generously funded by the UK Medical Research Council (Grant #: MR/T04487X/1)

Owner

Name: Julia Hatamyar
Login: jhatamyar
Kind: user
Company: University of York

Repositories: 1
Profile: https://github.com/jhatamyar

Research Fellow at University of York, Centre for Health Economics - causal ML

Citation (CITATION.cff)

cff-version: 1.1.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Hatamyar
    given-names: Julia
    orcid: https://https://orcid.org/0000-0003-4145-1265
title: "MLDID: Machine Learning for Staggered Difference-in-Differences"
version: 1.1.0
date-released: 2023-12-10

GitHub Events

Total

Issues event: 1
Watch event: 7
Issue comment event: 2
Fork event: 1

Last Year

Issues event: 1
Watch event: 7
Issue comment event: 2
Fork event: 1

Dependencies

DESCRIPTION cran

DRDID * imports
Matrix * imports
SuperLearner * imports
dplyr * imports
ggplot2 * imports
glmnet * imports
grf * imports
magrittr * imports
matrixStats * imports
stats * imports