csuite

CSuite: A Suite of Benchmark Datasets for Causality

https://github.com/microsoft/csuite

Last synced: 10 months ago · JSON representation ·

Repository

CSuite: A Suite of Benchmark Datasets for Causality

Basic Info

Host: GitHub
Owner: microsoft
License: mit
Language: TeX
Default Branch: main
Homepage:
Size: 414 KB

Statistics

Stars: 74
Watchers: 7
Forks: 7
Open Issues: 2
Releases: 1

Created over 3 years ago · Last pushed about 3 years ago

Metadata Files

Readme License Code of conduct Citation Security Support

CSuite: A Suite of Benchmark Datasets for Causality

CSuite is a collection of synthetic datasets for benchmarking causal machine learning algorithms. Each dataset consists of - the true causal graph, for benchmarking causal discovery; - 4000 rows of observational training data; - 2000 rows of observational test data; - interventional test data, for benchmarking estimation of average treatment effect (ATE) and conditional average treatment effect (CATE), 2000 rows per interventional environment.

The data was generated from known hand-crafted structural equation models (SEMs). Different datasets are intended to test different features of causal discovery and inference algorithms. CSuite was originally introduced in this paper. The data generation code for CSuite is publicly available.

Versioning

CSuite datasets are versioned so that we can amend and add datasets, whilst ensuring backwards compatibility with older versions of the data. Full reproducibility with CSuite requires specifying the correct version.

Summary of datasets

The download URLs here are for the latest version.

| Dataset | No. nodes | No. edges | Additive noise model? | Discrete/continuous | ATE benchmarking | CATE benchmarking | Download link | | :------------ | :------------ | :------------ | :------------ | :------------ |:------------ |:------------ |:------------ | | lingauss | 2 | 1 | Y | Continuous | Y | N | https://github.com/microsoft/csuite/releases/download/v0.1/csuitelingauss.zip | | linexp | 2 | 1 | Y | Continuous | Y | N | https://github.com/microsoft/csuite/releases/download/v0.1/csuitelinexp.zip | | nonlingauss | 2 | 1 | Y | Continuous | Y | N | https://github.com/microsoft/csuite/releases/download/v0.1/csuitenonlingauss.zip | | nonlinsimpson | 4 | 4 | Y | Continuous | Y | Y | https://github.com/microsoft/csuite/releases/download/v0.1/csuitenonlinsimpson.zip | | symprodsimpson | 4 | 4 | Y | Continuous | Y | Y | https://github.com/microsoft/csuite/releases/download/v0.1/csuitesymprodsimpson.zip | | largebackdoor | 9 | 10 | Y | Continuous | Y | Y | https://github.com/microsoft/csuite/releases/download/v0.1/csuitelargebackdoor.zip | | weakarrows | 9 | 15 | Y | Continuous | Y | N | https://github.com/microsoft/csuite/releases/download/v0.1/csuiteweakarrows.zip | | cattocts | 2 | 1 | N | Mixed | Y | N | https://github.com/microsoft/csuite/releases/download/v0.1/csuitecattocts.zip | | ctstocat | 2 | 1 | N | Mixed | Y | N | https://github.com/microsoft/csuite/releases/download/v0.1/csuitectstocat.zip | | mixedsimpson | 4 |4 | N | Mixed | Y | N | https://github.com/microsoft/csuite/releases/download/v0.1/csuitemixedsimpson.zip | | largebackdoorbinaryt | 9 | 10 | N | Mixed | Y | Y | https://github.com/microsoft/csuite/releases/download/v0.1/csuitelargebackdoorbinaryt.zip | | weakarrowsbinaryt | 9 | 15 | N | Mixed | Y | N | https://github.com/microsoft/csuite/releases/download/v0.1/csuiteweakarrowsbinaryt.zip | | mixedconfounding | 12 | 15 | N | Mixed | Y | N | https://github.com/microsoft/csuite/releases/download/v0.1/csuitemixedconfounding.zip | | catchain | 3 | 2 | N | Discrete | Y | N | https://github.com/microsoft/csuite/releases/download/v0.1/csuitecatchain.zip | | catcollider | 3 | 2 | N | Discrete | Y | N | https://github.com/microsoft/csuite/releases/download/v0.1/csuitecat_collider.zip |

Data format

Each dataset consists of the following files - adj_matrix.csv, which describes the causal graph used to generate the data; a value 1 in row i, column j indicates an edge from node i to node j; - train.csv, the observational training data; - test.csv, the observational test data; - interventions.json, a JSON containing interventional test data.

The interventional data JSON consists of pairs of interventional environments, which can be used to estimate (C)ATE. The two environments are the 'primary' and 'reference' environments. Conditional data was generating using HMC. The format of the interventional data is

{ "environments": [ { "conditioning_idxs": <optional list containing indices of nodes to that were conditioned on>, "conditioning_values": <list of values set on the conditioning nodes>, "effect_idxs": <list containing indices of nodes to be considered effect variables>, "intervention_idxs": <list of indices of nodes that were acted on with do-intervention>, "intervention_values": <list of values set on the intervention nodes in the primary do-intervention: for example, receiving a medicine>, "intervention_reference": <list of values set on the intervention nodes in the reference do-intervention: for example, not receiving the medicine>, "test_data": <array of data from the primary do-intervention, same number of columns as train.csv>, "reference_data": <array of data from the reference do-intervention> }, ... ], "metadata": { "columns_to_nodes": <matches to columns to their corresponding nodes, only important for vector-values nodes> } }

Download

From the terminal

You can download CSuite datasets from any previous version using the following URL pattern $ curl -O https://github.com/microsoft/csuite/releases/download/v<version>/csuite_<name>.zip where <name> and <version> should be set appropriately.

From Python

The uncompressed files listed under Data format are also directly available from a public storage account. These may either be accessed through their HTTP links, e.g. https://azuastoragepublic.blob.core.windows.net/datasets/csuite_linexp/train.csv or their equivalent Azure blob storage paths. To load these directly in python:

```python import pandas as pd

Load over HTTP

df = pd.readcsv("https://azuastoragepublic.blob.core.windows.net/datasets/csuitelinexp/train.csv")

Load using `adlfs` (`pip install adlfs`)

df = pd.readcsv("az://datasets@azuastoragepublic.blob.core.windows.net/csuitelinexp/train.csv") ```

Citation

If you use CSuite datasets in your work, please cite the following paper which originally introduced these datasets @article{geffner2022deep, title={Deep End-to-end Causal Inference}, author={Geffner, Tomas and Antoran, Javier and Foster, Adam and Gong, Wenbo and Ma, Chao and Kiciman, Emre and Sharma, Amit and Lamb, Angus and Kukla, Martin and Pawlowski, Nick and Allamanis, Miltiadis and Zhang, Cheng}, journal={arXiv preprint arXiv:2202.02195}, year={2022} }

Detailed descriptions of datasets

lingauss

$Two Node Graph X0 -> X1" width="250px" /> A two node linear Gaussian system. The structural equations are $$ \begin{align} X0 &\sim N(0, 1) \ X1 &= \frac{1}{2}X0 + \frac{\sqrt{3}}{2}Z1 \end{align} $$ where $Z1 \sim N(0,1)$ is independent of $X0$. The dataset is constructed so that the observational distribution is the same if $X0$ and $X1$ are swapped and both nodes have the same marginal variance of 1. This model is not structural identifiable from observational data. <h3>linexp</h3> <img src=$ Simpson's Paradox using a continuous SEM. The dataset is constructed so that $\textup{Cov}(X1,X2)$ has the opposite sign to $\textup{Cov}(X1,X2\mid X_0)$. Estimating the treatment effects correctly in this SEM is highly sensitive to accurate causal discovery.

The structural equations are

$$ \begin{align} X0 &\sim N(0,1) \ X1 &= s(1 - X0) + \sqrt{\frac{3}{20}} Z1\ X2 &= \tanh(2X1) + \frac{3}{2}X0 -1 + \tanh(Z2)\ X3 &= 5 \tanh\left(\frac{X2 - 4}{5}\right) + 3 + \frac{1}{\sqrt{10}} Z_3 \end{align} $$

where $Z1,Z2 \sim N(0,1)$ and $Z3 \sim \textup{Laplace}(1)$ are mutually independent and independent of $X0$, $s(x) = \log(1+\exp(x))$ is the softplus function. Constants were chosen so that each variable has a marginal variance of (approximately) 1.

symprod_simpson

$Four Node Graph X0 -> X1, X0 -> X2, X1 -> X2, X2 -> X3" width="300px" /> A dataset exhibiting multi-modality that is suitable for benchmarking CATE estimation. Nonlinear function estimation is important since $\textup{Cov}(X0,X2)=\textup{Cov}(X1,X2)=0$. The structural equations are $$ \begin{align} X0 &\sim N(0,1) \ X1 &= 2\tanh(2X0) + \frac{1}{\sqrt{10}} Z1\ X2 &= \frac{1}{2}X0 X1 + \frac{1}{\sqrt{2}} Z2\ X3 &= \tanh\left(\frac{3}{2} X0\right) + \sqrt{\frac{3}{10}} Z_3 \end{align} $$ where $Z1 \sim t3,Z2 \sim \textup{Laplace}(1)$ and $Z3 \sim N(0,1)$ are mutually independent and independent of $X_0$. Constants were chosen so that each variable has a marginal variance of (approximately) 1. <h3>large_backdoor</h3> <img src=$

A larger dataset with a pyramidal graph structure. This dataset is constructed so that there are many possible choices of backdoor adjustment set for estimating the treatment effect of $X7$ on $X8$. While both minimal and maximal adjustment sets can result in a correct solution, the a minimal adjustment set results in a much lower-dimensional adjustment problem and thus will result in lower variance solutions.

A complete description of the structural equations can be found in the data generation code for CSuite.

weak_arrows

Weak arrows graph

A larger dataset that is similar to large_backdoor, but with many additional edges. The causal discovery challenge revolves around finding all arrows, which are scaled to be relatively weak, but which have significant predictive power for $X_8$ in aggregate.

A complete description of the structural equations can be found in the data generation code for CSuite.

cattocts

$Two Node Graph X0 -> X1" width="250px" /> | Variable | Discrete/continuous | | ------------ | ------------ | | $X0$ | Discrete on $\{0,1,2\}$ | | $X1$ | Continuous | A two node system with one categorical and one continuous variable. The structural equations are $$ \begin{align} X0 &\sim \text{Cat}\left(\frac{1}{4}, \frac{1}{4}, \frac{1}{2}\right)\ X1 &= \frac{1}{2}(X0-1) + \frac{9}{25}\mathbf{1}{\{X0=2\}} + \frac{8}{5}(s(Z1) - 1) \end{align} $$ where $s(x) = \log(1+\exp(x))$ is the softplus function, and $Z1 \sim N(0,1)$ is independent of $X0$. <h3>ctstocat</h3> <img src=$ Simpson's Paradox using a mixed-type SEM. The dataset is constructed so that $\textup{Cov}(X0,X1)$ has the opposite sign to $\textup{Cov}(X0,X1\mid X_2)$. Estimating the treatment effects correctly in this SEM is highly sensitive to accurate causal discovery.

The structural equations are

$$ \begin{align} X2 &\sim \text{Cat}\left(\frac{1}{6},\frac{1}{6},\frac{1}{6},\frac{1}{6},\frac{1}{6},\frac{1}{6}\right) \ p(X0|X2=x) &= \begin{cases} \left(\tfrac{1}{12},\tfrac{11}{12} \right) & \text{ if } x < 3 \ \left(\tfrac{11}{12},\tfrac{1}{12} \right) & \text{ if } x \ge 3 \ \end{cases} \ X1 &= \frac{7}{10}\left(X0 + X2 - 4\right) + s\left(\frac{1}{2} Z1 \right) \ X3 &= \frac{10}{3} \tanh\left(\frac{X1}{3}\right) + \frac{1}{10}(Z3 -1) \end{align} $$

where $Z1 \sim N(0,1),Z3\sim \textup{Exp}(1)$ are independent noise random variables and $s(x)=\log(1+\exp(x))$.

largebackdoorbinary_t

Large backdoor graph

| Variable | Discrete/continuous | | ------------ | ------------ | | $X0$ | Continuous | | $X1$ | Continuous | | $X2$ | Continuous | | $X3$ | Continuous | | $X4$ | Continuous | | $X5$ | Continuous | | $X6$ | Continuous | | $X7$ | Discrete on $\{0,1\}$ | | $X_8$ | Continuous |

An adaptation of large_backdoor with a binary variable $X_7$ which is considered the treatment variable.

A complete description of the structural equations can be found in the data generation code for CSuite.

weakarrowbinary_t

Weak arrows graph

| Variable | Discrete/continuous | | ------------ | ------------ | | $X0$ | Continuous | | $X1$ | Continuous | | $X2$ | Continuous | | $X3$ | Continuous | | $X4$ | Continuous | | $X5$ | Continuous | | $X6$ | Continuous | | $X7$ | Discrete on $\{0,1\}$ | | $X_8$ | Continuous |

An adaptation of weak_arrows with a binary variable $X_7$ which is considered the treatment variable.

A complete description of the structural equations can be found in the data generation code for CSuite.

mixed_confounding

Mixed confounding graph

| Variable | Discrete/continuous | | ------------ | ------------ | | $X0$ | Discrete on $\{0,1\}$ | | $X1$ | Continuous | | $X2$ | Continuous | | $X3$ | Continuous | | $X4$ | Discrete on $\{0,1\}$ | | $X5$ | Discrete on $\{0,1,2\}$ | | $X6$ | Continuous | | $X7$ | Discrete on $\{0,1,2\}$ | | $X8$ | Continuous | | $X9$ | Continuous | | $X{10}$ | Continuous | | $X{11}$ | Continuous |

A larger dataset with treatment node $X0$ and outcome node $X1$. There are different variables that are: confounders, causes of $X0$ only, causes of $X1$ only, downstream of $X0$, downstream of $X1$, collider caused by $X0$ and $X1$.

A complete description of the structural equations can be found in the data generation code for CSuite.

cat_chain

$Chain graph X0->X1->X2" width="400px" /> | Variable | Discrete/continuous | | ------------ | ------------ | | $X0$ | Discrete on $\{0,1,2\}$ | | $X1$ | Discrete on $\{0,1,2\}$ | | $X_2$ | Discrete on $\{0,1\}$ | A chain graph with discrete variables. The structural equations are $$ \begin{align} X0 &\sim \text{Cat}\left(\frac{1}{4}, \frac{1}{4}, \frac{1}{2}\right)\ p(X1|X0=x) &= \begin{cases} \left(\tfrac{3}{4},\tfrac{1}{8},\tfrac{1}{8} \right) & \text{ if } x=0 \ \left(\tfrac{1}{8},\tfrac{3}{4},\tfrac{1}{8} \right) & \text{ if } x=1 \ \left(\tfrac{1}{8},\tfrac{1}{8},\tfrac{3}{4} \right) & \text{ if } x=2 \ \end{cases} \ p(X2|X_1=y) &= \begin{cases} \left(\tfrac{6}{7},\tfrac{1}{7} \right) & \text{ if } y=0 \ \left(\tfrac{6}{7},\tfrac{1}{7} \right) & \text{ if } y=1 \ \left(\tfrac{1}{7},\tfrac{6}{7} \right) & \text{ if } y=2. \ \end{cases} \ \end{align} $$ <h3>cat_collider</h3> <img src=$ Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Owner

Name: Microsoft
Login: microsoft
Kind: organization
Email: opensource@microsoft.com
Location: Redmond, WA

Website: https://opensource.microsoft.com
Twitter: OpenAtMicrosoft
Repositories: 7,257
Profile: https://github.com/microsoft

Open source projects and samples from Microsoft

Citation (CITATION.cff)

message: "If you use this dataset, please cite the following paper."
authors:
- family-names: "Geffner"
  given-names: "Tomas"
- family-names: "Antoran"
  given-names: "Javier"
- family-names: "Foster"
  given-names: "Adam"
- family-names: "Gong"
  given-names: "Wenbo"
- family-names: "Ma"
  given-names: "Chao"
- family-names: "Kiciman"
  given-names: "Emre"
- family-names: "Sharma"
  given-names: "Amit"
- family-names: "Lamb"
  given-names: "Angus"
- family-names: "Kukla"
  given-names: "Martin"
- family-names: "Pawlowski"
  given-names: "Nick"
- family-names: "Allamanis"
  given-names: "Miltiadis"
- family-names: "Zhang"
  given-names: "Cheng"
title: "Deep End-to-end Causal Inference"
date-released: 2022-02-04
url: "https://arxiv.org/abs/2202.02195"
journal: "arXiv preprint arXiv:2202.02195"
repository-code: "https://github.com/microsoft/csuite"

GitHub Events

Total

Watch event: 15
Fork event: 1

Last Year

Watch event: 15
Fork event: 1

Committers

Last synced: 12 months ago

All Time

Total Commits: 27
Total Committers: 4
Avg Commits per committer: 6.75
Development Distribution Score (DDS): 0.259

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Adam Foster	a****r	20
Microsoft Open Source	m****e	5
microsoft-github-operations[bot]	5****]	1
Agrin Hilmkil	a****h	1

Issues and Pull Requests

Last synced: 12 months ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

csuite

Science Score: 54.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

CSuite: A Suite of Benchmark Datasets for Causality

Versioning

Summary of datasets

Data format

Download

From the terminal

From Python

Load over HTTP

Load using adlfs (pip install adlfs)

Citation

Detailed descriptions of datasets

lingauss

symprod_simpson

weak_arrows

cattocts

largebackdoorbinary_t

weakarrowbinary_t

mixed_confounding

cat_chain

Trademarks

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Load using `adlfs` (`pip install adlfs`)