csuite

CSuite: A Suite of Benchmark Datasets for Causality

https://github.com/microsoft/csuite

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (7.3%) to scientific vocabulary
Last synced: 7 months ago · JSON representation ·

Repository

CSuite: A Suite of Benchmark Datasets for Causality

Basic Info
  • Host: GitHub
  • Owner: microsoft
  • License: mit
  • Language: TeX
  • Default Branch: main
  • Homepage:
  • Size: 414 KB
Statistics
  • Stars: 74
  • Watchers: 7
  • Forks: 7
  • Open Issues: 2
  • Releases: 1
Created over 3 years ago · Last pushed almost 3 years ago
Metadata Files
Readme License Code of conduct Citation Security Support

README.md

CSuite: A Suite of Benchmark Datasets for Causality

CSuite is a collection of synthetic datasets for benchmarking causal machine learning algorithms. Each dataset consists of - the true causal graph, for benchmarking causal discovery; - 4000 rows of observational training data; - 2000 rows of observational test data; - interventional test data, for benchmarking estimation of average treatment effect (ATE) and conditional average treatment effect (CATE), 2000 rows per interventional environment.

The data was generated from known hand-crafted structural equation models (SEMs). Different datasets are intended to test different features of causal discovery and inference algorithms. CSuite was originally introduced in this paper. The data generation code for CSuite is publicly available.

Versioning

CSuite datasets are versioned so that we can amend and add datasets, whilst ensuring backwards compatibility with older versions of the data. Full reproducibility with CSuite requires specifying the correct version.

Summary of datasets

The download URLs here are for the latest version.

| Dataset | No. nodes | No. edges | Additive noise model? | Discrete/continuous | ATE benchmarking | CATE benchmarking | Download link | | :------------ | :------------ | :------------ | :------------ | :------------ |:------------ |:------------ |:------------ | | lingauss | 2 | 1 | Y | Continuous | Y | N | https://github.com/microsoft/csuite/releases/download/v0.1/csuitelingauss.zip | | linexp | 2 | 1 | Y | Continuous | Y | N | https://github.com/microsoft/csuite/releases/download/v0.1/csuitelinexp.zip | | nonlingauss | 2 | 1 | Y | Continuous | Y | N | https://github.com/microsoft/csuite/releases/download/v0.1/csuitenonlingauss.zip | | nonlinsimpson | 4 | 4 | Y | Continuous | Y | Y | https://github.com/microsoft/csuite/releases/download/v0.1/csuitenonlinsimpson.zip | | symprodsimpson | 4 | 4 | Y | Continuous | Y | Y | https://github.com/microsoft/csuite/releases/download/v0.1/csuitesymprodsimpson.zip | | largebackdoor | 9 | 10 | Y | Continuous | Y | Y | https://github.com/microsoft/csuite/releases/download/v0.1/csuitelargebackdoor.zip | | weakarrows | 9 | 15 | Y | Continuous | Y | N | https://github.com/microsoft/csuite/releases/download/v0.1/csuiteweakarrows.zip | | cattocts | 2 | 1 | N | Mixed | Y | N | https://github.com/microsoft/csuite/releases/download/v0.1/csuitecattocts.zip | | ctstocat | 2 | 1 | N | Mixed | Y | N | https://github.com/microsoft/csuite/releases/download/v0.1/csuitectstocat.zip | | mixedsimpson | 4 |4 | N | Mixed | Y | N | https://github.com/microsoft/csuite/releases/download/v0.1/csuitemixedsimpson.zip | | largebackdoorbinaryt | 9 | 10 | N | Mixed | Y | Y | https://github.com/microsoft/csuite/releases/download/v0.1/csuitelargebackdoorbinaryt.zip | | weakarrowsbinaryt | 9 | 15 | N | Mixed | Y | N | https://github.com/microsoft/csuite/releases/download/v0.1/csuiteweakarrowsbinaryt.zip | | mixedconfounding | 12 | 15 | N | Mixed | Y | N | https://github.com/microsoft/csuite/releases/download/v0.1/csuitemixedconfounding.zip | | catchain | 3 | 2 | N | Discrete | Y | N | https://github.com/microsoft/csuite/releases/download/v0.1/csuitecatchain.zip | | catcollider | 3 | 2 | N | Discrete | Y | N | https://github.com/microsoft/csuite/releases/download/v0.1/csuitecat_collider.zip |

Data format

Each dataset consists of the following files - adj_matrix.csv, which describes the causal graph used to generate the data; a value 1 in row i, column j indicates an edge from node i to node j; - train.csv, the observational training data; - test.csv, the observational test data; - interventions.json, a JSON containing interventional test data.

The interventional data JSON consists of pairs of interventional environments, which can be used to estimate (C)ATE. The two environments are the 'primary' and 'reference' environments. Conditional data was generating using HMC. The format of the interventional data is

{ "environments": [ { "conditioning_idxs": <optional list containing indices of nodes to that were conditioned on>, "conditioning_values": <list of values set on the conditioning nodes>, "effect_idxs": <list containing indices of nodes to be considered effect variables>, "intervention_idxs": <list of indices of nodes that were acted on with do-intervention>, "intervention_values": <list of values set on the intervention nodes in the primary do-intervention: for example, receiving a medicine>, "intervention_reference": <list of values set on the intervention nodes in the reference do-intervention: for example, not receiving the medicine>, "test_data": <array of data from the primary do-intervention, same number of columns as train.csv>, "reference_data": <array of data from the reference do-intervention> }, ... ], "metadata": { "columns_to_nodes": <matches to columns to their corresponding nodes, only important for vector-values nodes> } }

Download

From the terminal

You can download CSuite datasets from any previous version using the following URL pattern $ curl -O https://github.com/microsoft/csuite/releases/download/v<version>/csuite_<name>.zip where <name> and <version> should be set appropriately.

From Python

The uncompressed files listed under Data format are also directly available from a public storage account. These may either be accessed through their HTTP links, e.g. https://azuastoragepublic.blob.core.windows.net/datasets/csuite_linexp/train.csv or their equivalent Azure blob storage paths. To load these directly in python:

```python import pandas as pd

Load over HTTP

df = pd.readcsv("https://azuastoragepublic.blob.core.windows.net/datasets/csuitelinexp/train.csv")

Load using adlfs (pip install adlfs)

df = pd.readcsv("az://datasets@azuastoragepublic.blob.core.windows.net/csuitelinexp/train.csv") ```

Citation

If you use CSuite datasets in your work, please cite the following paper which originally introduced these datasets @article{geffner2022deep, title={Deep End-to-end Causal Inference}, author={Geffner, Tomas and Antoran, Javier and Foster, Adam and Gong, Wenbo and Ma, Chao and Kiciman, Emre and Sharma, Amit and Lamb, Angus and Kukla, Martin and Pawlowski, Nick and Allamanis, Miltiadis and Zhang, Cheng}, journal={arXiv preprint arXiv:2202.02195}, year={2022} }

Detailed descriptions of datasets

lingauss

Two Node Graph X0 -> X1" width="250px" /></p>

<p>A two node linear Gaussian system. The structural equations are</p>

<p>$$
\begin{align}
X<em>0 &\sim N(0, 1) \
X</em>1 &= \frac{1}{2}X<em>0 + \frac{\sqrt{3}}{2}Z</em>1
\end{align}
$$</p>

<p>where $Z<em>1 \sim N(0,1)$ is independent of $X</em>0$. The dataset is constructed so that the observational distribution is the same if $X<em>0$ and $X</em>1$ are swapped and both nodes have the same marginal variance of 1. This model is not structural identifiable from observational data.</p>

<h3>linexp</h3>

<p><img src=Simpson's Paradox using a continuous SEM. The dataset is constructed so that $\textup{Cov}(X1,X2)$ has the opposite sign to $\textup{Cov}(X1,X2\mid X_0)$. Estimating the treatment effects correctly in this SEM is highly sensitive to accurate causal discovery.

The structural equations are

$$ \begin{align} X0 &\sim N(0,1) \ X1 &= s(1 - X0) + \sqrt{\frac{3}{20}} Z1\ X2 &= \tanh(2X1) + \frac{3}{2}X0 -1 + \tanh(Z2)\ X3 &= 5 \tanh\left(\frac{X2 - 4}{5}\right) + 3 + \frac{1}{\sqrt{10}} Z_3 \end{align} $$

where $Z1,Z2 \sim N(0,1)$ and $Z3 \sim \textup{Laplace}(1)$ are mutually independent and independent of $X0$, $s(x) = \log(1+\exp(x))$ is the softplus function. Constants were chosen so that each variable has a marginal variance of (approximately) 1.

symprod_simpson

Four Node Graph X0 -> X1, X0 -> X2, X1 -> X2, X2 -> X3" width="300px" /></p>

<p>A dataset exhibiting multi-modality that is suitable for benchmarking CATE estimation. Nonlinear function estimation is important since $\textup{Cov}(X<em>0,X</em>2)=\textup{Cov}(X<em>1,X</em>2)=0$.</p>

<p>The structural equations are</p>

<p>$$
\begin{align}
X<em>0 &\sim N(0,1) \
X</em>1 &= 2\tanh(2X<em>0) + \frac{1}{\sqrt{10}} Z</em>1\
X<em>2 &= \frac{1}{2}X</em>0 X<em>1 + \frac{1}{\sqrt{2}} Z</em>2\
X<em>3 &= \tanh\left(\frac{3}{2}  X</em>0\right) + \sqrt{\frac{3}{10}} Z_3
\end{align}
$$</p>

<p>where $Z<em>1 \sim t</em>3,Z<em>2 \sim \textup{Laplace}(1)$ and $Z</em>3 \sim N(0,1)$ are mutually independent and independent of $X_0$. Constants were chosen so that each variable has a marginal variance of (approximately) 1.  </p>

<h3>large_backdoor</h3>

<p><img src=

A larger dataset with a pyramidal graph structure. This dataset is constructed so that there are many possible choices of backdoor adjustment set for estimating the treatment effect of $X7$ on $X8$. While both minimal and maximal adjustment sets can result in a correct solution, the a minimal adjustment set results in a much lower-dimensional adjustment problem and thus will result in lower variance solutions.

A complete description of the structural equations can be found in the data generation code for CSuite.

weak_arrows

Weak arrows graph

A larger dataset that is similar to large_backdoor, but with many additional edges. The causal discovery challenge revolves around finding all arrows, which are scaled to be relatively weak, but which have significant predictive power for $X_8$ in aggregate.

A complete description of the structural equations can be found in the data generation code for CSuite.

cattocts

Two Node Graph X0 -> X1" width="250px" /></p>

<p>| Variable  | Discrete/continuous |
| ------------ | ------------ |
|  $X<em>0$ | Discrete on $\{0,1,2\}$ |
|  $X</em>1$ | Continuous  |</p>

<p>A two node system with one categorical and one continuous variable. The structural equations are</p>

<p>$$
\begin{align}
X<em>0 &\sim \text{Cat}\left(\frac{1}{4}, \frac{1}{4}, \frac{1}{2}\right)\
X</em>1 &= \frac{1}{2}(X<em>0-1) + \frac{9}{25}\mathbf{1}</em>{\{X<em>0=2\}} + \frac{8}{5}(s(Z</em>1) - 1)
\end{align}
$$</p>

<p>where $s(x) = \log(1+\exp(x))$ is the softplus function, and $Z<em>1 \sim N(0,1)$ is independent of $X</em>0$. </p>

<h3>cts<em>to</em>cat</h3>

<p><img src=Simpson's Paradox using a mixed-type SEM. The dataset is constructed so that $\textup{Cov}(X0,X1)$ has the opposite sign to $\textup{Cov}(X0,X1\mid X_2)$. Estimating the treatment effects correctly in this SEM is highly sensitive to accurate causal discovery.

The structural equations are

$$ \begin{align} X2 &\sim \text{Cat}\left(\frac{1}{6},\frac{1}{6},\frac{1}{6},\frac{1}{6},\frac{1}{6},\frac{1}{6}\right) \ p(X0|X2=x) &= \begin{cases} \left(\tfrac{1}{12},\tfrac{11}{12} \right) & \text{ if } x < 3 \ \left(\tfrac{11}{12},\tfrac{1}{12} \right) & \text{ if } x \ge 3 \ \end{cases} \ X1 &= \frac{7}{10}\left(X0 + X2 - 4\right) + s\left(\frac{1}{2} Z1 \right) \ X3 &= \frac{10}{3} \tanh\left(\frac{X1}{3}\right) + \frac{1}{10}(Z3 -1) \end{align} $$

where $Z1 \sim N(0,1),Z3\sim \textup{Exp}(1)$ are independent noise random variables and $s(x)=\log(1+\exp(x))$.

largebackdoorbinary_t

Large backdoor graph

| Variable | Discrete/continuous | | ------------ | ------------ | | $X0$ | Continuous | | $X1$ | Continuous | | $X2$ | Continuous | | $X3$ | Continuous | | $X4$ | Continuous | | $X5$ | Continuous | | $X6$ | Continuous | | $X7$ | Discrete on $\{0,1\}$ | | $X_8$ | Continuous |

An adaptation of large_backdoor with a binary variable $X_7$ which is considered the treatment variable.

A complete description of the structural equations can be found in the data generation code for CSuite.

weakarrowbinary_t

Weak arrows graph

| Variable | Discrete/continuous | | ------------ | ------------ | | $X0$ | Continuous | | $X1$ | Continuous | | $X2$ | Continuous | | $X3$ | Continuous | | $X4$ | Continuous | | $X5$ | Continuous | | $X6$ | Continuous | | $X7$ | Discrete on $\{0,1\}$ | | $X_8$ | Continuous |

An adaptation of weak_arrows with a binary variable $X_7$ which is considered the treatment variable.

A complete description of the structural equations can be found in the data generation code for CSuite.

mixed_confounding

Mixed confounding graph

| Variable | Discrete/continuous | | ------------ | ------------ | | $X0$ | Discrete on $\{0,1\}$ | | $X1$ | Continuous | | $X2$ | Continuous | | $X3$ | Continuous | | $X4$ | Discrete on $\{0,1\}$ | | $X5$ | Discrete on $\{0,1,2\}$ | | $X6$ | Continuous | | $X7$ | Discrete on $\{0,1,2\}$ | | $X8$ | Continuous | | $X9$ | Continuous | | $X{10}$ | Continuous | | $X{11}$ | Continuous |

A larger dataset with treatment node $X0$ and outcome node $X1$. There are different variables that are: confounders, causes of $X0$ only, causes of $X1$ only, downstream of $X0$, downstream of $X1$, collider caused by $X0$ and $X1$.

A complete description of the structural equations can be found in the data generation code for CSuite.

cat_chain

Chain graph X0->X1->X2" width="400px" /></p>

<p>| Variable  | Discrete/continuous |
| ------------ | ------------ |
|  $X<em>0$ | Discrete on $\{0,1,2\}$  |
|  $X</em>1$ | Discrete on $\{0,1,2\}$  |
|  $X_2$ | Discrete on $\{0,1\}$ |</p>

<p>A chain graph with discrete variables. The structural equations are </p>

<p>$$
\begin{align}
X<em>0 &\sim \text{Cat}\left(\frac{1}{4}, \frac{1}{4}, \frac{1}{2}\right)\
p(X</em>1|X<em>0=x) &= \begin{cases}
    \left(\tfrac{3}{4},\tfrac{1}{8},\tfrac{1}{8} \right) & \text{ if } x=0 \
    \left(\tfrac{1}{8},\tfrac{3}{4},\tfrac{1}{8} \right) & \text{ if } x=1 \
    \left(\tfrac{1}{8},\tfrac{1}{8},\tfrac{3}{4} \right) & \text{ if } x=2 \
    \end{cases} \
p(X</em>2|X_1=y) &= \begin{cases}
    \left(\tfrac{6}{7},\tfrac{1}{7} \right) & \text{ if } y=0 \
    \left(\tfrac{6}{7},\tfrac{1}{7} \right) & \text{ if } y=1 \
    \left(\tfrac{1}{7},\tfrac{6}{7} \right) & \text{ if } y=2. \
    \end{cases} \
\end{align}
$$</p>

<h3>cat_collider</h3>

<p><img src=Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Owner

  • Name: Microsoft
  • Login: microsoft
  • Kind: organization
  • Email: opensource@microsoft.com
  • Location: Redmond, WA

Open source projects and samples from Microsoft

Citation (CITATION.cff)

message: "If you use this dataset, please cite the following paper."
authors:
- family-names: "Geffner"
  given-names: "Tomas"
- family-names: "Antoran"
  given-names: "Javier"
- family-names: "Foster"
  given-names: "Adam"
- family-names: "Gong"
  given-names: "Wenbo"
- family-names: "Ma"
  given-names: "Chao"
- family-names: "Kiciman"
  given-names: "Emre"
- family-names: "Sharma"
  given-names: "Amit"
- family-names: "Lamb"
  given-names: "Angus"
- family-names: "Kukla"
  given-names: "Martin"
- family-names: "Pawlowski"
  given-names: "Nick"
- family-names: "Allamanis"
  given-names: "Miltiadis"
- family-names: "Zhang"
  given-names: "Cheng"
title: "Deep End-to-end Causal Inference"
date-released: 2022-02-04
url: "https://arxiv.org/abs/2202.02195"
journal: "arXiv preprint arXiv:2202.02195"
repository-code: "https://github.com/microsoft/csuite"

GitHub Events

Total
  • Watch event: 15
  • Fork event: 1
Last Year
  • Watch event: 15
  • Fork event: 1

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 27
  • Total Committers: 4
  • Avg Commits per committer: 6.75
  • Development Distribution Score (DDS): 0.259
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Adam Foster a****r 20
Microsoft Open Source m****e 5
microsoft-github-operations[bot] 5****] 1
Agrin Hilmkil a****h 1

Issues and Pull Requests

Last synced: 9 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • fred887 (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels