Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (7.3%) to scientific vocabulary
Repository
CSuite: A Suite of Benchmark Datasets for Causality
Basic Info
Statistics
- Stars: 74
- Watchers: 7
- Forks: 7
- Open Issues: 2
- Releases: 1
Metadata Files
README.md
CSuite: A Suite of Benchmark Datasets for Causality
CSuite is a collection of synthetic datasets for benchmarking causal machine learning algorithms. Each dataset consists of - the true causal graph, for benchmarking causal discovery; - 4000 rows of observational training data; - 2000 rows of observational test data; - interventional test data, for benchmarking estimation of average treatment effect (ATE) and conditional average treatment effect (CATE), 2000 rows per interventional environment.
The data was generated from known hand-crafted structural equation models (SEMs). Different datasets are intended to test different features of causal discovery and inference algorithms. CSuite was originally introduced in this paper. The data generation code for CSuite is publicly available.
Versioning
CSuite datasets are versioned so that we can amend and add datasets, whilst ensuring backwards compatibility with older versions of the data. Full reproducibility with CSuite requires specifying the correct version.
Summary of datasets
The download URLs here are for the latest version.
| Dataset | No. nodes | No. edges | Additive noise model? | Discrete/continuous | ATE benchmarking | CATE benchmarking | Download link | | :------------ | :------------ | :------------ | :------------ | :------------ |:------------ |:------------ |:------------ | | lingauss | 2 | 1 | Y | Continuous | Y | N | https://github.com/microsoft/csuite/releases/download/v0.1/csuitelingauss.zip | | linexp | 2 | 1 | Y | Continuous | Y | N | https://github.com/microsoft/csuite/releases/download/v0.1/csuitelinexp.zip | | nonlingauss | 2 | 1 | Y | Continuous | Y | N | https://github.com/microsoft/csuite/releases/download/v0.1/csuitenonlingauss.zip | | nonlinsimpson | 4 | 4 | Y | Continuous | Y | Y | https://github.com/microsoft/csuite/releases/download/v0.1/csuitenonlinsimpson.zip | | symprodsimpson | 4 | 4 | Y | Continuous | Y | Y | https://github.com/microsoft/csuite/releases/download/v0.1/csuitesymprodsimpson.zip | | largebackdoor | 9 | 10 | Y | Continuous | Y | Y | https://github.com/microsoft/csuite/releases/download/v0.1/csuitelargebackdoor.zip | | weakarrows | 9 | 15 | Y | Continuous | Y | N | https://github.com/microsoft/csuite/releases/download/v0.1/csuiteweakarrows.zip | | cattocts | 2 | 1 | N | Mixed | Y | N | https://github.com/microsoft/csuite/releases/download/v0.1/csuitecattocts.zip | | ctstocat | 2 | 1 | N | Mixed | Y | N | https://github.com/microsoft/csuite/releases/download/v0.1/csuitectstocat.zip | | mixedsimpson | 4 |4 | N | Mixed | Y | N | https://github.com/microsoft/csuite/releases/download/v0.1/csuitemixedsimpson.zip | | largebackdoorbinaryt | 9 | 10 | N | Mixed | Y | Y | https://github.com/microsoft/csuite/releases/download/v0.1/csuitelargebackdoorbinaryt.zip | | weakarrowsbinaryt | 9 | 15 | N | Mixed | Y | N | https://github.com/microsoft/csuite/releases/download/v0.1/csuiteweakarrowsbinaryt.zip | | mixedconfounding | 12 | 15 | N | Mixed | Y | N | https://github.com/microsoft/csuite/releases/download/v0.1/csuitemixedconfounding.zip | | catchain | 3 | 2 | N | Discrete | Y | N | https://github.com/microsoft/csuite/releases/download/v0.1/csuitecatchain.zip | | catcollider | 3 | 2 | N | Discrete | Y | N | https://github.com/microsoft/csuite/releases/download/v0.1/csuitecat_collider.zip |
Data format
Each dataset consists of the following files
- adj_matrix.csv, which describes the causal graph used to generate the data; a value 1 in row i, column j indicates an edge from node i to node j;
- train.csv, the observational training data;
- test.csv, the observational test data;
- interventions.json, a JSON containing interventional test data.
The interventional data JSON consists of pairs of interventional environments, which can be used to estimate (C)ATE. The two environments are the 'primary' and 'reference' environments. Conditional data was generating using HMC. The format of the interventional data is
{
"environments": [
{
"conditioning_idxs": <optional list containing indices of nodes to that were conditioned on>,
"conditioning_values": <list of values set on the conditioning nodes>,
"effect_idxs": <list containing indices of nodes to be considered effect variables>,
"intervention_idxs": <list of indices of nodes that were acted on with do-intervention>,
"intervention_values": <list of values set on the intervention nodes in the primary do-intervention: for example, receiving a medicine>,
"intervention_reference": <list of values set on the intervention nodes in the reference do-intervention: for example, not receiving the medicine>,
"test_data": <array of data from the primary do-intervention, same number of columns as train.csv>,
"reference_data": <array of data from the reference do-intervention>
},
...
],
"metadata": {
"columns_to_nodes": <matches to columns to their corresponding nodes, only important for vector-values nodes>
}
}
Download
From the terminal
You can download CSuite datasets from any previous version using the following URL pattern
$ curl -O https://github.com/microsoft/csuite/releases/download/v<version>/csuite_<name>.zip
where <name> and <version> should be set appropriately.
From Python
The uncompressed files listed under Data format are also directly available from a public storage account. These may either be accessed through their HTTP links, e.g. https://azuastoragepublic.blob.core.windows.net/datasets/csuite_linexp/train.csv or their equivalent Azure blob storage paths. To load these directly in python:
```python import pandas as pd
Load over HTTP
df = pd.readcsv("https://azuastoragepublic.blob.core.windows.net/datasets/csuitelinexp/train.csv")
Load using adlfs (pip install adlfs)
df = pd.readcsv("az://datasets@azuastoragepublic.blob.core.windows.net/csuitelinexp/train.csv") ```
Citation
If you use CSuite datasets in your work, please cite the following paper which originally introduced these datasets
@article{geffner2022deep,
title={Deep End-to-end Causal Inference},
author={Geffner, Tomas and Antoran, Javier and Foster, Adam and Gong, Wenbo and Ma, Chao and Kiciman, Emre and Sharma, Amit and Lamb, Angus and Kukla, Martin and Pawlowski, Nick and Allamanis, Miltiadis and Zhang, Cheng},
journal={arXiv preprint arXiv:2202.02195},
year={2022}
}
Detailed descriptions of datasets
lingauss
Simpson's Paradox using a continuous SEM. The dataset is constructed so that $\textup{Cov}(X1,X2)$ has the opposite sign to $\textup{Cov}(X1,X2\mid X_0)$. Estimating the treatment effects correctly in this SEM is highly sensitive to accurate causal discovery.
The structural equations are
$$ \begin{align} X0 &\sim N(0,1) \ X1 &= s(1 - X0) + \sqrt{\frac{3}{20}} Z1\ X2 &= \tanh(2X1) + \frac{3}{2}X0 -1 + \tanh(Z2)\ X3 &= 5 \tanh\left(\frac{X2 - 4}{5}\right) + 3 + \frac{1}{\sqrt{10}} Z_3 \end{align} $$
where $Z1,Z2 \sim N(0,1)$ and $Z3 \sim \textup{Laplace}(1)$ are mutually independent and independent of $X0$, $s(x) = \log(1+\exp(x))$ is the softplus function. Constants were chosen so that each variable has a marginal variance of (approximately) 1.
symprod_simpson
A larger dataset with a pyramidal graph structure. This dataset is constructed so that there are many possible choices of backdoor adjustment set for estimating the treatment effect of $X7$ on $X8$. While both minimal and maximal adjustment sets can result in a correct solution, the a minimal adjustment set results in a much lower-dimensional adjustment problem and thus will result in lower variance solutions.
A complete description of the structural equations can be found in the data generation code for CSuite.
weak_arrows
A larger dataset that is similar to large_backdoor, but with many additional edges. The causal discovery challenge revolves
around finding all arrows, which are scaled to be relatively weak, but which have significant predictive power for $X_8$ in aggregate.
A complete description of the structural equations can be found in the data generation code for CSuite.
cattocts
Simpson's Paradox using a mixed-type SEM. The dataset is constructed so that $\textup{Cov}(X0,X1)$ has the opposite sign to $\textup{Cov}(X0,X1\mid X_2)$. Estimating the treatment effects correctly in this SEM is highly sensitive to accurate causal discovery.
The structural equations are
$$ \begin{align} X2 &\sim \text{Cat}\left(\frac{1}{6},\frac{1}{6},\frac{1}{6},\frac{1}{6},\frac{1}{6},\frac{1}{6}\right) \ p(X0|X2=x) &= \begin{cases} \left(\tfrac{1}{12},\tfrac{11}{12} \right) & \text{ if } x < 3 \ \left(\tfrac{11}{12},\tfrac{1}{12} \right) & \text{ if } x \ge 3 \ \end{cases} \ X1 &= \frac{7}{10}\left(X0 + X2 - 4\right) + s\left(\frac{1}{2} Z1 \right) \ X3 &= \frac{10}{3} \tanh\left(\frac{X1}{3}\right) + \frac{1}{10}(Z3 -1) \end{align} $$
where $Z1 \sim N(0,1),Z3\sim \textup{Exp}(1)$ are independent noise random variables and $s(x)=\log(1+\exp(x))$.
largebackdoorbinary_t
| Variable | Discrete/continuous | | ------------ | ------------ | | $X0$ | Continuous | | $X1$ | Continuous | | $X2$ | Continuous | | $X3$ | Continuous | | $X4$ | Continuous | | $X5$ | Continuous | | $X6$ | Continuous | | $X7$ | Discrete on $\{0,1\}$ | | $X_8$ | Continuous |
An adaptation of large_backdoor with a binary variable $X_7$ which is considered the treatment variable.
A complete description of the structural equations can be found in the data generation code for CSuite.
weakarrowbinary_t
| Variable | Discrete/continuous | | ------------ | ------------ | | $X0$ | Continuous | | $X1$ | Continuous | | $X2$ | Continuous | | $X3$ | Continuous | | $X4$ | Continuous | | $X5$ | Continuous | | $X6$ | Continuous | | $X7$ | Discrete on $\{0,1\}$ | | $X_8$ | Continuous |
An adaptation of weak_arrows with a binary variable $X_7$ which is considered the treatment variable.
A complete description of the structural equations can be found in the data generation code for CSuite.
mixed_confounding
| Variable | Discrete/continuous | | ------------ | ------------ | | $X0$ | Discrete on $\{0,1\}$ | | $X1$ | Continuous | | $X2$ | Continuous | | $X3$ | Continuous | | $X4$ | Discrete on $\{0,1\}$ | | $X5$ | Discrete on $\{0,1,2\}$ | | $X6$ | Continuous | | $X7$ | Discrete on $\{0,1,2\}$ | | $X8$ | Continuous | | $X9$ | Continuous | | $X{10}$ | Continuous | | $X{11}$ | Continuous |
A larger dataset with treatment node $X0$ and outcome node $X1$. There are different variables that are: confounders, causes of $X0$ only, causes of $X1$ only, downstream of $X0$, downstream of $X1$, collider caused by $X0$ and $X1$.
A complete description of the structural equations can be found in the data generation code for CSuite.
cat_chain
Microsoft Open Source Code of Conduct.
For more information see the Code of Conduct FAQ or
contact opencode@microsoft.com with any additional questions or comments.
Trademarks
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.
Owner
- Name: Microsoft
- Login: microsoft
- Kind: organization
- Email: opensource@microsoft.com
- Location: Redmond, WA
- Website: https://opensource.microsoft.com
- Twitter: OpenAtMicrosoft
- Repositories: 7,257
- Profile: https://github.com/microsoft
Open source projects and samples from Microsoft
Citation (CITATION.cff)
message: "If you use this dataset, please cite the following paper." authors: - family-names: "Geffner" given-names: "Tomas" - family-names: "Antoran" given-names: "Javier" - family-names: "Foster" given-names: "Adam" - family-names: "Gong" given-names: "Wenbo" - family-names: "Ma" given-names: "Chao" - family-names: "Kiciman" given-names: "Emre" - family-names: "Sharma" given-names: "Amit" - family-names: "Lamb" given-names: "Angus" - family-names: "Kukla" given-names: "Martin" - family-names: "Pawlowski" given-names: "Nick" - family-names: "Allamanis" given-names: "Miltiadis" - family-names: "Zhang" given-names: "Cheng" title: "Deep End-to-end Causal Inference" date-released: 2022-02-04 url: "https://arxiv.org/abs/2202.02195" journal: "arXiv preprint arXiv:2202.02195" repository-code: "https://github.com/microsoft/csuite"
GitHub Events
Total
- Watch event: 15
- Fork event: 1
Last Year
- Watch event: 15
- Fork event: 1
Committers
Last synced: 9 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Adam Foster | a****r | 20 |
| Microsoft Open Source | m****e | 5 |
| microsoft-github-operations[bot] | 5****] | 1 |
| Agrin Hilmkil | a****h | 1 |
Issues and Pull Requests
Last synced: 9 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- fred887 (1)