https://github.com/bytedance/causalmatch

CausalMatch is a Bytedance research project aimed at integrating cutting-edge machine learning and econometrics methods to bring about automation in decision-making process.

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
✓
Academic publication links
Links to: arxiv.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (16.2%) to scientific vocabulary

Keywords

causal-inference econometrics machine-learning

Last synced: 7 months ago · JSON representation

Repository

CausalMatch is a Bytedance research project aimed at integrating cutting-edge machine learning and econometrics methods to bring about automation in decision-making process.

Basic Info

Host: GitHub
Owner: bytedance
License: apache-2.0
Language: Jupyter Notebook
Default Branch: main
Homepage:
Size: 754 KB

Statistics

Stars: 87
Watchers: 6
Forks: 5
Open Issues: 1
Releases: 5

Topics

causal-inference econometrics machine-learning

Created over 1 year ago · Last pushed 7 months ago

Metadata Files

Readme License

CausalMatch: A Python Package for Propensity Score Matching and Coarsened Exact Matching

CausalMatch is a Python package that implements two classic matching methods, propensity score matching (PSM) and coarsened exact matching (CEM), to estimate average treatment effects from observational data. This package was designed and built as part of the ByteDance data science research program with the goal of combining state-of-the-art machine learning techniques with econometrics to bring automation to complex causal inference problems. Our toolkit possess the following features: * Implement classic matching techniques in the literature at the intersection of econometrics and machine learning * Maintain flexibility in modeling the propensity score model (via various machine learning classification models), while preserving the causal interpretation of the learned model and often offering valid confidence intervals * Use a unified API * Build on standard Python packages for Machine Learning and Data Analysis

Table of Contents

- [News](#news) - [Getting Started](#getting-started) - [Installation](#installation) - [Usage Examples](#usage-examples) - [Estimation Methods](#estimation-methods) - [References](#references)

News

If you'd like to contribute to this project, please contact xiaoyuzhou@bytedance.com. If you have any questions, feel free to raise them in the issues section.

March 19, 2025: Release v0.0.5, see release notes here

Previous releases

**December 10, 2024:** Release v0.0.4, see release notes [here](https://github.com/bytedance/CausalMatch/releases/tag/v0.0.4) **August 20, 2024:** Release v0.0.2, see release notes [here](https://github.com/bytedance/CausalMatch/releases/tag/v0.0.2) **August 2, 2024:** Release 0.0.1.

Getting Started

Installation

Install the latest release from [PyPI]: pip install causalmatch==0.0.5

Usage Examples

Estimation Methods

Propensity Score Matching (aka PSM) (click to expand)

* Simple PSM ```Python from causalmatch import matching, gen_test_data from sklearn.ensemble import GradientBoostingClassifier df = gen_test_data(n = 10000, c_ratio=0.5) df.head() X = ['c_1', 'c_2', 'c_3', 'd_1', 'gender'] y = ['y', 'y2'] T = 'treatment' id = 'user_id' # STEP 1: initialize object match_obj = matching(data = df, T = T, X = X, y = y, id = id) # STEP 2: propensity score matching match_obj.psm(n_neighbors = 1, # number of neighbors model = GradientBoostingClassifier(), # p-score model trim_percentage = 0.1, # trim x percent of data based on propensity score caliper = 0.1) # caliper for p-score diff # STEP 3: balance check after propensity score matching match_obj.balance_check(include_discrete = True) # STEP 4: obtain average partial effect print(match_obj.ate()) ``` * PSM with multiple p-score models and select the best one based on f1 score ```Python # STEP 0: define all classification model you need from causalmatch import matching from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.naive_bayes import GaussianNB from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import GradientBoostingClassifier from lightgbm import LGBMClassifier from xgboost import XGBClassifier ps_model1 = LogisticRegression(C=1e6) ps_model2 = SVC(probability=True) ps_model3 = GaussianNB() ps_model4 = KNeighborsClassifier() ps_model5 = DecisionTreeClassifier() ps_model6 = RandomForestClassifier() ps_model7 = GradientBoostingClassifier() ps_model8 = LGBMClassifier() ps_model9 = XGBClassifier() model_list = [ps_model1, ps_model2, ps_model3, ps_model4, ps_model5, ps_model6, ps_model7, ps_model8, ps_model9] match_obj = matching(data = df, T = T, X = X, id = id) match_obj.psm(n_neighbors = 1, model_list = model_list, # input list of models you want to try trim_percentage = 0, caliper = 1, test_size = 0.2) # train-test split, what portion does test sample takes print(match_obj.balance_check(include_discrete = True)) df_out = match_obj.df_out_final_post_trim.merge(df[y + X + [id]], how='left', on = id) ```

Coarsened Exact Matching (click to expand)

* Simple CEM ```Python match_obj_cem = matching(data = df, y = ['y'], T = 'treatment', X = ['c_1','d_1','d_3'], id = 'user_id') # coarsened exact matching match_obj_cem.cem(n_bins = 10, # number of bins for continuous x variables, cut by percentile k2k = True) # k2k: trim exp/base to have same observation numbers print(match_obj_cem.balance_check(include_discrete=True)) print(match_obj_cem.ate()) ``` * CEM with customized bin cut ```Python match_obj_cem = matching(data = df, y = ['y'], T = 'treatment', X = ['c_1','d_1','d_3'], id = 'user_id') match_obj_cem.cem(n_bins = 10, break_points = {'c_1': [-1, 0.3, 0.6, 2]}, # cut point for continuous variable cluster_criteria = {'d_1': [['apple','pear'],['cat','dog'],['bee']], 'd_3': [['0.0','1.0','2.0'], ['3.0','4.0','5.0'], ['6.0','7.0','8.0','9.0']]}, # group values for discrete variables k2k = True) ```

See the References section for more details.

References

S. Athey, J. Tibshirani, S. Wager. Generalized random forests. Annals of Statistics, 47, no. 2, 1148--1178, 2019.

V. Chernozhukov, D. Nekipelov, V. Semenova, V. Syrgkanis. Plug-in Regularized Estimation of High-Dimensional Parameters in Nonlinear Semiparametric Models. Arxiv preprint arxiv:1806.04823, 2018.

S. Wager, S. Athey. Estimation and Inference of Heterogeneous Treatment Effects using Random Forests. Journal of the American Statistical Association, 113:523, 1228-1242, 2018.

V. Chernozhukov, D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, and a. W. Newey. Double Machine Learning for Treatment and Causal Parameters. ArXiv preprint arXiv:1608.00060, 2016.

Bajari, P., Burdick, B., Imbens, G. W., Masoero, L., McQueen, J., Richardson, T., & Rosen, I. M. (2021). Multiple randomization designs. arXiv preprint arXiv:2112.13495.

Owner

Name: Bytedance Inc.
Login: bytedance
Kind: organization
Location: Singapore

Website: https://opensource.bytedance.com
Twitter: ByteDanceOSS
Repositories: 255
Profile: https://github.com/bytedance

GitHub Events

Total

Release event: 3
Watch event: 47
Issue comment event: 1
Member event: 2
Push event: 20
Fork event: 4
Create event: 4

Last Year

Release event: 3
Watch event: 47
Issue comment event: 1
Member event: 2
Push event: 20
Fork event: 4
Create event: 4

Committers

Last synced: 11 months ago

All Time

Total Commits: 50
Total Committers: 4
Avg Commits per committer: 12.5
Development Distribution Score (DDS): 0.06

Past Year

Commits: 50
Committers: 4
Avg Commits per committer: 12.5
Development Distribution Score (DDS): 0.06

Top Committers

Name	Email	Commits
周小羽	x**u@b**m	47
lx-byte	l**g@b**m	1
ajw-gelaoguan	j**o@b**m	1
Willem Jiang	1****d	1

Committer Domains (Top 20 + Academic)

bytedance.com: 3

Issues and Pull Requests

Last synced: 8 months ago

All Time

Total issues: 1
Total pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: less than a minute
Total issue authors: 1
Total pull request authors: 1
Average comments per issue: 0.0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: less than a minute
Issue authors: 1
Pull request authors: 1
Average comments per issue: 0.0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

nickeubank (1)

Pull Request Authors

lx-byte (2)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 56 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 6
Total maintainers: 2

pypi.org: causalmatch

Propensity score matching and coarsened exact matching

Homepage: https://github.com/bytedance/CausalMatch
Documentation: https://causalmatch.readthedocs.io/
License: apache-2.0
Latest release: 0.0.6
published 7 months ago

Versions: 6
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 56 Last month

Rankings

Dependent packages count: 10.5%

Average: 34.9%

Dependent repos count: 59.3%

Maintainers (2)

langteam schouxy53

Last synced: 7 months ago

Dependencies

setup.py pypi

https://github.com/bytedance/causalmatch

Science Score: 49.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

CausalMatch: A Python Package for Propensity Score Matching and Coarsened Exact Matching

News

Getting Started

Installation

Usage Examples

Estimation Methods

References

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: causalmatch

Rankings

Maintainers (2)

Dependencies