https://github.com/csinva/disentangled-attribution-curves

Using / reproducing DAC from the paper "Disentangled Attribution Curves for Interpreting Random Forests and Boosted Trees"

Keywords

ai artificial-intelligence boosting ensemble-model explainable-ai feature-engineering feature-importance interpretability machine-learning ml python random-forest random-forests scikit-learn statistics stats

Last synced: 5 months ago · JSON representation

Repository

Using / reproducing DAC from the paper "Disentangled Attribution Curves for Interpreting Random Forests and Boosted Trees"

Basic Info

Host: GitHub
Owner: csinva
License: mit
Language: Python
Default Branch: master
Homepage: https://arxiv.org/abs/1905.07631
Size: 4.63 MB

Statistics

Stars: 27
Watchers: 6
Forks: 4
Open Issues: 1
Releases: 0

Topics

ai artificial-intelligence boosting ensemble-model explainable-ai feature-engineering feature-importance interpretability machine-learning ml python random-forest random-forests scikit-learn statistics stats

Created over 7 years ago · Last pushed about 5 years ago

Metadata Files

Readme License

readme.md

Disentangled attribution curves (DAC) 🔎

Official code for using / reproducing DAC from the paper Disentangled Attribution Curves for Interpreting Random Forests (arXiv 2018 pdf)

Note: this repo is actively maintained. For any questions please file an issue.

documentation

using DAC on new models

quick install: pip install git+https://github.com/csinva/disentangled-attribution-curves
the core of the method code lies in the dac folder and is compatible with scikit-learn
the examples/xor_dac.ipynb folder contains examples of how to use DAC on a new dataset with some simple datasets (e.g. XOR, etc.)
the basic api consists of two functions: from dac import dac, dac_plot
dac(forest, input_space_x, outcome_space_y, assignment, S, continuous_y=True, class_id=1)
- inputs:
  - forest: an sklearn ensemble of decision trees
  - input_space_x: the matrix of training data (feature values), a numpy 2D array
  - outcome_space_y: the array of training data (labels/regression targets), a numpy 1D array
  - assignment: a matrix of feature values that will have their DAC importance score evaluated, a numpy 2D array
  - S: a binary indicator of whether to include each feature in the importance calculation, a numpy 1D array with values 0 and 1 only
  - continuous_y: a boolean indicator of whether the y targets are regression(true) or classification(false), defaults to true
  - class_id: if classification, the class value to return proportions for, defaults to 1
- returns
- dac_curve
- for regression: a numpy array whose length corresponds to the number of samples in the assignment input. Each entry is a DAC importance score, a float between min(outcomespacey) and max(outcomespacey)
  - for classification: a numpy array whose length corresponds to the number of samples in the assignment input. Each entry is a DAC importance score, a float between 0 and 1
dac_plot(forest, input_space_x, outcome_space_y, S, interval_x, interval_y, di_x, di_y, C, continuous_y, weights
- inputs
  - forest: an sklearn ensemble of decision trees (random forest or adaboosted forest)
  - input_space_x: the matrix of training data (feature values), a numpy 2D array
  - outcome_space_y: the array of training data (labels/regression targets), a numpy 1D array
  - S: a binary indicator of whether to include each feature in the importance calculation, a numpy 1D array with values 0 and 1 only
  - interval_x: an interval for the x axis of the plot, defaults to None. If None, a reasonable interval will be extrapolated from the range of the first feature specified in S.
  - interval_y: an interval for the y axis of the plot (only applicable to heat maps), defaults to None.
    If None, a reasonable interval will be extrapolated from the range of the second feature specified in S.
  - di_x: a step length for the x axis of the plot, defaults to None. If None, a reasonable step length will be extrapolated from the range of the first feature specified in S.
  - di_y: a step length for the y axis of the plot (only applicable to heat maps), defaults to None. If None, a reasonable step length will be extrapolated from the range of the second feature specified in S.
  - C: a hyper-parameter specifying the number of standard deviations samples can be from the mean of the leaf and be counted into the curve. Smaller values yield a more sensitive curve, larger values yield a smoother curve.
  - continuous_y: a boolean indicator of whether the y targets are regression(true) or classification(false), defaults to true
  - weights: weights for the individual estimators contributions to the curve, defaults to None. If None, weights will be extrapolated from the forest type.
- returns
  - dac_curve a numpy array containing values for the DAC curve or heatmap describing the interaction of the variables specified in S

reproducing results from the paper

the examples/bikesharingdac.ipynb folder contains examples of how to use DAC to reproducing the qualitative curves on the bike-sharing dataset in the paper
the simulation script replicates the experiments with running simulations
the pmlb script replicates the experiments of automatic feature engineering on pmlb datasets

dac animation

a gif demonstrating calculating a DAC curve for a simple tree

related work

this work is part of an overarching project on interpretable machine learning, guided by the PDR framework for interpretable machine learning
for related work, see the github repo for disentangled hierarchical dnn interpretations (ICLR 2019)

reference

feel free to use/share this code openly
citation for this work:

c @article{devlin2019disentangled, title={Disentangled Attribution Curves for Interpreting Random Forests and Boosted Trees}, author={Devlin, Summer and Singh, Chandan and Murdoch, W James and Yu, Bin}, journal={arXiv preprint arXiv:1905.07631}, year={2019} }

Owner

Name: Chandan Singh
Login: csinva
Kind: user
Location: Microsoft research
Company: Senior researcher

Website: csinva.io
Twitter: csinva_
Repositories: 29
Profile: https://github.com/csinva

Senior researcher @Microsoft interpreting ML models in science and medicine. PhD from UC Berkeley.

GitHub Events

Total

Watch event: 2

Last Year

Watch event: 2

Committers

Last synced: 10 months ago

All Time

Total Commits: 81
Total Committers: 3
Avg Commits per committer: 27.0
Development Distribution Score (DDS): 0.346

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Chandan Singh	c**h@b**u	53
Summer Devlin	s**r@S**t	17
Summer Devlin	s**r@S**l	11

Committer Domains (Top 20 + Academic)

summers-mbp.hsd1.ca.comcast.net: 1 berkeley.edu: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 2
Total pull requests: 7
Average time to close issues: 39 minutes
Average time to close pull requests: 17 days
Total issue authors: 2
Total pull request authors: 1
Average comments per issue: 1.5
Average comments per pull request: 0.0
Merged pull requests: 7
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

https://github.com/csinva/disentangled-attribution-curves

Science Score: 33.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

readme.md

Disentangled attribution curves (DAC) 🔎

documentation

using DAC on new models

reproducing results from the paper

dac animation

related work

reference

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels