https://github.com/csinva/disentangled-attribution-curves

Using / reproducing DAC from the paper "Disentangled Attribution Curves for Interpreting Random Forests and Boosted Trees"

https://github.com/csinva/disentangled-attribution-curves

Science Score: 33.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Committers with academic emails
    1 of 3 committers (33.3%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.9%) to scientific vocabulary

Keywords

ai artificial-intelligence boosting ensemble-model explainable-ai feature-engineering feature-importance interpretability machine-learning ml python random-forest random-forests scikit-learn statistics stats
Last synced: 5 months ago · JSON representation

Repository

Using / reproducing DAC from the paper "Disentangled Attribution Curves for Interpreting Random Forests and Boosted Trees"

Basic Info
Statistics
  • Stars: 27
  • Watchers: 6
  • Forks: 4
  • Open Issues: 1
  • Releases: 0
Topics
ai artificial-intelligence boosting ensemble-model explainable-ai feature-engineering feature-importance interpretability machine-learning ml python random-forest random-forests scikit-learn statistics stats
Created over 7 years ago · Last pushed about 5 years ago
Metadata Files
Readme License

readme.md

Disentangled attribution curves (DAC) 🔎

Official code for using / reproducing DAC from the paper Disentangled Attribution Curves for Interpreting Random Forests (arXiv 2018 pdf)

Note: this repo is actively maintained. For any questions please file an issue.

documentation

using DAC on new models

  • quick install: pip install git+https://github.com/csinva/disentangled-attribution-curves
  • the core of the method code lies in the dac folder and is compatible with scikit-learn
  • the examples/xor_dac.ipynb folder contains examples of how to use DAC on a new dataset with some simple datasets (e.g. XOR, etc.)
  • the basic api consists of two functions: from dac import dac, dac_plot
  • dac(forest, input_space_x, outcome_space_y, assignment, S, continuous_y=True, class_id=1)

    • inputs:
      • forest: an sklearn ensemble of decision trees
      • input_space_x: the matrix of training data (feature values), a numpy 2D array
      • outcome_space_y: the array of training data (labels/regression targets), a numpy 1D array
      • assignment: a matrix of feature values that will have their DAC importance score evaluated, a numpy 2D array
      • S: a binary indicator of whether to include each feature in the importance calculation, a numpy 1D array with values 0 and 1 only
      • continuous_y: a boolean indicator of whether the y targets are regression(true) or classification(false), defaults to true
      • class_id: if classification, the class value to return proportions for, defaults to 1
    • returns
    • dac_curve
    • for regression: a numpy array whose length corresponds to the number of samples in the assignment input. Each entry is a DAC importance score, a float between min(outcomespacey) and max(outcomespacey)
      • for classification: a numpy array whose length corresponds to the number of samples in the assignment input. Each entry is a DAC importance score, a float between 0 and 1
  • dac_plot(forest, input_space_x, outcome_space_y, S, interval_x, interval_y, di_x, di_y, C, continuous_y, weights

    • inputs
      • forest: an sklearn ensemble of decision trees (random forest or adaboosted forest)
      • input_space_x: the matrix of training data (feature values), a numpy 2D array
      • outcome_space_y: the array of training data (labels/regression targets), a numpy 1D array
      • S: a binary indicator of whether to include each feature in the importance calculation, a numpy 1D array with values 0 and 1 only
      • interval_x: an interval for the x axis of the plot, defaults to None. If None, a reasonable interval will be extrapolated from the range of the first feature specified in S.
      • interval_y: an interval for the y axis of the plot (only applicable to heat maps), defaults to None.
        If None, a reasonable interval will be extrapolated from the range of the second feature specified in S.
      • di_x: a step length for the x axis of the plot, defaults to None. If None, a reasonable step length will be extrapolated from the range of the first feature specified in S.
      • di_y: a step length for the y axis of the plot (only applicable to heat maps), defaults to None. If None, a reasonable step length will be extrapolated from the range of the second feature specified in S.
      • C: a hyper-parameter specifying the number of standard deviations samples can be from the mean of the leaf and be counted into the curve. Smaller values yield a more sensitive curve, larger values yield a smoother curve.
      • continuous_y: a boolean indicator of whether the y targets are regression(true) or classification(false), defaults to true
      • weights: weights for the individual estimators contributions to the curve, defaults to None. If None, weights will be extrapolated from the forest type.
    • returns
      • dac_curve a numpy array containing values for the DAC curve or heatmap describing the interaction of the variables specified in S

reproducing results from the paper

  • the examples/bikesharingdac.ipynb folder contains examples of how to use DAC to reproducing the qualitative curves on the bike-sharing dataset in the paper
  • the simulation script replicates the experiments with running simulations
  • the pmlb script replicates the experiments of automatic feature engineering on pmlb datasets

dac animation

a gif demonstrating calculating a DAC curve for a simple tree

related work

  • this work is part of an overarching project on interpretable machine learning, guided by the PDR framework for interpretable machine learning
  • for related work, see the github repo for disentangled hierarchical dnn interpretations (ICLR 2019)

reference

  • feel free to use/share this code openly

  • citation for this work:

c @article{devlin2019disentangled, title={Disentangled Attribution Curves for Interpreting Random Forests and Boosted Trees}, author={Devlin, Summer and Singh, Chandan and Murdoch, W James and Yu, Bin}, journal={arXiv preprint arXiv:1905.07631}, year={2019} }

Owner

  • Name: Chandan Singh
  • Login: csinva
  • Kind: user
  • Location: Microsoft research
  • Company: Senior researcher

Senior researcher @Microsoft interpreting ML models in science and medicine. PhD from UC Berkeley.

GitHub Events

Total
  • Watch event: 2
Last Year
  • Watch event: 2

Committers

Last synced: 10 months ago

All Time
  • Total Commits: 81
  • Total Committers: 3
  • Avg Commits per committer: 27.0
  • Development Distribution Score (DDS): 0.346
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Chandan Singh c****h@b****u 53
Summer Devlin s****r@S****t 17
Summer Devlin s****r@S****l 11
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 2
  • Total pull requests: 7
  • Average time to close issues: 39 minutes
  • Average time to close pull requests: 17 days
  • Total issue authors: 2
  • Total pull request authors: 1
  • Average comments per issue: 1.5
  • Average comments per pull request: 0.0
  • Merged pull requests: 7
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • stephenpardy (1)
  • kumar-hardik (1)
Pull Request Authors
  • devlins96 (7)
Top Labels
Issue Labels
Pull Request Labels