efficient-rashomon-rule-set

[KDD 2024] Efficient Exploration of the Rashomon Set of Rule Set Models

https://github.com/xiaohan2012/efficient-rashomon-rule-set

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.2%) to scientific vocabulary
Last synced: 7 months ago · JSON representation ·

Repository

[KDD 2024] Efficient Exploration of the Rashomon Set of Rule Set Models

Basic Info
  • Host: GitHub
  • Owner: xiaohan2012
  • Language: Jupyter Notebook
  • Default Branch: main
  • Homepage:
  • Size: 82.1 MB
Statistics
  • Stars: 3
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Created almost 2 years ago · Last pushed over 1 year ago
Metadata Files
Readme Citation

README.md

About Unittests Lint TestCoverage PythonVersion

Efficient algorithms to explore the Rashomon set of rule set models

This repository contains the source code of the paper "Efficient Exploration of the Rashomon Set of Rule Set Models" (KDD 2024)

What is a Rashomon set and why studying it?

The Rashomon set of an ML problem refers to the set of models of near-optimal predictive performance.

Why studying it? Because models with similar performance may exhibit drastically different properties (such as fairness), therefore a single model does not offer an adequate representation of the reality.

An example showcasing the Rashomon set of rule set models for the COMPAS dataset.

  • Each rule set is plotted as a point, whose position is determined by the statistical parity (SP) of the rule set on race and gender (in the X and Y axis, respectively).
    • Statistical parity quantifies the fairness of classification models.
  • You can see that two highlighted models have very different SP[race] scores, though their accuracy scores are close.

Project overview

  • We designed efficient algorithms to explore the Rashomon set of rule-set models for binary classification problems.
    • Our focus is on rule set models, due to their inherent interpretability 💡.
  • We investigated two exploration modes -- counting and uniform sampling from the Rashomon set.
  • Instead of tackling exact counting and uniform sampling, we study the approximate versions of them, which reduces the search space drastically.
  • For both problems, we invented theoretically-sound algorithms sped up by effective pruning bounds, and a efficient implementation of it powered by Numba and Ray.
    • Compared to off-the-shelf tools (such as Google OR-tools), our implementation is often >1000x faster ⚡⚡⚡

Environment setup

The source code is tested against Python 3.8 on MacOS 14.2.1

shell pip install -r requirements.txt

Verify that unit tests pass

shell pytest tests

Example usage

We illustrate the usage of approximate counter and almost-uniform sampler applied on synthetic data.

Preparation

Set up a Ray cluster for parallel computing, e.g.,

python import ray ray.init()

Approximate counting

``` python from bds.ruleutils import generaterandomrulesandy from bds.meel import approxmc2

ub = 0.9 # upper bound on the rule set objective function lmbd = 0.1 # complexity penalty term

eps = 0.8 # error parameter related to estimation accuracy delta = 0.8 # the estimation confidence parameter

numpts, numrules = 100, 10

generate the input data

randomrules, randomy = generaterandomrulesandy( numpts, numrules, rand_seed=42 )

get an approximate estimation of the number of good rule set models

estimatedcount = approxmc2( randomrules, randomy, lmbd=lmbd, ub=ub, delta=delta, eps=eps, rand_seed=42, parallel=True, # using paralle run ) ```

Almost uniform sampling

``` python from bds.ruleutils import generaterandomrulesand_y from bds.meel import UniGen

numpts, numrules = 100, 10 randomrules, randomy = generaterandomrulesandy( numpts, numrules, rand_seed=42 )

ub = 0.9 eps = 8 # epsilon parameter that controls the closeness between the sampled distribution and uniform distribution lmbd = 0.1 # complexity penalty term

sampler = UniGen(randomrules, randomy, lmbd, ub, eps, rand_seed=42)

sampler.prepare() # collect necessary statistics required for sampling

sample 10 rule sets almost uniformly from the Rashomon set

samples = sampler.sample(10, exclude_none=True) ```

Candidate rules extraction on real-world datasets

When working with real-world datasets, the first step is often extract a list of candidate rules.

For this purpose, you may rely on extract_rules_with_min_support to extract a list of rules with support above a given threshold.

``` python import pandas as pd from bds.candidategeneration import extractruleswithmin_support

dataset = "compas" data = pd.readcsv('data/compastrain-binary.csv') # the features are binary X = data.to_numpy()[:,:-2] # extract the feature matrix

attribute_names = list(data.columns[:-2])

candidaterules = extractruleswithminsupport(X, attributenames, min_support=70)

then you may apply the sampler or count estimator on the candidate rules

```

Contact persons

  • Han Xiao: xiaohan2012@gmail.com
  • Martino Ciaperoni: martino.ciaperoni@aalto.fi

Citing this work

If you find this work useful, please consider citing it.

Bibtex entry ``` bibtex @inproceedings{ciaperoni2024efficient, title={Efficient Exploration of the Rashomon Set of Rule-Set Models}, author={Ciaperoni, Martino and Xiao, Han and Gionis, Aristides}, booktitle={Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining}, pages={478--489}, year={2024} } ```

TODO

  • [ ] rename package to ers
  • [ ] packaging
  • [ ] maybe add a logo?

Owner

  • Name: Xiao Han
  • Login: xiaohan2012
  • Kind: user
  • Location: Jyväskylä, Finland
  • Company: Unity

Algorithm and Data Science lover

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Xiao"
  given-names: "Han"
  orcid: "https://orcid.org/0000-0003-3529-083X"
- family-names: "Ciaperoni"
  given-names: "Martino"
  orcid: ""
title: "Efficient Exploration of the Rashomon Set of Rule Set Models"
version: 0.0.0
doi: 10.5281/zenodo.14219954
date-released: 2024-11-26
url: "https://github.com/xiaohan2012/efficient-rashomon-rule-set/"

GitHub Events

Total
  • Release event: 1
  • Watch event: 3
  • Push event: 22
  • Create event: 1
Last Year
  • Release event: 1
  • Watch event: 3
  • Push event: 22
  • Create event: 1

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 328
  • Total Committers: 3
  • Avg Commits per committer: 109.333
  • Development Distribution Score (DDS): 0.076
Past Year
  • Commits: 30
  • Committers: 1
  • Avg Commits per committer: 30.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Han Xiao x****2@g****m 303
martinociaperoni m****p@g****m 22
martinociaperoni 3****i 3

Issues and Pull Requests

Last synced: 8 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels