efficient-rashomon-rule-set

[KDD 2024] Efficient Exploration of the Rashomon Set of Rule Set Models

https://github.com/xiaohan2012/efficient-rashomon-rule-set

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.2%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

[KDD 2024] Efficient Exploration of the Rashomon Set of Rule Set Models

Basic Info

Host: GitHub
Owner: xiaohan2012
Language: Jupyter Notebook
Default Branch: main
Homepage:
Size: 82.1 MB

Statistics

Stars: 3
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 1

Created about 2 years ago · Last pushed over 1 year ago

Metadata Files

Readme Citation

Efficient algorithms to explore the Rashomon set of rule set models

This repository contains the source code of the paper "Efficient Exploration of the Rashomon Set of Rule Set Models" (KDD 2024)

What is a Rashomon set and why studying it?

The Rashomon set of an ML problem refers to the set of models of near-optimal predictive performance.

Why studying it? Because models with similar performance may exhibit drastically different properties (such as fairness), therefore a single model does not offer an adequate representation of the reality.

An example showcasing the Rashomon set of rule set models for the COMPAS dataset.

Each rule set is plotted as a point, whose position is determined by the statistical parity (SP) of the rule set on race and gender (in the X and Y axis, respectively).
- Statistical parity quantifies the fairness of classification models.
You can see that two highlighted models have very different SP[race] scores, though their accuracy scores are close.

Project overview

We designed efficient algorithms to explore the Rashomon set of rule-set models for binary classification problems.
- Our focus is on rule set models, due to their inherent interpretability 💡.
We investigated two exploration modes -- counting and uniform sampling from the Rashomon set.
Instead of tackling exact counting and uniform sampling, we study the approximate versions of them, which reduces the search space drastically.
For both problems, we invented theoretically-sound algorithms sped up by effective pruning bounds, and a efficient implementation of it powered by Numba and Ray.
- Compared to off-the-shelf tools (such as Google OR-tools), our implementation is often >1000x faster ⚡⚡⚡

Environment setup

The source code is tested against Python 3.8 on MacOS 14.2.1

shell pip install -r requirements.txt

Verify that unit tests pass

shell pytest tests

Example usage

We illustrate the usage of approximate counter and almost-uniform sampler applied on synthetic data.

Preparation

Set up a Ray cluster for parallel computing, e.g.,

python import ray ray.init()

Approximate counting

``` python from bds.ruleutils import generaterandomrulesandy from bds.meel import approxmc2

ub = 0.9 # upper bound on the rule set objective function lmbd = 0.1 # complexity penalty term

eps = 0.8 # error parameter related to estimation accuracy delta = 0.8 # the estimation confidence parameter

numpts, numrules = 100, 10

generate the input data

randomrules, randomy = generaterandomrulesandy( numpts, numrules, rand_seed=42 )

get an approximate estimation of the number of good rule set models

estimatedcount = approxmc2( randomrules, randomy, lmbd=lmbd, ub=ub, delta=delta, eps=eps, rand_seed=42, parallel=True, # using paralle run ) ```

Almost uniform sampling

``` python from bds.ruleutils import generaterandomrulesand_y from bds.meel import UniGen

numpts, numrules = 100, 10 randomrules, randomy = generaterandomrulesandy( numpts, numrules, rand_seed=42 )

ub = 0.9 eps = 8 # epsilon parameter that controls the closeness between the sampled distribution and uniform distribution lmbd = 0.1 # complexity penalty term

sampler = UniGen(randomrules, randomy, lmbd, ub, eps, rand_seed=42)

sampler.prepare() # collect necessary statistics required for sampling

sample 10 rule sets almost uniformly from the Rashomon set

samples = sampler.sample(10, exclude_none=True) ```

Candidate rules extraction on real-world datasets

When working with real-world datasets, the first step is often extract a list of candidate rules.

For this purpose, you may rely on extract_rules_with_min_support to extract a list of rules with support above a given threshold.

``` python import pandas as pd from bds.candidategeneration import extractruleswithmin_support

dataset = "compas" data = pd.readcsv('data/compastrain-binary.csv') # the features are binary X = data.to_numpy()[:,:-2] # extract the feature matrix

attribute_names = list(data.columns[:-2])

candidaterules = extractruleswithminsupport(X, attributenames, min_support=70)

then you may apply the sampler or count estimator on the candidate rules

```

Contact persons

Han Xiao: xiaohan2012@gmail.com
Martino Ciaperoni: martino.ciaperoni@aalto.fi

Citing this work

If you find this work useful, please consider citing it.

Bibtex entry

``` bibtex @inproceedings{ciaperoni2024efficient, title={Efficient Exploration of the Rashomon Set of Rule-Set Models}, author={Ciaperoni, Martino and Xiao, Han and Gionis, Aristides}, booktitle={Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining}, pages={478--489}, year={2024} } ```

TODO

[ ] rename package to ers
[ ] packaging
[ ] maybe add a logo?

Owner

Name: Xiao Han
Login: xiaohan2012
Kind: user
Location: Jyväskylä, Finland
Company: Unity

Website: https://xiaohan2012.github.io/
Repositories: 155
Profile: https://github.com/xiaohan2012

Algorithm and Data Science lover

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Xiao"
  given-names: "Han"
  orcid: "https://orcid.org/0000-0003-3529-083X"
- family-names: "Ciaperoni"
  given-names: "Martino"
  orcid: ""
title: "Efficient Exploration of the Rashomon Set of Rule Set Models"
version: 0.0.0
doi: 10.5281/zenodo.14219954
date-released: 2024-11-26
url: "https://github.com/xiaohan2012/efficient-rashomon-rule-set/"

GitHub Events

Total

Release event: 1
Watch event: 3
Push event: 22
Create event: 1

Last Year

Release event: 1
Watch event: 3
Push event: 22
Create event: 1

Committers

Last synced: 11 months ago

All Time

Total Commits: 328
Total Committers: 3
Avg Commits per committer: 109.333
Development Distribution Score (DDS): 0.076

Past Year

Commits: 30
Committers: 1
Avg Commits per committer: 30.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Han Xiao	x**2@g**m	303
martinociaperoni	m**p@g**m	22
martinociaperoni	3****i	3

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

efficient-rashomon-rule-set

Science Score: 54.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Efficient algorithms to explore the Rashomon set of rule set models

What is a Rashomon set and why studying it?

Project overview

Environment setup

Example usage

Preparation

Approximate counting

generate the input data

get an approximate estimation of the number of good rule set models

Almost uniform sampling

sample 10 rule sets almost uniformly from the Rashomon set

Candidate rules extraction on real-world datasets

then you may apply the sampler or count estimator on the candidate rules

Contact persons

Citing this work

TODO

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels