califair-em

Official implementation of GUIDE-AI @ SIGMOD paper "Threshold-Independent Fair Matching through Score Calibration"

https://github.com/mhmoslemi2338/califair-em

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.3%) to scientific vocabulary

Keywords

classification entity-resolution fairness optimal-transport record-linkage sigmod

Last synced: 6 months ago · JSON representation ·

Repository

Official implementation of GUIDE-AI @ SIGMOD paper "Threshold-Independent Fair Matching through Score Calibration"

Basic Info

Host: GitHub
Owner: mhmoslemi2338
License: mit
Language: Python
Default Branch: master
Homepage: https://dl.acm.org/doi/abs/10.1145/3665601.3669845
Size: 19.3 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Topics

classification entity-resolution fairness optimal-transport record-linkage sigmod

Created almost 2 years ago · Last pushed 8 months ago

Metadata Files

Readme License Citation

Official implementation of GUIDE-AI @ SIGMOD 2024 paper "Threshold-Independent Fair Matching through Score Calibration"

Abstract

Entity Matching (EM) is a critical task in numerous fields, such as healthcare, finance, and public administration, as it identifies records that refer to the same entity within or across different databases. EM faces considerable challenges, particularly with false positives and negatives. These are typically addressed by generating matching scores and apply thresholds to balance false positives and negatives in various contexts. However, adjusting these thresholds can affect the fairness of the outcomes, a critical factor that remains largely overlooked in current fair EM research. The existing body of research on fair EM tends to concentrate on static thresholds, neglecting their critical impact on fairness. To address this, we introduce a new approach in EM using recent metrics for evaluating biases in score based binary classification, particularly through the lens of distributional parity. This approach enables the application of various bias metrics like equalized odds, equal opportunity, and demographic parity without depending on threshold settings. Our experiments with leading matching methods reveal potential biases, and by applying a calibration technique for EM scores using Wasserstein barycenters, we not only mitigate these biases but also preserve accuracy across real world datasets. This paper contributes to the field of fairness in data cleaning, especially within EM, which is a central task in data cleaning, by promoting a method for generating matching scores that reduce biases across different thresholds.

Data Directory

You can find all the data we used in the DATA directory.

The dataset are from the paper Deep Learning for Entity Matching: A Design Space Exploration, SIGMOD 2018 at https://github.com/anhaidgroup/deepmatcher/blob/master/Datasets.md

Implementation Details

In each of the directories for DITTO, DeepMatcher, EMTransformer, HierGAT, HierMatcher, and Magellan, you will find the implementation for each method and instructions for obtaining the results. Each directory contains a Python script named starting with Train_. You can use this script to retrain the network. After training, the score for the test data will be automatically saved in the SCORES directory.

Regenerating Experiments

You can regenerate the experiments from the experiments.ipynb file, which utilizes the scores in the SCORES directory. This notebook also saves some variables in .pkl format and saves the final results and measurements in .csv format, as well as figures in .pdf format, in the FIGURES directory.

Citation

If you use this code, please cite our paper:

```bibtex @inproceedings{moslemi2024threshold, title={Threshold-Independent Fair Matching through Score Calibration}, author={Moslemi, Mohammad Hossein and Milani, Mostafa}, booktitle={Proceedings of the Conference on Governance, Understanding and Integration of Data for Effective and Responsible AI}, pages={40--44}, year={2024} }

Owner

Name: mohammad hosein moslemi
Login: mhmoslemi2338
Kind: user
Location: tehran,Iran
Company: sharif university of Tech, Tehran

Website: http://ee.sharif.edu/~moslemi.mohammdhosein/
Twitter: mh_moslemi
Repositories: 7
Profile: https://github.com/mhmoslemi2338

BSc. of Electrical Engineering at the Sharif university of tech.my main interest are : computer vision and image processing specially medical image process

Citation (CITATION.cff)

cff-version: 1.2.0
message: >
  If you use this code, please cite our GUIDE-AI 2024 paper.

title: Threshold-Independent Fair Matching through Score Calibration
version: "1.0.0"
doi: 10.1145/3665601.3669845
date-released: 2024-06-14        

authors:
  - family-names: Moslemi
    given-names: Mohammad Hossein
    orcid: https://orcid.org/0009-0002-0278-4665
  - family-names: Milani
    given-names: Mostafa

repository-code: https://github.com/mhmoslemi2338/CaliFair-EM
url: https://doi.org/10.1145/3665601.3669845
license: MIT

preferred-citation:
  type: conference-paper
  title: Threshold-Independent Fair Matching through Score Calibration
  authors:
    - family-names: Moslemi
      given-names: Mohammad Hossein
      orcid: https://orcid.org/0009-0002-0278-4665
    - family-names: Milani
      given-names: Mostafa
  conference-name: Proceedings of the Conference on Governance, Understanding and Integration of Data for Effective and Responsible AI (GUIDE-AI '24)
  year: 2024
  pages: 40–44
  doi: 10.1145/3665601.3669845

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science