pre-em-bias
Official implementation of the IEEE Big Data 2024 paper "Evaluating Blocking Biases in Entity Matching"
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in README -
✓Academic publication links
Links to: ieee.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (6.4%) to scientific vocabulary
Keywords
Repository
Official implementation of the IEEE Big Data 2024 paper "Evaluating Blocking Biases in Entity Matching"
Basic Info
- Host: GitHub
- Owner: mhmoslemi2338
- License: mit
- Language: Jupyter Notebook
- Default Branch: master
- Homepage: https://ieeexplore.ieee.org/document/10825531
- Size: 38.1 MB
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
Official implementation of the IEEE Big Data 2024 paper "Evaluating Blocking Biases in Entity Matching"
1- Dataset characteristics
The table provides detailed statistical information about the datasets. The numbers in parentheses refer to the corresponding counts for the minority group. For example, in the WAL–AMZ dataset, |D₁| (2.6k) indicates that there are 2.6k entities, with 96 of them belonging to the minority group. The majority group parameters can be inferred from the table by subtracting the minority group numbers from the total values listed.
| Dataset | #Attr. | |D₁| | |D₂| | |P| | |M| | |------------|--------|------------|-------------|---------------|-----------| | WAL–AMZ | 5 | 2.6k (96) | 22.0k (172) | 56.4m (2.5m) | 962 (88) | | BEER | 4 | 4.3k (1.3k)| 3.0k (932) | 13.0m (6.8m) | 68 (29) | | AMZ–GOO | 3 | 1.4k (83) | 3.2k (4) | 4.4m (272.9k) | 1.2k (60) | | FOD–ZAG | 6 | 533 (72) | 331 (63) | 176.4k (25.2k)| 111 (10) | | ITU–AMZ | 8 | 6.9k (1.9k)| 55.9k (12.7k)| 386.2m (171.0m)| 132 (40) | | DBLP–GOO | 4 | 2.6k (191) | 64.3k (389) | 168.1m (13.2m)| 5.3k (403)| | DBLP–ACM | 4 | 2.6k (251) | 2.3k (225) | 6.0m (1.1m) | 2.2k (310)|
4- Assessing Bias Propagating from Blocking to Matching
This document presents the results of assessing bias propagation from blocking to matching using various blocking methods. The experiments focus on comparing fairness metrics such as Equal Opportunity (EO), Equalized Odds (EOP), and Demographic Parity (DP), along with confusion matrix elements (TP, FN, FP, TN) for minority and majority groups across several datasets.
Blocking Methods
- SB: StandardBlocking
- EQG: ExtendedQGramsBlocking
- ESA: ExtendedSuffixArraysBlocking
- QG: QGramsBlocking
- SA: SuffixArraysBlocking
- CTT: CTT
- AE: AUTO
Beer Dataset Results
StandardBlocking (SB)
- EOP: 0.0168
- EO: 0.0168
- DP: 1.7458e-06
- Confusion Matrix:
- Minority:
TP=28, FN=1, FP=0, TN=6754455 - Majority:
TP=37, FN=2, FP=0, TN=6280477
- Minority:
ExtendedQGramsBlocking (EQG)
- EOP: 0.0937
- EO: 0.0937
- DP: 1.2682e-06
- Confusion Matrix:
- Minority:
TP=28, FN=1, FP=0, TN=6754455 - Majority:
TP=34, FN=5, FP=0, TN=6280477
- Minority:
ExtendedSuffixArraysBlocking (ESA)
- EOP: 0.0592
- EO: 0.0592
- DP: 1.4162e-06
- Confusion Matrix:
- Minority:
TP=27, FN=2, FP=0, TN=6754455 - Majority:
TP=34, FN=5, FP=0, TN=6280477
- Minority:
QGramsBlocking (QG)
- EOP: 0.0681
- EO: 0.0681
- DP: 1.4274e-06
- Confusion Matrix:
- Minority:
TP=28, FN=1, FP=0, TN=6754455 - Majority:
TP=35, FN=4, FP=0, TN=6280477
- Minority:
SuffixArraysBlocking (SA)
- EOP: 0.0849
- EO: 0.0849
- DP: 1.2570e-06
- Confusion Matrix:
- Minority:
TP=27, FN=2, FP=0, TN=6754455 - Majority:
TP=33, FN=6, FP=0, TN=6280477
- Minority:
Fodors-Zagat Dataset Results
StandardBlocking (SB)
- EOP: 0.0
- EO: 0.0
- DP: 0.0002317
- Confusion Matrix:
- Minority:
TP=11, FN=0, FP=0, TN=25204 - Majority:
TP=101, FN=0, FP=0, TN=151107
- Minority:
ExtendedQGramsBlocking (EQG)
- EOP: 0.0
- EO: 0.0
- DP: 0.0002317
- Confusion Matrix:
- Minority:
TP=11, FN=0, FP=0, TN=25204 - Majority:
TP=101, FN=0, FP=0, TN=151107
- Minority:
ExtendedSuffixArraysBlocking (ESA)
- EOP: 0.0297
- EO: 0.0297
- DP: 0.0002119
- Confusion Matrix:
- Minority:
TP=11, FN=0, FP=0, TN=25204 - Majority:
TP=98, FN=3, FP=0, TN=151107
- Minority:
QGramsBlocking (QG)
- EOP: 0.0
- EO: 0.0
- DP: 0.0002317
- Confusion Matrix:
- Minority:
TP=11, FN=0, FP=0, TN=25204 - Majority:
TP=101, FN=0, FP=0, TN=151107
- Minority:
SuffixArraysBlocking (SA)
- EOP: 0.0
- EO: 0.0
- DP: 0.0002317
- Confusion Matrix:
- Minority:
TP=11, FN=0, FP=0, TN=25204 - Majority:
TP=101, FN=0, FP=0, TN=151107
- Minority:
Walmart-Amazon Dataset Results
StandardBlocking (SB)
- EOP: 0.0147
- EO: 0.0147
- DP: -1.7728e-05
- Confusion Matrix:
- Minority:
TP=86, FN=2, FP=0, TN=2541792 - Majority:
TP=867, FN=7, FP=0, TN=53834242
- Minority:
ExtendedQGramsBlocking (EQG)
- EOP: 0.0113
- EO: 0.0113
- DP: -1.7784e-05
- Confusion Matrix:
- Minority:
TP=86, FN=2, FP=0, TN=2541792 - Majority:
TP=864, FN=10, FP=0, TN=53834242
- Minority:
ExtendedSuffixArraysBlocking (ESA)
- EOP: 0.0106
- EO: 0.0106
- DP: -1.5915e-05
- Confusion Matrix:
- Minority:
TP=77, FN=11, FP=0, TN=2541792 - Majority:
TP=774, FN=100, FP=0, TN=53834242
- Minority:
QGramsBlocking (QG)
- EOP: 0.0147
- EO: 0.0147
- DP: -1.7728e-05
- Confusion Matrix:
- Minority:
TP=86, FN=2, FP=0, TN=2541792 - Majority:
TP=867, FN=7, FP=0, TN=53834242
- Minority:
SuffixArraysBlocking (SA)
- EOP: 0.0528
- EO: 0.0528
- DP: -1.5020e-05
- Confusion Matrix:
- Minority:
TP=76, FN=12, FP=0, TN=2541792 - Majority:
TP=801, FN=73, FP=0, TN=53834242
- Minority:
Amazon-Google Dataset Results
StandardBlocking (SB)
- EOP: 0.0171
- EO: 0.0171
- DP: 5.1505e-05
- Confusion Matrix:
- Minority:
TP=58, FN=2, FP=0, TN=272818 - Majority:
TP=1089, FN=18, FP=0, TN=4123053
- Minority:
ExtendedQGramsBlocking (EQG)
- EOP: 0.0616
- EO: 0.0616
- DP: 5.9401e-05
- Confusion Matrix:
- Minority:
TP=53, FN=7, FP=0, TN=272818 - Majority:
TP=1046, FN=61, FP=0, TN=4123053
- Minority:
ExtendedSuffixArraysBlocking (ESA)
- EOP: 0.1816
- EO: 0.1816
- DP: 8.1097e-05
- Confusion Matrix:
- Minority:
TP=40, FN=20, FP=0, TN=272818 - Majority:
TP=939, FN=168, FP=0, TN=4123053
- Minority:
QGramsBlocking (QG)
- EOP: 0.0100
- EO: 0.0100
- DP: 4.4230e-05
- Confusion Matrix:
- Minority:
TP=58, FN=2, FP=0, TN=272818 - Majority:
TP=1059, FN=48, FP=0, TN=4123053
- Minority:
SuffixArraysBlocking (SA)
- EOP: 0.1601
- EO: 0.1601
- DP: 7.8562e-05
- Confusion Matrix:
- Minority:
TP=44, FN=16, FP=0, TN=272818 - Majority:
TP=989, FN=118, FP=0, TN=4123053
- Minority:
DBLP-GoogleScholar Dataset Results
StandardBlocking (SB)
- EOP: 0.0065
- EO: 0.0065
- DP: 4.3342e-08
- Confusion Matrix:
- Minority:
TP=37, FN=0, FP=0, TN=1128018 - Majority:
TP=460, FN=3, FP=0, TN=14005497
- Minority:
ExtendedQGramsBlocking (EQG)
- EOP: 0.0108
- EO: 0.0108
- DP: -9.9454e-08
- Confusion Matrix:
- Minority:
TP=37, FN=0, FP=0, TN=1128018 - Majority:
TP=458, FN=5, FP=0, TN=14005497
- Minority:
ExtendedSuffixArraysBlocking (ESA)
- EOP: 0.0421
- EO: 0.0421
- DP: -1.1407e-06
- Confusion Matrix:
- Minority:
TP=36, FN=1, FP=0, TN=1128018 - Majority:
TP=431, FN=32, FP=0, TN=14005497
- Minority:
QGramsBlocking (QG)
- EOP: 0.0065
- EO: 0.0065
- DP: 4.3342e-08
- Confusion Matrix:
- Minority:
TP=37, FN=0, FP=0, TN=1128018 - Majority:
TP=460, FN=3, FP=0, TN=14005497
- Minority:
SuffixArraysBlocking (SA)
- EOP: 0.0518
- EO: 0.0518
- DP: -1.4560e-06
- Confusion Matrix:
- Minority:
TP=37, FN=0, FP=0, TN=1128018 - Majority:
TP=439, FN=24, FP=0, TN=14005497
- Minority:
iTunes-Amazon Dataset Results
StandardBlocking (SB)
- EOP: 0.0
- EO: 0.0
- DP: 4.5370e-07
- Confusion Matrix:
- Minority:
TP=2, FN=0, FP=0, TN=15804497 - Majority:
TP=11, FN=0, FP=0, TN=18957434
- Minority:
ExtendedQGramsBlocking (EQG)
- EOP: 0.5
- EO: 0.5
- DP: 5.1697e-07
- Confusion Matrix:
- Minority:
TP=1, FN=1, FP=0, TN=15804497 - Majority:
TP=11, FN=0, FP=0, TN=18957434
- Minority:
ExtendedSuffixArraysBlocking (ESA)
- EOP: 0.2727
- EO: 0.2727
- DP: 2.9545e-07
- Confusion Matrix:
- Minority:
TP=2, FN=0, FP=0, TN=15804497 - Majority:
TP=8, FN=3, FP=0, TN=18957434
- Minority:
QGramsBlocking (QG)
- EOP: 0.0
- EO: 0.0
- DP: 4.5370e-07
- Confusion Matrix:
- Minority:
TP=2, FN=0, FP=0, TN=15804497 - Majority:
TP=11, FN=0, FP=0, TN=18957434
- Minority:
SuffixArraysBlocking (SA)
- EOP: 0.0
- EO: 0.0
- DP: 4.5370e-07
- Confusion Matrix:
- Minority:
TP=2, FN=0, FP=0, TN=15804497 - Majority:
TP=11, FN=0, FP=0, TN=18957434
- Minority:
The results demonstrate how different blocking methods affect bias metrics and classification performance for both minority and majority groups across various datasets.
Citation
If you use this code, please cite our paper:
```bibtex @INPROCEEDINGS{10825531, author={Moslemi, Mohmmad Hossein and Balamurugan, Harini and Milani, Mostafa}, booktitle={2024 IEEE International Conference on Big Data (BigData)}, title={Evaluating Blocking Biases in Entity Matching}, year={2024}, volume={}, number={}, pages={64-73}, keywords={Measurement;Data integration;Big Data;Computational complexity}, doi={10.1109/BigData62323.2024.10825531}}
Owner
- Name: mohammad hosein moslemi
- Login: mhmoslemi2338
- Kind: user
- Location: tehran,Iran
- Company: sharif university of Tech, Tehran
- Website: http://ee.sharif.edu/~moslemi.mohammdhosein/
- Twitter: mh_moslemi
- Repositories: 7
- Profile: https://github.com/mhmoslemi2338
BSc. of Electrical Engineering at the Sharif university of tech.my main interest are : computer vision and image processing specially medical image process
Citation (CITATION.cff)
cff-version: 1.2.0
message: >
If you use this code, please cite our IEEE Big Data 2024 paper.
title: Evaluating Blocking Biases in Entity Matching
version: "1.0.0"
doi: 10.1109/BigData62323.2024.10825531
date-released: 2024-12-15
authors:
- family-names: Moslemi
given-names: Mohammad Hossein
orcid: https://orcid.org/0009-0002-0278-4665
- family-names: Balamurugan
given-names: Harini
- family-names: Milani
given-names: Mostafa
repository-code: https://github.com/mhmoslemi2338/pre-EM-bias
url: https://ieeexplore.ieee.org/document/10825531
license: MIT
preferred-citation:
type: conference-paper
title: Evaluating Blocking Biases in Entity Matching
authors:
- family-names: Moslemi
given-names: Mohammad Hossein
- family-names: Balamurugan
given-names: Harini
- family-names: Milani
given-names: Mostafa
conference-name: 2024 IEEE International Conference on Big Data (BigData)
year: 2024
pages: 64–73
doi: 10.1109/BigData62323.2024.10825531
GitHub Events
Total
- Watch event: 2
- Push event: 3
Last Year
- Watch event: 2
- Push event: 3