leveraging_data_histories_to_improve_er

https://github.com/wolv3rine876/leveraging_data_histories_to_improve_er

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.0%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: wolv3rine876
License: mit
Language: Python
Default Branch: main
Size: 229 KB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created over 2 years ago · Last pushed over 2 years ago

Metadata Files

Readme License Citation

Leveraging Data Histories to Improve Entity Resolution

This repository contains the source code and data to test the impact of data histories on entity resolution using Ditto [1]. It is part of the equally named master's thesis that was handed in on the 1st of December 2023 at Hasso Plattner Institut and supervised by Prof. Dr. Naumann.

Summary

In short, the code extracts a set of (table) rows from Wikipedia, including their entire history. Based on that, it provides a set of command-line utilities to generate so-called prompts that Ditto can classify. Please refer to the corresponding master's thesis for an extensive description of the pipeline and experimental setup.

Prerequisites

:warning: Archives are stored here (pw: Gnbsfkywge). Download them to the repository root. Feel free to contact me in case the link expired.

To run the source code, python is required. Set up other requirements by running:

shell ./install.sh pip3 install -r requirements.txt

In addition, the file ditto_env.yaml contains information about the environment used to run Ditto.

Setup the repository as follows:

```shell

Setup the submodules

git submodule init git submodule update

Copy Ditto's config file

mv -f ditto_config.json ditto/config.json

Extract the data archives

mkdir data 7z x prompts.7z -odata/ 7z x train.7z -odata/ 7z x gold-standard.7z -odata/ ```

Running

In general, the depicted pipeline runs for multiple days. gold-standard.7z, train.7z, and prompts.7z contain 1) the annotated gold standard, 2) the sampled training data with 200.000 pairs, and 3) the generated prompts. Hence, it is possible to shortcut some stages and start with data based on a dump of the English Wikipedia from the 1st of June 2023.

Stages 1-4

Please refer to the submodule wiki-row-col-matcher.

Filtering

This stage takes the input from Stage 4 and applies a set of filter rules. Moreover, it extracts the wikilinks in the subject column that are used as labels.

shell python3 cli.py filter <input_path> data/gold-standard --force

Grouping & Sampling

Grouping by links and sampling the training pairs is done in a single command. The command randomly chooses s/2 row pairs that refer to the same article (considered a match) and s/2 row pairs with different links (considered a non-match) that follow the (Jaccard) similarity distribution of the matches. That way, the matches and non-matches can not be distinguished just by their similarity. Thus, the model has to learn "harder".

shell python3 cli.py sample data/gold-standard data/train -s 200000

Generation

A set of formatters implement the different proposed serialization methods. Please refer to the documentation of the classes in sampling/formatter. To generate prompts following the roUnq called format, use the following command:

shell python3 cli.py gen-prompts data/train data/prompts -fmt ro_concat_hist -fs DISTINCT True -fs SEP False

Classification

For the final classification using Ditto, use the following command:

shell export CUDA_VISIBLE_DEVICES=0; python3 ditto/train_ditto.py \ --task xl_roUnq \ --batch_size 64 \ --max_len 256 \ --lr 3e-5 \ --n_epochs 40 \ --lm roberta \ --fp16 \ --logdir <log_path> \ --save_model

Bibliography

[1] Y. Li, J. Li, Y. Suhara, A. Doan, and W. C. Tan, “Deep entity matching with-trained language models”, Proceedings of the VLDB Endowment, vol. 14, no. 1, 2020, issn: 21508097. doi: 10.14778/3421424.3421431. [Online]. Available: https://arxiv.org/abs/2004.00584

Owner

Login: wolv3rine876
Kind: user

Repositories: 1
Profile: https://github.com/wolv3rine876

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: Leveraging Data Histories to Improve Entity Resolution
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Sebastian
    family-names: Apitz
    affiliation: Hasso Plattner Institut
repository-code: >-
  https://github.com/wolv3rine876/Leveraging_Data_Histories_to_Improve_ER
keywords:
  - Entity Resolution
  - Temporal Data
  - Wikipedia
license: MIT

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science