https://github.com/ighina/latinwsd

Repository for Paper "Language Pivoting from Parallel Corpora for Word Sense Disambiguation of Historical Languages: a Case Study on Latin" presented at COLING-LREC 2024

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.4%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

Repository for Paper "Language Pivoting from Parallel Corpora for Word Sense Disambiguation of Historical Languages: a Case Study on Latin" presented at COLING-LREC 2024

Basic Info

Host: GitHub
Owner: Ighina
Language: Jupyter Notebook
Default Branch: main
Size: 14.6 MB

Statistics

Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created over 2 years ago · Last pushed about 2 years ago

Metadata Files

Readme

LatinWSD

Code for the paper "Language Pivoting from Parallel Corpora for Word Sense Disambiguation of Historical Languages: a Case Study on Latin" presented at LREC-COLING 2024.

Basic Usage

Downloading Model and tokenizer

In order to use the code, first download the relevant Latin BERT model and tokenizer from the relevant repository. For the Latin BERT model, follow the instruction in the repository to download it. Create the "basemodels" folder with: ``` mkdir basemodels ``` Move both the latin.subword.encoder and the latinbert folder into basemodels.

Installing libraries

Install the required libraries into your environment with: pip install -r requirements.txt

Language Pivoting

The languege pivoting approach described in the paper has been performed by first preprocessing the Dynamic Lexicon Latin-English parallel corpus with the scripts available in preprocess folder. Once obtained the final csv file from the preprocessing, the English column is passed through AMuSE WSD system and the annotations were propagated with one of the two methods described in the paper (for the align method, we use the target word column in the csv file to identify the English lemma corresponding to the target Latin one).

Finally, the datasets thus obtained are stored in the data folder of this repository.

Run the main script

To fine-tune LatinBERT models on all the target lemmas from semeval dataset with the addition of the Persinter data run: ``` python scripts/latinwsdbert.py train --bertPath basemodels/latinbert --tokenizerPath basemodels/subwordtokenizerlatin/latin.subword.encoder -f data/semevalwsdbert.model --maxepochs 20 -i data/semevalwsd.data -add data/silverinterwsd.data -name semevalwithinter -save -pre -nod ``` Similarly, substitute "silverinterwsd.data" with "silveralignwsd.data" or "silverrarewsd.data" to run the models with the addition of Persalign or Persrare datasets, also changing the "-name" option to a different name under which to store the results.

In order to train just on one of the above datasets, instead, run: python scripts/latin_wsd_bert.py train --bertPath base_models/latin_bert --tokenizerPath base_models/subword_tokenizer_latin/latin.subword.encoder -f data/semeval_wsd_bert.model --max_epochs 20 -i data/semeval_wsd.data -name semeval_only -save -pre -nod Again, change "data/semeval_wsd.data" to another dataset among the ones available in the data folder to train on a different dataset.

In all cases, results will be stored in {name}.json file, where {name} is the input to the "-name" option. The models checkpoint will be saved under saved_models in the format {lemma}.bin, where the {lemma} is the target lemma on which that specific instance was fine-tuned on. Make sure to save the checkpoints separately before re-running the experiments with different settings, as they will otherwise be overwritten.

Owner

Name: Iacopo Ghinassi
Login: Ighina
Kind: user
Location: London
Company: Queen Mary University of London

Repositories: 5
Profile: https://github.com/Ighina

PhD Candidate in Speech and Language Processing, passionate about Digital Humanities and building things from scratch.

GitHub Events

Total

Watch event: 1

Last Year

Watch event: 1

Dependencies

requirements.txt pypi

Unidecode ==1.3.8
beautifulsoup4 ==4.12.3
cltk ==1.1.6
future ==1.0.0
pygame ==2.5.2
tensor2tensor ==1.15.7
tensorflow ==2.15.0
tokenizers ==0.15.2
torch ==2.2.1
torchdata ==0.7.1
torchtext ==0.17.1
tqdm ==4.66.2
transformers ==4.38.1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/ighina/latinwsd

Science Score: 13.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

LatinWSD

Basic Usage

Downloading Model and tokenizer

Installing libraries

Language Pivoting

Run the main script

Owner

GitHub Events

Total

Last Year

Dependencies