https://github.com/ighina/latinwsd
Repository for Paper "Language Pivoting from Parallel Corpora for Word Sense Disambiguation of Historical Languages: a Case Study on Latin" presented at COLING-LREC 2024
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.4%) to scientific vocabulary
Repository
Repository for Paper "Language Pivoting from Parallel Corpora for Word Sense Disambiguation of Historical Languages: a Case Study on Latin" presented at COLING-LREC 2024
Basic Info
- Host: GitHub
- Owner: Ighina
- Language: Jupyter Notebook
- Default Branch: main
- Size: 14.6 MB
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
LatinWSD
Code for the paper "Language Pivoting from Parallel Corpora for Word Sense Disambiguation of Historical Languages: a Case Study on Latin" presented at LREC-COLING 2024.
Basic Usage
Downloading Model and tokenizer
In order to use the code, first download the relevant Latin BERT model and tokenizer from the relevant repository. For the Latin BERT model, follow the instruction in the repository to download it. Create the "basemodels" folder with: ``` mkdir basemodels ``` Move both the latin.subword.encoder and the latinbert folder into basemodels.
Installing libraries
Install the required libraries into your environment with:
pip install -r requirements.txt
Language Pivoting
The languege pivoting approach described in the paper has been performed by first preprocessing the Dynamic Lexicon Latin-English parallel corpus with the scripts available in preprocess folder. Once obtained the final csv file from the preprocessing, the English column is passed through AMuSE WSD system and the annotations were propagated with one of the two methods described in the paper (for the align method, we use the target word column in the csv file to identify the English lemma corresponding to the target Latin one).
Finally, the datasets thus obtained are stored in the data folder of this repository.
Run the main script
To fine-tune LatinBERT models on all the target lemmas from semeval dataset with the addition of the Persinter data run: ``` python scripts/latinwsdbert.py train --bertPath basemodels/latinbert --tokenizerPath basemodels/subwordtokenizerlatin/latin.subword.encoder -f data/semevalwsdbert.model --maxepochs 20 -i data/semevalwsd.data -add data/silverinterwsd.data -name semevalwithinter -save -pre -nod ``` Similarly, substitute "silverinterwsd.data" with "silveralignwsd.data" or "silverrarewsd.data" to run the models with the addition of Persalign or Persrare datasets, also changing the "-name" option to a different name under which to store the results.
In order to train just on one of the above datasets, instead, run:
python scripts/latin_wsd_bert.py train --bertPath base_models/latin_bert --tokenizerPath base_models/subword_tokenizer_latin/latin.subword.encoder -f data/semeval_wsd_bert.model --max_epochs 20 -i data/semeval_wsd.data -name semeval_only -save -pre -nod
Again, change "data/semeval_wsd.data" to another dataset among the ones available in the data folder to train on a different dataset.
In all cases, results will be stored in {name}.json file, where {name} is the input to the "-name" option. The models checkpoint will be saved under saved_models in the format {lemma}.bin, where the {lemma} is the target lemma on which that specific instance was fine-tuned on. Make sure to save the checkpoints separately before re-running the experiments with different settings, as they will otherwise be overwritten.
Owner
- Name: Iacopo Ghinassi
- Login: Ighina
- Kind: user
- Location: London
- Company: Queen Mary University of London
- Repositories: 5
- Profile: https://github.com/Ighina
PhD Candidate in Speech and Language Processing, passionate about Digital Humanities and building things from scratch.
GitHub Events
Total
- Watch event: 1
Last Year
- Watch event: 1
Dependencies
- Unidecode ==1.3.8
- beautifulsoup4 ==4.12.3
- cltk ==1.1.6
- future ==1.0.0
- pygame ==2.5.2
- tensor2tensor ==1.15.7
- tensorflow ==2.15.0
- tokenizers ==0.15.2
- torch ==2.2.1
- torchdata ==0.7.1
- torchtext ==0.17.1
- tqdm ==4.66.2
- transformers ==4.38.1