libertus
Multilingual BERT model for Ancient and Historical Languages for SIGTYP Shared Task 2024
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.8%) to scientific vocabulary
Repository
Multilingual BERT model for Ancient and Historical Languages for SIGTYP Shared Task 2024
Basic Info
Statistics
- Stars: 3
- Watchers: 3
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
🪐 LiBERTus - A Multilingual Language Model for Ancient and Historical Languages
Submission to Task 1 (Constrained) of the SIGTYP 2024 Shared Task on Word
Embedding Evaluation for Ancient and Historical
Languages. The system is built by
first pretraining a multilingual language model and then finetuning it for a
downstream task. The submission for Phase 1 and 2 of the Shared Task can be
found in the submission_p1 and submission_p2 directories.
📋 project.yml
The project.yml defines the data assets required by the
project, as well as the available commands and workflows. For details, see the
Weasel documentation.
⏯ Commands
The following commands are defined by the project. They
can be executed using weasel run [name].
Commands are only re-run if their inputs have changed.
| Command | Description |
| --- | --- |
| create-pretraining | Create corpus for multilingual LM pretraining |
| create-vocab | Train a tokenizer to create a vocabulary |
| pretrain-model | Pretrain a multilingual LM from a corpus |
| pretrain-model-from-checkpoint | Pretrain a multilingual LM from a corpus based on a checkpoint |
| upload-to-hf | Upload pretrained model and corresponding tokenizer to the HuggingFace repository |
| convert-to-spacy-merged | Convert CoNLL-U files into spaCy format for finetuning |
| convert-to-spacy | Convert CoNLL-U files into spaCy format for finetuning |
| finetune-tok2vec-model | Finetune a tok2vec model given a training and validation corpora |
| finetune-trf-model | Finetune a transformer model given a training and validation corpora |
| finetune-with-merged-corpus | Finetune a transformer model on the combined training and validation corpora |
| package-model | Package model and upload to HuggingFace |
| evaluate-model-dev | Evaluate a model on the validation set |
| plot-figures | Plot figures for the writeup |
| setup-test | Install models from HuggingFace via pip |
| download-models-locally | Download models from HuggingFace |
| get-test-results | Get results from the test file |
| zip-results-p1 | Zip the results into a single file for submission (Phase 1) |
| zip-results-p2 | Zip teh results into a single file for submission (Phase 2) |
⏭ Workflows
The following workflows are defined by the project. They
can be executed using weasel run [name]
and will run the specified commands in order. Commands are only re-run if their
inputs have changed.
| Workflow | Steps |
| --- | --- |
| pretrain | create-pretraining → create-vocab → pretrain-model |
| finetune | convert-to-spacy → finetune-trf-model → evaluate-model-dev |
| experiment-merged | convert-to-spacy-merged → finetune-with-merged-corpus |
| experiment-sampling | create-vocab → pretrain-model |
| make-submission-p1 | setup-test → get-test-results → zip-results-p1 |
| make-submission-p2 | download-models-locally → zip-results-p2 |
🗂 Assets
The following assets are defined by the project. They can
be fetched by running weasel assets
in the project directory.
| File | Source | Description |
| --- | --- | --- |
| assets/train/ | Git | CoNLL-U training datasets for Task 0 (morphology/lemma/POS) |
| assets/dev/ | Git | CoNLL-U validation datasets for Task 0 (morphology/lemma/POS) |
| assets/test/ | Git | CoNLL-U test datasets for Task 0 (morphology/lemma/POS) |
📄 Cite
If you used any of the code or the models, don't forget to cite
@inproceedings{miranda-2024-allen,
title = "{A}llen Institute for {AI} @ {SIGTYP} 2024 Shared Task on Word Embedding Evaluation for Ancient and Historical Languages",
author = "Miranda, Lester",
booktitle = "Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP",
month = mar,
year = "2024",
address = "St. Julian's, Malta",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.sigtyp-1.18",
pages = "151--159",
}
Owner
- Name: Lj Miranda
- Login: ljvmiranda921
- Kind: user
- Company: @explosion
- Website: https://ljvmiranda921.github.io/
- Twitter: ljvmiranda
- Repositories: 40
- Profile: https://github.com/ljvmiranda921
Machine Learning Engineer at @explosion 💥
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: >-
Allen Institute for AI @ SIGTYP 2024 Shared Task on Word
Embedding Evaluation for Ancient and Historical Languages
message: 'https://aclanthology.org/2024.sigtyp-1.18/'
type: software
authors:
- given-names: Lester James
family-names: Miranda
email: ljm@allenai.org
affiliation: Allen Institute for Artificial Intelligence
orcid: 'https://orcid.org/0000-0002-7872-6464'
repository-code: 'https://github.com/ljvmiranda921/LiBERTus'
abstract: >-
In this paper, we describe Allen AI’s submission to the
constrained track of the SIGTYP 2024 Shared Task. Using
only the data provided by the organizers, we pretrained a
transformer-based multilingual model, then finetuned it on
the Universal Dependencies (UD) annotations of a given
language for a downstream task. Our systems achieved
decent performance on the test set, beating the baseline
in most language-task pairs, yet struggles with subtoken
tags in multiword expressions as seen in Coptic and
Ancient Hebrew. On the validation set, we obtained ≥70%
F1- score on most language-task pairs. In addition, we
also explored the cross-lingual capability of our trained
models. This paper highlights our pretraining and
finetuning process, and our findings from our internal
evaluations.
keywords:
- ancient languages
- multilingual nlp
- word embeddings
license: MIT
GitHub Events
Total
Last Year
Issues and Pull Requests
Last synced: 9 months ago
All Time
- Total issues: 3
- Total pull requests: 16
- Average time to close issues: 17 days
- Average time to close pull requests: 2 days
- Total issue authors: 1
- Total pull request authors: 1
- Average comments per issue: 0.0
- Average comments per pull request: 0.0
- Merged pull requests: 16
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- ljvmiranda921 (3)
Pull Request Authors
- ljvmiranda921 (18)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- cloudpathlib *
- conllu *
- matplotlib *
- numpy *
- spacy >=3.6.0,<3.7.0
- spacy-huggingface-hub *
- spacy-transformers *
- torch *
- tqdm *
- transformers >=4.35.0
- typer *
- wandb *
- wasabi *
- weasel *
- numpy *
- pytest *
- spacy >=3.6.1,<3.7.0
- spacy-transformers *
- wasabi *