libertus

Multilingual BERT model for Ancient and Historical Languages for SIGTYP Shared Task 2024

https://github.com/ljvmiranda921/libertus

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.8%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Multilingual BERT model for Ancient and Historical Languages for SIGTYP Shared Task 2024

Basic Info

Host: GitHub
Owner: ljvmiranda921
Language: Python
Default Branch: master
Homepage:
Size: 12.8 MB

Statistics

Stars: 3
Watchers: 3
Forks: 0
Open Issues: 0
Releases: 0

Created over 2 years ago · Last pushed about 2 years ago

Metadata Files

Readme Citation

🪐 LiBERTus - A Multilingual Language Model for Ancient and Historical Languages

Submission to Task 1 (Constrained) of the SIGTYP 2024 Shared Task on Word Embedding Evaluation for Ancient and Historical Languages. The system is built by first pretraining a multilingual language model and then finetuning it for a downstream task. The submission for Phase 1 and 2 of the Shared Task can be found in the submission_p1 and submission_p2 directories.

📋 project.yml

The project.yml defines the data assets required by the project, as well as the available commands and workflows. For details, see the Weasel documentation.

⏯ Commands

The following commands are defined by the project. They can be executed using weasel run [name]. Commands are only re-run if their inputs have changed.

| Command | Description | | --- | --- | | create-pretraining | Create corpus for multilingual LM pretraining | | create-vocab | Train a tokenizer to create a vocabulary | | pretrain-model | Pretrain a multilingual LM from a corpus | | pretrain-model-from-checkpoint | Pretrain a multilingual LM from a corpus based on a checkpoint | | upload-to-hf | Upload pretrained model and corresponding tokenizer to the HuggingFace repository | | convert-to-spacy-merged | Convert CoNLL-U files into spaCy format for finetuning | | convert-to-spacy | Convert CoNLL-U files into spaCy format for finetuning | | finetune-tok2vec-model | Finetune a tok2vec model given a training and validation corpora | | finetune-trf-model | Finetune a transformer model given a training and validation corpora | | finetune-with-merged-corpus | Finetune a transformer model on the combined training and validation corpora | | package-model | Package model and upload to HuggingFace | | evaluate-model-dev | Evaluate a model on the validation set | | plot-figures | Plot figures for the writeup | | setup-test | Install models from HuggingFace via pip | | download-models-locally | Download models from HuggingFace | | get-test-results | Get results from the test file | | zip-results-p1 | Zip the results into a single file for submission (Phase 1) | | zip-results-p2 | Zip teh results into a single file for submission (Phase 2) |

⏭ Workflows

The following workflows are defined by the project. They can be executed using weasel run [name] and will run the specified commands in order. Commands are only re-run if their inputs have changed.

| Workflow | Steps | | --- | --- | | pretrain | create-pretraining → create-vocab → pretrain-model | | finetune | convert-to-spacy → finetune-trf-model → evaluate-model-dev | | experiment-merged | convert-to-spacy-merged → finetune-with-merged-corpus | | experiment-sampling | create-vocab → pretrain-model | | make-submission-p1 | setup-test → get-test-results → zip-results-p1 | | make-submission-p2 | download-models-locally → zip-results-p2 |

🗂 Assets

The following assets are defined by the project. They can be fetched by running weasel assets in the project directory.

| File | Source | Description | | --- | --- | --- | | assets/train/ | Git | CoNLL-U training datasets for Task 0 (morphology/lemma/POS) | | assets/dev/ | Git | CoNLL-U validation datasets for Task 0 (morphology/lemma/POS) | | assets/test/ | Git | CoNLL-U test datasets for Task 0 (morphology/lemma/POS) |

📄 Cite

If you used any of the code or the models, don't forget to cite

@inproceedings{miranda-2024-allen, title = "{A}llen Institute for {AI} @ {SIGTYP} 2024 Shared Task on Word Embedding Evaluation for Ancient and Historical Languages", author = "Miranda, Lester", booktitle = "Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP", month = mar, year = "2024", address = "St. Julian's, Malta", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.sigtyp-1.18", pages = "151--159", }

Owner

Name: Lj Miranda
Login: ljvmiranda921
Kind: user
Company: @explosion

Website: https://ljvmiranda921.github.io/
Twitter: ljvmiranda
Repositories: 40
Profile: https://github.com/ljvmiranda921

Machine Learning Engineer at @explosion 💥

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: >-
  Allen Institute for AI @ SIGTYP 2024 Shared Task on Word
  Embedding Evaluation for Ancient and Historical Languages
message: 'https://aclanthology.org/2024.sigtyp-1.18/'
type: software
authors:
  - given-names: Lester James
    family-names: Miranda
    email: ljm@allenai.org
    affiliation: Allen Institute for Artificial Intelligence
    orcid: 'https://orcid.org/0000-0002-7872-6464'
repository-code: 'https://github.com/ljvmiranda921/LiBERTus'
abstract: >-
  In this paper, we describe Allen AI’s submission to the
  constrained track of the SIGTYP 2024 Shared Task. Using
  only the data provided by the organizers, we pretrained a
  transformer-based multilingual model, then finetuned it on
  the Universal Dependencies (UD) annotations of a given
  language for a downstream task. Our systems achieved
  decent performance on the test set, beating the baseline
  in most language-task pairs, yet struggles with subtoken
  tags in multiword expressions as seen in Coptic and
  Ancient Hebrew. On the validation set, we obtained ≥70%
  F1- score on most language-task pairs. In addition, we
  also explored the cross-lingual capability of our trained
  models. This paper highlights our pretraining and
  finetuning process, and our findings from our internal
  evaluations.
keywords:
  - ancient languages
  - multilingual nlp
  - word embeddings
license: MIT

GitHub Events

Total

Last Year

Committers

Last synced: 11 months ago

All Time

Total Commits: 74
Total Committers: 1
Avg Commits per committer: 74.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Lj Miranda	l**a@g**m	74

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 3
Total pull requests: 16
Average time to close issues: 17 days
Average time to close pull requests: 2 days
Total issue authors: 1
Total pull request authors: 1
Average comments per issue: 0.0
Average comments per pull request: 0.0
Merged pull requests: 16
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

ljvmiranda921 (3)

Pull Request Authors

ljvmiranda921 (18)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

requirements.txt pypi

cloudpathlib *
conllu *
matplotlib *
numpy *
spacy >=3.6.0,<3.7.0
spacy-huggingface-hub *
spacy-transformers *
torch *
tqdm *
transformers >=4.35.0
typer *
wandb *
wasabi *
weasel *

submission_p2/requirements.txt pypi

numpy *
pytest *
spacy >=3.6.1,<3.7.0
spacy-transformers *
wasabi *

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

libertus

Science Score: 44.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

🪐 LiBERTus - A Multilingual Language Model for Ancient and Historical Languages

📋 project.yml

⏯ Commands

⏭ Workflows

🗂 Assets

📄 Cite

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies