libertus

Multilingual BERT model for Ancient and Historical Languages for SIGTYP Shared Task 2024

https://github.com/ljvmiranda921/libertus

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.8%) to scientific vocabulary
Last synced: 8 months ago · JSON representation ·

Repository

Multilingual BERT model for Ancient and Historical Languages for SIGTYP Shared Task 2024

Basic Info
  • Host: GitHub
  • Owner: ljvmiranda921
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 12.8 MB
Statistics
  • Stars: 3
  • Watchers: 3
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created over 2 years ago · Last pushed about 2 years ago
Metadata Files
Readme Citation

README.md

🪐 LiBERTus - A Multilingual Language Model for Ancient and Historical Languages

Submission to Task 1 (Constrained) of the SIGTYP 2024 Shared Task on Word Embedding Evaluation for Ancient and Historical Languages. The system is built by first pretraining a multilingual language model and then finetuning it for a downstream task. The submission for Phase 1 and 2 of the Shared Task can be found in the submission_p1 and submission_p2 directories.

📋 project.yml

The project.yml defines the data assets required by the project, as well as the available commands and workflows. For details, see the Weasel documentation.

⏯ Commands

The following commands are defined by the project. They can be executed using weasel run [name]. Commands are only re-run if their inputs have changed.

| Command | Description | | --- | --- | | create-pretraining | Create corpus for multilingual LM pretraining | | create-vocab | Train a tokenizer to create a vocabulary | | pretrain-model | Pretrain a multilingual LM from a corpus | | pretrain-model-from-checkpoint | Pretrain a multilingual LM from a corpus based on a checkpoint | | upload-to-hf | Upload pretrained model and corresponding tokenizer to the HuggingFace repository | | convert-to-spacy-merged | Convert CoNLL-U files into spaCy format for finetuning | | convert-to-spacy | Convert CoNLL-U files into spaCy format for finetuning | | finetune-tok2vec-model | Finetune a tok2vec model given a training and validation corpora | | finetune-trf-model | Finetune a transformer model given a training and validation corpora | | finetune-with-merged-corpus | Finetune a transformer model on the combined training and validation corpora | | package-model | Package model and upload to HuggingFace | | evaluate-model-dev | Evaluate a model on the validation set | | plot-figures | Plot figures for the writeup | | setup-test | Install models from HuggingFace via pip | | download-models-locally | Download models from HuggingFace | | get-test-results | Get results from the test file | | zip-results-p1 | Zip the results into a single file for submission (Phase 1) | | zip-results-p2 | Zip teh results into a single file for submission (Phase 2) |

⏭ Workflows

The following workflows are defined by the project. They can be executed using weasel run [name] and will run the specified commands in order. Commands are only re-run if their inputs have changed.

| Workflow | Steps | | --- | --- | | pretrain | create-pretrainingcreate-vocabpretrain-model | | finetune | convert-to-spacyfinetune-trf-modelevaluate-model-dev | | experiment-merged | convert-to-spacy-mergedfinetune-with-merged-corpus | | experiment-sampling | create-vocabpretrain-model | | make-submission-p1 | setup-testget-test-resultszip-results-p1 | | make-submission-p2 | download-models-locallyzip-results-p2 |

🗂 Assets

The following assets are defined by the project. They can be fetched by running weasel assets in the project directory.

| File | Source | Description | | --- | --- | --- | | assets/train/ | Git | CoNLL-U training datasets for Task 0 (morphology/lemma/POS) | | assets/dev/ | Git | CoNLL-U validation datasets for Task 0 (morphology/lemma/POS) | | assets/test/ | Git | CoNLL-U test datasets for Task 0 (morphology/lemma/POS) |

📄 Cite

If you used any of the code or the models, don't forget to cite

@inproceedings{miranda-2024-allen, title = "{A}llen Institute for {AI} @ {SIGTYP} 2024 Shared Task on Word Embedding Evaluation for Ancient and Historical Languages", author = "Miranda, Lester", booktitle = "Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP", month = mar, year = "2024", address = "St. Julian's, Malta", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.sigtyp-1.18", pages = "151--159", }

Owner

  • Name: Lj Miranda
  • Login: ljvmiranda921
  • Kind: user
  • Company: @explosion

Machine Learning Engineer at @explosion 💥

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: >-
  Allen Institute for AI @ SIGTYP 2024 Shared Task on Word
  Embedding Evaluation for Ancient and Historical Languages
message: 'https://aclanthology.org/2024.sigtyp-1.18/'
type: software
authors:
  - given-names: Lester James
    family-names: Miranda
    email: ljm@allenai.org
    affiliation: Allen Institute for Artificial Intelligence
    orcid: 'https://orcid.org/0000-0002-7872-6464'
repository-code: 'https://github.com/ljvmiranda921/LiBERTus'
abstract: >-
  In this paper, we describe Allen AI’s submission to the
  constrained track of the SIGTYP 2024 Shared Task. Using
  only the data provided by the organizers, we pretrained a
  transformer-based multilingual model, then finetuned it on
  the Universal Dependencies (UD) annotations of a given
  language for a downstream task. Our systems achieved
  decent performance on the test set, beating the baseline
  in most language-task pairs, yet struggles with subtoken
  tags in multiword expressions as seen in Coptic and
  Ancient Hebrew. On the validation set, we obtained ≥70%
  F1- score on most language-task pairs. In addition, we
  also explored the cross-lingual capability of our trained
  models. This paper highlights our pretraining and
  finetuning process, and our findings from our internal
  evaluations.
keywords:
  - ancient languages
  - multilingual nlp
  - word embeddings
license: MIT

GitHub Events

Total
Last Year

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 74
  • Total Committers: 1
  • Avg Commits per committer: 74.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Lj Miranda l****a@g****m 74

Issues and Pull Requests

Last synced: 9 months ago

All Time
  • Total issues: 3
  • Total pull requests: 16
  • Average time to close issues: 17 days
  • Average time to close pull requests: 2 days
  • Total issue authors: 1
  • Total pull request authors: 1
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 16
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • ljvmiranda921 (3)
Pull Request Authors
  • ljvmiranda921 (18)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

requirements.txt pypi
  • cloudpathlib *
  • conllu *
  • matplotlib *
  • numpy *
  • spacy >=3.6.0,<3.7.0
  • spacy-huggingface-hub *
  • spacy-transformers *
  • torch *
  • tqdm *
  • transformers >=4.35.0
  • typer *
  • wandb *
  • wasabi *
  • weasel *
submission_p2/requirements.txt pypi
  • numpy *
  • pytest *
  • spacy >=3.6.1,<3.7.0
  • spacy-transformers *
  • wasabi *