https://github.com/bminixhofer/focus
Official Code for "FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models" @ EMNLP 2023
Science Score: 23.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
✓DOI references
Found 1 DOI reference(s) in README -
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.3%) to scientific vocabulary
Last synced: 9 months ago
·
JSON representation
Repository
Official Code for "FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models" @ EMNLP 2023
Basic Info
- Host: GitHub
- Owner: bminixhofer
- License: mit
- Language: Python
- Default Branch: main
- Homepage: https://arxiv.org/abs/2305.14481
- Size: 20 MB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Fork of konstantinjdobler/focus
Created over 2 years ago
· Last pushed over 2 years ago
https://github.com/bminixhofer/focus/blob/main/
# FOCUS
Code for "FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models" accepted at the EMNLP 2023 main conference.
Paper on arXiv: https://arxiv.org/abs/2305.14481.
## Installation
We provide the package via `pip install deepfocus`.
Alternatively, you can simply copy the `deepfocus` folder and drop it into your project.
The necessary dependencies are listed in `requirements.txt` (`pip install -r requirements.txt`).
## Usage
The following example shows how to use FOCUS to specialize `xlm-roberta-base` on German with a custom, language-specific tokenizer. The code is also available in [`focus_example.py`](focus_example.py).
```python
from transformers import AutoModelForMaskedLM, AutoTokenizer
from deepfocus import FOCUS
source_tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")
source_model = AutoModelForMaskedLM.from_pretrained("xlm-roberta-base")
target_tokenizer = AutoTokenizer.from_pretrained(
"./tokenizers/de/xlmr-unigram-50k"
)
# Example for training a new tokenizer:
# target_tokenizer = source_tokenizer.train_new_from_iterator(
# load_dataset("cc100", lang="de", split="train")["text"],
# vocab_size=50_048
# )
# target_tokenizer.save_pretrained("./target_tokenizer_test")
target_embeddings = FOCUS(
source_embeddings=source_model.get_input_embeddings().weight,
source_tokenizer=source_tokenizer,
target_tokenizer=target_tokenizer,
target_training_data_path="/path/to/data.txt"
# fasttext_model_path="/path/to/fasttext.bin", # or directly provide path to token-level fasttext model
# In the paper, we use `target_training_data_path` but we also implement using
# WECHSEL's word-to-subword mapping if the language has pretrained fasttext word embeddings available online
# To use, supply a two-letter `language_identifier` (e.g. "de" for German) instead of `target_training_data_path` and set:
# auxiliary_embedding_mode="fasttext-wordlevel",
# language_identifier="de",
)
source_model.resize_token_embeddings(len(target_tokenizer))
source_model.get_input_embeddings().weight.data = target_embeddings
# Continue training the model on the target language with `target_tokenizer`.
# ...
```
## Checkpoints
We publish the checkpoints trained with FOCUS on HuggingFace:
| Language | Vocabulary Replacement (preferred) | Vocabulary Extension |
|-------------|-----------------------------------------------|------------------------------------------------|
| German | [`konstantindobler/xlm-roberta-base-focus-german`](https://huggingface.co/konstantindobler/xlm-roberta-base-focus-german) | [`konstantindobler/xlm-roberta-base-focus-extend-german`](https://huggingface.co/konstantindobler/xlm-roberta-base-focus-extend-german) |
| Arabic | [`konstantindobler/xlm-roberta-base-focus-arabic`](https://huggingface.co/konstantindobler/xlm-roberta-base-focus-arabic) | [`konstantindobler/xlm-roberta-base-focus-extend-arabic`](https://huggingface.co/konstantindobler/xlm-roberta-base-focus-extend-arabic) |
| Kiswahili | [`konstantindobler/xlm-roberta-base-focus-kiswahili`](https://huggingface.co/konstantindobler/xlm-roberta-base-focus-kiswahili) | [`konstantindobler/xlm-roberta-base-focus-extend-kiswahili`](https://huggingface.co/konstantindobler/xlm-roberta-base-focus-extend-kiswahili)|
| Hausa | [`konstantindobler/xlm-roberta-base-focus-hausa`](https://huggingface.co/konstantindobler/xlm-roberta-base-focus-hausa) | [`konstantindobler/xlm-roberta-base-focus-extend-hausa`](https://huggingface.co/konstantindobler/xlm-roberta-base-focus-extend-hausa) |
| isiXhosa | [`konstantindobler/xlm-roberta-base-focus-isixhosa`](https://huggingface.co/konstantindobler/xlm-roberta-base-focus-isixhosa) | [`konstantindobler/xlm-roberta-base-focus-extend-isixhosa`](https://huggingface.co/konstantindobler/xlm-roberta-base-focus-extend-isixhosa)|
In our experiments, full vocabulary replacement coupled with FOCUS outperformed extending XLM-R's original vocabulary, while also resulting in a smaller model and being faster to train.
## Citation
You can cite FOCUS like this:
```bibtex
@inproceedings{dobler-de-melo-2023-focus,
title = "{FOCUS}: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models",
author = "Dobler, Konstantin and
de Melo, Gerard",
editor = "Bouamor, Houda and
Pino, Juan and
Bali, Kalika",
booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.emnlp-main.829",
doi = "10.18653/v1/2023.emnlp-main.829",
pages = "13440--13454",
}
```
If you use the "WECHSEL-style" word-to-subword mapping, please consider also citing their [original work](https://github.com/CPJKU/wechsel).
Owner
- Name: Benjamin Minixhofer
- Login: bminixhofer
- Kind: user
- Location: Linz, Austria
- Website: bmin.ai
- Twitter: bminixhofer
- Repositories: 31
- Profile: https://github.com/bminixhofer
PhD Student @cambridgeltl
GitHub Events
Total
- Fork event: 1
Last Year
- Fork event: 1