deepfocus
[EMNLP'23] Official Code for "FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models"
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in README -
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.5%) to scientific vocabulary
Repository
[EMNLP'23] Official Code for "FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models"
Basic Info
- Host: GitHub
- Owner: konstantinjdobler
- License: mit
- Language: Python
- Default Branch: main
- Homepage: https://arxiv.org/abs/2305.14481
- Size: 20 MB
Statistics
- Stars: 33
- Watchers: 1
- Forks: 5
- Open Issues: 0
- Releases: 2
Metadata Files
README.md
FOCUS
Code for "FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models" accepted at the EMNLP 2023 main conference.
Paper on arXiv: https://arxiv.org/abs/2305.14481.
Installation
We provide the package via pip install deepfocus.
Alternatively, you can simply copy the deepfocus folder and drop it into your project.
The necessary dependencies are listed in requirements.txt (pip install -r requirements.txt).
Usage
The following example shows how to use FOCUS to specialize xlm-roberta-base on German with a custom, language-specific tokenizer. The code is also available in focus_example.py.
```python from transformers import AutoModelForMaskedLM, AutoTokenizer from deepfocus import FOCUS
sourcetokenizer = AutoTokenizer.frompretrained("xlm-roberta-base") sourcemodel = AutoModelForMaskedLM.frompretrained("xlm-roberta-base")
targettokenizer = AutoTokenizer.frompretrained( "./tokenizers/de/xlmr-unigram-50k" )
Example for training a new tokenizer:
targettokenizer = sourcetokenizer.trainnewfrom_iterator(
load_dataset("cc100", lang="de", split="train")["text"],
vocabsize=50048
)
targettokenizer.savepretrained("./targettokenizertest")
targetembeddings = FOCUS(
sourceembeddings=sourcemodel.getinputembeddings().weight,
sourcetokenizer=sourcetokenizer,
targettokenizer=targettokenizer,
targettrainingdatapath="/path/to/data.jsonl" # data should be .jsonl where each line is a sample like {"text": "Lorem ipsum..."}
# fasttextmodelpath="/path/to/fasttext.bin", # or directly provide path to token-level fasttext model
# In the paper, we use `target_training_data_path` but we also implement using
# WECHSEL's word-to-subword mapping if the language has pretrained fasttext word embeddings available online
# To use, supply a two-letter `language_identifier` (e.g. "de" for German) instead of `target_training_data_path` and set:
# auxiliary_embedding_mode="fasttext-wordlevel",
# language_identifier="de",
) sourcemodel.resizetokenembeddings(len(targettokenizer)) sourcemodel.getinputembeddings().weight.data = targetembeddings
if the model has separate output embeddings, apply FOCUS separately
if not model.config.tiewordembeddings: targetoutputembeddings = FOCUS( sourceembeddings=sourcemodel.getoutputembeddings().weight, sourcetokenizer=sourcetokenizer, targettokenizer=targettokenizer, targettrainingdatapath="/path/to/data.jsonl" # same argument options as above, fasttext models are cached! ) model.getoutputembeddings().weight.data = targetoutput_embeddings
Continue training the model on the target language with target_tokenizer.
...
```
Checkpoints
We publish the checkpoints trained with FOCUS on HuggingFace:
| Language | Vocabulary Replacement (preferred) | Vocabulary Extension |
|-------------|-----------------------------------------------|------------------------------------------------|
| German | konstantindobler/xlm-roberta-base-focus-german | konstantindobler/xlm-roberta-base-focus-extend-german |
| Arabic | konstantindobler/xlm-roberta-base-focus-arabic | konstantindobler/xlm-roberta-base-focus-extend-arabic |
| Kiswahili | konstantindobler/xlm-roberta-base-focus-kiswahili | konstantindobler/xlm-roberta-base-focus-extend-kiswahili|
| Hausa | konstantindobler/xlm-roberta-base-focus-hausa | konstantindobler/xlm-roberta-base-focus-extend-hausa |
| isiXhosa | konstantindobler/xlm-roberta-base-focus-isixhosa | konstantindobler/xlm-roberta-base-focus-extend-isixhosa|
In our experiments, full vocabulary replacement coupled with FOCUS outperformed extending XLM-R's original vocabulary, while also resulting in a smaller model and being faster to train.
Citation
You can cite FOCUS like this:
bibtex
@inproceedings{dobler-de-melo-2023-focus,
title = "{FOCUS}: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models",
author = "Dobler, Konstantin and
de Melo, Gerard",
editor = "Bouamor, Houda and
Pino, Juan and
Bali, Kalika",
booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.emnlp-main.829",
doi = "10.18653/v1/2023.emnlp-main.829",
pages = "13440--13454",
}
If you use the "WECHSEL-style" word-to-subword mapping, please consider also citing their original work.
Owner
- Name: Konstantin Dobler
- Login: konstantinjdobler
- Kind: user
- Location: Potsdam, Germany
- Company: Hasso Plattner Institute
- Website: konstantindobler.me
- Repositories: 36
- Profile: https://github.com/konstantinjdobler
PhD student @ HPI working with language models
Citation (CITATION.cff)
cff-version: 1.2.0
title: "FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models"
message: "If you use this software, please cite the paper from preferred-citation."
type: software
url: "https://github.com/konstantinjdobler/focus"
authors:
- given-names: Konstantin
family-names: Dobler
- given-names: Gerard
family-names: Melo
name-particle: de
preferred-citation:
abbreviation: dobler-demelo-2023-focus
type: conference-paper
authors:
- given-names: Konstantin
family-names: Dobler
- given-names: Gerard
family-names: Melo
name-particle: de
title: "FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models"
year: 2023
publisher: "Empirical Methods in Natural Language Processing"
url: "https://arxiv.org/abs/2305.14481"
GitHub Events
Total
- Watch event: 8
- Push event: 2
- Fork event: 2
Last Year
- Watch event: 8
- Push event: 2
- Fork event: 2
Packages
- Total packages: 1
-
Total downloads:
- pypi 352 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 2
- Total maintainers: 1
pypi.org: deepfocus
Offcial Python implementation of "FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models" published at EMNLP 2023.
- Homepage: https://github.com/konstantinjdobler/focus
- Documentation: https://deepfocus.readthedocs.io/
- License: Apache Software License
-
Latest release: 1.0.1
published over 2 years ago