deepfocus

[EMNLP'23] Official Code for "FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models"

https://github.com/konstantinjdobler/focus

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.5%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

[EMNLP'23] Official Code for "FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models"

Basic Info
Statistics
  • Stars: 33
  • Watchers: 1
  • Forks: 5
  • Open Issues: 0
  • Releases: 2
Created about 3 years ago · Last pushed 9 months ago
Metadata Files
Readme License Citation

README.md

FOCUS

Code for "FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models" accepted at the EMNLP 2023 main conference.

Paper on arXiv: https://arxiv.org/abs/2305.14481.

Installation

We provide the package via pip install deepfocus.

Alternatively, you can simply copy the deepfocus folder and drop it into your project. The necessary dependencies are listed in requirements.txt (pip install -r requirements.txt).

Usage

The following example shows how to use FOCUS to specialize xlm-roberta-base on German with a custom, language-specific tokenizer. The code is also available in focus_example.py.

```python from transformers import AutoModelForMaskedLM, AutoTokenizer from deepfocus import FOCUS

sourcetokenizer = AutoTokenizer.frompretrained("xlm-roberta-base") sourcemodel = AutoModelForMaskedLM.frompretrained("xlm-roberta-base")

targettokenizer = AutoTokenizer.frompretrained( "./tokenizers/de/xlmr-unigram-50k" )

Example for training a new tokenizer:

targettokenizer = sourcetokenizer.trainnewfrom_iterator(

load_dataset("cc100", lang="de", split="train")["text"],

vocabsize=50048

)

targettokenizer.savepretrained("./targettokenizertest")

targetembeddings = FOCUS( sourceembeddings=sourcemodel.getinputembeddings().weight, sourcetokenizer=sourcetokenizer, targettokenizer=targettokenizer, targettrainingdatapath="/path/to/data.jsonl" # data should be .jsonl where each line is a sample like {"text": "Lorem ipsum..."} # fasttextmodelpath="/path/to/fasttext.bin", # or directly provide path to token-level fasttext model

# In the paper, we use `target_training_data_path` but we also implement using
# WECHSEL's word-to-subword mapping if the language has pretrained fasttext word embeddings available online
# To use, supply a two-letter `language_identifier` (e.g. "de" for German) instead of `target_training_data_path` and set:
# auxiliary_embedding_mode="fasttext-wordlevel",
# language_identifier="de",

) sourcemodel.resizetokenembeddings(len(targettokenizer)) sourcemodel.getinputembeddings().weight.data = targetembeddings

if the model has separate output embeddings, apply FOCUS separately

if not model.config.tiewordembeddings: targetoutputembeddings = FOCUS( sourceembeddings=sourcemodel.getoutputembeddings().weight, sourcetokenizer=sourcetokenizer, targettokenizer=targettokenizer, targettrainingdatapath="/path/to/data.jsonl" # same argument options as above, fasttext models are cached! ) model.getoutputembeddings().weight.data = targetoutput_embeddings

Continue training the model on the target language with target_tokenizer.

...

```

Checkpoints

We publish the checkpoints trained with FOCUS on HuggingFace: | Language | Vocabulary Replacement (preferred) | Vocabulary Extension | |-------------|-----------------------------------------------|------------------------------------------------| | German | konstantindobler/xlm-roberta-base-focus-german | konstantindobler/xlm-roberta-base-focus-extend-german | | Arabic | konstantindobler/xlm-roberta-base-focus-arabic | konstantindobler/xlm-roberta-base-focus-extend-arabic | | Kiswahili | konstantindobler/xlm-roberta-base-focus-kiswahili | konstantindobler/xlm-roberta-base-focus-extend-kiswahili| | Hausa | konstantindobler/xlm-roberta-base-focus-hausa | konstantindobler/xlm-roberta-base-focus-extend-hausa | | isiXhosa | konstantindobler/xlm-roberta-base-focus-isixhosa | konstantindobler/xlm-roberta-base-focus-extend-isixhosa|

In our experiments, full vocabulary replacement coupled with FOCUS outperformed extending XLM-R's original vocabulary, while also resulting in a smaller model and being faster to train.

Citation

You can cite FOCUS like this:

bibtex @inproceedings{dobler-de-melo-2023-focus, title = "{FOCUS}: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models", author = "Dobler, Konstantin and de Melo, Gerard", editor = "Bouamor, Houda and Pino, Juan and Bali, Kalika", booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing", month = dec, year = "2023", address = "Singapore", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.emnlp-main.829", doi = "10.18653/v1/2023.emnlp-main.829", pages = "13440--13454", }

If you use the "WECHSEL-style" word-to-subword mapping, please consider also citing their original work.

Owner

  • Name: Konstantin Dobler
  • Login: konstantinjdobler
  • Kind: user
  • Location: Potsdam, Germany
  • Company: Hasso Plattner Institute

PhD student @ HPI working with language models

Citation (CITATION.cff)

cff-version: 1.2.0
title: "FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models"
message: "If you use this software, please cite the paper from preferred-citation."
type: software
url: "https://github.com/konstantinjdobler/focus"
authors:
  - given-names: Konstantin
    family-names: Dobler
  - given-names: Gerard
    family-names: Melo
    name-particle: de
preferred-citation:
  abbreviation: dobler-demelo-2023-focus
  type: conference-paper
  authors:
    - given-names: Konstantin
      family-names: Dobler
    - given-names: Gerard
      family-names: Melo
      name-particle: de
  title: "FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models"
  year: 2023
  publisher: "Empirical Methods in Natural Language Processing"
  url: "https://arxiv.org/abs/2305.14481"

GitHub Events

Total
  • Watch event: 8
  • Push event: 2
  • Fork event: 2
Last Year
  • Watch event: 8
  • Push event: 2
  • Fork event: 2

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 352 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 2
  • Total maintainers: 1
pypi.org: deepfocus

Offcial Python implementation of "FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models" published at EMNLP 2023.

  • Versions: 2
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 352 Last month
Rankings
Dependent packages count: 9.8%
Average: 38.8%
Dependent repos count: 67.9%
Maintainers (1)
Last synced: 6 months ago