deepfocus

[EMNLP'23] Official Code for "FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models"

https://github.com/konstantinjdobler/focus

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.5%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

[EMNLP'23] Official Code for "FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models"

Basic Info

Host: GitHub
Owner: konstantinjdobler
License: mit
Language: Python
Default Branch: main
Homepage: https://arxiv.org/abs/2305.14481
Size: 20 MB

Statistics

Stars: 33
Watchers: 1
Forks: 5
Open Issues: 0
Releases: 2

Created over 3 years ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

FOCUS

Code for "FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models" accepted at the EMNLP 2023 main conference.

Paper on arXiv: https://arxiv.org/abs/2305.14481.

Installation

We provide the package via pip install deepfocus.

Alternatively, you can simply copy the deepfocus folder and drop it into your project. The necessary dependencies are listed in requirements.txt (pip install -r requirements.txt).

Usage

The following example shows how to use FOCUS to specialize xlm-roberta-base on German with a custom, language-specific tokenizer. The code is also available in focus_example.py.

```python from transformers import AutoModelForMaskedLM, AutoTokenizer from deepfocus import FOCUS

sourcetokenizer = AutoTokenizer.frompretrained("xlm-roberta-base") sourcemodel = AutoModelForMaskedLM.frompretrained("xlm-roberta-base")

targettokenizer = AutoTokenizer.frompretrained( "./tokenizers/de/xlmr-unigram-50k" )

Example for training a new tokenizer:

targettokenizer = sourcetokenizer.trainnewfrom_iterator(

load_dataset("cc100", lang="de", split="train")["text"],

vocabsize=50048

)

targettokenizer.savepretrained("./targettokenizertest")

targetembeddings = FOCUS( sourceembeddings=sourcemodel.getinputembeddings().weight, sourcetokenizer=sourcetokenizer, targettokenizer=targettokenizer, targettrainingdatapath="/path/to/data.jsonl" # data should be .jsonl where each line is a sample like {"text": "Lorem ipsum..."} # fasttextmodelpath="/path/to/fasttext.bin", # or directly provide path to token-level fasttext model

# In the paper, we use `target_training_data_path` but we also implement using
# WECHSEL's word-to-subword mapping if the language has pretrained fasttext word embeddings available online
# To use, supply a two-letter `language_identifier` (e.g. "de" for German) instead of `target_training_data_path` and set:
# auxiliary_embedding_mode="fasttext-wordlevel",
# language_identifier="de",

) sourcemodel.resizetokenembeddings(len(targettokenizer)) sourcemodel.getinputembeddings().weight.data = targetembeddings

if the model has separate output embeddings, apply FOCUS separately

if not model.config.tiewordembeddings: targetoutputembeddings = FOCUS( sourceembeddings=sourcemodel.getoutputembeddings().weight, sourcetokenizer=sourcetokenizer, targettokenizer=targettokenizer, targettrainingdatapath="/path/to/data.jsonl" # same argument options as above, fasttext models are cached! ) model.getoutputembeddings().weight.data = targetoutput_embeddings

Continue training the model on the target language with `target_tokenizer`.

...

```

Checkpoints

We publish the checkpoints trained with FOCUS on HuggingFace: | Language | Vocabulary Replacement (preferred) | Vocabulary Extension | |-------------|-----------------------------------------------|------------------------------------------------| | German | konstantindobler/xlm-roberta-base-focus-german | konstantindobler/xlm-roberta-base-focus-extend-german | | Arabic | konstantindobler/xlm-roberta-base-focus-arabic | konstantindobler/xlm-roberta-base-focus-extend-arabic | | Kiswahili | konstantindobler/xlm-roberta-base-focus-kiswahili | konstantindobler/xlm-roberta-base-focus-extend-kiswahili| | Hausa | konstantindobler/xlm-roberta-base-focus-hausa | konstantindobler/xlm-roberta-base-focus-extend-hausa | | isiXhosa | konstantindobler/xlm-roberta-base-focus-isixhosa | konstantindobler/xlm-roberta-base-focus-extend-isixhosa|

In our experiments, full vocabulary replacement coupled with FOCUS outperformed extending XLM-R's original vocabulary, while also resulting in a smaller model and being faster to train.

Citation

You can cite FOCUS like this:

bibtex @inproceedings{dobler-de-melo-2023-focus, title = "{FOCUS}: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models", author = "Dobler, Konstantin and de Melo, Gerard", editor = "Bouamor, Houda and Pino, Juan and Bali, Kalika", booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing", month = dec, year = "2023", address = "Singapore", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.emnlp-main.829", doi = "10.18653/v1/2023.emnlp-main.829", pages = "13440--13454", }

If you use the "WECHSEL-style" word-to-subword mapping, please consider also citing their original work.

Owner

Name: Konstantin Dobler
Login: konstantinjdobler
Kind: user
Location: Potsdam, Germany
Company: Hasso Plattner Institute

Website: konstantindobler.me
Repositories: 36
Profile: https://github.com/konstantinjdobler

PhD student @ HPI working with language models

Citation (CITATION.cff)

cff-version: 1.2.0
title: "FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models"
message: "If you use this software, please cite the paper from preferred-citation."
type: software
url: "https://github.com/konstantinjdobler/focus"
authors:
  - given-names: Konstantin
    family-names: Dobler
  - given-names: Gerard
    family-names: Melo
    name-particle: de
preferred-citation:
  abbreviation: dobler-demelo-2023-focus
  type: conference-paper
  authors:
    - given-names: Konstantin
      family-names: Dobler
    - given-names: Gerard
      family-names: Melo
      name-particle: de
  title: "FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models"
  year: 2023
  publisher: "Empirical Methods in Natural Language Processing"
  url: "https://arxiv.org/abs/2305.14481"

GitHub Events

Total

Watch event: 8
Push event: 2
Fork event: 2

Last Year

Watch event: 8
Push event: 2
Fork event: 2

Packages

Total packages: 1
Total downloads:
- pypi 352 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 2
Total maintainers: 1

pypi.org: deepfocus

Offcial Python implementation of "FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models" published at EMNLP 2023.

Homepage: https://github.com/konstantinjdobler/focus
Documentation: https://deepfocus.readthedocs.io/
License: Apache Software License
Latest release: 1.0.1
published over 2 years ago

Versions: 2
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 352 Last month

Rankings

Dependent packages count: 9.8%

Average: 38.8%

Dependent repos count: 67.9%

Maintainers (1)

konstantindobler

Last synced: 10 months ago

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

deepfocus

Science Score: 67.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

FOCUS

Installation

Usage

Example for training a new tokenizer:

targettokenizer = sourcetokenizer.trainnewfrom_iterator(

load_dataset("cc100", lang="de", split="train")["text"],

vocabsize=50048

)

targettokenizer.savepretrained("./targettokenizertest")

if the model has separate output embeddings, apply FOCUS separately

Continue training the model on the target language with `target_tokenizer`.

...

Checkpoints

Citation

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Packages

pypi.org: deepfocus

Rankings

Maintainers (1)

deepfocus

Science Score: 67.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

FOCUS

Installation

Usage

Example for training a new tokenizer:

targettokenizer = sourcetokenizer.trainnewfrom_iterator(

load_dataset("cc100", lang="de", split="train")["text"],

vocabsize=50048

)

targettokenizer.savepretrained("./targettokenizertest")

if the model has separate output embeddings, apply FOCUS separately

Continue training the model on the target language with target_tokenizer.

...

Checkpoints

Citation

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Packages

pypi.org: deepfocus

Rankings

Maintainers (1)

Continue training the model on the target language with `target_tokenizer`.