https://github.com/bminixhofer/zett

Code for Zero-Shot Tokenizer Transfer

https://github.com/bminixhofer/zett

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (6.7%) to scientific vocabulary

Keywords

language-model llm llms multilingual tokenization transfer-learning
Last synced: 5 months ago · JSON representation

Repository

Code for Zero-Shot Tokenizer Transfer

Basic Info
Statistics
  • Stars: 128
  • Watchers: 2
  • Forks: 11
  • Open Issues: 8
  • Releases: 0
Topics
language-model llm llms multilingual tokenization transfer-learning
Created almost 2 years ago · Last pushed about 1 year ago
Metadata Files
Readme

README.md

Zero-Shot Tokenizer Transfer

This repository contains the code for the paper Zero-Shot Tokenizer Transfer. ZeTT frees language models from their tokenizer, allowing you to use any model with any tokenizer, with little or no extra training

Available pretrained hypernetworks

| Hypernetwork | ..for Model | Comments | |---------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------|----------------------------| | benjamin/zett-hypernetwork-xlm-roberta-base | xlm-roberta-base | multilingual, 26 languages | | benjamin/zett-hypernetwork-Mistral-7B-v0.1 | mistralai/Mistral-7B-v0.1 | English + Code | | benjamin/zett-hypernetwork-multilingual-Mistral-7B-v0.1 | mistralai/Mistral-7B-v0.1 | multilingual, 26 languages | | benjamin/zett-hypernetwork-TinyLlama-1.1B-intermediate-step-1431k-3T | TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T | English + Code | | benjamin/zett-hypernetwork-Meta-Llama-3-8B-experimental | meta-llama/Meta-Llama-3-8B | experimental English + Code, seems to underperform on Code |

Using a pretrained hypernetwork

Environment Setup

Requirements are in requirements.txt, This, for example, creates a working environment:

``` conda create -n zett Python=3.11 conda activate zett

pip install -r requirements.txt pip install -U "jax[cuda12pip]==0.4.23" -f https://storage.googleapis.com/jax-releases/jaxcuda_releases.html # adjust based on your CUDA version pip install -e . ```

Transferring to a new tokenizer

Let's transfer XLM-RoBERTa to the GPT2 tokenizer. ```bash git clone https://huggingface.co/benjamin/zett-hypernetwork-xlm-roberta-base python3 scripts/transfer.py \ --target_model=FacebookAI/xlm-roberta-base \ --tokenizer_name=gpt2 \ --output=my-new-fancy-xlm-r \ --model_class=AutoModelForMaskedLM \ --lang_code=en \ --checkpoint_path=zett-hypernetwork-xlm-roberta-base \ --save_pt # otherwise saves only Flax weights ``` Tada! ```python from transformers import AutoModelForMaskedLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("my-new-fancy-xlm-r") model = AutoModelForMaskedLM.from_pretrained("my-new-fancy-xlm-r") out = model(**tokenizer("Hello world!", return_tensors="pt")) ```
..or Mistral-7B to the GPT-NeoX tokenizer: ```bash git clone https://huggingface.co/benjamin/zett-hypernetwork-Mistral-7B-v0.1 # because Flax weights are not merged in the main branch, we need to specify the revision of a PR containing Flax weights python3 scripts/transfer.py \ --target_model=mistralai/Mistral-7B-v0.1 \ --revision=refs/pr/95 \ --tokenizer_name=EleutherAI/gpt-neox-20b \ --output=my-new-fancy-mistral \ --model_class=AutoModelForCausalLM \ --checkpoint_path=zett-hypernetwork-Mistral-7B-v0.1 \ --save_pt # otherwise saves only Flax weights ``` ```python from transformers import AutoModelForCausalLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("my-new-fancy-mistral") model = AutoModelForCausalLM.from_pretrained("my-new-fancy-mistral") out = model(**tokenizer("Hello world!", return_tensors="pt")) ```

Although the codebase is in Jax/Flax, there are Pytorch bindings for the model in ./hf_hypernet. You can use them as follows:

```python from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer from zett.utils import getsurfaceform_matrix

basemodel = AutoModelForCausalLM.frompretrained("mistralai/Mistral-7B-v0.1") hypernet = AutoModel.frompretrained("benjamin/zett-hypernetwork-Mistral-7B-v0.1", trustremote_code=True)

sourceembeddings = torch.concatenate([ basemodel.getinputembeddings().weight.data, basemodel.getoutput_embeddings().weight.data, ], axis=1)

hntokenizer = AutoTokenizer.frompretrained("benjamin/zett-hypernetwork-Mistral-7B-v0.1")

targetsurfaceforms = getsurfaceformmatrix( ["hello", "world"], # byte representation of the tokens to predict maxlen=hypernet.config.hnsurfacemaxlen, tokenizertouse=hntokenizer, )[0]

the last output is the predicted bias in case the model uses a bias (e.g. XLM-R)

predictedinputembeddings, predictedoutputembeddings, _ = hypernet( torch.fromnumpy(targetsurfaceforms), sourceembeddings=source_embeddings )

```

but transfer.py is currently not ported to PyTorch (PRs welcome!).

Advanced usage

Training a Hypernetwork

The script used to train the hypernetwork is train.py.

But first, you'll need to download and prepare the data via data/prepare.py and data/prepare_code.py.

You'll also need to install the Rust module in rust_utils (used to quickly sample tokenizers) via e.g. cd rust_utils && maturin develop --release.

Once finished, you can run training using the configs in configs/. For example:

bash python3 train.py configs/zeroshot/v7:tinyllama_en+code:lw=0.5_long.json

to train a hypernetwork for TinyLlama on English and Code.

Transferring fine-tuned models to a new tokenizer using a base model hypernetwork

Use scripts/apply_to_ft.py to transfer the tokenizers of a fine-tuned model, given a base model with already transferred tokenizer. For example:

bash python3 scripts/apply_to_ft.py \ --output=transferred-chat-mistral \ --base_model_path=mistralai/Mistral-7B-v0.1 \ --ft_model_path=mistralai/Mistral-7B-Instruct-v0.1 \ --tokenizer_swapped_base_model_path=path-to-base-model-with-new-tokenizer \ --lambdas 0.5 \

Reproducing the experiments from the paper

There are bash scripts in experiments/ to allow reproducing the main results from the paper.

Evaluation on code is still missing because we are using a fork of bigcode-evaluation-harness to fix some issues we encountered. They will be added soon.

Unigramifying, using n-shot transferred models, reproducing the tokenizers from the paper, etc.

Guide coming soon... (but feel free to dig into scripts/ in the meantime)

Disclaimer

I prioritized releasing the code quickly instead of making it perfectly clean. There may still be remnants of my personal environment used to train the models and other non-niceties. I am in the process of cleaning this up. If you run into any problems or have any questions, please open an issue.

Owner

  • Name: Benjamin Minixhofer
  • Login: bminixhofer
  • Kind: user
  • Location: Linz, Austria

PhD Student @cambridgeltl

GitHub Events

Total
  • Issues event: 1
  • Watch event: 26
  • Issue comment event: 1
  • Push event: 5
  • Fork event: 4
Last Year
  • Issues event: 1
  • Watch event: 26
  • Issue comment event: 1
  • Push event: 5
  • Fork event: 4

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 18
  • Total Committers: 2
  • Avg Commits per committer: 9.0
  • Development Distribution Score (DDS): 0.056
Past Year
  • Commits: 8
  • Committers: 2
  • Avg Commits per committer: 4.0
  • Development Distribution Score (DDS): 0.125
Top Committers
Name Email Commits
Benjamin Minixhofer b****r@g****m 17
haemmerl h****l@c****e 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 9 months ago

All Time
  • Total issues: 11
  • Total pull requests: 1
  • Average time to close issues: 16 days
  • Average time to close pull requests: 27 days
  • Total issue authors: 11
  • Total pull request authors: 1
  • Average comments per issue: 2.09
  • Average comments per pull request: 2.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 7
  • Pull requests: 1
  • Average time to close issues: about 1 month
  • Average time to close pull requests: 27 days
  • Issue authors: 7
  • Pull request authors: 1
  • Average comments per issue: 0.86
  • Average comments per pull request: 2.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • KathyHaem (1)
  • LorrinWWW (1)
  • jubgjf (1)
  • Mnji4 (1)
  • kdcyberdude (1)
  • gushu333 (1)
  • zaidalyafeai (1)
  • zf0x00 (1)
  • noforit (1)
  • FannyDucel (1)
  • elements72 (1)
  • BenoitDalFerro (1)
Pull Request Authors
  • KathyHaem (2)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

rust_utils/Cargo.lock cargo
  • 107 dependencies
rust_utils/Cargo.toml cargo
  • pyo3 0.19 development
  • tempfile 3.1 development
  • onig 6.0
  • pyo3 0.19
  • rand 0.8
  • rand_distr 0.4.3
  • tokenizers 0.14.0
requirements.txt pypi
  • accelerate ==0.23.0
  • datasets ==2.18.0
  • flax ==0.8.0
  • h5py ==3.8.0
  • matplotlib ==3.7.2
  • maturin ==1.3.0
  • optax ==0.1.5
  • pandas ==2.0.3
  • pyahocorasick ==2.0.0
  • scikit-learn ==1.4.2
  • transformers ==4.34.1
  • wandb ==0.15.4
rust_utils/pyproject.toml pypi
setup.py pypi