https://github.com/bminixhofer/zett
Code for Zero-Shot Tokenizer Transfer
Science Score: 23.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (6.7%) to scientific vocabulary
Keywords
Repository
Code for Zero-Shot Tokenizer Transfer
Basic Info
- Host: GitHub
- Owner: bminixhofer
- Language: Python
- Default Branch: main
- Homepage: https://arxiv.org/abs/2405.07883
- Size: 1.04 MB
Statistics
- Stars: 128
- Watchers: 2
- Forks: 11
- Open Issues: 8
- Releases: 0
Topics
Metadata Files
README.md
Zero-Shot Tokenizer Transfer
This repository contains the code for the paper Zero-Shot Tokenizer Transfer. ZeTT frees language models from their tokenizer, allowing you to use any model with any tokenizer, with little or no extra training
Available pretrained hypernetworks
| Hypernetwork | ..for Model | Comments | |---------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------|----------------------------| | benjamin/zett-hypernetwork-xlm-roberta-base | xlm-roberta-base | multilingual, 26 languages | | benjamin/zett-hypernetwork-Mistral-7B-v0.1 | mistralai/Mistral-7B-v0.1 | English + Code | | benjamin/zett-hypernetwork-multilingual-Mistral-7B-v0.1 | mistralai/Mistral-7B-v0.1 | multilingual, 26 languages | | benjamin/zett-hypernetwork-TinyLlama-1.1B-intermediate-step-1431k-3T | TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T | English + Code | | benjamin/zett-hypernetwork-Meta-Llama-3-8B-experimental | meta-llama/Meta-Llama-3-8B | experimental English + Code, seems to underperform on Code |
Using a pretrained hypernetwork
Environment Setup
Requirements are in requirements.txt, This, for example, creates a working environment:
``` conda create -n zett Python=3.11 conda activate zett
pip install -r requirements.txt pip install -U "jax[cuda12pip]==0.4.23" -f https://storage.googleapis.com/jax-releases/jaxcuda_releases.html # adjust based on your CUDA version pip install -e . ```
Transferring to a new tokenizer
Let's transfer XLM-RoBERTa to the GPT2 tokenizer.
```bash git clone https://huggingface.co/benjamin/zett-hypernetwork-xlm-roberta-base python3 scripts/transfer.py \ --target_model=FacebookAI/xlm-roberta-base \ --tokenizer_name=gpt2 \ --output=my-new-fancy-xlm-r \ --model_class=AutoModelForMaskedLM \ --lang_code=en \ --checkpoint_path=zett-hypernetwork-xlm-roberta-base \ --save_pt # otherwise saves only Flax weights ``` Tada! ```python from transformers import AutoModelForMaskedLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("my-new-fancy-xlm-r") model = AutoModelForMaskedLM.from_pretrained("my-new-fancy-xlm-r") out = model(**tokenizer("Hello world!", return_tensors="pt")) ```..or Mistral-7B to the GPT-NeoX tokenizer:
```bash git clone https://huggingface.co/benjamin/zett-hypernetwork-Mistral-7B-v0.1 # because Flax weights are not merged in the main branch, we need to specify the revision of a PR containing Flax weights python3 scripts/transfer.py \ --target_model=mistralai/Mistral-7B-v0.1 \ --revision=refs/pr/95 \ --tokenizer_name=EleutherAI/gpt-neox-20b \ --output=my-new-fancy-mistral \ --model_class=AutoModelForCausalLM \ --checkpoint_path=zett-hypernetwork-Mistral-7B-v0.1 \ --save_pt # otherwise saves only Flax weights ``` ```python from transformers import AutoModelForCausalLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("my-new-fancy-mistral") model = AutoModelForCausalLM.from_pretrained("my-new-fancy-mistral") out = model(**tokenizer("Hello world!", return_tensors="pt")) ```Although the codebase is in Jax/Flax, there are Pytorch bindings for the model in ./hf_hypernet. You can use them as follows:
```python from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer from zett.utils import getsurfaceform_matrix
basemodel = AutoModelForCausalLM.frompretrained("mistralai/Mistral-7B-v0.1") hypernet = AutoModel.frompretrained("benjamin/zett-hypernetwork-Mistral-7B-v0.1", trustremote_code=True)
sourceembeddings = torch.concatenate([ basemodel.getinputembeddings().weight.data, basemodel.getoutput_embeddings().weight.data, ], axis=1)
hntokenizer = AutoTokenizer.frompretrained("benjamin/zett-hypernetwork-Mistral-7B-v0.1")
targetsurfaceforms = getsurfaceformmatrix( ["hello", "world"], # byte representation of the tokens to predict maxlen=hypernet.config.hnsurfacemaxlen, tokenizertouse=hntokenizer, )[0]
the last output is the predicted bias in case the model uses a bias (e.g. XLM-R)
predictedinputembeddings, predictedoutputembeddings, _ = hypernet( torch.fromnumpy(targetsurfaceforms), sourceembeddings=source_embeddings )
```
but transfer.py is currently not ported to PyTorch (PRs welcome!).
Advanced usage
Training a Hypernetwork
The script used to train the hypernetwork is train.py.
But first, you'll need to download and prepare the data via data/prepare.py and data/prepare_code.py.
You'll also need to install the Rust module in rust_utils (used to quickly sample tokenizers) via e.g. cd rust_utils && maturin develop --release.
Once finished, you can run training using the configs in configs/. For example:
bash
python3 train.py configs/zeroshot/v7:tinyllama_en+code:lw=0.5_long.json
to train a hypernetwork for TinyLlama on English and Code.
Transferring fine-tuned models to a new tokenizer using a base model hypernetwork
Use scripts/apply_to_ft.py to transfer the tokenizers of a fine-tuned model, given a base model with already transferred tokenizer. For example:
bash
python3 scripts/apply_to_ft.py \
--output=transferred-chat-mistral \
--base_model_path=mistralai/Mistral-7B-v0.1 \
--ft_model_path=mistralai/Mistral-7B-Instruct-v0.1 \
--tokenizer_swapped_base_model_path=path-to-base-model-with-new-tokenizer \
--lambdas 0.5 \
Reproducing the experiments from the paper
There are bash scripts in experiments/ to allow reproducing the main results from the paper.
Evaluation on code is still missing because we are using a fork of bigcode-evaluation-harness to fix some issues we encountered. They will be added soon.
Unigramifying, using n-shot transferred models, reproducing the tokenizers from the paper, etc.
Guide coming soon... (but feel free to dig into scripts/ in the meantime)
Disclaimer
I prioritized releasing the code quickly instead of making it perfectly clean. There may still be remnants of my personal environment used to train the models and other non-niceties. I am in the process of cleaning this up. If you run into any problems or have any questions, please open an issue.
Owner
- Name: Benjamin Minixhofer
- Login: bminixhofer
- Kind: user
- Location: Linz, Austria
- Website: bmin.ai
- Twitter: bminixhofer
- Repositories: 31
- Profile: https://github.com/bminixhofer
PhD Student @cambridgeltl
GitHub Events
Total
- Issues event: 1
- Watch event: 26
- Issue comment event: 1
- Push event: 5
- Fork event: 4
Last Year
- Issues event: 1
- Watch event: 26
- Issue comment event: 1
- Push event: 5
- Fork event: 4
Committers
Last synced: 9 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Benjamin Minixhofer | b****r@g****m | 17 |
| haemmerl | h****l@c****e | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 9 months ago
All Time
- Total issues: 11
- Total pull requests: 1
- Average time to close issues: 16 days
- Average time to close pull requests: 27 days
- Total issue authors: 11
- Total pull request authors: 1
- Average comments per issue: 2.09
- Average comments per pull request: 2.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 7
- Pull requests: 1
- Average time to close issues: about 1 month
- Average time to close pull requests: 27 days
- Issue authors: 7
- Pull request authors: 1
- Average comments per issue: 0.86
- Average comments per pull request: 2.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- KathyHaem (1)
- LorrinWWW (1)
- jubgjf (1)
- Mnji4 (1)
- kdcyberdude (1)
- gushu333 (1)
- zaidalyafeai (1)
- zf0x00 (1)
- noforit (1)
- FannyDucel (1)
- elements72 (1)
- BenoitDalFerro (1)
Pull Request Authors
- KathyHaem (2)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- 107 dependencies
- pyo3 0.19 development
- tempfile 3.1 development
- onig 6.0
- pyo3 0.19
- rand 0.8
- rand_distr 0.4.3
- tokenizers 0.14.0
- accelerate ==0.23.0
- datasets ==2.18.0
- flax ==0.8.0
- h5py ==3.8.0
- matplotlib ==3.7.2
- maturin ==1.3.0
- optax ==0.1.5
- pandas ==2.0.3
- pyahocorasick ==2.0.0
- scikit-learn ==1.4.2
- transformers ==4.34.1
- wandb ==0.15.4