https://github.com/bminixhofer/tokenkit

A toolkit implementing advanced methods to transfer models and model knowledge across tokenizers.

Keywords

distillation jax llms machine-learning tokenization tokenizer-transfer transfer-learning

Last synced: 5 months ago · JSON representation

Repository

A toolkit implementing advanced methods to transfer models and model knowledge across tokenizers.

Basic Info

Host: GitHub
Owner: bminixhofer
License: apache-2.0
Language: Python
Default Branch: main
Homepage:
Size: 366 KB

Statistics

Stars: 30
Watchers: 3
Forks: 4
Open Issues: 2
Releases: 0

Topics

distillation jax llms machine-learning tokenization tokenizer-transfer transfer-learning

Created 11 months ago · Last pushed 8 months ago

Metadata Files

Readme License

tokenkit

Tokenization Transfer for LLMs

tokenkit is a toolkit implementing advanced methods to transfer models and model knowledge across tokenizers.

News

2025-04-23: A new guide on implementing cross-tokenizer distillation via ALM from scratch in PyTorch!
2025-04-22: New Llama3-2-3B-IT-Byte and Gemma2-2B-IT-Byte checkpoints with native transformers support (plus, documentation on how to train them). Also, new guides for running tokenizer transfer and byteification!
2025-04-02: The initial release of tokenkit with support for cross-tokenizer distillation via ALM and Zero-Shot Tokenizer Transfer via FVT!

Why Transfer Across Tokenizers?

LLMs are bound to the tokenizer they were pretrained with. This limits their adaptability, reusability and modularity. Tokenizer transfer can lift this limitation. For example: - If we want to reuse an LLM trained primarily on English in another language, we might want to update its tokenizer to one that is more suitable for the new language. - If we want to combine (e.g., token-level ensemble) two LLMs, we need to transfer them to a common tokenizer. - If we want to experiment with better tokenization schemes (e.g., byte-level tokenization), we might want to transfer an existing LLM to this tokenizer instead of training a new one expensively from scratch. - If we want to transfer knowledge from a large teacher model to a smaller student model (which uses another tokenizer), we might want to use cross-tokenizer distillation to directly transfer the teacher's knowledge to the student without the need to first transfer the teacher to the student's tokenizer.

This library aims to let you accomplish all of this.

Installation

tokenkit is primarily implemented in Jax, using PyTorch for data loading (so your PyTorch installation does not need to support an accelerator). Recommended installation:

TPU

```bash # Clone the repository & install the library git clone https://github.com/bminixhofer/tokenkit # Create a new virtual environment # Currently, requires Python <=3.10, but we are working on this: https://github.com/bminixhofer/tokenkit/issues/4 python -m venv tokenkit_env . tokenkit_env/bin/activate # Install torch & jax 0.5.0 pip install torch jax[tpu]==0.5.0 -f https://storage.googleapis.com/jax-releases/libtpu_releases.html # Currently, tokenkit relies on a fork of `lm_eval` pip install git+https://github.com/bminixhofer/lm-evaluation-harness # Install the library and the remaining dependencies pip install -r requirements.txt pip install -e . ```

GPU

```bash # Clone the repository & install the library git clone https://github.com/bminixhofer/tokenkit # Create a new virtual environment # Currently, requires Python <=3.10, but we are working on this: https://github.com/bminixhofer/tokenkit/issues/4 python -m venv tokenkit_env . tokenkit_env/bin/activate # Install torch & jax 0.5.0 # you may need to substitute cuda12 with the version of CUDA you are using: pip install torch jax[cuda12]==0.5.0 # Currently, tokenkit relies on a fork of `lm_eval` pip install git+https://github.com/bminixhofer/lm-evaluation-harness # Install the library and the remaining dependencies pip install -r requirements.txt pip install -e . ```

Quickstart

After installing the library, you can play around with the scripts in examples/ to get started immediately. For example:

bash examples/llama3_to_byte_tokenizer_gpu.sh

If you're interested in reproducing or improving on a public model which has been trained via ALM, you can also take a look at the tokenkit command used to train that model, for example in the Training section of the Llama3-2-3B-IT-Byte model card.

Guides

Tokenizer Transfer via tokenkit (start here!)
Byteification: A Unified Interface to Tokenizers
Implementing ALM From Scratch in PyTorch (interactive notebook)

Features

Cross-Tokenizer Distillation

tokenkit supports Approximate Likelihood Matching (ALM) for cross-tokenizer distillation. ALM usually performs best, but we have also implemented the following baselines:

You can run cross-tokenizer distillation using the scripts/cross_tokenizer_distill.py script. See examples for examples on transferring to different subword tokenizers and to byte-level tokenization.

Zero-Shot Tokenizer Transfer

tokenkit supports Zero-Shot Tokenizer Transfer (ZeTT) via Fast Vocabulary Transfer (FVT). Zero-Shot Tokenizer Transfer is usually used to obtain a good initialization for additional training, but can in some cases also be useful on its own. See our ZeTT paper for more details.

You can run Zero-Shot Tokenizer Transfer using the scripts/zett.py script.

** We are working on implementing more ZeTT methods (including hypernetwork training introduced here).**

Token-Level Ensembling & Evaluating Transferred Models

tokenkit supports autoregressive generation & loglikelihood scoring evaluation by implementing a Jax backend to the LM Evaluation Harness. Alongside generating from single models, you can also generate from token-level ensembles of models. There are some predefined ensembles in configs/models. For example, this evaluates a token-level ensemle of Llama and Qwen on MMLU:

bash python3 scripts/eval_lockstep.py \ models=llama_qwen \ eval.tasks=[mmlu]

To evaluate pretrained byte-level models, you'll need to pass embeddings to expand the input ids with (i.e., to use as n-gram embeddings). For example:

bash python3 scripts/eval.py \ model.pretrained_model_name_or_path=\'benjamin/Gemma2-2B-IT-Byte\' \ model.tokenizer_name=\'google/gemma-2-2b-it:source=Gemma2:conversion=byte\' \ expand_model.pretrained_model_name_or_path=\'benjamin/gemma-2-2b-it-flax\' \ expand_model.tokenizer_name=\'google/gemma-2-2b-it:source=Gemma2\' \ eval.tasks=[mmlu]

To evaluate any other model (e.g., subword-to-subword transferred models), use something like the following:

bash python3 scripts/eval.py \ model.pretrained_model_name_or_path=\'benjamin/Gemma2-2B-IT-with-Qwen2-Tokenizer\' \ model.tokenizer_name=\'benjamin/Gemma2-2B-IT-with-Qwen2-Tokenizer:source=Gemma2:conversion=prebyteified\' \ eval.tasks=[mmlu] \

Citation

To refer to this repository or to cite Approximate Likelihood Matching, please use this citation:

@article{alm, title={Cross-Tokenizer Distillation via Approximate Likelihood Matching}, author={Minixhofer, Benjamin and Vuli{\'c}, Ivan and Ponti, Edoardo Maria}, journal={arXiv preprint arXiv:2503.20083}, year={2025} }

Please use this citation for Zero-Shot Tokenizer Transfer:

@inproceedings{zett, title={Zero-Shot Tokenizer Transfer}, author={Benjamin Minixhofer and Edoardo Ponti and Ivan Vuli{\'c}}, booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems}, year={2024}, url={https://openreview.net/forum?id=RwBObRsIzC} }

Acknowledgments

Constituent projects (ALM, ZeTT) were supported by a Royal Society University Research Fellowship Inclusive and Sustainable Language Technology for a Truly Multilingual World (no 221137; 2022-) awarded to Ivan Vuli, by the Google Cloud Research Credits program with the award GCP329647813, and by Cloud TPUs from Googles TPU Research Cloud (TRC). The name tokenkit and the README layout were inspired by mergekit. big_vision was extremely useful as a high-quality reference JAX training codebase.

Owner

Name: Benjamin Minixhofer
Login: bminixhofer
Kind: user
Location: Linz, Austria

Website: bmin.ai
Twitter: bminixhofer
Repositories: 31
Profile: https://github.com/bminixhofer

PhD Student @cambridgeltl

GitHub Events

Total

Issues event: 6
Watch event: 37
Issue comment event: 4
Push event: 130
Pull request event: 8
Fork event: 3
Create event: 8

Last Year

Issues event: 6
Watch event: 37
Issue comment event: 4
Push event: 130
Pull request event: 8
Fork event: 3
Create event: 8

Committers

Last synced: 9 months ago

All Time

Total Commits: 144
Total Committers: 1
Avg Commits per committer: 144.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 144
Committers: 1
Avg Commits per committer: 144.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Benjamin Minixhofer	b**r@g**m	144

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 3
Total pull requests: 3
Average time to close issues: 1 day
Average time to close pull requests: about 8 hours
Total issue authors: 2
Total pull request authors: 1
Average comments per issue: 1.0
Average comments per pull request: 0.0
Merged pull requests: 3
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 3
Pull requests: 3
Average time to close issues: 1 day
Average time to close pull requests: about 8 hours
Issue authors: 2
Pull request authors: 1
Average comments per issue: 1.0
Average comments per pull request: 0.0
Merged pull requests: 3
Bot issues: 0
Bot pull requests: 0

https://github.com/bminixhofer/tokenkit

Science Score: 36.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

tokenkit

Tokenization Transfer for LLMs

News

Contents

Why Transfer Across Tokenizers?

Installation

Quickstart

Guides

Features

Cross-Tokenizer Distillation

Zero-Shot Tokenizer Transfer

Token-Level Ensembling & Evaluating Transferred Models

Citation

Acknowledgments

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels