https://github.com/bminixhofer/tokenkit

A toolkit implementing advanced methods to transfer models and model knowledge across tokenizers.

https://github.com/bminixhofer/tokenkit

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.6%) to scientific vocabulary

Keywords

distillation jax llms machine-learning tokenization tokenizer-transfer transfer-learning
Last synced: 5 months ago · JSON representation

Repository

A toolkit implementing advanced methods to transfer models and model knowledge across tokenizers.

Basic Info
  • Host: GitHub
  • Owner: bminixhofer
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 366 KB
Statistics
  • Stars: 30
  • Watchers: 3
  • Forks: 4
  • Open Issues: 2
  • Releases: 0
Topics
distillation jax llms machine-learning tokenization tokenizer-transfer transfer-learning
Created 11 months ago · Last pushed 8 months ago
Metadata Files
Readme License

README.md

tokenkit

Tokenization Transfer for LLMs

tokenkit is a toolkit implementing advanced methods to transfer models and model knowledge across tokenizers.

News

Contents

Why Transfer Across Tokenizers?

LLMs are bound to the tokenizer they were pretrained with. This limits their adaptability, reusability and modularity. Tokenizer transfer can lift this limitation. For example: - If we want to reuse an LLM trained primarily on English in another language, we might want to update its tokenizer to one that is more suitable for the new language. - If we want to combine (e.g., token-level ensemble) two LLMs, we need to transfer them to a common tokenizer. - If we want to experiment with better tokenization schemes (e.g., byte-level tokenization), we might want to transfer an existing LLM to this tokenizer instead of training a new one expensively from scratch. - If we want to transfer knowledge from a large teacher model to a smaller student model (which uses another tokenizer), we might want to use cross-tokenizer distillation to directly transfer the teacher's knowledge to the student without the need to first transfer the teacher to the student's tokenizer.

This library aims to let you accomplish all of this.

Installation

tokenkit is primarily implemented in Jax, using PyTorch for data loading (so your PyTorch installation does not need to support an accelerator). Recommended installation:

TPU ```bash # Clone the repository & install the library git clone https://github.com/bminixhofer/tokenkit # Create a new virtual environment # Currently, requires Python <=3.10, but we are working on this: https://github.com/bminixhofer/tokenkit/issues/4 python -m venv tokenkit_env . tokenkit_env/bin/activate # Install torch & jax 0.5.0 pip install torch jax[tpu]==0.5.0 -f https://storage.googleapis.com/jax-releases/libtpu_releases.html # Currently, tokenkit relies on a fork of `lm_eval` pip install git+https://github.com/bminixhofer/lm-evaluation-harness # Install the library and the remaining dependencies pip install -r requirements.txt pip install -e . ```
GPU ```bash # Clone the repository & install the library git clone https://github.com/bminixhofer/tokenkit # Create a new virtual environment # Currently, requires Python <=3.10, but we are working on this: https://github.com/bminixhofer/tokenkit/issues/4 python -m venv tokenkit_env . tokenkit_env/bin/activate # Install torch & jax 0.5.0 # you may need to substitute cuda12 with the version of CUDA you are using: pip install torch jax[cuda12]==0.5.0 # Currently, tokenkit relies on a fork of `lm_eval` pip install git+https://github.com/bminixhofer/lm-evaluation-harness # Install the library and the remaining dependencies pip install -r requirements.txt pip install -e . ```

Quickstart

After installing the library, you can play around with the scripts in examples/ to get started immediately. For example:

bash examples/llama3_to_byte_tokenizer_gpu.sh

If you're interested in reproducing or improving on a public model which has been trained via ALM, you can also take a look at the tokenkit command used to train that model, for example in the Training section of the Llama3-2-3B-IT-Byte model card.

Guides

Features

Cross-Tokenizer Distillation

tokenkit supports Approximate Likelihood Matching (ALM) for cross-tokenizer distillation. ALM usually performs best, but we have also implemented the following baselines:

You can run cross-tokenizer distillation using the scripts/cross_tokenizer_distill.py script. See examples for examples on transferring to different subword tokenizers and to byte-level tokenization.

Zero-Shot Tokenizer Transfer

tokenkit supports Zero-Shot Tokenizer Transfer (ZeTT) via Fast Vocabulary Transfer (FVT). Zero-Shot Tokenizer Transfer is usually used to obtain a good initialization for additional training, but can in some cases also be useful on its own. See our ZeTT paper for more details.

You can run Zero-Shot Tokenizer Transfer using the scripts/zett.py script.

** We are working on implementing more ZeTT methods (including hypernetwork training introduced here).**

Token-Level Ensembling & Evaluating Transferred Models

tokenkit supports autoregressive generation & loglikelihood scoring evaluation by implementing a Jax backend to the LM Evaluation Harness. Alongside generating from single models, you can also generate from token-level ensembles of models. There are some predefined ensembles in configs/models. For example, this evaluates a token-level ensemle of Llama and Qwen on MMLU:

bash python3 scripts/eval_lockstep.py \ models=llama_qwen \ eval.tasks=[mmlu]

To evaluate pretrained byte-level models, you'll need to pass embeddings to expand the input ids with (i.e., to use as n-gram embeddings). For example:

bash python3 scripts/eval.py \ model.pretrained_model_name_or_path=\'benjamin/Gemma2-2B-IT-Byte\' \ model.tokenizer_name=\'google/gemma-2-2b-it:source=Gemma2:conversion=byte\' \ expand_model.pretrained_model_name_or_path=\'benjamin/gemma-2-2b-it-flax\' \ expand_model.tokenizer_name=\'google/gemma-2-2b-it:source=Gemma2\' \ eval.tasks=[mmlu]

To evaluate any other model (e.g., subword-to-subword transferred models), use something like the following:

bash python3 scripts/eval.py \ model.pretrained_model_name_or_path=\'benjamin/Gemma2-2B-IT-with-Qwen2-Tokenizer\' \ model.tokenizer_name=\'benjamin/Gemma2-2B-IT-with-Qwen2-Tokenizer:source=Gemma2:conversion=prebyteified\' \ eval.tasks=[mmlu] \

Citation

To refer to this repository or to cite Approximate Likelihood Matching, please use this citation:

@article{alm, title={Cross-Tokenizer Distillation via Approximate Likelihood Matching}, author={Minixhofer, Benjamin and Vuli{\'c}, Ivan and Ponti, Edoardo Maria}, journal={arXiv preprint arXiv:2503.20083}, year={2025} }

Please use this citation for Zero-Shot Tokenizer Transfer:

@inproceedings{zett, title={Zero-Shot Tokenizer Transfer}, author={Benjamin Minixhofer and Edoardo Ponti and Ivan Vuli{\'c}}, booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems}, year={2024}, url={https://openreview.net/forum?id=RwBObRsIzC} }

Acknowledgments

Constituent projects (ALM, ZeTT) were supported by a Royal Society University Research Fellowship Inclusive and Sustainable Language Technology for a Truly Multilingual World (no 221137; 2022-) awarded to Ivan Vuli, by the Google Cloud Research Credits program with the award GCP329647813, and by Cloud TPUs from Googles TPU Research Cloud (TRC). The name tokenkit and the README layout were inspired by mergekit. big_vision was extremely useful as a high-quality reference JAX training codebase.

Owner

  • Name: Benjamin Minixhofer
  • Login: bminixhofer
  • Kind: user
  • Location: Linz, Austria

PhD Student @cambridgeltl

GitHub Events

Total
  • Issues event: 6
  • Watch event: 37
  • Issue comment event: 4
  • Push event: 130
  • Pull request event: 8
  • Fork event: 3
  • Create event: 8
Last Year
  • Issues event: 6
  • Watch event: 37
  • Issue comment event: 4
  • Push event: 130
  • Pull request event: 8
  • Fork event: 3
  • Create event: 8

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 144
  • Total Committers: 1
  • Avg Commits per committer: 144.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 144
  • Committers: 1
  • Avg Commits per committer: 144.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Benjamin Minixhofer b****r@g****m 144

Issues and Pull Requests

Last synced: 9 months ago

All Time
  • Total issues: 3
  • Total pull requests: 3
  • Average time to close issues: 1 day
  • Average time to close pull requests: about 8 hours
  • Total issue authors: 2
  • Total pull request authors: 1
  • Average comments per issue: 1.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 3
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 3
  • Pull requests: 3
  • Average time to close issues: 1 day
  • Average time to close pull requests: about 8 hours
  • Issue authors: 2
  • Pull request authors: 1
  • Average comments per issue: 1.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 3
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • GorkaUrbizu (2)
  • bminixhofer (1)
  • Mnji4 (1)
Pull Request Authors
  • bminixhofer (7)
Top Labels
Issue Labels
Pull Request Labels