tokenizers
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.4%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: peblove
- License: apache-2.0
- Language: Rust
- Default Branch: main
- Size: 20.5 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
BBPE16 Tokenizer
Why BBPE16 (UTF-16 Byte-level Byte-Pair Encoding)?
BBPE16 (UTF-16 Byte-level Byte-Pair Encoding) represents a different approach from traditional Byte-Level BPE (BBPE), particularly in how it handles character encoding internally:
UTF-16 vs UTF-8 Processing: While standard BBPE processes text at the UTF-8 byte level, BBPE16 operates on UTF-16 byte sequences internally. This fundamental difference affects how characters are segmented and merged during the BPE algorithm.
Seamless Integration: When encoding, BBPE16 automatically converts UTF-8 strings to UTF-16 internally for processing, and during decoding, it processes tokens based on UTF-16 before returning standard UTF-8 strings. This makes it transparent to users—no manual encoding or decoding steps are required.
Different Character Representation: UTF-16 uses 16-bit code units, representing most common characters in a single unit while using surrogate pairs for characters beyond the Basic Multilingual Plane. UTF-8, in contrast, uses variable-length encoding with 1-4 bytes per character. This encoding difference affects how the BPE algorithm identifies and merges byte patterns.
CJK Language Characteristics: CJK characters typically require 3 bytes in UTF-8 encoding but fit within a single 16-bit unit in UTF-16 (for characters in the Basic Multilingual Plane). This difference in representation affects the byte-level patterns that BPE algorithms learn and apply.
Same Alphabet Foundation: BBPE16 maintains compatibility with the standard 256-character byte-level alphabet used in traditional BBPE, ensuring consistent vocabulary mapping strategies.
Automatic Model Detection: The tokenizer can automatically distinguish between BBPE and BBPE16 models when loading, providing seamless integration without requiring manual configuration.
UTF16ByteLevelBPETokenizer
A specialized tokenizer that processes text at the UTF-16 byte level while maintaining compatibility with the standard ByteLevel alphabet, offering an alternative approach to traditional UTF-8 byte-level processing.
Key Characteristics:
- Processes text using UTF-16 internal representation
- Maintains 100% accuracy across all languages
- Compatible byte-level alphabet (256 characters, same mapping strategy as ByteLevel)
- Transparent UTF-8 to UTF-16 conversion during processing
Usage:
```python from tokenizers.implementations import UTF16ByteLevelBPETokenizer
Initialize tokenizer
tokenizer = UTF16ByteLevelBPETokenizer()
Train on your dataset
tokenizer.train(files=["trainingdata.txt"], vocabsize=1000)
Encode text with UTF-16 internal processing
output = tokenizer.encode("안녕하세요 你好 こんにちは") print(f"Tokens: {output.tokens}") ```
When to Consider:
- Working with applications that benefit from UTF-16 internal processing
- Exploring alternative byte-level tokenization approaches
- Building systems where UTF-16 representation aligns with other components
- Research comparing different encoding approaches in tokenization
Technical Differences:
The core distinction lies in the internal character representation during BPE processing. While UTF-8 BBPE works with variable-length byte sequences (1-4 bytes per character), UTF-16 BBPE processes 16-bit code units, which affects how byte patterns are identified and merged during vocabulary construction.
Citation:
If you use UTF16ByteLevelBPETokenizer in your research or applications, please cite:
bibtex
@misc{kim2025utf16bytelevel,
title={UTF16ByteLevelBPETokenizer: Alternative UTF-16 Based Tokenization},
author={Hyunsik Kim},
year={2025},
month={May},
note={Implementation based on HuggingFace Tokenizers library},
email={avantkim@gmail.com},
url={https://github.com/peblove/tokenizers}
}
Technical References:
- Base Implementation: HuggingFace Tokenizers - https://github.com/huggingface/tokenizers
- GPT-2 Byte-Level BPE: Radford et al. (2019) - https://github.com/openai/gpt-2
- UTF-16 Encoding Standard: Unicode Consortium - https://unicode.org/standard/standard.html
Performance Benchmarks
Performance characteristics can vary depending on hardware configuration. Running the ~/bindings/python/benches/test_tiktoken.py benchmark provides comparative analysis on various systems:
Bindings
We provide bindings to the following languages (more to come!): - Rust (Original implementation) - Python - Node.js - Ruby (Contributed by @ankane, external repo)
Installation
You can install from source using:
bash
pip install git+https://github.com/peblove/tokenizers.git#subdirectory=bindings/python
or install the released versions with:
bash
pip install tokenizers
Quick Example Using Python
Choose your model between Byte-Pair Encoding, WordPiece or Unigram and instantiate a tokenizer:
```python from tokenizers import Tokenizer from tokenizers.models import BPE
tokenizer = Tokenizer(BPE()) ```
You can customize how pre-tokenization (e.g., splitting into words) is done:
```python from tokenizers.pre_tokenizers import Whitespace
tokenizer.pre_tokenizer = Whitespace() ```
Then training your tokenizer on a set of files just takes two lines of code:
```python from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]) tokenizer.train(files=["wiki.train.raw", "wiki.valid.raw", "wiki.test.raw"], trainer=trainer) ```
Once your tokenizer is trained, encode any text with just one line: ```python output = tokenizer.encode("Hello, y'all! How are you 😁 ?") print(output.tokens)
["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"]
```
Check the documentation or the quicktour to learn more!
Owner
- Name: Hyunsik Kim
- Login: peblove
- Kind: user
- Company: home
- Repositories: 1
- Profile: https://github.com/peblove
AI/ML Researcher with expertise in Speech Recognition and ARM Architecture
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: HuggingFace's Tokenizers
message: >-
Fast State-of-the-Art Tokenizers optimized for Research
and Production.
type: software
authors:
- given-names: Anthony
family-names: Moi
email: m.anthony.moi@gmail.com
affiliation: HuggingFace
- given-names: Nicolas
family-names: Patry
affiliation: HuggingFace
repository-code: 'https://github.com/huggingface/tokenizers'
url: 'https://github.com/huggingface/tokenizers'
repository: 'https://huggingface.co'
abstract: >-
Fast State-of-the-Art Tokenizers optimized for Research
and Production.
keywords:
- Rust
- Tokenizer
- NLP
license: Apache-2.0
commit: 37372b6
version: 0.13.4
date-released: '2023-04-05'
GitHub Events
Total
- Public event: 3
- Push event: 34
- Create event: 2
Last Year
- Public event: 3
- Push event: 34
- Create event: 2
Dependencies
- PyO3/maturin-action v1 composite
- actions/attest-build-provenance v1 composite
- actions/checkout v4 composite
- actions/download-artifact v4 composite
- actions/setup-python v5 composite
- actions/upload-artifact v4 composite
- actions/checkout v4 composite
- actions/setup-python v5 composite
- actions/upload-artifact v4 composite
- dtolnay/rust-toolchain stable composite
- actions/cache v4 composite
- actions/checkout v4 composite
- actions/download-artifact v4 composite
- actions/setup-node v4 composite
- actions/setup-python v5 composite
- actions/upload-artifact v4 composite
- dtolnay/rust-toolchain stable composite
- actions-rs/cargo v1 composite
- actions/cache v4 composite
- actions/checkout v4 composite
- actions/setup-node v4 composite
- dtolnay/rust-toolchain stable composite
- PyO3/maturin-action v1 composite
- actions/checkout v4 composite
- actions/download-artifact v4 composite
- actions/setup-python v5 composite
- actions/upload-artifact v4 composite
- actions-rs/cargo v1 composite
- actions-rs/toolchain v1 composite
- actions/cache v4 composite
- actions/checkout v4 composite
- actions/setup-python v5 composite
- actions/cache v4 composite
- actions/checkout v4 composite
- dtolnay/rust-toolchain stable composite
- actions-rs/cargo v1 composite
- actions-rs/toolchain v1 composite
- actions/checkout v4 composite
- actions/stale v9 composite
- actions/checkout v4 composite
- trufflesecurity/trufflehog 853e1e8d249fd1e29d0fcc7280d29b03df3d643d composite
- pyo3 0.23 development
- tempfile 3.10 development
- env_logger 0.11
- itertools 0.12
- libc 0.2
- ndarray 0.16
- numpy 0.23
- pyo3 0.23
- rayon 1.10
- serde 1.0
- serde_json 1.0
- assert_approx_eq 1.1 development
- criterion 0.5 development
- tempfile 3.10 development
- tracing 0.1 development
- tracing-subscriber 0.3.18 development
- aho-corasick 1.1
- derive_builder 0.20
- esaxx-rs 0.1.10
- fancy-regex 0.14
- getrandom 0.2.10
- hf-hub 0.4.1
- indicatif 0.17
- itertools 0.13
- log 0.4
- macro_rules_attribute 0.2.0
- monostate 0.1.12
- onig 6.4
- paste 1.0.14
- rand 0.8
- rayon 1.10
- rayon-cond 0.3
- regex 1.10
- regex-syntax 0.8
- serde 1.0
- serde_json 1.0
- spm_precompiled 0.1.3
- thiserror 2
- unicode-normalization-alignments 0.1
- unicode-segmentation 1.11
- unicode_categories 0.1
- wasm-bindgen-test 0.3.13 development
- console_error_panic_hook 0.1.6
- wasm-bindgen 0.2.63
- wee_alloc 0.4.5
- @napi-rs/cli ^2.14.6 development
- @swc-node/register ^1.5.5 development
- @swc/core ^1.3.32 development
- @taplo/cli ^0.5.2 development
- @types/jest ^29.5.1 development
- @typescript-eslint/eslint-plugin ^5.50.0 development
- @typescript-eslint/parser ^5.50.0 development
- ava ^5.1.1 development
- benny ^3.7.1 development
- chalk ^5.2.0 development
- eslint ^8.33.0 development
- eslint-config-prettier ^8.6.0 development
- eslint-plugin-import ^2.27.5 development
- eslint-plugin-prettier ^4.2.1 development
- husky ^8.0.3 development
- jest ^29.5.0 development
- lint-staged ^13.1.0 development
- npm-run-all ^4.1.5 development
- prettier ^2.8.3 development
- ts-jest ^29.1.0 development
- typescript ^5.0.0 development
- 686 dependencies
- 318 dependencies
- copy-webpack-plugin ^11.0.0 development
- webpack ^5.75.0 development
- webpack-cli ^5.0.1 development
- webpack-dev-server ^4.10.0 development
- unstable_wasm file:../pkg
- huggingface_hub >=0.16.4,<1.0