tokenizers

https://github.com/peblove/tokenizers

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.4%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: peblove
License: apache-2.0
Language: Rust
Default Branch: main
Size: 20.5 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created about 1 year ago · Last pushed 10 months ago

Metadata Files

Readme License Citation

BBPE16 Tokenizer

Why BBPE16 (UTF-16 Byte-level Byte-Pair Encoding)?

BBPE16 (UTF-16 Byte-level Byte-Pair Encoding) represents a different approach from traditional Byte-Level BPE (BBPE), particularly in how it handles character encoding internally:

UTF-16 vs UTF-8 Processing: While standard BBPE processes text at the UTF-8 byte level, BBPE16 operates on UTF-16 byte sequences internally. This fundamental difference affects how characters are segmented and merged during the BPE algorithm.
Seamless Integration: When encoding, BBPE16 automatically converts UTF-8 strings to UTF-16 internally for processing, and during decoding, it processes tokens based on UTF-16 before returning standard UTF-8 strings. This makes it transparent to users—no manual encoding or decoding steps are required.
Different Character Representation: UTF-16 uses 16-bit code units, representing most common characters in a single unit while using surrogate pairs for characters beyond the Basic Multilingual Plane. UTF-8, in contrast, uses variable-length encoding with 1-4 bytes per character. This encoding difference affects how the BPE algorithm identifies and merges byte patterns.
CJK Language Characteristics: CJK characters typically require 3 bytes in UTF-8 encoding but fit within a single 16-bit unit in UTF-16 (for characters in the Basic Multilingual Plane). This difference in representation affects the byte-level patterns that BPE algorithms learn and apply.
Same Alphabet Foundation: BBPE16 maintains compatibility with the standard 256-character byte-level alphabet used in traditional BBPE, ensuring consistent vocabulary mapping strategies.
Automatic Model Detection: The tokenizer can automatically distinguish between BBPE and BBPE16 models when loading, providing seamless integration without requiring manual configuration.

UTF16ByteLevelBPETokenizer

A specialized tokenizer that processes text at the UTF-16 byte level while maintaining compatibility with the standard ByteLevel alphabet, offering an alternative approach to traditional UTF-8 byte-level processing.

Key Characteristics:

Processes text using UTF-16 internal representation
Maintains 100% accuracy across all languages
Compatible byte-level alphabet (256 characters, same mapping strategy as ByteLevel)
Transparent UTF-8 to UTF-16 conversion during processing

Usage:

```python from tokenizers.implementations import UTF16ByteLevelBPETokenizer

Initialize tokenizer

tokenizer = UTF16ByteLevelBPETokenizer()

Train on your dataset

tokenizer.train(files=["trainingdata.txt"], vocabsize=1000)

Encode text with UTF-16 internal processing

output = tokenizer.encode("안녕하세요 你好こんにちは") print(f"Tokens: {output.tokens}") ```

When to Consider:

Working with applications that benefit from UTF-16 internal processing
Exploring alternative byte-level tokenization approaches
Building systems where UTF-16 representation aligns with other components
Research comparing different encoding approaches in tokenization

Technical Differences:

The core distinction lies in the internal character representation during BPE processing. While UTF-8 BBPE works with variable-length byte sequences (1-4 bytes per character), UTF-16 BBPE processes 16-bit code units, which affects how byte patterns are identified and merged during vocabulary construction.

Citation:

If you use UTF16ByteLevelBPETokenizer in your research or applications, please cite:

bibtex @misc{kim2025utf16bytelevel, title={UTF16ByteLevelBPETokenizer: Alternative UTF-16 Based Tokenization}, author={Hyunsik Kim}, year={2025}, month={May}, note={Implementation based on HuggingFace Tokenizers library}, email={avantkim@gmail.com}, url={https://github.com/peblove/tokenizers} }

Technical References:

Base Implementation: HuggingFace Tokenizers - https://github.com/huggingface/tokenizers
GPT-2 Byte-Level BPE: Radford et al. (2019) - https://github.com/openai/gpt-2
UTF-16 Encoding Standard: Unicode Consortium - https://unicode.org/standard/standard.html

Performance Benchmarks

Performance characteristics can vary depending on hardware configuration. Running the ~/bindings/python/benches/test_tiktoken.py benchmark provides comparative analysis on various systems:

Performance Benchmark

Bindings

We provide bindings to the following languages (more to come!): - Rust (Original implementation) - Python - Node.js - Ruby (Contributed by @ankane, external repo)

Installation

You can install from source using: bash pip install git+https://github.com/peblove/tokenizers.git#subdirectory=bindings/python

or install the released versions with:

bash pip install tokenizers

Quick Example Using Python

Choose your model between Byte-Pair Encoding, WordPiece or Unigram and instantiate a tokenizer:

```python from tokenizers import Tokenizer from tokenizers.models import BPE

tokenizer = Tokenizer(BPE()) ```

You can customize how pre-tokenization (e.g., splitting into words) is done:

```python from tokenizers.pre_tokenizers import Whitespace

tokenizer.pre_tokenizer = Whitespace() ```

Then training your tokenizer on a set of files just takes two lines of code:

```python from tokenizers.trainers import BpeTrainer

trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]) tokenizer.train(files=["wiki.train.raw", "wiki.valid.raw", "wiki.test.raw"], trainer=trainer) ```

Once your tokenizer is trained, encode any text with just one line: ```python output = tokenizer.encode("Hello, y'all! How are you 😁 ?") print(output.tokens)

["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"]

```

Check the documentation or the quicktour to learn more!

Owner

Name: Hyunsik Kim
Login: peblove
Kind: user
Company: home

Repositories: 1
Profile: https://github.com/peblove

AI/ML Researcher with expertise in Speech Recognition and ARM Architecture

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: HuggingFace's Tokenizers
message: >-
  Fast State-of-the-Art Tokenizers optimized for Research
  and Production.
type: software
authors:
  - given-names: Anthony
    family-names: Moi
    email: m.anthony.moi@gmail.com
    affiliation: HuggingFace
  - given-names: Nicolas
    family-names: Patry
    affiliation: HuggingFace
repository-code: 'https://github.com/huggingface/tokenizers'
url: 'https://github.com/huggingface/tokenizers'
repository: 'https://huggingface.co'
abstract: >-
  Fast State-of-the-Art Tokenizers optimized for Research
  and Production.
keywords:
  - Rust
  - Tokenizer
  - NLP
license: Apache-2.0
commit: 37372b6
version: 0.13.4
date-released: '2023-04-05'

GitHub Events

Total

Public event: 3
Push event: 34
Create event: 2

Last Year

Public event: 3
Push event: 34
Create event: 2

Dependencies

.github/workflows/CI.yml actions

PyO3/maturin-action v1 composite
actions/attest-build-provenance v1 composite
actions/checkout v4 composite
actions/download-artifact v4 composite
actions/setup-python v5 composite
actions/upload-artifact v4 composite

.github/workflows/build_documentation.yml actions

.github/workflows/build_pr_documentation.yml actions

.github/workflows/delete_doc_comment.yml actions

.github/workflows/delete_doc_comment_trigger.yml actions

.github/workflows/docs-check.yml actions

actions/checkout v4 composite
actions/setup-python v5 composite
actions/upload-artifact v4 composite
dtolnay/rust-toolchain stable composite

.github/workflows/node-release.yml actions

actions/cache v4 composite
actions/checkout v4 composite
actions/download-artifact v4 composite
actions/setup-node v4 composite
actions/setup-python v5 composite
actions/upload-artifact v4 composite
dtolnay/rust-toolchain stable composite

.github/workflows/node.yml actions

actions-rs/cargo v1 composite
actions/cache v4 composite
actions/checkout v4 composite
actions/setup-node v4 composite
dtolnay/rust-toolchain stable composite

.github/workflows/python-release.yml actions

PyO3/maturin-action v1 composite
actions/checkout v4 composite
actions/download-artifact v4 composite
actions/setup-python v5 composite
actions/upload-artifact v4 composite

.github/workflows/python.yml actions

actions-rs/cargo v1 composite
actions-rs/toolchain v1 composite
actions/cache v4 composite
actions/checkout v4 composite
actions/setup-python v5 composite

.github/workflows/rust-release.yml actions

actions/cache v4 composite
actions/checkout v4 composite
dtolnay/rust-toolchain stable composite

.github/workflows/rust.yml actions

actions-rs/cargo v1 composite
actions-rs/toolchain v1 composite
actions/checkout v4 composite

.github/workflows/stale.yml actions

actions/stale v9 composite

.github/workflows/trufflehog.yml actions

actions/checkout v4 composite
trufflesecurity/trufflehog 853e1e8d249fd1e29d0fcc7280d29b03df3d643d composite

.github/workflows/upload_pr_documentation.yml actions

bindings/node/Cargo.toml cargo

bindings/python/Cargo.toml cargo

pyo3 0.23 development
tempfile 3.10 development
env_logger 0.11
itertools 0.12
libc 0.2
ndarray 0.16
numpy 0.23
pyo3 0.23
rayon 1.10
serde 1.0
serde_json 1.0

tokenizers/Cargo.toml cargo

assert_approx_eq 1.1 development
criterion 0.5 development
tempfile 3.10 development
tracing 0.1 development
tracing-subscriber 0.3.18 development
aho-corasick 1.1
derive_builder 0.20
esaxx-rs 0.1.10
fancy-regex 0.14
getrandom 0.2.10
hf-hub 0.4.1
indicatif 0.17
itertools 0.13
log 0.4
macro_rules_attribute 0.2.0
monostate 0.1.12
onig 6.4
paste 1.0.14
rand 0.8
rayon 1.10
rayon-cond 0.3
regex 1.10
regex-syntax 0.8
serde 1.0
serde_json 1.0
spm_precompiled 0.1.3
thiserror 2
unicode-normalization-alignments 0.1
unicode-segmentation 1.11
unicode_categories 0.1

tokenizers/examples/unstable_wasm/Cargo.toml cargo

wasm-bindgen-test 0.3.13 development
console_error_panic_hook 0.1.6
wasm-bindgen 0.2.63
wee_alloc 0.4.5

bindings/node/npm/android-arm-eabi/package.json npm

bindings/node/npm/android-arm64/package.json npm

bindings/node/npm/darwin-arm64/package.json npm

bindings/node/npm/darwin-x64/package.json npm

bindings/node/npm/freebsd-x64/package.json npm

bindings/node/npm/linux-arm-gnueabihf/package.json npm

bindings/node/npm/linux-arm64-gnu/package.json npm

bindings/node/npm/linux-arm64-musl/package.json npm

bindings/node/npm/linux-x64-gnu/package.json npm

bindings/node/npm/linux-x64-musl/package.json npm

bindings/node/npm/win32-arm64-msvc/package.json npm

bindings/node/npm/win32-ia32-msvc/package.json npm

bindings/node/npm/win32-x64-msvc/package.json npm

bindings/node/package.json npm

@napi-rs/cli ^2.14.6 development
@swc-node/register ^1.5.5 development
@swc/core ^1.3.32 development
@taplo/cli ^0.5.2 development
@types/jest ^29.5.1 development
@typescript-eslint/eslint-plugin ^5.50.0 development
@typescript-eslint/parser ^5.50.0 development
ava ^5.1.1 development
benny ^3.7.1 development
chalk ^5.2.0 development
eslint ^8.33.0 development
eslint-config-prettier ^8.6.0 development
eslint-plugin-import ^2.27.5 development
eslint-plugin-prettier ^4.2.1 development
husky ^8.0.3 development
jest ^29.5.0 development
lint-staged ^13.1.0 development
npm-run-all ^4.1.5 development
prettier ^2.8.3 development
ts-jest ^29.1.0 development
typescript ^5.0.0 development

bindings/node/yarn.lock npm

686 dependencies

tokenizers/examples/unstable_wasm/www/package-lock.json npm

318 dependencies

tokenizers/examples/unstable_wasm/www/package.json npm

copy-webpack-plugin ^11.0.0 development
webpack ^5.75.0 development
webpack-cli ^5.0.1 development
webpack-dev-server ^4.10.0 development
unstable_wasm file:../pkg

bindings/python/pyproject.toml pypi

huggingface_hub >=0.16.4,<1.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science