Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.4%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: peblove
  • License: apache-2.0
  • Language: Rust
  • Default Branch: main
  • Size: 20.5 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created 10 months ago · Last pushed 7 months ago
Metadata Files
Readme License Citation

README.md

BBPE16 Tokenizer

Why BBPE16 (UTF-16 Byte-level Byte-Pair Encoding)?

BBPE16 (UTF-16 Byte-level Byte-Pair Encoding) represents a different approach from traditional Byte-Level BPE (BBPE), particularly in how it handles character encoding internally:

  • UTF-16 vs UTF-8 Processing: While standard BBPE processes text at the UTF-8 byte level, BBPE16 operates on UTF-16 byte sequences internally. This fundamental difference affects how characters are segmented and merged during the BPE algorithm.

  • Seamless Integration: When encoding, BBPE16 automatically converts UTF-8 strings to UTF-16 internally for processing, and during decoding, it processes tokens based on UTF-16 before returning standard UTF-8 strings. This makes it transparent to users—no manual encoding or decoding steps are required.

  • Different Character Representation: UTF-16 uses 16-bit code units, representing most common characters in a single unit while using surrogate pairs for characters beyond the Basic Multilingual Plane. UTF-8, in contrast, uses variable-length encoding with 1-4 bytes per character. This encoding difference affects how the BPE algorithm identifies and merges byte patterns.

  • CJK Language Characteristics: CJK characters typically require 3 bytes in UTF-8 encoding but fit within a single 16-bit unit in UTF-16 (for characters in the Basic Multilingual Plane). This difference in representation affects the byte-level patterns that BPE algorithms learn and apply.

  • Same Alphabet Foundation: BBPE16 maintains compatibility with the standard 256-character byte-level alphabet used in traditional BBPE, ensuring consistent vocabulary mapping strategies.

  • Automatic Model Detection: The tokenizer can automatically distinguish between BBPE and BBPE16 models when loading, providing seamless integration without requiring manual configuration.

UTF16ByteLevelBPETokenizer

A specialized tokenizer that processes text at the UTF-16 byte level while maintaining compatibility with the standard ByteLevel alphabet, offering an alternative approach to traditional UTF-8 byte-level processing.

Key Characteristics:

  • Processes text using UTF-16 internal representation
  • Maintains 100% accuracy across all languages
  • Compatible byte-level alphabet (256 characters, same mapping strategy as ByteLevel)
  • Transparent UTF-8 to UTF-16 conversion during processing

Usage:

```python from tokenizers.implementations import UTF16ByteLevelBPETokenizer

Initialize tokenizer

tokenizer = UTF16ByteLevelBPETokenizer()

Train on your dataset

tokenizer.train(files=["trainingdata.txt"], vocabsize=1000)

Encode text with UTF-16 internal processing

output = tokenizer.encode("안녕하세요 你好 こんにちは") print(f"Tokens: {output.tokens}") ```

When to Consider:

  • Working with applications that benefit from UTF-16 internal processing
  • Exploring alternative byte-level tokenization approaches
  • Building systems where UTF-16 representation aligns with other components
  • Research comparing different encoding approaches in tokenization

Technical Differences:

The core distinction lies in the internal character representation during BPE processing. While UTF-8 BBPE works with variable-length byte sequences (1-4 bytes per character), UTF-16 BBPE processes 16-bit code units, which affects how byte patterns are identified and merged during vocabulary construction.

Citation:

If you use UTF16ByteLevelBPETokenizer in your research or applications, please cite:

bibtex @misc{kim2025utf16bytelevel, title={UTF16ByteLevelBPETokenizer: Alternative UTF-16 Based Tokenization}, author={Hyunsik Kim}, year={2025}, month={May}, note={Implementation based on HuggingFace Tokenizers library}, email={avantkim@gmail.com}, url={https://github.com/peblove/tokenizers} }

Technical References:

Performance Benchmarks

Performance characteristics can vary depending on hardware configuration. Running the ~/bindings/python/benches/test_tiktoken.py benchmark provides comparative analysis on various systems:

Performance Benchmark

Bindings

We provide bindings to the following languages (more to come!): - Rust (Original implementation) - Python - Node.js - Ruby (Contributed by @ankane, external repo)

Installation

You can install from source using: bash pip install git+https://github.com/peblove/tokenizers.git#subdirectory=bindings/python

or install the released versions with:

bash pip install tokenizers

Quick Example Using Python

Choose your model between Byte-Pair Encoding, WordPiece or Unigram and instantiate a tokenizer:

```python from tokenizers import Tokenizer from tokenizers.models import BPE

tokenizer = Tokenizer(BPE()) ```

You can customize how pre-tokenization (e.g., splitting into words) is done:

```python from tokenizers.pre_tokenizers import Whitespace

tokenizer.pre_tokenizer = Whitespace() ```

Then training your tokenizer on a set of files just takes two lines of code:

```python from tokenizers.trainers import BpeTrainer

trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]) tokenizer.train(files=["wiki.train.raw", "wiki.valid.raw", "wiki.test.raw"], trainer=trainer) ```

Once your tokenizer is trained, encode any text with just one line: ```python output = tokenizer.encode("Hello, y'all! How are you 😁 ?") print(output.tokens)

["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"]

```

Check the documentation or the quicktour to learn more!

Owner

  • Name: Hyunsik Kim
  • Login: peblove
  • Kind: user
  • Company: home

AI/ML Researcher with expertise in Speech Recognition and ARM Architecture

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: HuggingFace's Tokenizers
message: >-
  Fast State-of-the-Art Tokenizers optimized for Research
  and Production.
type: software
authors:
  - given-names: Anthony
    family-names: Moi
    email: m.anthony.moi@gmail.com
    affiliation: HuggingFace
  - given-names: Nicolas
    family-names: Patry
    affiliation: HuggingFace
repository-code: 'https://github.com/huggingface/tokenizers'
url: 'https://github.com/huggingface/tokenizers'
repository: 'https://huggingface.co'
abstract: >-
  Fast State-of-the-Art Tokenizers optimized for Research
  and Production.
keywords:
  - Rust
  - Tokenizer
  - NLP
license: Apache-2.0
commit: 37372b6
version: 0.13.4
date-released: '2023-04-05'

GitHub Events

Total
  • Public event: 3
  • Push event: 34
  • Create event: 2
Last Year
  • Public event: 3
  • Push event: 34
  • Create event: 2

Dependencies

.github/workflows/CI.yml actions
  • PyO3/maturin-action v1 composite
  • actions/attest-build-provenance v1 composite
  • actions/checkout v4 composite
  • actions/download-artifact v4 composite
  • actions/setup-python v5 composite
  • actions/upload-artifact v4 composite
.github/workflows/build_documentation.yml actions
.github/workflows/build_pr_documentation.yml actions
.github/workflows/delete_doc_comment.yml actions
.github/workflows/delete_doc_comment_trigger.yml actions
.github/workflows/docs-check.yml actions
  • actions/checkout v4 composite
  • actions/setup-python v5 composite
  • actions/upload-artifact v4 composite
  • dtolnay/rust-toolchain stable composite
.github/workflows/node-release.yml actions
  • actions/cache v4 composite
  • actions/checkout v4 composite
  • actions/download-artifact v4 composite
  • actions/setup-node v4 composite
  • actions/setup-python v5 composite
  • actions/upload-artifact v4 composite
  • dtolnay/rust-toolchain stable composite
.github/workflows/node.yml actions
  • actions-rs/cargo v1 composite
  • actions/cache v4 composite
  • actions/checkout v4 composite
  • actions/setup-node v4 composite
  • dtolnay/rust-toolchain stable composite
.github/workflows/python-release.yml actions
  • PyO3/maturin-action v1 composite
  • actions/checkout v4 composite
  • actions/download-artifact v4 composite
  • actions/setup-python v5 composite
  • actions/upload-artifact v4 composite
.github/workflows/python.yml actions
  • actions-rs/cargo v1 composite
  • actions-rs/toolchain v1 composite
  • actions/cache v4 composite
  • actions/checkout v4 composite
  • actions/setup-python v5 composite
.github/workflows/rust-release.yml actions
  • actions/cache v4 composite
  • actions/checkout v4 composite
  • dtolnay/rust-toolchain stable composite
.github/workflows/rust.yml actions
  • actions-rs/cargo v1 composite
  • actions-rs/toolchain v1 composite
  • actions/checkout v4 composite
.github/workflows/stale.yml actions
  • actions/stale v9 composite
.github/workflows/trufflehog.yml actions
  • actions/checkout v4 composite
  • trufflesecurity/trufflehog 853e1e8d249fd1e29d0fcc7280d29b03df3d643d composite
.github/workflows/upload_pr_documentation.yml actions
bindings/node/Cargo.toml cargo
bindings/python/Cargo.toml cargo
  • pyo3 0.23 development
  • tempfile 3.10 development
  • env_logger 0.11
  • itertools 0.12
  • libc 0.2
  • ndarray 0.16
  • numpy 0.23
  • pyo3 0.23
  • rayon 1.10
  • serde 1.0
  • serde_json 1.0
tokenizers/Cargo.toml cargo
  • assert_approx_eq 1.1 development
  • criterion 0.5 development
  • tempfile 3.10 development
  • tracing 0.1 development
  • tracing-subscriber 0.3.18 development
  • aho-corasick 1.1
  • derive_builder 0.20
  • esaxx-rs 0.1.10
  • fancy-regex 0.14
  • getrandom 0.2.10
  • hf-hub 0.4.1
  • indicatif 0.17
  • itertools 0.13
  • log 0.4
  • macro_rules_attribute 0.2.0
  • monostate 0.1.12
  • onig 6.4
  • paste 1.0.14
  • rand 0.8
  • rayon 1.10
  • rayon-cond 0.3
  • regex 1.10
  • regex-syntax 0.8
  • serde 1.0
  • serde_json 1.0
  • spm_precompiled 0.1.3
  • thiserror 2
  • unicode-normalization-alignments 0.1
  • unicode-segmentation 1.11
  • unicode_categories 0.1
tokenizers/examples/unstable_wasm/Cargo.toml cargo
  • wasm-bindgen-test 0.3.13 development
  • console_error_panic_hook 0.1.6
  • wasm-bindgen 0.2.63
  • wee_alloc 0.4.5
bindings/node/npm/android-arm-eabi/package.json npm
bindings/node/npm/android-arm64/package.json npm
bindings/node/npm/darwin-arm64/package.json npm
bindings/node/npm/darwin-x64/package.json npm
bindings/node/npm/freebsd-x64/package.json npm
bindings/node/npm/linux-arm-gnueabihf/package.json npm
bindings/node/npm/linux-arm64-gnu/package.json npm
bindings/node/npm/linux-arm64-musl/package.json npm
bindings/node/npm/linux-x64-gnu/package.json npm
bindings/node/npm/linux-x64-musl/package.json npm
bindings/node/npm/win32-arm64-msvc/package.json npm
bindings/node/npm/win32-ia32-msvc/package.json npm
bindings/node/npm/win32-x64-msvc/package.json npm
bindings/node/package.json npm
  • @napi-rs/cli ^2.14.6 development
  • @swc-node/register ^1.5.5 development
  • @swc/core ^1.3.32 development
  • @taplo/cli ^0.5.2 development
  • @types/jest ^29.5.1 development
  • @typescript-eslint/eslint-plugin ^5.50.0 development
  • @typescript-eslint/parser ^5.50.0 development
  • ava ^5.1.1 development
  • benny ^3.7.1 development
  • chalk ^5.2.0 development
  • eslint ^8.33.0 development
  • eslint-config-prettier ^8.6.0 development
  • eslint-plugin-import ^2.27.5 development
  • eslint-plugin-prettier ^4.2.1 development
  • husky ^8.0.3 development
  • jest ^29.5.0 development
  • lint-staged ^13.1.0 development
  • npm-run-all ^4.1.5 development
  • prettier ^2.8.3 development
  • ts-jest ^29.1.0 development
  • typescript ^5.0.0 development
bindings/node/yarn.lock npm
  • 686 dependencies
tokenizers/examples/unstable_wasm/www/package-lock.json npm
  • 318 dependencies
tokenizers/examples/unstable_wasm/www/package.json npm
  • copy-webpack-plugin ^11.0.0 development
  • webpack ^5.75.0 development
  • webpack-cli ^5.0.1 development
  • webpack-dev-server ^4.10.0 development
  • unstable_wasm file:../pkg
bindings/python/pyproject.toml pypi
  • huggingface_hub >=0.16.4,<1.0