https://github.com/cahya-wirawan/rwkv-tokenizer

A fast RWKV Tokenizer written in Rust

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.5%) to scientific vocabulary

Keywords

huggingface llm rwkv tiktoken tokenizer trie

Last synced: 5 months ago · JSON representation

Repository

A fast RWKV Tokenizer written in Rust

Basic Info

Host: GitHub
Owner: cahya-wirawan
License: apache-2.0
Language: Jupyter Notebook
Default Branch: main
Homepage: https://github.com/cahya-wirawan/rwkv-tokenizer
Size: 1.93 MB

Statistics

Stars: 49
Watchers: 2
Forks: 3
Open Issues: 0
Releases: 0

Topics

huggingface llm rwkv tiktoken tokenizer trie

Created over 1 year ago · Last pushed 6 months ago

Metadata Files

Readme License

RWKV Tokenizer

A fast RWKV Tokenizer written in Rust that supports the World Tokenizer used by the RWKV v5 and newer models.

Installation and Usage

Rust

Add the rwkv-tokenizer to the dependecy list in Cargo.toml or add it using cargo add rwkv-tokenizer.

Usage

Following is a simple Rust code to use it: ``` use rwkv_tokenizer;

fn main() { let tokenizer = rwkvtokenizer::WorldTokenizer::new(None).unwrap(); let text = "Today is a beautiful day. 今天是美好的一天。"; let ids = tokenizer.encode(text); let tokens = tokenizer.decode(ids.clone()).unwrap(); println!("Text: {text}"); println!("Ids: {ids:?}"); println!("Tokens: {tokens:?}"); } And run it with `cargo run`: $ cargo run Compiling hellorwkv v0.1.0 (/home/cahya/Work/MachineLearning/Rust/hellorwkv) Finished dev profile [unoptimized + debuginfo] target(s) in 0.44s Running `target/debug/hellorwkvText: Today is a beautiful day. 今天是美好的一天。 Ids: [33520, 4600, 332, 59219, 21509, 47, 33, 10381, 11639, 13091, 15597, 11685, 14734, 10250, 11639, 10080] Tokens: "Today is a beautiful day. 今天是美好的一天。"``

Python binding

Install the rwkv-tokenizer python module: $ pip install pyrwkv-tokenizer

Usage

```

import pyrwkvtokenizer tokenizer = pyrwkvtokenizer.RWKVTokenizer() tokenizer.encode("Today is a beautiful day. 今天是美好的一天。") [33520, 4600, 332, 59219, 21509, 47, 33, 10381, 11639, 13091, 15597, 11685, 14734, 10250, 11639, 10080] tokenizer.decode([33520, 4600, 332, 59219, 21509, 47, 33, 10381, 11639, 13091, 15597, 11685, 14734, 10250, 11639, 10080]) 'Today is a beautiful day. 今天是美好的一天。' tokenizer.encode_batch(["Today is a beautiful day.", " 今天是美好的一天。"]) [[33520, 4600, 332, 59219, 21509, 47], [33, 10381, 11639, 13091, 15597, 11685, 14734, 10250, 11639, 10080]] ```

WebAssembly binding

There are two WebAssembly modules: Nodejs and Web module. Install the RWKV Tokenizer WebAssembly rwkv-tokenizer if the application runs only as nodejs application or WebAsesembly rwkv-tokenizer-web if the application is a web application. Following is an example to install and run a nodejs application:

$ npm install rwkv-tokenizer

Usage

Create an example javascript file wasm.js with following content: ``` const { WorldTokenizer } = require('rwkv-tokenizer');

async function runWasm() { try { const textToEncode = "Today is a beautiful day. 今天是美好的一天。"; const tokenizer = new WorldTokenizer(); let encodedText = tokenizer.encode(textToEncode); console.log(Encoded text using tokenizer:, encodedText); let decodedText = tokenizer.decode(encodedText); console.log(Decoded text using tokenizer:, decodedText); let encodedTextBatch = tokenizer.encode_batch([textToEncode, "Another sentence."]); console.log(Encoded text using tokenizer:, encodedTextBatch);

} catch (error) { console.error("Error loading or using WASM module:", error); } }

runWasm(); ```

and execute it: bash $ node wasm.js The ouput should be like: ``` Encoded text using tokenizer: Uint16Array(16) [ 33520, 4600, 332, 59219, 21509, 47, 33, 10381, 11639, 13091, 15597, 11685, 14734, 10250, 11639, 10080 ] Decoded text using tokenizer: Today is a beautiful day. 今天是美好的一天。 Encoded text using tokenizer: [ Uint16Array(16) [ 33520, 4600, 332, 59219, 21509, 47, 33, 10381, 11639, 13091, 15597, 11685, 14734, 10250, 11639, 10080 ], Uint16Array(3) [ 48358, 57192, 47 ] ]

```

A demo of the Webassembly RWKV Tokenizer running as web application is available at https://cahya-wirawan.github.io/rwkv-tokenizer-wasm/ with its source code https://github.com/cahya-wirawan/rwkv-tokenizer-wasm.

Performance and Validity Test

We compared the encoding results of the Rust RWKV Tokenizer and the original tokenizer using the English Wikipedia and Chinese poetries datasets. Both results are identical. The Rust RWKV Tokenizer also passes the original tokenizer's unit test. The following steps describe how to do the unit test: $ pip install pytest pyrwkv-tokenizer $ git clone https://github.com/cahya-wirawan/rwkv-tokenizer.git $ cd rwkv-tokenizer $ pytest

We did a performance comparison on the simple English Wikipedia dataset 20220301.simple* among following tokenizer: - The original RWKV tokenizer (BlinkDL) - Huggingface implementaion of RWKV tokenizer - Huggingface LLama tokenizer - Huggingface Mistral tokenizer - Bert tokenizer - OpenAI Tiktoken - The Rust RWKV tokenizer

The comparison is done using this jupyter notebook in a M2 Mac mini. The Rust RWKV tokenizer is around 17x faster than the original tokenizer and 9.6x faster than OpenAI Tiktoken.

performance-comparison

We updated the Rust RWKV world tokenizer to support batch encoding with multithreading. We ran the same comparison script from the Huggingface Tokenizers with the additional rwkv tokenizer. The result shows that the rwkv world tokenizer is significantly faster than the Tiktoken and Huggingface tokenizers in all numbers of threads and document sizes (on average, its speed is ten times faster).

performance-comparison

*The simple English Wikipedia dataset can be downloaded as jsonl file from https://huggingface.co/datasets/cahya/simple-wikipedia/resolve/main/simple-wikipedia.jsonl?download=true

Tools using this tokenizer

We also created the json2bin application to convert datasets from JSONL format into binidx format, a data format used for training RWKV models. It uses multithreading to scale up the performance and can convert a dataset more than 70 times faster (around 360 MB/s) than the original json2binidx_tool written in Python.

Changelog

Version 0.10.0
- Added a function to create the toknizer from a vocabulary stored in a buffer.
- Added WebAssembly binding
Version 0.9.1
- Added utf8 error handling to decoder
Version 0.9.0
- Added multithreading for the function encode_batch()
- Added batch/multithreading comparison
Version 0.3.0
- Fixed the issue where some characters were not encoded correctly

This tokenizer is my very first Rust program, so it might still have many bugs and silly codes :-)

Owner

Name: Cahya Wirawan
Login: cahya-wirawan
Kind: user
Location: Vienna, Austria

Website: https://www.linkedin.com/in/cahyawirawan/
Twitter: CahyaWr
Repositories: 171
Profile: https://github.com/cahya-wirawan

System engineer, currently working on NLP, CV and Speech Recognition for fun and curiosity

GitHub Events

Total

Watch event: 18
Issue comment event: 5
Push event: 18
Pull request event: 6
Fork event: 1
Create event: 4

Last Year

Watch event: 18
Issue comment event: 5
Push event: 18
Pull request event: 6
Fork event: 1
Create event: 4

Committers

Last synced: 9 months ago

All Time

Total Commits: 102
Total Committers: 2
Avg Commits per committer: 51.0
Development Distribution Score (DDS): 0.01

Past Year

Commits: 102
Committers: 2
Avg Commits per committer: 51.0
Development Distribution Score (DDS): 0.01

Top Committers

Name	Email	Commits
Cahya Wirawan	c**n@g**m	101
Christian Balcom	r**r@g**m	1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 0
Total pull requests: 3
Average time to close issues: N/A
Average time to close pull requests: 5 days
Total issue authors: 0
Total pull request authors: 2
Average comments per issue: 0
Average comments per pull request: 2.0
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 3
Average time to close issues: N/A
Average time to close pull requests: 5 days
Issue authors: 0
Pull request authors: 2
Average comments per issue: 0
Average comments per pull request: 2.0
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

computer-whisperer (4)
cahya-wirawan (2)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 4
Total downloads:
- pypi 3,866 last-month
- npm 65 last-month

Total dependent packages: 0
(may contain duplicates)
Total dependent repositories: 0
(may contain duplicates)
Total versions: 36
Total maintainers: 1

npmjs.org: rwkv-tokenizer-web

RWKV Tokenizer - WASM

Homepage: https://github.com/cahya-wirawan/rwkv-tokenizer#readme
License: MIT/Apache-2.0
Latest release: 0.3.4
published 8 months ago

Versions: 3
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 14 Last month

Rankings

Dependent repos count: 24.2%

Average: 29.6%

Dependent packages count: 35.0%

Maintainers (1)

cahya.wirawan

Last synced: 6 months ago

npmjs.org: rwkv-tokenizer

RWKV Tokenizer - WASM

Homepage: https://github.com/cahya-wirawan/rwkv-tokenizer#readme
License: MIT/Apache-2.0
Latest release: 0.3.3
published 8 months ago

Versions: 9
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 51 Last month

Rankings

Dependent repos count: 24.3%

Average: 29.6%

Dependent packages count: 35.0%

Maintainers (1)

cahya.wirawan

Last synced: 6 months ago

pypi.org: pyrwkv-tokenizer

RWKV Tokenizer

Documentation: https://pyrwkv-tokenizer.readthedocs.io/
License: Apache-2.0
Latest release: 0.9.1
published 11 months ago

Versions: 10
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 2,873 Last month

Rankings

Dependent packages count: 10.8%

Average: 35.7%

Dependent repos count: 60.7%

Maintainers (1)

cahya.wirawan

Last synced: 6 months ago

pypi.org: rwkv-tokenizer

RWKV Tokenizer

Documentation: https://rwkv-tokenizer.readthedocs.io/
License: Apache-2.0
Latest release: 0.11.0
published 7 months ago

Versions: 14
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 993 Last month

Rankings

Dependent packages count: 10.9%

Average: 36.0%

Dependent repos count: 61.2%

Maintainers (1)

cahya.wirawan

Last synced: 6 months ago

Dependencies

.github/workflows/CI.yml actions

PyO3/maturin-action v1 composite
actions/checkout v4 composite
actions/download-artifact v4 composite
actions/setup-python v5 composite
actions/upload-artifact v4 composite

bindings/python/Cargo.lock cargo

aho-corasick 1.1.3
autocfg 1.3.0
bitflags 2.5.0
cfg-if 1.0.0
heck 0.4.1
indoc 2.0.5
libc 0.2.155
lock_api 0.4.12
memchr 2.7.2
memoffset 0.9.1
once_cell 1.19.0
parking_lot 0.12.3
parking_lot_core 0.9.10
portable-atomic 1.6.0
proc-macro2 1.0.84
pyo3 0.21.2
pyo3-build-config 0.21.2
pyo3-ffi 0.21.2
pyo3-macros 0.21.2
pyo3-macros-backend 0.21.2
quote 1.0.36
redox_syscall 0.5.1
regex 1.10.4
regex-automata 0.4.6
regex-syntax 0.8.3
rwkv-tokenizer 0.8.0
scopeguard 1.2.0
smallvec 1.13.2
syn 2.0.66
target-lexicon 0.12.14
unescape 0.1.0
unicode-ident 1.0.12
unindent 0.2.3
windows-targets 0.52.5
windows_aarch64_gnullvm 0.52.5
windows_aarch64_msvc 0.52.5
windows_i686_gnu 0.52.5
windows_i686_gnullvm 0.52.5
windows_i686_msvc 0.52.5
windows_x86_64_gnu 0.52.5
windows_x86_64_gnullvm 0.52.5
windows_x86_64_msvc 0.52.5

bindings/python/Cargo.toml cargo

rwkv-tokenizer/Cargo.lock cargo

aho-corasick 1.1.3
memchr 2.7.4
regex 1.10.5
regex-automata 0.4.7
regex-syntax 0.8.4
unescape 0.1.0

rwkv-tokenizer/Cargo.toml cargo

bindings/python/pyproject.toml pypi

https://github.com/cahya-wirawan/rwkv-tokenizer

Science Score: 26.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

RWKV Tokenizer

Installation and Usage

Rust

Usage

Python binding

Usage

WebAssembly binding

Usage

Performance and Validity Test

Tools using this tokenizer

Changelog

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

npmjs.org: rwkv-tokenizer-web

Rankings

Maintainers (1)

npmjs.org: rwkv-tokenizer

Rankings

Maintainers (1)

pypi.org: pyrwkv-tokenizer

Rankings

Maintainers (1)

pypi.org: rwkv-tokenizer

Rankings

Maintainers (1)

Dependencies