Fast, Consistent Tokenization of Natural Language Text
Fast, Consistent Tokenization of Natural Language Text - Published in JOSS (2018)
nlpo3
Thai natural language processing library in Rust, with Python and Node bindings.
simplemma
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
pyonmttok
Fast and customizable text tokenization library with BPE and SentencePiece support
token-wars-dataviz
A data visualisation in `matplotlib` of the number of parameters in major LLMs as well as the number of tokens of text they were trained on.
https://github.com/cahya-wirawan/rwkv-tokenizer
A fast RWKV Tokenizer written in Rust
ekphrasis
Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
biologicaltokenizers
Effect of tokenization on transformers for biological sequence
https://github.com/centre-for-humanities-computing/chinese-tokenizer
A Rusty way of tokenizing Chinese texts
https://github.com/bernard-ng/drc-native-tokenizer
Tokenization for Low-Resource Congolese Languages: Efficiency, Coverage, and Downstream Impact
https://github.com/alexeyev/mystem-scala
Morphological analyzer `mystem` (Russian language) wrapper for JVM languages