Scientific Software
Updated 6 months ago

Fast, Consistent Tokenization of Natural Language Text — Peer-reviewed • Rank 19.9 • Science 95%

Fast, Consistent Tokenization of Natural Language Text - Published in JOSS (2018)

Scientific Software · Peer-reviewed
Updated 6 months ago

fugashi • Rank 22.1 • Science 64%

A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.

Updated 6 months ago

nlpo3 • Rank 15.7 • Science 67%

Thai natural language processing library in Rust, with Python and Node bindings.

Updated 6 months ago

lexikanon • Rank 5.5 • Science 67%

A HyFI plugin for Tokenizers

Updated 6 months ago

bpeasy • Rank 12.4 • Science 54%

Fast bare-bones BPE for modern tokenizer training

Updated 6 months ago

pinyintokenizer • Rank 10.4 • Science 44%

pinyintokenizer, 拼音分词器,将连续的拼音切分为单字拼音列表。

Updated 6 months ago

pyonmttok • Rank 18.5 • Science 26%

Fast and customizable text tokenization library with BPE and SentencePiece support

Updated 6 months ago

token-wars-dataviz • Rank 0.0 • Science 44%

A data visualisation in `matplotlib` of the number of parameters in major LLMs as well as the number of tokens of text they were trained on.

Updated 6 months ago

ekphrasis • Rank 15.7 • Science 23%

Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).

Updated 6 months ago

biologicaltokenizers • Science 57%

Effect of tokenization on transformers for biological sequence

Updated 5 months ago

https://github.com/bernard-ng/drc-native-tokenizer • Science 26%

Tokenization for Low-Resource Congolese Languages: Efficiency, Coverage, and Downstream Impact