Projects

Scientific Software

Updated 6 months ago

Fast, Consistent Tokenization of Natural Language Text — Peer-reviewed • Rank 19.9 • Science 95%

Fast, Consistent Tokenization of Natural Language Text - Published in JOSS (2018)

nlp peer-reviewed r r-package rstats text-mining tokenizer

Scientific Software · Peer-reviewed

Updated 6 months ago

fugashi • Rank 22.1 • Science 64%

A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.

cython-wrapper japanese mecab nlp tokenizer

Updated 6 months ago

nlpo3 • Rank 15.7 • Science 67%

Thai natural language processing library in Rust, with Python and Node bindings.

hacktoberfest natural-language-processing nodejs python rust text-processing thai-language tokenizer

Updated 6 months ago

simplemma • Rank 16.8 • Science 57%

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

corpus-tools language-detection language-identification lemmatiser lemmatization lemmatizer low-resource-nlp morphological-analysis nlp tokenization tokenizer wordlist

Updated 6 months ago

lexikanon • Rank 5.5 • Science 67%

A HyFI plugin for Tokenizers

hyfi hyfi-plugins tokenizer

Updated 6 months ago

bpeasy • Rank 12.4 • Science 54%

Fast bare-bones BPE for modern tokenizer training

bpe tokenization tokenizer

Updated 6 months ago

@stdlib/nlp-tokenize • Rank 11.7 • Science 44%

Tokenize a string.

javascript nlp node node-js nodejs separate split stdlib text-mining tokenizer tokens util utilities utility utils word

Updated 6 months ago

@stdlib/nlp-sentencize • Rank 11.0 • Science 44%

Split a string into an array of sentences.

javascript nlp node node-js nodejs sentence sentences separate split stdlib text-mining tokenizer util utilities utility utils

Updated 6 months ago

pinyintokenizer • Rank 10.4 • Science 44%

pinyintokenizer, 拼音分词器，将连续的拼音切分为单字拼音列表。

nlp pinyin pinyin-analysis pinyin4j tokenizer trie-tree

Updated 6 months ago

pyonmttok • Rank 18.5 • Science 26%

Fast and customizable text tokenization library with BPE and SentencePiece support

bpe cpp icu machine-translation natural-language-processing python sentencepiece tokenization tokenizer unicode

Updated 6 months ago

token-wars-dataviz • Rank 0.0 • Science 44%

A data visualisation in `matplotlib` of the number of parameters in major LLMs as well as the number of tokens of text they were trained on.

llm llms tokenization tokenizer tokens tokenwars

Updated 5 months ago

https://github.com/cahya-wirawan/rwkv-tokenizer • Rank 12.9 • Science 26%

A fast RWKV Tokenizer written in Rust

huggingface llm rwkv tiktoken tokenizer trie

Updated 6 months ago

ekphrasis • Rank 15.7 • Science 23%

Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).

nlp nlp-library semeval spell-corrector spelling-correction text-processing text-segmentation tokenization tokenizer word-normalization word-segmentation