WordTokenizers.jl
WordTokenizers.jl: Basic tools for tokenizing natural language in Julia - Published in JOSS (2020)
KeemenaPreprocessing.jl: Unicode-Robust Cleaning, Multi-Level Tokenisation & Streaming Offset Bundling for Julia NLP
KeemenaPreprocessing.jl: Unicode-Robust Cleaning, Multi-Level Tokenisation & Streaming Offset Bundling for Julia NLP - Published in JOSS (2026)
simplemma
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
pyonmttok
Fast and customizable text tokenization library with BPE and SentencePiece support
token-wars-dataviz
A data visualisation in `matplotlib` of the number of parameters in major LLMs as well as the number of tokens of text they were trained on.
https://github.com/bminixhofer/tokenkit
A toolkit implementing advanced methods to transfer models and model knowledge across tokenizers.
ekphrasis
Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
klmbr
klmbr - a prompt pre-processing technique to break through the barrier of entropy while generating text with LLMs
transform-emr
This model is a decoder transformer based model aiming to model events predictions from EMR records as a sequential text generation problem. This project is a part of my thesis research.
https://github.com/aveek-saha/sentiment-based-stock-price-forecasting
Apple Stock Price Forecasting using Sentiment Analysis
https://github.com/cosmaadrian/strawberry-problem
Official repository for "The Strawberry Problem 🍓: Emergence of Character-level Understanding in Tokenized Language Models"
double-jeopardy-in-llms
Code for "Double Jeopardy and Climate Impact in the Use of Large Language Models." Includes scripts for analyzing socio-economic disparities, tokenization inefficiencies, and LLM utility using FLORES-200, Ethnologue, WDI, and GPT-4 APIs.