WordTokenizers.jl
WordTokenizers.jl: Basic tools for tokenizing natural language in Julia - Published in JOSS (2020)
simplemma
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
pyonmttok
Fast and customizable text tokenization library with BPE and SentencePiece support
token-wars-dataviz
A data visualisation in `matplotlib` of the number of parameters in major LLMs as well as the number of tokens of text they were trained on.
https://github.com/bminixhofer/tokenkit
A toolkit implementing advanced methods to transfer models and model knowledge across tokenizers.
ekphrasis
Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
https://github.com/aveek-saha/sentiment-based-stock-price-forecasting
Apple Stock Price Forecasting using Sentiment Analysis
klmbr
klmbr - a prompt pre-processing technique to break through the barrier of entropy while generating text with LLMs
transform-emr
This model is a decoder transformer based model aiming to model events predictions from EMR records as a sequential text generation problem. This project is a part of my thesis research.
double-jeopardy-in-llms
Code for "Double Jeopardy and Climate Impact in the Use of Large Language Models." Includes scripts for analyzing socio-economic disparities, tokenization inefficiencies, and LLM utility using FLORES-200, Ethnologue, WDI, and GPT-4 APIs.
https://github.com/cosmaadrian/strawberry-problem
Official repository for "The Strawberry Problem 🍓: Emergence of Character-level Understanding in Tokenized Language Models"