Projects

Scientific Software

Updated 10 months ago

WordTokenizers.jl — Peer-reviewed • Rank 13.2 • Science 95%

WordTokenizers.jl: Basic tools for tokenizing natural language in Julia - Published in JOSS (2020)

data-mining information-retrieval lexer nlp tokenization

Engineering (40%) Earth and Environmental Sciences (40%)

Scientific Software · Peer-reviewed

Updated 5 months ago

KeemenaPreprocessing.jl: Unicode-Robust Cleaning, Multi-Level Tokenisation & Streaming Offset Bundling for Julia NLP • Rank 1.4 • Science 87%

KeemenaPreprocessing.jl: Unicode-Robust Cleaning, Multi-Level Tokenisation & Streaming Offset Bundling for Julia NLP - Published in JOSS (2026)

julia natural-language-processing nlp text-encoding textprocessing tokenization

Updated 10 months ago

simplemma • Rank 16.8 • Science 57%

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

corpus-tools language-detection language-identification lemmatiser lemmatization lemmatizer low-resource-nlp morphological-analysis nlp tokenization tokenizer wordlist

Updated 10 months ago

spacy • Rank 37.7 • Science 36%

💫 Industrial-strength Natural Language Processing (NLP) in Python

ai artificial-intelligence cython data-science deep-learning entity-linking machine-learning named-entity-recognition natural-language-processing neural-network neural-networks nlp nlp-library python spacy text-classification tokenization

Updated 10 months ago

bpeasy • Rank 12.4 • Science 54%

Fast bare-bones BPE for modern tokenizer training

bpe tokenization tokenizer

Updated 10 months ago

pyonmttok • Rank 18.5 • Science 26%

Fast and customizable text tokenization library with BPE and SentencePiece support

bpe cpp icu machine-translation natural-language-processing python sentencepiece tokenization tokenizer unicode

Updated 10 months ago

token-wars-dataviz • Rank 0.0 • Science 44%

A data visualisation in `matplotlib` of the number of parameters in major LLMs as well as the number of tokens of text they were trained on.

llm llms tokenization tokenizer tokens tokenwars

Updated 10 months ago

https://github.com/bminixhofer/tokenkit • Rank 3.5 • Science 36%

A toolkit implementing advanced methods to transfer models and model knowledge across tokenizers.

distillation jax llms machine-learning tokenization tokenizer-transfer transfer-learning

Updated 10 months ago

ekphrasis • Rank 15.7 • Science 23%

Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).

nlp nlp-library semeval spell-corrector spelling-correction text-processing text-segmentation tokenization tokenizer word-normalization word-segmentation

Updated 10 months ago

https://github.com/bminixhofer/zett • Rank 5.6 • Science 23%

Code for Zero-Shot Tokenizer Transfer

language-model llm llms multilingual tokenization transfer-learning

Updated 10 months ago

wisesight-sentiment • Science 67%

Thai social media text sentiment dataset

classification corpus sentiment-analysis thai tokenization

Updated 10 months ago

double-jeopardy-in-llms • Science 54%

Code for "Double Jeopardy and Climate Impact in the Use of Large Language Models." Includes scripts for analyzing socio-economic disparities, tokenization inefficiencies, and LLM utility using FLORES-200, Ethnologue, WDI, and GPT-4 APIs.

climate-change ethnologue gdp language llm openai-api tokenization wdi

Updated 10 months ago

https://github.com/cosmaadrian/strawberry-problem • Science 36%

Official repository for "The Strawberry Problem 🍓: Emergence of Character-level Understanding in Tokenized Language Models"

character-understanding cross-attention llms paper tokenization transformer

Updated 10 months ago

com.rootroo • Science 67%

Multilingual Natural Language Processing for Java

java maven natural-language-processing nlg nlp tokenization

Updated 10 months ago

klmbr • Science 44%

klmbr - a prompt pre-processing technique to break through the barrier of entropy while generating text with LLMs

inference llm prompts tokenization

Updated 10 months ago

https://github.com/aveek-saha/sentiment-based-stock-price-forecasting • Science 13%

Apple Stock Price Forecasting using Sentiment Analysis

keras-tensorflow linear-regression lstm naive-bayes-classifier natural-language-processing news-sentiment-analyser sentiment-analysis svc-svm tokenization twitter-sentiment-analysis

Updated 10 months ago

transform-emr • Science 54%

This model is a decoder transformer based model aiming to model events predictions from EMR records as a sequential text generation problem. This project is a part of my thesis research.

medical-informatics pretraining tokenization transformer-architecture

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

WordTokenizers.jl — Peer-reviewed • Rank 13.2 • Science 95%

KeemenaPreprocessing.jl: Unicode-Robust Cleaning, Multi-Level Tokenisation & Streaming Offset Bundling for Julia NLP • Rank 1.4 • Science 87%

simplemma • Rank 16.8 • Science 57%

spacy • Rank 37.7 • Science 36%

bpeasy • Rank 12.4 • Science 54%

pyonmttok • Rank 18.5 • Science 26%

token-wars-dataviz • Rank 0.0 • Science 44%

https://github.com/bminixhofer/tokenkit • Rank 3.5 • Science 36%

ekphrasis • Rank 15.7 • Science 23%

https://github.com/bminixhofer/zett • Rank 5.6 • Science 23%

wisesight-sentiment • Science 67%

double-jeopardy-in-llms • Science 54%

https://github.com/cosmaadrian/strawberry-problem • Science 36%

com.rootroo • Science 67%

klmbr • Science 44%

https://github.com/aveek-saha/sentiment-based-stock-price-forecasting • Science 13%

transform-emr • Science 54%