Scientific Software
Updated 6 months ago

WordTokenizers.jl — Peer-reviewed • Rank 13.2 • Science 95%

WordTokenizers.jl: Basic tools for tokenizing natural language in Julia - Published in JOSS (2020)

Engineering (40%) Earth and Environmental Sciences (40%)
Scientific Software · Peer-reviewed
Updated 6 months ago

bpeasy • Rank 12.4 • Science 54%

Fast bare-bones BPE for modern tokenizer training

Updated 6 months ago

pyonmttok • Rank 18.5 • Science 26%

Fast and customizable text tokenization library with BPE and SentencePiece support

Updated 6 months ago

token-wars-dataviz • Rank 0.0 • Science 44%

A data visualisation in `matplotlib` of the number of parameters in major LLMs as well as the number of tokens of text they were trained on.

Updated 5 months ago

https://github.com/bminixhofer/tokenkit • Rank 3.5 • Science 36%

A toolkit implementing advanced methods to transfer models and model knowledge across tokenizers.

Updated 6 months ago

ekphrasis • Rank 15.7 • Science 23%

Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).

Updated 6 months ago

wisesight-sentiment • Science 67%

Thai social media text sentiment dataset

Updated 6 months ago

klmbr • Science 44%

klmbr - a prompt pre-processing technique to break through the barrier of entropy while generating text with LLMs

Updated 6 months ago

transform-emr • Science 54%

This model is a decoder transformer based model aiming to model events predictions from EMR records as a sequential text generation problem. This project is a part of my thesis research.

Updated 6 months ago

double-jeopardy-in-llms • Science 54%

Code for "Double Jeopardy and Climate Impact in the Use of Large Language Models." Includes scripts for analyzing socio-economic disparities, tokenization inefficiencies, and LLM utility using FLORES-200, Ethnologue, WDI, and GPT-4 APIs.

Updated 5 months ago

https://github.com/cosmaadrian/strawberry-problem • Science 36%

Official repository for "The Strawberry Problem 🍓: Emergence of Character-level Understanding in Tokenized Language Models"

Updated 6 months ago

com.rootroo • Science 67%

Multilingual Natural Language Processing for Java