Updated 9 months ago

bpeasy • Rank 12.4 • Science 54%

Fast bare-bones BPE for modern tokenizer training

Updated 9 months ago

token-wars-dataviz • Rank 0.0 • Science 44%

A data visualisation in `matplotlib` of the number of parameters in major LLMs as well as the number of tokens of text they were trained on.

Updated 9 months ago

wisesight-sentiment • Science 67%

Thai social media text sentiment dataset

Updated 9 months ago

com.rootroo • Science 67%

Multilingual Natural Language Processing for Java

Updated 9 months ago

klmbr • Science 44%

klmbr - a prompt pre-processing technique to break through the barrier of entropy while generating text with LLMs

Updated 9 months ago

transform-emr • Science 54%

This model is a decoder transformer based model aiming to model events predictions from EMR records as a sequential text generation problem. This project is a part of my thesis research.

Updated 9 months ago

double-jeopardy-in-llms • Science 54%

Code for "Double Jeopardy and Climate Impact in the Use of Large Language Models." Includes scripts for analyzing socio-economic disparities, tokenization inefficiencies, and LLM utility using FLORES-200, Ethnologue, WDI, and GPT-4 APIs.