Projects with CITATION.cff

Updated 11 months ago

simplemma • Rank 16.8 • Science 57%

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

corpus-tools language-detection language-identification lemmatiser lemmatization lemmatizer low-resource-nlp morphological-analysis nlp tokenization tokenizer wordlist

Updated 11 months ago

bpeasy • Rank 12.4 • Science 54%

Fast bare-bones BPE for modern tokenizer training

bpe tokenization tokenizer

Updated 11 months ago

token-wars-dataviz • Rank 0.0 • Science 44%

A data visualisation in `matplotlib` of the number of parameters in major LLMs as well as the number of tokens of text they were trained on.

llm llms tokenization tokenizer tokens tokenwars

Updated 11 months ago

klmbr • Science 44%

klmbr - a prompt pre-processing technique to break through the barrier of entropy while generating text with LLMs

inference llm prompts tokenization

Updated 11 months ago

transform-emr • Science 54%

This model is a decoder transformer based model aiming to model events predictions from EMR records as a sequential text generation problem. This project is a part of my thesis research.

medical-informatics pretraining tokenization transformer-architecture

Updated 11 months ago

com.rootroo • Science 67%

Multilingual Natural Language Processing for Java

java maven natural-language-processing nlg nlp tokenization

Updated 11 months ago

wisesight-sentiment • Science 67%

Thai social media text sentiment dataset

classification corpus sentiment-analysis thai tokenization

Updated 11 months ago

double-jeopardy-in-llms • Science 54%

Code for "Double Jeopardy and Climate Impact in the Use of Large Language Models." Includes scripts for analyzing socio-economic disparities, tokenization inefficiencies, and LLM utility using FLORES-200, Ethnologue, WDI, and GPT-4 APIs.

climate-change ethnologue gdp language llm openai-api tokenization wdi

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

simplemma • Rank 16.8 • Science 57%

bpeasy • Rank 12.4 • Science 54%

token-wars-dataviz • Rank 0.0 • Science 44%

klmbr • Science 44%

transform-emr • Science 54%

com.rootroo • Science 67%

wisesight-sentiment • Science 67%

double-jeopardy-in-llms • Science 54%