https://github.com/cosmaadrian/strawberry-problem

Official repository for "The Strawberry Problem 🍓: Emergence of Character-level Understanding in Tokenized Language Models"

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org, scholar.google
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.2%) to scientific vocabulary

Keywords

character-understanding cross-attention llms paper tokenization transformer

Last synced: 5 months ago · JSON representation

Repository

Official repository for "The Strawberry Problem 🍓: Emergence of Character-level Understanding in Tokenized Language Models"

Basic Info

Host: GitHub
Owner: cosmaadrian
License: other
Language: Python
Default Branch: master
Homepage: https://github.com/cosmaadrian/strawberry-problem
Size: 56.6 KB

Statistics

Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Topics

character-understanding cross-attention llms paper tokenization transformer

Created 9 months ago · Last pushed 9 months ago

Metadata Files

Readme License

The Strawberry Problem 🍓
Emergence of Character-level Understanding in Tokenized Language Models

[Adrian Cosma](https://scholar.google.com/citations?user=cdYk_RUAAAAJ&hl=en), [Stefan Ruseti](https://scholar.google.com/citations?user=aEyJTykAAAAJ&hl=en), [Emilian Radoi](https://scholar.google.com/citations?user=yjtWIf8AAAAJ&hl=en), [Mihai Dascalu](https://scholar.google.ro/citations?user=3L9yY8UAAAAJ&hl=en)

[📜 Paper PDF](https://arxiv.org/abs/2505.14172)| [📘 Abstract](#intro)| [⚒️ Usage](#usage)| [📖 Citation](#citation)| [📝 License](#license)

📘 Abstract

Despite their remarkable progress across diverse domains, Large Language Models (LLMs) consistently fail at simple character-level tasks, such as counting letters in words, due to a fundamental limitation: tokenization. In this work, we frame this limitation as a problem of low mutual information and analyze it in terms of concept emergence. Using a suite of 19 synthetic tasks that isolate character-level reasoning in a controlled setting, we show that such capabilities emerge slowly, suddenly, and only late in training. We further show that percolation-based models of concept emergence explain these patterns, suggesting that learning character composition is not fundamentally different from learning commonsense knowledge. To address this bottleneck, we propose a lightweight architectural modification that significantly improves character-level reasoning while preserving the inductive advantages of subword models. Together, our results bridge low-level perceptual gaps in tokenized LMs and provide a principled framework for understanding and mitigating their structural blind spots. We make our code publicly available.

⚒️ Usage

Go to cd experiments/ and run:

Step 1 - Generate vocabularies

bash generate_datasets.sh

Step 2 - Train the models

bash train.sh bash wiki_train.sh

Step 3 - Perform ablation studies

bash ablation.sh

📖 Citation

If you found our work useful, please cite our paper:

@misc{cosma2025strawberry, title={The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models}, author={Adrian Cosma and Stefan Ruseti and Emilian Radoi and Mihai Dascalu}, year={2025}, eprint={2505.14172}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.14172}, }

📝 License

This work is protected by Attribution-NonCommercial 4.0 International

Owner

Name: Adrian Cosma
Login: cosmaadrian
Kind: user
Location: Bucharest, Romania
Company: University Politehnica of Bucharest

Repositories: 21
Profile: https://github.com/cosmaadrian

Mercenary Researcher

GitHub Events

Total

Watch event: 2
Push event: 1
Create event: 1

Last Year

Watch event: 2
Push event: 1
Create event: 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/cosmaadrian/strawberry-problem

Science Score: 36.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

The Strawberry Problem 🍓
Emergence of Character-level Understanding in Tokenized Language Models

📘 Abstract

⚒️ Usage

Step 1 - Generate vocabularies

Step 2 - Train the models

Step 3 - Perform ablation studies

📖 Citation

📝 License

Owner

GitHub Events

Total

Last Year

https://github.com/cosmaadrian/strawberry-problem

Science Score: 36.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

The Strawberry Problem 🍓 Emergence of Character-level Understanding in Tokenized Language Models

📘 Abstract

⚒️ Usage

Step 1 - Generate vocabularies

Step 2 - Train the models

Step 3 - Perform ablation studies

📖 Citation

📝 License

Owner

GitHub Events

Total

Last Year

The Strawberry Problem 🍓
Emergence of Character-level Understanding in Tokenized Language Models