https://github.com/cosmaadrian/strawberry-problem
Official repository for "The Strawberry Problem 🍓: Emergence of Character-level Understanding in Tokenized Language Models"
Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org, scholar.google -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.2%) to scientific vocabulary
Keywords
Repository
Official repository for "The Strawberry Problem 🍓: Emergence of Character-level Understanding in Tokenized Language Models"
Basic Info
- Host: GitHub
- Owner: cosmaadrian
- License: other
- Language: Python
- Default Branch: master
- Homepage: https://github.com/cosmaadrian/strawberry-problem
- Size: 56.6 KB
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
The Strawberry Problem 🍓
Emergence of Character-level Understanding in Tokenized Language Models
📘 Abstract
Despite their remarkable progress across diverse domains, Large Language Models (LLMs) consistently fail at simple character-level tasks, such as counting letters in words, due to a fundamental limitation: tokenization. In this work, we frame this limitation as a problem of low mutual information and analyze it in terms of concept emergence. Using a suite of 19 synthetic tasks that isolate character-level reasoning in a controlled setting, we show that such capabilities emerge slowly, suddenly, and only late in training. We further show that percolation-based models of concept emergence explain these patterns, suggesting that learning character composition is not fundamentally different from learning commonsense knowledge. To address this bottleneck, we propose a lightweight architectural modification that significantly improves character-level reasoning while preserving the inductive advantages of subword models. Together, our results bridge low-level perceptual gaps in tokenized LMs and provide a principled framework for understanding and mitigating their structural blind spots. We make our code publicly available.
⚒️ Usage
Go to cd experiments/ and run:
Step 1 - Generate vocabularies
bash generate_datasets.sh
Step 2 - Train the models
bash train.sh
bash wiki_train.sh
Step 3 - Perform ablation studies
bash ablation.sh
📖 Citation
If you found our work useful, please cite our paper:
@misc{cosma2025strawberry,
title={The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models},
author={Adrian Cosma and Stefan Ruseti and Emilian Radoi and Mihai Dascalu},
year={2025},
eprint={2505.14172},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.14172},
}
📝 License
This work is protected by Attribution-NonCommercial 4.0 International
Owner
- Name: Adrian Cosma
- Login: cosmaadrian
- Kind: user
- Location: Bucharest, Romania
- Company: University Politehnica of Bucharest
- Repositories: 21
- Profile: https://github.com/cosmaadrian
Mercenary Researcher
GitHub Events
Total
- Watch event: 2
- Push event: 1
- Create event: 1
Last Year
- Watch event: 2
- Push event: 1
- Create event: 1