pico

A byte-level language model architecture

https://github.com/loristns/pico

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.5%) to scientific vocabulary

Last synced: 11 months ago · JSON representation ·

Repository

A byte-level language model architecture

Basic Info

Host: GitHub
Owner: loristns
License: apache-2.0
Language: Python
Default Branch: main
Homepage:
Size: 1020 KB

Statistics

Stars: 2
Watchers: 2
Forks: 0
Open Issues: 0
Releases: 0

Created over 2 years ago · Last pushed over 1 year ago

Metadata Files

Readme License Citation

🤏Pico

Pico is my experimental language model decoder-only architecture designed to bypass tokenization by processing raw bytes directly.

Pico architecture

To offset the computational cost of working at the byte level, the model employs sliding window attention over a small context. To reintroduce longer-range dependencies, intermediate transformer blocks operate on larger windows, but they are selectively applied over a small fixed fraction of the bytes using a Mixture-of-Depths router.

This architecture share similarities with SpaceByte, but instead of applying intermediate transformer blocks when encountering whitespaces, Pico learns by itself when to insert them in an end-to-end manner.

I also experimented with multi-byte prediction, which allows the model to not only predict the next byte but also the next few bytes. This has been shown to improve the performance of byte-level models, although probably not at the scale I run my experiments but it also allows for self-speculative decoding at inference time.

Other architectural choices include: - Sliding window, grouped-query attention - SwiGLU - ALiBi positional encoding (might try RoPE later) - SOAP optimizer for pretraining

Here is an example of the output of a ~10M parameters Pico model trained on the TinyStories dataset:

``` Once_ upon_ a time, there_ was a little_ girl named_ Lily._ She_ loved_ to_ play_ outside in the_ park._ One_ day,_ she_ saw a bee flying around_ her_ and said, "Hello, bee! Do_ you_ want to_ play_ with_ me?_"

The bee nodded_ and said, "Yes, I_ would love_ to_ join me,_ too!_" ```

The _ prefixes the bytes where the model decided to insert the intermediate transformer blocks. As we can see, the model has somehow chosen to insert them at word boundaries like a form of "tokenization".

When completly disabling the intermediate transformer blocks, the model can still generate well-formed words, but loose coherence at the sentence level:

Once upon a time in the wet borided in the safe in all the wonderful and thought again. He smilly came of the brave enough to listen that missides! He said whines waved as she funny thanks the weak down that is all the bricycle.

This suggests that the intermediate blocks may produces "higher-level" representations that helps the model to maintain coherence over longer ranges.

That being said, I also observed harder-to-interpret patterns on models trained on other datasets, so it's not clear that the Mixture-of-Depths is systematically used by the model as a form of tokenization.

Usage

```bash

Installation

uv sync

Initialize a new model

python picolm.py init models/my-model --dim 372 --latent-num-blocks 8 --fb-att-window-size 32

Train the model

python picolm.py train models/my-model my-train-run roneneldan/TinyStories --dataset-column-name text --batch-size 8

Generate text

python picolm.py run models/my-model my-train-run --temperature 0.8 --prompt "Once upon a time" ```

Owner

Name: Loris Nezan
Login: loristns
Kind: user
Location: France
Company: @Ikomia-dev

Website: xtns.dev
Twitter: loristns
Repositories: 19
Profile: https://github.com/loristns

i like centering divs and pretending my python scripts are sentient

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Nezan"
  given-names: "Loris"
title: "Pico byte-level language model architecture"
license: Apache-2.0
date-released: "2025-01-25"
url: "https://github.com/loristns/pico"

GitHub Events

Total

Watch event: 2
Push event: 16
Public event: 1
Pull request review event: 2
Pull request review comment event: 2
Pull request event: 1
Create event: 2

Last Year

Watch event: 2
Push event: 16
Public event: 1
Pull request review event: 2
Pull request review comment event: 2
Pull request event: 1
Create event: 2

Issues and Pull Requests

Last synced: over 1 year ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

loristns (1)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

pyproject.toml pypi

accelerate >=1.3.0
datasets >=3.0.2
einops >=0.8.0
flash-attn *
pydantic >=2.10.6
safetensors >=0.5.2
setuptools >=75.8.0
torch >=2.5.1
triton >=3.0.0
typer >=0.12.5
wandb >=0.19.4

uv.lock pypi

aiohappyeyeballs 2.4.4
aiohttp 3.11.11
aiosignal 1.3.2
annotated-types 0.7.0
attrs 24.3.0
certifi 2024.12.14
charset-normalizer 3.4.1
click 8.1.8
colorama 0.4.6
datasets 3.2.0
dill 0.3.8
docker-pycreds 0.4.0
einops 0.8.0
filelock 3.17.0
flash-attn 2.7.3
frozenlist 1.5.0
fsspec 2024.9.0
gitdb 4.0.12
gitpython 3.1.44
huggingface-hub 0.27.1
idna 3.10
jinja2 3.1.5
markdown-it-py 3.0.0
markupsafe 3.0.2
mdurl 0.1.2
mpmath 1.3.0
multidict 6.1.0
multiprocess 0.70.16
networkx 3.4.2
numpy 2.2.2
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 9.1.0.70
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.21.5
nvidia-nvjitlink-cu12 12.8.61
nvidia-nvtx-cu12 12.1.105
packaging 24.2
pandas 2.2.3
pico 0.1.0
platformdirs 4.3.6
propcache 0.2.1
protobuf 5.29.3
psutil 6.1.1
pyarrow 19.0.0
pydantic 2.10.6
pydantic-core 2.27.2
pygments 2.19.1
python-dateutil 2.9.0.post0
pytz 2024.2
pyyaml 6.0.2
requests 2.32.3
rich 13.9.4
ruff 0.9.3
safetensors 0.5.2
sentry-sdk 2.20.0
setproctitle 1.3.4
setuptools 75.8.0
shellingham 1.5.4
six 1.17.0
smmap 5.0.2
sympy 1.13.1
torch 2.5.1+cu121
tqdm 4.67.1
triton 3.1.0
typer 0.15.1
typing-extensions 4.12.2
tzdata 2025.1
urllib3 2.3.0
wandb 0.19.4
xxhash 3.5.0
yarl 1.18.3

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science