pico

A byte-level language model architecture

https://github.com/loristns/pico

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.5%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

A byte-level language model architecture

Basic Info
  • Host: GitHub
  • Owner: loristns
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 1020 KB
Statistics
  • Stars: 2
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created almost 2 years ago · Last pushed about 1 year ago
Metadata Files
Readme License Citation

README.md

🤏Pico

Pico is my experimental language model decoder-only architecture designed to bypass tokenization by processing raw bytes directly.

Pico architecture

To offset the computational cost of working at the byte level, the model employs sliding window attention over a small context. To reintroduce longer-range dependencies, intermediate transformer blocks operate on larger windows, but they are selectively applied over a small fixed fraction of the bytes using a Mixture-of-Depths router.

This architecture share similarities with SpaceByte, but instead of applying intermediate transformer blocks when encountering whitespaces, Pico learns by itself when to insert them in an end-to-end manner.

I also experimented with multi-byte prediction, which allows the model to not only predict the next byte but also the next few bytes. This has been shown to improve the performance of byte-level models, although probably not at the scale I run my experiments but it also allows for self-speculative decoding at inference time.

Other architectural choices include: - Sliding window, grouped-query attention - SwiGLU - ALiBi positional encoding (might try RoPE later) - SOAP optimizer for pretraining

Here is an example of the output of a ~10M parameters Pico model trained on the TinyStories dataset:

``` Once_ upon_ a time, there_ was a little_ girl named_ Lily._ She_ loved_ to_ play_ outside in the_ park._ One_ day,_ she_ saw a bee flying around_ her_ and said, "Hello, bee! Do_ you_ want to_ play_ with_ me?_"

The bee nodded_ and said, "Yes, I_ would love_ to_ join me,_ too!_" ```

The _ prefixes the bytes where the model decided to insert the intermediate transformer blocks. As we can see, the model has somehow chosen to insert them at word boundaries like a form of "tokenization".

When completly disabling the intermediate transformer blocks, the model can still generate well-formed words, but loose coherence at the sentence level:

Once upon a time in the wet borided in the safe in all the wonderful and thought again. He smilly came of the brave enough to listen that missides! He said whines waved as she funny thanks the weak down that is all the bricycle.

This suggests that the intermediate blocks may produces "higher-level" representations that helps the model to maintain coherence over longer ranges.

That being said, I also observed harder-to-interpret patterns on models trained on other datasets, so it's not clear that the Mixture-of-Depths is systematically used by the model as a form of tokenization.

Usage

```bash

Installation

uv sync

Initialize a new model

python picolm.py init models/my-model --dim 372 --latent-num-blocks 8 --fb-att-window-size 32

Train the model

python picolm.py train models/my-model my-train-run roneneldan/TinyStories --dataset-column-name text --batch-size 8

Generate text

python picolm.py run models/my-model my-train-run --temperature 0.8 --prompt "Once upon a time" ```

Owner

  • Name: Loris Nezan
  • Login: loristns
  • Kind: user
  • Location: France
  • Company: @Ikomia-dev

i like centering divs and pretending my python scripts are sentient

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Nezan"
  given-names: "Loris"
title: "Pico byte-level language model architecture"
license: Apache-2.0
date-released: "2025-01-25"
url: "https://github.com/loristns/pico"

GitHub Events

Total
  • Watch event: 2
  • Push event: 16
  • Public event: 1
  • Pull request review event: 2
  • Pull request review comment event: 2
  • Pull request event: 1
  • Create event: 2
Last Year
  • Watch event: 2
  • Push event: 16
  • Public event: 1
  • Pull request review event: 2
  • Pull request review comment event: 2
  • Pull request event: 1
  • Create event: 2

Issues and Pull Requests

Last synced: 11 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • loristns (1)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

pyproject.toml pypi
  • accelerate >=1.3.0
  • datasets >=3.0.2
  • einops >=0.8.0
  • flash-attn *
  • pydantic >=2.10.6
  • safetensors >=0.5.2
  • setuptools >=75.8.0
  • torch >=2.5.1
  • triton >=3.0.0
  • typer >=0.12.5
  • wandb >=0.19.4
uv.lock pypi
  • aiohappyeyeballs 2.4.4
  • aiohttp 3.11.11
  • aiosignal 1.3.2
  • annotated-types 0.7.0
  • attrs 24.3.0
  • certifi 2024.12.14
  • charset-normalizer 3.4.1
  • click 8.1.8
  • colorama 0.4.6
  • datasets 3.2.0
  • dill 0.3.8
  • docker-pycreds 0.4.0
  • einops 0.8.0
  • filelock 3.17.0
  • flash-attn 2.7.3
  • frozenlist 1.5.0
  • fsspec 2024.9.0
  • gitdb 4.0.12
  • gitpython 3.1.44
  • huggingface-hub 0.27.1
  • idna 3.10
  • jinja2 3.1.5
  • markdown-it-py 3.0.0
  • markupsafe 3.0.2
  • mdurl 0.1.2
  • mpmath 1.3.0
  • multidict 6.1.0
  • multiprocess 0.70.16
  • networkx 3.4.2
  • numpy 2.2.2
  • nvidia-cublas-cu12 12.1.3.1
  • nvidia-cuda-cupti-cu12 12.1.105
  • nvidia-cuda-nvrtc-cu12 12.1.105
  • nvidia-cuda-runtime-cu12 12.1.105
  • nvidia-cudnn-cu12 9.1.0.70
  • nvidia-cufft-cu12 11.0.2.54
  • nvidia-curand-cu12 10.3.2.106
  • nvidia-cusolver-cu12 11.4.5.107
  • nvidia-cusparse-cu12 12.1.0.106
  • nvidia-nccl-cu12 2.21.5
  • nvidia-nvjitlink-cu12 12.8.61
  • nvidia-nvtx-cu12 12.1.105
  • packaging 24.2
  • pandas 2.2.3
  • pico 0.1.0
  • platformdirs 4.3.6
  • propcache 0.2.1
  • protobuf 5.29.3
  • psutil 6.1.1
  • pyarrow 19.0.0
  • pydantic 2.10.6
  • pydantic-core 2.27.2
  • pygments 2.19.1
  • python-dateutil 2.9.0.post0
  • pytz 2024.2
  • pyyaml 6.0.2
  • requests 2.32.3
  • rich 13.9.4
  • ruff 0.9.3
  • safetensors 0.5.2
  • sentry-sdk 2.20.0
  • setproctitle 1.3.4
  • setuptools 75.8.0
  • shellingham 1.5.4
  • six 1.17.0
  • smmap 5.0.2
  • sympy 1.13.1
  • torch 2.5.1+cu121
  • tqdm 4.67.1
  • triton 3.1.0
  • typer 0.15.1
  • typing-extensions 4.12.2
  • tzdata 2025.1
  • urllib3 2.3.0
  • wandb 0.19.4
  • xxhash 3.5.0
  • yarl 1.18.3