Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.5%) to scientific vocabulary
Repository
A byte-level language model architecture
Basic Info
Statistics
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
🤏Pico
Pico is my experimental language model decoder-only architecture designed to bypass tokenization by processing raw bytes directly.
To offset the computational cost of working at the byte level, the model employs sliding window attention over a small context. To reintroduce longer-range dependencies, intermediate transformer blocks operate on larger windows, but they are selectively applied over a small fixed fraction of the bytes using a Mixture-of-Depths router.
This architecture share similarities with SpaceByte, but instead of applying intermediate transformer blocks when encountering whitespaces, Pico learns by itself when to insert them in an end-to-end manner.
I also experimented with multi-byte prediction, which allows the model to not only predict the next byte but also the next few bytes. This has been shown to improve the performance of byte-level models, although probably not at the scale I run my experiments but it also allows for self-speculative decoding at inference time.
Other architectural choices include: - Sliding window, grouped-query attention - SwiGLU - ALiBi positional encoding (might try RoPE later) - SOAP optimizer for pretraining
Here is an example of the output of a ~10M parameters Pico model trained on the TinyStories dataset:
``` Once_ upon_ a time, there_ was a little_ girl named_ Lily._ She_ loved_ to_ play_ outside in the_ park._ One_ day,_ she_ saw a bee flying around_ her_ and said, "Hello, bee! Do_ you_ want to_ play_ with_ me?_"
The bee nodded_ and said, "Yes, I_ would love_ to_ join me,_ too!_" ```
The _ prefixes the bytes where the model decided to insert the intermediate transformer blocks.
As we can see, the model has somehow chosen to insert them at word boundaries like a form of "tokenization".
When completly disabling the intermediate transformer blocks, the model can still generate well-formed words, but loose coherence at the sentence level:
Once upon a time in the wet borided in the safe in all the wonderful and thought again. He smilly came of the brave enough to listen that missides! He said whines waved as she funny thanks the weak down that is all the bricycle.
This suggests that the intermediate blocks may produces "higher-level" representations that helps the model to maintain coherence over longer ranges.
That being said, I also observed harder-to-interpret patterns on models trained on other datasets, so it's not clear that the Mixture-of-Depths is systematically used by the model as a form of tokenization.
Usage
```bash
Installation
uv sync
Initialize a new model
python picolm.py init models/my-model --dim 372 --latent-num-blocks 8 --fb-att-window-size 32
Train the model
python picolm.py train models/my-model my-train-run roneneldan/TinyStories --dataset-column-name text --batch-size 8
Generate text
python picolm.py run models/my-model my-train-run --temperature 0.8 --prompt "Once upon a time" ```
Owner
- Name: Loris Nezan
- Login: loristns
- Kind: user
- Location: France
- Company: @Ikomia-dev
- Website: xtns.dev
- Twitter: loristns
- Repositories: 19
- Profile: https://github.com/loristns
i like centering divs and pretending my python scripts are sentient
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Nezan" given-names: "Loris" title: "Pico byte-level language model architecture" license: Apache-2.0 date-released: "2025-01-25" url: "https://github.com/loristns/pico"
GitHub Events
Total
- Watch event: 2
- Push event: 16
- Public event: 1
- Pull request review event: 2
- Pull request review comment event: 2
- Pull request event: 1
- Create event: 2
Last Year
- Watch event: 2
- Push event: 16
- Public event: 1
- Pull request review event: 2
- Pull request review comment event: 2
- Pull request event: 1
- Create event: 2
Issues and Pull Requests
Last synced: 11 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
- loristns (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- accelerate >=1.3.0
- datasets >=3.0.2
- einops >=0.8.0
- flash-attn *
- pydantic >=2.10.6
- safetensors >=0.5.2
- setuptools >=75.8.0
- torch >=2.5.1
- triton >=3.0.0
- typer >=0.12.5
- wandb >=0.19.4
- aiohappyeyeballs 2.4.4
- aiohttp 3.11.11
- aiosignal 1.3.2
- annotated-types 0.7.0
- attrs 24.3.0
- certifi 2024.12.14
- charset-normalizer 3.4.1
- click 8.1.8
- colorama 0.4.6
- datasets 3.2.0
- dill 0.3.8
- docker-pycreds 0.4.0
- einops 0.8.0
- filelock 3.17.0
- flash-attn 2.7.3
- frozenlist 1.5.0
- fsspec 2024.9.0
- gitdb 4.0.12
- gitpython 3.1.44
- huggingface-hub 0.27.1
- idna 3.10
- jinja2 3.1.5
- markdown-it-py 3.0.0
- markupsafe 3.0.2
- mdurl 0.1.2
- mpmath 1.3.0
- multidict 6.1.0
- multiprocess 0.70.16
- networkx 3.4.2
- numpy 2.2.2
- nvidia-cublas-cu12 12.1.3.1
- nvidia-cuda-cupti-cu12 12.1.105
- nvidia-cuda-nvrtc-cu12 12.1.105
- nvidia-cuda-runtime-cu12 12.1.105
- nvidia-cudnn-cu12 9.1.0.70
- nvidia-cufft-cu12 11.0.2.54
- nvidia-curand-cu12 10.3.2.106
- nvidia-cusolver-cu12 11.4.5.107
- nvidia-cusparse-cu12 12.1.0.106
- nvidia-nccl-cu12 2.21.5
- nvidia-nvjitlink-cu12 12.8.61
- nvidia-nvtx-cu12 12.1.105
- packaging 24.2
- pandas 2.2.3
- pico 0.1.0
- platformdirs 4.3.6
- propcache 0.2.1
- protobuf 5.29.3
- psutil 6.1.1
- pyarrow 19.0.0
- pydantic 2.10.6
- pydantic-core 2.27.2
- pygments 2.19.1
- python-dateutil 2.9.0.post0
- pytz 2024.2
- pyyaml 6.0.2
- requests 2.32.3
- rich 13.9.4
- ruff 0.9.3
- safetensors 0.5.2
- sentry-sdk 2.20.0
- setproctitle 1.3.4
- setuptools 75.8.0
- shellingham 1.5.4
- six 1.17.0
- smmap 5.0.2
- sympy 1.13.1
- torch 2.5.1+cu121
- tqdm 4.67.1
- triton 3.1.0
- typer 0.15.1
- typing-extensions 4.12.2
- tzdata 2025.1
- urllib3 2.3.0
- wandb 0.19.4
- xxhash 3.5.0
- yarl 1.18.3