llammlein

Repo used to train our German, decoder-only LLäMmlein model family from scratch. We currently provide three model sizes — 120M, 1B, and 7B — all trained on the same curated dataset for consistent scaling comparisons. All components of the training process, including code, datasets, and intermediate checkpoints are openly released.

https://github.com/lsx-uniwue/llammlein

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.3%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: LSX-UniWue
License: apache-2.0
Language: Python
Default Branch: main
Homepage: https://www.informatik.uni-wuerzburg.de/datascience/projects/nlp/llammlein/
Size: 1.18 MB

Statistics

Stars: 0
Watchers: 6
Forks: 0
Open Issues: 0
Releases: 0

Created about 1 year ago · Last pushed 11 months ago

Metadata Files

Readme License Citation

README.md

🐑 LLäMmlein Codebase

Welcome to the LLäMmlein codebase — used to train our German, decoder-only LLäMmlein model family from scratch. We currently provide three model sizes — 120M, 1B, and 7B — all trained on the same curated dataset for consistent scaling comparisons. All components of the training process, including code, datasets, and intermediate checkpoints are openly released. Find more information in our paper. This repository builds on TinyLlama as its backbone.

🔍 New Highlight Features

⚡️ Flash Attention 3
🏎️ Fast dataloader with caching for improved speed
✏️ Datapoint logging for better tracking

Model Family

All models are available on Hugging Face, along with: - Intermediate checkpoints - Data logging metadata

| Model Size | Hugging Face Link | |------------|-------------------| | LLäMmlein 7B | LLäMmlein_7B | | LLäMmlein 1B | LLäMmlein_1B | | LLäMmlein 120M | LLäMmlein_120M |

Legacy Models (Preregistered, No Data Logging)

These earlier versions were used in our initial experiments (see accompanying paper):

| Model | Hugging Face Link | |-------|-------------------| | LLäMmlein 1B (prerelease) | LLäMmlein1Bprerelease | | LLäMmlein 120M (prerelease) | LLäMmlein120Mprerelease |

Data

All our models are trained using our filtered version of the RedPajama V2 dataset, available under LLaMmlein-Dataset In addition to paragraph deduplication, we applied a token-to-word ratio filter to exclude low-quality or noisy text.

Usage

Install Requirements 🚀

We provide our requirements in TinyLlama/cluster/building/requirements.txt. They can be installed using: bash pip install -r requirements.txt

In addition, we provide a containerized environment using Singularity (Apptainer). The .sif image can be build using the provided definition file llammlein.def. (located in TinyLlama/cluster/building/):

bash bash build-image.sh

Training

The tinyllama_fabric.py contains the general training script. Before starting several parameters have to be set, we will list the most important here:
* modelname : Identifier specified in lit-gpt/config.py * globalbatchsize: batch size across all gpus * learningrate * microbatchsize * maxsteps * shardingstrategy: Efficiency of the FSDP sharding strategy can differ drastically between cluster settings * traindatadir: Path to the training dataset (line 112) * statedicttype: Depending on the size of model you are training and available gpus it can make sense to set this parameter to sharded instead of full (line 131) * tokenizer_path: Path to the tokenizer (line 417)

Once all training parameters are configured, training can be launched using a script similar to this one: ```bash

!/bin/bash

SBATCH --job-name="training"

SBATCH --nodes=16

SBATCH --ntasks-per-node=4

SBATCH --gres=gpu:h100:4

SBATCH --partition=h100

SBATCH --time=24:00:00

Set all cluster specific variables i.e. HTTPPROXY, NCCLIB_HCA ...

export MODELNAME="$OUTDIR" # location to save intermediate checkpoints srun apptainer exec llammlein.sif python pretrain/tinyllamafabric.py ```

Our specific .sh file with all our selected environment variables can be found in TinyLlama/cluster/exec. Training can be resumed from a specific checkpoint using:
bash srun apptainer exec llammlein.sif python pretrain/tinyllama_fabric.py --resume $MODEL_NAME/iter-00100000-ckpt.pth

Transformation

The checkpoints saved during training are not Huggingface compatible therefore the examplary script scripts/create.sh has to be executed. Please make sure to add all paths correctly in the .sh file.

Citation

bib @inproceedings{pfister-etal-2025-llammlein, title = {{LL}{\"a}{M}mlein: Transparent, Compact and Competitive {G}erman-Only Language Models from Scratch}, author = "Pfister, Jan and Wunderle, Julia and Hotho, Andreas", editor = "Che, Wanxiang and Nabende, Joyce and Shutova, Ekaterina and Pilehvar, Mohammad Taher", booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2025", address = "Vienna, Austria", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.acl-long.111/", pages = "2227--2246", ISBN = "979-8-89176-251-0", abstract = {We transparently create two German-only decoder models, LL{\"a}Mmlein 120M and 1B, from scratch and publish them, along with the training data, for the (German) NLP research community to use. The model training involved several key steps, including data preprocessing/filtering, the creation of a German tokenizer, the training itself, as well as the evaluation of the final models on various benchmarks, also against existing models. Throughout the training process, multiple checkpoints were saved in equal intervals and analyzed using the German SuperGLEBer benchmark to gain insights into the models' learning process.Compared to state-of-the-art models on the SuperGLEBer benchmark, both LL{\"a}Mmlein models performed competitively, consistently matching or surpassing models with similar parameter sizes. The results show that the models' quality scales with size as expected, but performance improvements on some tasks plateaued early during training, offering valuable insights into resource allocation for future models.} }

Owner

Name: Chair of Computer Science X - Data Science
Login: LSX-UniWue
Kind: organization
Location: Germany

Website: professor-x.de
Twitter: datascience_jmu
Repositories: 2
Profile: https://github.com/LSX-UniWue

Citation (CITATION.bib)

@inproceedings{pfister-etal-2025-llammlein,
    title = {{LL}{\"a}{M}mlein: Transparent, Compact and Competitive {G}erman-Only Language Models from Scratch},
    author = "Pfister, Jan  and
      Wunderle, Julia  and
      Hotho, Andreas",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.acl-long.111/",
    pages = "2227--2246",
    ISBN = "979-8-89176-251-0",
    abstract = {We transparently create two German-only decoder models, LL{\"a}Mmlein 120M and 1B, from scratch and publish them, along with the training data, for the (German) NLP research community to use. The model training involved several key steps, including data preprocessing/filtering, the creation of a German tokenizer, the training itself, as well as the evaluation of the final models on various benchmarks, also against existing models. Throughout the training process, multiple checkpoints were saved in equal intervals and analyzed using the German SuperGLEBer benchmark to gain insights into the models' learning process.Compared to state-of-the-art models on the SuperGLEBer benchmark, both LL{\"a}Mmlein models performed competitively, consistently matching or surpassing models with similar parameter sizes. The results show that the models' quality scales with size as expected, but performance improvements on some tasks plateaued early during training, offering valuable insights into resource allocation for future models.}
}

GitHub Events

Total

Watch event: 1
Push event: 1
Create event: 1

Last Year

Watch event: 1
Push event: 1
Create event: 1

Dependencies

TinyLlama/chat_gradio/requirements.txt pypi

gradio >=4.13.0
torch >=2.0
transformers >=4.35.0

TinyLlama/cluster/building/requirements.txt pypi

Cython ==0.29.28
GitPython ==3.1.43
Jinja2 ==3.1.3
MarkupSafe ==2.1.5
Pillow ==9.0.1
PyJWT ==2.10.1
PyYAML ==6.0.2
Pygments ==2.18.0
aiohappyeyeballs ==2.4.4
aiohttp ==3.11.10
aiosignal ==1.3.2
annotated-types ==0.7.0
anyio ==4.7.0
arrow ==1.3.0
async-timeout ==5.0.1
attrs ==24.3.0
autocommand ==2.2.2
backoff ==2.2.1
backports.tarfile ==1.2.0
beautifulsoup4 ==4.12.3
blessed ==1.20.0
boto3 ==1.35.81
botocore ==1.35.81
certifi ==2024.12.14
charset-normalizer ==3.4.0
click ==8.1.7
croniter ==1.4.1
dateutils ==0.6.12
deepdiff ==6.7.1
deepspeed ==0.16.2
docker-pycreds ==0.4.0
docstring_parser ==0.16
docutils ==0.17.1
editor ==1.6.6
einops ==0.8.0
exceptiongroup ==1.2.2
fastapi ==0.115.6
filelock ==3.13.1
flash-attn ==2.7.2.post1
flashattn-hopper ==3.0.0b1
frozenlist ==1.5.0
fsspec ==2023.12.2
gitdb ==4.0.11
h11 ==0.14.0
hjson ==3.1.0
huggingface-hub ==0.27.0
idna ==3.10
importlib_metadata ==8.0.0
importlib_resources ==6.4.5
inflect ==7.3.1
inquirer ==3.4.0
itsdangerous ==2.2.0
jaraco.collections ==5.1.0
jaraco.context ==5.3.0
jaraco.functools ==4.0.1
jaraco.text ==3.12.1
jmespath ==1.0.1
joblib ==1.4.2
jsonargparse ==4.35.0
lightning ==2.1.2
lightning-cloud ==0.5.52
lightning-utilities ==0.11.9
markdown-it-py ==3.0.0
mdurl ==0.1.2
more-itertools ==10.3.0
mpmath ==1.3.0
msgpack ==1.1.0
multidict ==6.1.0
networkx ==3.2.1
ninja ==1.11.1.3
nltk ==3.9.1
numpy ==2.2.0
nvidia-cublas-cu12 ==12.4.5.8
nvidia-cuda-cupti-cu12 ==12.4.127
nvidia-cuda-nvrtc-cu12 ==12.4.127
nvidia-cuda-runtime-cu12 ==12.4.127
nvidia-cudnn-cu12 ==9.1.0.70
nvidia-cufft-cu12 ==11.2.1.3
nvidia-curand-cu12 ==10.3.5.147
nvidia-cusolver-cu12 ==11.6.1.9
nvidia-cusparse-cu12 ==12.3.1.170
nvidia-nccl-cu12 ==2.21.5
nvidia-nvjitlink-cu12 ==12.4.127
nvidia-nvtx-cu12 ==12.4.127
olefile ==0.46
ordered-set ==4.1.0
packaging ==24.2
pandas ==2.2.3
platformdirs ==4.3.6
propcache ==0.2.1
protobuf ==5.29.1
psutil ==5.9.8
py-cpuinfo ==9.0.0
pyarrow ==18.1.0
pydantic ==2.10.3
pydantic_core ==2.27.1
python-dateutil ==2.9.0.post0
python-multipart ==0.0.19
pytorch-lightning ==2.4.0
pytz ==2024.2
readchar ==4.2.1
regex ==2024.11.6
requests ==2.32.3
rich ==13.9.4
roman ==3.3
runs ==1.2.2
s3transfer ==0.10.4
safetensors ==0.4.5
sentencepiece ==0.2.0
sentry-sdk ==2.19.2
setproctitle ==1.3.4
six ==1.17.0
smmap ==5.0.1
sniffio ==1.3.1
soupsieve ==2.6
starlette ==0.41.3
starsessions ==1.3.0
sympy ==1.13.1
tokenizers ==0.21.0
tomli ==2.0.1
torch ==2.5.1
torchmetrics ==1.6.0
tqdm ==4.67.1
traitlets ==5.14.3
transformers ==4.47.0
triton ==3.1.0
typeguard ==4.3.0
types-python-dateutil ==2.9.0.20241206
typeshed_client ==2.7.0
typing_extensions ==4.12.2
tzdata ==2024.2
urllib3 ==2.2.3
uvicorn ==0.34.0
wandb ==0.19.1
wcwidth ==0.2.13
websocket-client ==1.8.0
websockets ==11.0.3
xformers ==0.0.28.post3
xmod ==1.8.1
yarl ==1.18.3
zipp ==3.19.2
zstandard ==0.23.0
zstd ==1.5.5.1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

llammlein

Science Score: 54.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

🐑 LLäMmlein Codebase

🔍 New Highlight Features

Model Family

Legacy Models (Preregistered, No Data Logging)

Data

Usage

Install Requirements 🚀

Training

!/bin/bash

SBATCH --job-name="training"

SBATCH --nodes=16

SBATCH --ntasks-per-node=4

SBATCH --gres=gpu:h100:4

SBATCH --partition=h100

SBATCH --time=24:00:00

Set all cluster specific variables i.e. HTTPPROXY, NCCLIB_HCA ...

Transformation

Citation

Owner

Citation (CITATION.bib)

GitHub Events

Total

Last Year

Dependencies