loro-main

Official implementation of ICLR 2025 'LORO: Parameter and Memory Efficient Pretraining via Low-rank Riemannian Optimization'

https://github.com/mzf666/loro-main

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.0%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Official implementation of ICLR 2025 'LORO: Parameter and Memory Efficient Pretraining via Low-rank Riemannian Optimization'

Basic Info

Host: GitHub
Owner: mzf666
License: apache-2.0
Language: Python
Default Branch: main
Size: 2.15 MB

Statistics

Stars: 5
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created over 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

Official Implementation of Low-rank Riemannian Optimizer (ICLR 2025)

Overview

This repo contains the pre-release implementation of Low-rank Riemannian Optimizater (LORO), proposed in LORO: Parameter and Memory Efficient Pretraining via Low-rank Riemannian Optimization, ICLR 2025.

To achieve efficient yet effective pretraining for low-rank language models, this repo implements a Low-rank Riemannian Optimizer (LORO). At each exact LORO update step, the low-rank factor pairs $\mathbf{B}$ and $\mathbf{A}$ are jointly updated to ensure their full-size product $(\mathbf{BA})$ moves along the steepest descent direction on the low-rank manifold, without the need to compute any memory-intensive full-size matrices or gradients.

In practice, LORO periodically employs an exact low-rank Riemannian update step (equation 10), while in between, it uses the approximated low-rank Riemannian update (equation 12). LORO accumulates the gradients of $\mathbf{B}$ and $\mathbf{A}$ using the momentum strategy as the Adam optimizer. A pseudo-code of LORO is outlined below.

Updates

Currently, we are exploring LORO on larger models and datasets with lower ranks. We welcome any discussions, suggestions, and feedback on LORO.

31-Jan-2025: LORO implementation for LLaMA-60M / 130M / 350M / 1B released.
23-Jan-2025: LORO is accepted to ICLR 2025.

Installation

To install LORO optimizer from source codes:

bash git clone git@github.com:mzf666/LORO-main.git cd LORO-main pip install -e .

To install the dependency:

bash pip install -r requirements.txt

Usage

Our scripts are tested on NVIDIA A100 GPUs with Python 3.8.12 and PyTorch 2.0.0.

Train low-rank LLMs from scratch with LORO

```python from lorotorch.lowrankmodule import applylowrankparam, getlowrankparam from lorotorch.lorooptim import LOROAdamW

load model, e.g. LLaMA-60M

modelconfig = loadconfig(...) model = loadmodel(modelconfig)

apply low-rank parameterization

rank = 128 applylowrankparam( model, modelconfig, modeltype="llama", scope="all", attnrank=rank, mlprank=rank, init="xavier, )

load LORO optimizer

paramgroups = getlowrankparam(model, modelconfig)

optimizer = LOROAdamW( paramgroups, lr=args.lr, weightdecay=args.weightdecay, lorotype=args.loro_type, model=model, )

train model with exact / approximate LORO steps

... loss = model(**data_batch).loss loss.backward() optimizer.step() ...

```

Pre-Training LLaMA on C4 dataset

run_c4.py is the main script for training low-rank LLaMA models on C4 with LORO. Our benchmark scripts for various sizes of models are in scripts/loro_c4 folder. To train rank-128 LLaMA-60M model on C4, execute the following:

```bash rank=128 freq=500 # use exact LORO step every ${freq} iteration

CUDAVISIBLEDEVICES=0 torchrun --standalone --nprocpernode 1 runc4.py \ --singlegpu \ --modelconfig configs/llama60m.json \ --dtype bfloat16 \ --batchsize 256 \ --totalbatchsize 512 \ --numtrainingsteps 10000 \ --saveevery 1000 \ --evalevery 1000 \ --lr 0.01 \ --scheduler "cosinerestart" \ --warmupsteps 1000 \ --minlrratio 0.1 \ --cosinerestartfreq $freq \ --lradjuststeps -1000 \ --weightdecay 0 \ --optimizer loroadamw \ --lororefresh all \ --lororefreshfreq $freq \ --loroscope all \ --loroinit xavier \ --loroattnrank $rank \ --loromlprank $rank \ --lorotype loro \ --lorofreq $freq \ --lorolrscaler -1 ```

Fine-Tuning RoBERTa on GLUE tasks

run_glue.py is the main script for fine-tuning RoBERTa models on GLUE tasks with LORO. Notice that in fine-tuning scenarios, we adopt the LoRA parameterization $\mathbf{W}+\mathbf{BA}$, where $\mathbf{W}$ is the full-size pretrained weight and we only train the low-rank factors $\mathbf{B}$ and $\mathbf{A}$. An example script is shown below:

bash CUDA_VISIBLE_DEVICES=0 python run_glue.py \ --model_name_or_path roberta-base \ --task_name mrpc \ --max_length 512 \ --seed 0 \ --optimizer "loro_adamw" \ --per_device_train_batch_size 16 \ --num_train_epochs 20 \ --learning_rate 0.0002 \ --lr_scheduler_type linear \ --weight_decay 0 \ --loro_type "loro" \ --loro_rank 8 \ --loro_alpha 8 \ --loro_freq 100 \ --loro_init "xavier" \ --loro_scope "qv" \ --loro_lr_scaler 1

Thanks

This repo is heavily borrowed from GaLore.

Citation

bibtex @inproceedings{ LORO_iclr2025, title={Parameter and Memory Efficient Pretraining via Low-rank Riemannian Optimization}, author={Zhanfeng Mo, Long-kai Huang, Sinno Jialin Pan}, booktitle={The Thirteenth International Conference on Learning Representations}, year={2025}, url={https://openreview.net/forum?id=i0zzO7Hslk} }

Owner

Login: mzf666
Kind: user

Repositories: 1
Profile: https://github.com/mzf666

GitHub Events

Total

Watch event: 10
Issue comment event: 1
Push event: 6
Create event: 2

Last Year

Watch event: 10
Issue comment event: 1
Push event: 6
Create event: 2

Dependencies

requirements.txt pypi

bitsandbytes ==0.43.1
datasets ==2.20.0
evaluate ==0.4.2
huggingface_hub ==0.28.0
lion-pytorch ==0.2.2
loguru ==0.7.2
matplotlib ==3.7.0
numpy ==1.24.4
nvitop ==1.3.2
pandas ==2.0.0
scikit-learn ==1.3.0
scipy ==1.10.0
tensorly ==0.8.1
tokenizers ==0.13.3
torch ==2.0.0
torchaudio ==2.0.1
torchvision ==0.15.1
transformers ==4.31.0
wandb ==0.17.3

setup.py pypi

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science