loro-main
Official implementation of ICLR 2025 'LORO: Parameter and Memory Efficient Pretraining via Low-rank Riemannian Optimization'
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.0%) to scientific vocabulary
Repository
Official implementation of ICLR 2025 'LORO: Parameter and Memory Efficient Pretraining via Low-rank Riemannian Optimization'
Basic Info
- Host: GitHub
- Owner: mzf666
- License: apache-2.0
- Language: Python
- Default Branch: main
- Size: 2.15 MB
Statistics
- Stars: 5
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Official Implementation of Low-rank Riemannian Optimizer (ICLR 2025)
Overview
This repo contains the pre-release implementation of Low-rank Riemannian Optimizater (LORO), proposed in LORO: Parameter and Memory Efficient Pretraining via Low-rank Riemannian Optimization, ICLR 2025.
To achieve efficient yet effective pretraining for low-rank language models, this repo implements a Low-rank Riemannian Optimizer (LORO). At each exact LORO update step, the low-rank factor pairs $\mathbf{B}$ and $\mathbf{A}$ are jointly updated to ensure their full-size product $(\mathbf{BA})$ moves along the steepest descent direction on the low-rank manifold, without the need to compute any memory-intensive full-size matrices or gradients.
In practice, LORO periodically employs an exact low-rank Riemannian update step (equation 10), while in between, it uses the approximated low-rank Riemannian update (equation 12). LORO accumulates the gradients of $\mathbf{B}$ and $\mathbf{A}$ using the momentum strategy as the Adam optimizer. A pseudo-code of LORO is outlined below.
Updates
Currently, we are exploring LORO on larger models and datasets with lower ranks. We welcome any discussions, suggestions, and feedback on LORO.
- 31-Jan-2025: LORO implementation for LLaMA-60M / 130M / 350M / 1B released.
- 23-Jan-2025: LORO is accepted to ICLR 2025.
Installation
To install LORO optimizer from source codes:
bash
git clone git@github.com:mzf666/LORO-main.git
cd LORO-main
pip install -e .
To install the dependency:
bash
pip install -r requirements.txt
Usage
Our scripts are tested on NVIDIA A100 GPUs with Python 3.8.12 and PyTorch 2.0.0.
Train low-rank LLMs from scratch with LORO
```python from lorotorch.lowrankmodule import applylowrankparam, getlowrankparam from lorotorch.lorooptim import LOROAdamW
load model, e.g. LLaMA-60M
modelconfig = loadconfig(...) model = loadmodel(modelconfig)
apply low-rank parameterization
rank = 128 applylowrankparam( model, modelconfig, modeltype="llama", scope="all", attnrank=rank, mlprank=rank, init="xavier, )
load LORO optimizer
paramgroups = getlowrankparam(model, modelconfig)
optimizer = LOROAdamW( paramgroups, lr=args.lr, weightdecay=args.weightdecay, lorotype=args.loro_type, model=model, )
train model with exact / approximate LORO steps
... loss = model(**data_batch).loss loss.backward() optimizer.step() ...
```
Pre-Training LLaMA on C4 dataset
run_c4.py is the main script for training low-rank LLaMA models on C4 with LORO. Our benchmark scripts for various sizes of models are in scripts/loro_c4 folder. To train rank-128 LLaMA-60M model on C4, execute the following:
```bash rank=128 freq=500 # use exact LORO step every ${freq} iteration
CUDAVISIBLEDEVICES=0 torchrun --standalone --nprocpernode 1 runc4.py \ --singlegpu \ --modelconfig configs/llama60m.json \ --dtype bfloat16 \ --batchsize 256 \ --totalbatchsize 512 \ --numtrainingsteps 10000 \ --saveevery 1000 \ --evalevery 1000 \ --lr 0.01 \ --scheduler "cosinerestart" \ --warmupsteps 1000 \ --minlrratio 0.1 \ --cosinerestartfreq $freq \ --lradjuststeps -1000 \ --weightdecay 0 \ --optimizer loroadamw \ --lororefresh all \ --lororefreshfreq $freq \ --loroscope all \ --loroinit xavier \ --loroattnrank $rank \ --loromlprank $rank \ --lorotype loro \ --lorofreq $freq \ --lorolrscaler -1 ```
Fine-Tuning RoBERTa on GLUE tasks
run_glue.py is the main script for fine-tuning RoBERTa models on GLUE tasks with LORO. Notice that in fine-tuning scenarios, we adopt the LoRA parameterization $\mathbf{W}+\mathbf{BA}$, where $\mathbf{W}$ is the full-size pretrained weight and we only train the low-rank factors $\mathbf{B}$ and $\mathbf{A}$. An example script is shown below:
bash
CUDA_VISIBLE_DEVICES=0 python run_glue.py \
--model_name_or_path roberta-base \
--task_name mrpc \
--max_length 512 \
--seed 0 \
--optimizer "loro_adamw" \
--per_device_train_batch_size 16 \
--num_train_epochs 20 \
--learning_rate 0.0002 \
--lr_scheduler_type linear \
--weight_decay 0 \
--loro_type "loro" \
--loro_rank 8 \
--loro_alpha 8 \
--loro_freq 100 \
--loro_init "xavier" \
--loro_scope "qv" \
--loro_lr_scaler 1
Thanks
This repo is heavily borrowed from GaLore.
Citation
bibtex
@inproceedings{
LORO_iclr2025,
title={Parameter and Memory Efficient Pretraining via Low-rank Riemannian Optimization},
author={Zhanfeng Mo, Long-kai Huang, Sinno Jialin Pan},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=i0zzO7Hslk}
}
Owner
- Login: mzf666
- Kind: user
- Repositories: 1
- Profile: https://github.com/mzf666
GitHub Events
Total
- Watch event: 10
- Issue comment event: 1
- Push event: 6
- Create event: 2
Last Year
- Watch event: 10
- Issue comment event: 1
- Push event: 6
- Create event: 2
Dependencies
- bitsandbytes ==0.43.1
- datasets ==2.20.0
- evaluate ==0.4.2
- huggingface_hub ==0.28.0
- lion-pytorch ==0.2.2
- loguru ==0.7.2
- matplotlib ==3.7.0
- numpy ==1.24.4
- nvitop ==1.3.2
- pandas ==2.0.0
- scikit-learn ==1.3.0
- scipy ==1.10.0
- tensorly ==0.8.1
- tokenizers ==0.13.3
- torch ==2.0.0
- torchaudio ==2.0.1
- torchvision ==0.15.1
- transformers ==4.31.0
- wandb ==0.17.3