loro-main

Official implementation of ICLR 2025 'LORO: Parameter and Memory Efficient Pretraining via Low-rank Riemannian Optimization'

https://github.com/mzf666/loro-main

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.0%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Official implementation of ICLR 2025 'LORO: Parameter and Memory Efficient Pretraining via Low-rank Riemannian Optimization'

Basic Info
  • Host: GitHub
  • Owner: mzf666
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Size: 2.15 MB
Statistics
  • Stars: 5
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created over 1 year ago · Last pushed about 1 year ago
Metadata Files
Readme License Citation

README.md

Official Implementation of Low-rank Riemannian Optimizer (ICLR 2025)

Overview

This repo contains the pre-release implementation of Low-rank Riemannian Optimizater (LORO), proposed in LORO: Parameter and Memory Efficient Pretraining via Low-rank Riemannian Optimization, ICLR 2025.

To achieve efficient yet effective pretraining for low-rank language models, this repo implements a Low-rank Riemannian Optimizer (LORO). At each exact LORO update step, the low-rank factor pairs $\mathbf{B}$ and $\mathbf{A}$ are jointly updated to ensure their full-size product $(\mathbf{BA})$ moves along the steepest descent direction on the low-rank manifold, without the need to compute any memory-intensive full-size matrices or gradients.

In practice, LORO periodically employs an exact low-rank Riemannian update step (equation 10), while in between, it uses the approximated low-rank Riemannian update (equation 12). LORO accumulates the gradients of $\mathbf{B}$ and $\mathbf{A}$ using the momentum strategy as the Adam optimizer. A pseudo-code of LORO is outlined below.


Updates

Currently, we are exploring LORO on larger models and datasets with lower ranks. We welcome any discussions, suggestions, and feedback on LORO.

  • 31-Jan-2025: LORO implementation for LLaMA-60M / 130M / 350M / 1B released.
  • 23-Jan-2025: LORO is accepted to ICLR 2025.

Installation

To install LORO optimizer from source codes:

bash git clone git@github.com:mzf666/LORO-main.git cd LORO-main pip install -e .

To install the dependency:

bash pip install -r requirements.txt


Usage

Our scripts are tested on NVIDIA A100 GPUs with Python 3.8.12 and PyTorch 2.0.0.

Train low-rank LLMs from scratch with LORO

```python from lorotorch.lowrankmodule import applylowrankparam, getlowrankparam from lorotorch.lorooptim import LOROAdamW

load model, e.g. LLaMA-60M

modelconfig = loadconfig(...) model = loadmodel(modelconfig)

apply low-rank parameterization

rank = 128 applylowrankparam( model, modelconfig, modeltype="llama", scope="all", attnrank=rank, mlprank=rank, init="xavier, )

load LORO optimizer

paramgroups = getlowrankparam(model, modelconfig)

optimizer = LOROAdamW( paramgroups, lr=args.lr, weightdecay=args.weightdecay, lorotype=args.loro_type, model=model, )

train model with exact / approximate LORO steps

... loss = model(**data_batch).loss loss.backward() optimizer.step() ...

```

Pre-Training LLaMA on C4 dataset

run_c4.py is the main script for training low-rank LLaMA models on C4 with LORO. Our benchmark scripts for various sizes of models are in scripts/loro_c4 folder. To train rank-128 LLaMA-60M model on C4, execute the following:

```bash rank=128 freq=500 # use exact LORO step every ${freq} iteration

CUDAVISIBLEDEVICES=0 torchrun --standalone --nprocpernode 1 runc4.py \ --singlegpu \ --modelconfig configs/llama60m.json \ --dtype bfloat16 \ --batchsize 256 \ --totalbatchsize 512 \ --numtrainingsteps 10000 \ --saveevery 1000 \ --evalevery 1000 \ --lr 0.01 \ --scheduler "cosinerestart" \ --warmupsteps 1000 \ --minlrratio 0.1 \ --cosinerestartfreq $freq \ --lradjuststeps -1000 \ --weightdecay 0 \ --optimizer loroadamw \ --lororefresh all \ --lororefreshfreq $freq \ --loroscope all \ --loroinit xavier \ --loroattnrank $rank \ --loromlprank $rank \ --lorotype loro \ --lorofreq $freq \ --lorolrscaler -1 ```

Fine-Tuning RoBERTa on GLUE tasks

run_glue.py is the main script for fine-tuning RoBERTa models on GLUE tasks with LORO. Notice that in fine-tuning scenarios, we adopt the LoRA parameterization $\mathbf{W}+\mathbf{BA}$, where $\mathbf{W}$ is the full-size pretrained weight and we only train the low-rank factors $\mathbf{B}$ and $\mathbf{A}$. An example script is shown below:

bash CUDA_VISIBLE_DEVICES=0 python run_glue.py \ --model_name_or_path roberta-base \ --task_name mrpc \ --max_length 512 \ --seed 0 \ --optimizer "loro_adamw" \ --per_device_train_batch_size 16 \ --num_train_epochs 20 \ --learning_rate 0.0002 \ --lr_scheduler_type linear \ --weight_decay 0 \ --loro_type "loro" \ --loro_rank 8 \ --loro_alpha 8 \ --loro_freq 100 \ --loro_init "xavier" \ --loro_scope "qv" \ --loro_lr_scaler 1


Thanks

This repo is heavily borrowed from GaLore.


Citation

bibtex @inproceedings{ LORO_iclr2025, title={Parameter and Memory Efficient Pretraining via Low-rank Riemannian Optimization}, author={Zhanfeng Mo, Long-kai Huang, Sinno Jialin Pan}, booktitle={The Thirteenth International Conference on Learning Representations}, year={2025}, url={https://openreview.net/forum?id=i0zzO7Hslk} }

Owner

  • Login: mzf666
  • Kind: user

GitHub Events

Total
  • Watch event: 10
  • Issue comment event: 1
  • Push event: 6
  • Create event: 2
Last Year
  • Watch event: 10
  • Issue comment event: 1
  • Push event: 6
  • Create event: 2

Dependencies

requirements.txt pypi
  • bitsandbytes ==0.43.1
  • datasets ==2.20.0
  • evaluate ==0.4.2
  • huggingface_hub ==0.28.0
  • lion-pytorch ==0.2.2
  • loguru ==0.7.2
  • matplotlib ==3.7.0
  • numpy ==1.24.4
  • nvitop ==1.3.2
  • pandas ==2.0.0
  • scikit-learn ==1.3.0
  • scipy ==1.10.0
  • tensorly ==0.8.1
  • tokenizers ==0.13.3
  • torch ==2.0.0
  • torchaudio ==2.0.1
  • torchvision ==0.15.1
  • transformers ==4.31.0
  • wandb ==0.17.3
setup.py pypi