Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.9%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: zqiu24
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Size: 1.62 MB
Statistics
  • Stars: 0
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created 12 months ago · Last pushed 12 months ago
Metadata Files
Readme License Citation

README.md

GaLore

This repo contains the pre-release version of GaLore algorithm, proposed by GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection.

Gradient Low-Rank Projection (GaLore) is a memory-efficient low-rank training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods, such as LoRA. As a gradient projection method, GaLore is independent of the choice of optimizers and can be easily plugged into existing ones with only two lines of code, as shown in Algorithm 1 below.

Image 2

News

  • 2024-09-01: We are working on GaLore 2, which is a more efficient and accessible version of GaLore. Please stay tuned!
  • 2024-07-11: We release Q-GaLore: Quantized GaLore with INT4 Projection. [paper] [code]

  • 2024-07-01: GaLore is accepted to ICML 2024 as Oral!

  • 2024-04-20: Please join our Slack workspace GaLore-Social to discuss with us and the community.

Installation

Install GaLore optimizer

Install from pip: ```bash conda create -n galore python=3.10 -y pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121 pip install -r exp_requirements.txt

pip uninstall galore-torch

```

or if you want to install from source:

bash git clone git@github.com:jiaweizzhao/GaLore.git cd GaLore pip install -e .

Install experiment dependencies

bash pip install -r exp_requirements.txt

Our experiment scripts are tested on Python 3.8 with PyTorch 2.1.

Usage

Save optimizer memory using GaLore optimizers

```python from galore_torch import GaLoreAdamW, GaLoreAdamW8bit, GaLoreAdafactor

define param groups as galoreparams and nongalore_params

paramgroups = [{'params': nongaloreparams}, {'params': galoreparams, 'rank': 128, 'updateprojgap': 200, 'scale': 0.25, 'projtype': 'std'}] optimizer = GaLoreAdamW(paramgroups, lr=0.01) ```

Save weight gradient memory using per-layer weight updates

We use register_post_accumulate_grad_hook provided by PyTorch (torch>=2.1.0) to enable per-layer weight updates. An example is shown below:

```python

define an optimizer for each parameter p, and store them in optimizer_dict

for p in model.parameters(): if p.requiresgrad: optimizerdict[p] = GaLoreAdamW([{'params': p, 'rank': 128, 'updateprojgap': 200, 'scale': 0.25, 'proj_type': 'std'}], lr=0.01)

define a hook function to update the parameter p during the backward pass

def optimizerhook(p): if p.grad is None: return optimizerdict[p].step() optimizerdict[p].zerograd()

Register the hook onto every parameter

for p in model.parameters(): if p.requiresgrad: p.registerpostaccumulategradhook(optimizerhook) ``` More details can be found in torchrun_main.py.

Benchmark 1: Pre-Training LLaMA on C4 dataset

torchrun_main.py is the main script for training LLaMA models on C4 with GaLore. Our benchmark scripts for various sizes of models are in scripts/benchmark_c4 folder. For example, to train a 60m model on C4, do the following:

```bash

LLaMA-60M, GaLore-Adam, 1 A100, 1 Node

torchrun --standalone --nprocpernode 1 torchrunmain.py \ --modelconfig configs/llama60m.json \ --lr 0.01 \ --galorescale 0.25 \ --rank 128 \ --updateprojgap 200 \ --batchsize 256 \ --totalbatchsize 512 \ --numtrainingsteps 10000 \ --warmupsteps 1000 \ --weightdecay 0 \ --dtype bfloat16 \ --evalevery 1000 \ --optimizer galore_adamw ```

Train 7B model with a single GPU with 24GB memory

To train a 7B model with a single GPU such as NVIDIA RTX 4090, all you need to do is to specify --optimizer=galore_adamw8bit_per_layer, which enables GaLoreAdamW8bit with per-layer weight updates. With activation checkpointing, you can maintain a batch size of 16 tested on NVIDIA RTX 4090.

```bash

LLaMA-7B, 8-bit GaLore-Adam, single GPU, activation checkpointing

bsz=16, 22.8G,

torchrun --standalone --nprocpernode 1 torchrunmain.py \ --modelconfig configs/llama7b.json \ --lr 0.005 \ --galorescale 0.25 \ --rank 1024 \ --updateprojgap 500 \ --batchsize 16 \ --totalbatchsize 512 \ --activationcheckpointing \ --numtrainingsteps 150000 \ --warmupsteps 15000 \ --weightdecay 0 \ --gradclipping 1.0 \ --dtype bfloat16 \ --evalevery 1000 \ --singlegpu \ --optimizer galoreadamw8bitperlayer ```

Currently per-layer weight updates technique is only supported for single GPU training (--single_gpu) without using nn.parallel.DistributedDataParallel. We are working on supporting multi-GPU training with per-layer weight updates.

Benchmark 2: Fine-Tuning RoBERTa on GLUE tasks

run_glue.py is the main script for fine-tuning RoBERTa models on GLUE tasks with GaLore. An example script is shown below:

bash python run_glue.py \ --model_name_or_path roberta-base \ --task_name mrpc \ --enable_galore \ --lora_all_modules \ --max_length 512 \ --seed=1234 \ --lora_r 4 \ --galore_scale 4 \ --per_device_train_batch_size 16 \ --update_proj_gap 500 \ --learning_rate 3e-5 \ --num_train_epochs 30 \ --output_dir results/ft/roberta_base/mrpc

Citation

bibtex @misc{zhao2024galore, title={GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection}, author={Jiawei Zhao and Zhenyu Zhang and Beidi Chen and Zhangyang Wang and Anima Anandkumar and Yuandong Tian}, year={2024}, eprint={2403.03507}, archivePrefix={arXiv}, primaryClass={cs.LG} }

Owner

  • Name: Zeju Qiu
  • Login: zqiu24
  • Kind: user
  • Location: Munich

Citation (CITATION.cff)

cff-version: 1.2.0
title: "GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection"
version: 1.0.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: "Jiawei"
    given-names: "Zhao"
year: 2024
repository-code: "https://arxiv.org/abs/2403.03507"

GitHub Events

Total
  • Member event: 1
  • Push event: 4
Last Year
  • Member event: 1
  • Push event: 4

Dependencies

exp_requirements.txt pypi
  • bitsandbytes *
  • datasets *
  • evaluate *
  • lion-pytorch *
  • loguru *
  • matplotlib *
  • nvitop *
  • peft *
  • scikit-learn *
  • scipy *
  • tokenizers *
  • transformers ==4.31.0
  • wandb *
requirements.txt pypi
  • bitsandbytes *
  • tensorly *
  • torch *
  • transformers ==4.31.0
setup.py pypi