galore-torch

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

https://github.com/jiaweizzhao/galore

Science Score: 64.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Committers with academic emails
    1 of 6 committers (16.7%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.4%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Basic Info
  • Host: GitHub
  • Owner: jiaweizzhao
  • License: apache-2.0
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 396 KB
Statistics
  • Stars: 1,588
  • Watchers: 17
  • Forks: 160
  • Open Issues: 46
  • Releases: 0
Created almost 2 years ago · Last pushed over 1 year ago
Metadata Files
Readme License Citation

README.md

GaLore

This repo contains the pre-release version of GaLore algorithm, proposed by GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection.

Gradient Low-Rank Projection (GaLore) is a memory-efficient low-rank training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods, such as LoRA. As a gradient projection method, GaLore is independent of the choice of optimizers and can be easily plugged into existing ones with only two lines of code, as shown in Algorithm 1 below.

Image 2

News

  • 2024-09-01: We are working on GaLore 2, which is a more efficient and accessible version of GaLore. Please stay tuned!
  • 2024-07-11: We release Q-GaLore: Quantized GaLore with INT4 Projection. [paper] [code]

  • 2024-07-01: GaLore is accepted to ICML 2024 as Oral!

  • 2024-04-20: Please join our Slack workspace GaLore-Social to discuss with us and the community.

Installation

Install GaLore optimizer

Install from pip: bash pip install galore-torch

or if you want to install from source:

bash git clone git@github.com:jiaweizzhao/GaLore.git cd GaLore pip install -e .

Install experiment dependencies

bash pip install -r exp_requirements.txt

Our experiment scripts are tested on Python 3.8 with PyTorch 2.1.

Usage

Save optimizer memory using GaLore optimizers

```python from galore_torch import GaLoreAdamW, GaLoreAdamW8bit, GaLoreAdafactor

define param groups as galoreparams and nongalore_params

paramgroups = [{'params': nongaloreparams}, {'params': galoreparams, 'rank': 128, 'updateprojgap': 200, 'scale': 0.25, 'projtype': 'std'}] optimizer = GaLoreAdamW(paramgroups, lr=0.01) ```

Save weight gradient memory using per-layer weight updates

We use register_post_accumulate_grad_hook provided by PyTorch (torch>=2.1.0) to enable per-layer weight updates. An example is shown below:

```python

define an optimizer for each parameter p, and store them in optimizer_dict

for p in model.parameters(): if p.requiresgrad: optimizerdict[p] = GaLoreAdamW([{'params': p, 'rank': 128, 'updateprojgap': 200, 'scale': 0.25, 'proj_type': 'std'}], lr=0.01)

define a hook function to update the parameter p during the backward pass

def optimizerhook(p): if p.grad is None: return optimizerdict[p].step() optimizerdict[p].zerograd()

Register the hook onto every parameter

for p in model.parameters(): if p.requiresgrad: p.registerpostaccumulategradhook(optimizerhook) ``` More details can be found in torchrun_main.py.

Benchmark 1: Pre-Training LLaMA on C4 dataset

torchrun_main.py is the main script for training LLaMA models on C4 with GaLore. Our benchmark scripts for various sizes of models are in scripts/benchmark_c4 folder. For example, to train a 60m model on C4, do the following:

```bash

LLaMA-60M, GaLore-Adam, 1 A100, 1 Node

torchrun --standalone --nprocpernode 1 torchrunmain.py \ --modelconfig configs/llama60m.json \ --lr 0.01 \ --galorescale 0.25 \ --rank 128 \ --updateprojgap 200 \ --batchsize 256 \ --totalbatchsize 512 \ --numtrainingsteps 10000 \ --warmupsteps 1000 \ --weightdecay 0 \ --dtype bfloat16 \ --evalevery 1000 \ --optimizer galore_adamw ```

Train 7B model with a single GPU with 24GB memory

To train a 7B model with a single GPU such as NVIDIA RTX 4090, all you need to do is to specify --optimizer=galore_adamw8bit_per_layer, which enables GaLoreAdamW8bit with per-layer weight updates. With activation checkpointing, you can maintain a batch size of 16 tested on NVIDIA RTX 4090.

```bash

LLaMA-7B, 8-bit GaLore-Adam, single GPU, activation checkpointing

bsz=16, 22.8G,

torchrun --standalone --nprocpernode 1 torchrunmain.py \ --modelconfig configs/llama7b.json \ --lr 0.005 \ --galorescale 0.25 \ --rank 1024 \ --updateprojgap 500 \ --batchsize 16 \ --totalbatchsize 512 \ --activationcheckpointing \ --numtrainingsteps 150000 \ --warmupsteps 15000 \ --weightdecay 0 \ --gradclipping 1.0 \ --dtype bfloat16 \ --evalevery 1000 \ --singlegpu \ --optimizer galoreadamw8bitperlayer ```

Currently per-layer weight updates technique is only supported for single GPU training (--single_gpu) without using nn.parallel.DistributedDataParallel. We are working on supporting multi-GPU training with per-layer weight updates.

Benchmark 2: Fine-Tuning RoBERTa on GLUE tasks

run_glue.py is the main script for fine-tuning RoBERTa models on GLUE tasks with GaLore. An example script is shown below:

bash python run_glue.py \ --model_name_or_path roberta-base \ --task_name mrpc \ --enable_galore \ --lora_all_modules \ --max_length 512 \ --seed=1234 \ --lora_r 4 \ --galore_scale 4 \ --per_device_train_batch_size 16 \ --update_proj_gap 500 \ --learning_rate 3e-5 \ --num_train_epochs 30 \ --output_dir results/ft/roberta_base/mrpc

Citation

bibtex @misc{zhao2024galore, title={GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection}, author={Jiawei Zhao and Zhenyu Zhang and Beidi Chen and Zhangyang Wang and Anima Anandkumar and Yuandong Tian}, year={2024}, eprint={2403.03507}, archivePrefix={arXiv}, primaryClass={cs.LG} }

Owner

  • Name: Jiawei Zhao
  • Login: jiaweizzhao
  • Kind: user
  • Company: Caltech

Citation (CITATION.cff)

cff-version: 1.2.0
title: "GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection"
version: 1.0.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: "Jiawei"
    given-names: "Zhao"
year: 2024
repository-code: "https://arxiv.org/abs/2403.03507"

GitHub Events

Total
  • Issues event: 9
  • Watch event: 195
  • Issue comment event: 11
  • Push event: 1
  • Pull request event: 3
  • Fork event: 22
Last Year
  • Issues event: 9
  • Watch event: 195
  • Issue comment event: 11
  • Push event: 1
  • Pull request event: 3
  • Fork event: 22

Committers

Last synced: 10 months ago

All Time
  • Total Commits: 16
  • Total Committers: 6
  • Avg Commits per committer: 2.667
  • Development Distribution Score (DDS): 0.438
Past Year
  • Commits: 7
  • Committers: 2
  • Avg Commits per committer: 3.5
  • Development Distribution Score (DDS): 0.143
Top Committers
Name Email Commits
Jiawei Zhao j****w@g****m 9
Robert r****1@g****m 3
darthjaja 1****6 1
Zhenyu (Allen) Zhang z****g@u****u 1
Hoang Manh Linh j****a@g****m 1
Andrew Gu a****u@f****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 58
  • Total pull requests: 16
  • Average time to close issues: 5 days
  • Average time to close pull requests: 23 days
  • Total issue authors: 55
  • Total pull request authors: 14
  • Average comments per issue: 2.03
  • Average comments per pull request: 0.38
  • Merged pull requests: 5
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 10
  • Pull requests: 6
  • Average time to close issues: 5 days
  • Average time to close pull requests: N/A
  • Issue authors: 9
  • Pull request authors: 4
  • Average comments per issue: 0.4
  • Average comments per pull request: 0.17
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • CrazyElements (2)
  • nikitaved (2)
  • liveck (1)
  • imrankh46 (1)
  • hiyouga (1)
  • Edenzzzz (1)
  • bhavnicksm (1)
  • peterjc123 (1)
  • calebmor460 (1)
  • MYT677 (1)
  • threewayhandshake (1)
  • gaotianyu1350 (1)
  • wsp317 (1)
  • jie040109 (1)
  • fy817 (1)
Pull Request Authors
  • awgu (2)
  • jeromeku (2)
  • tomas-gajarsky (2)
  • yao-matrix (2)
  • eltociear (2)
  • Kyriection (2)
  • jetaudio (2)
  • Robertboy18 (2)
  • marcandrelarochelle (2)
  • darthjaja6 (2)
  • winglian (2)
  • jiaweizzhao (2)
  • Explorergt92 (2)
  • gslama12 (1)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 14,684 last-month
  • Total dependent packages: 1
  • Total dependent repositories: 0
  • Total versions: 1
  • Total maintainers: 1
pypi.org: galore-torch

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

  • Versions: 1
  • Dependent Packages: 1
  • Dependent Repositories: 0
  • Downloads: 14,684 Last month
Rankings
Dependent packages count: 9.7%
Average: 36.9%
Dependent repos count: 64.0%
Maintainers (1)
Last synced: 6 months ago