galore-torch
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Science Score: 64.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
✓Committers with academic emails
1 of 6 committers (16.7%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.4%) to scientific vocabulary
Repository
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Basic Info
Statistics
- Stars: 1,588
- Watchers: 17
- Forks: 160
- Open Issues: 46
- Releases: 0
Metadata Files
README.md
GaLore
This repo contains the pre-release version of GaLore algorithm, proposed by GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection.
Gradient Low-Rank Projection (GaLore) is a memory-efficient low-rank training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods, such as LoRA. As a gradient projection method, GaLore is independent of the choice of optimizers and can be easily plugged into existing ones with only two lines of code, as shown in Algorithm 1 below.
News
- 2024-09-01: We are working on GaLore 2, which is a more efficient and accessible version of GaLore. Please stay tuned!
2024-07-11: We release Q-GaLore: Quantized GaLore with INT4 Projection. [paper] [code]
2024-07-01: GaLore is accepted to ICML 2024 as Oral!
2024-04-20: Please join our Slack workspace GaLore-Social to discuss with us and the community.
Installation
Install GaLore optimizer
Install from pip:
bash
pip install galore-torch
or if you want to install from source:
bash
git clone git@github.com:jiaweizzhao/GaLore.git
cd GaLore
pip install -e .
Install experiment dependencies
bash
pip install -r exp_requirements.txt
Our experiment scripts are tested on Python 3.8 with PyTorch 2.1.
Usage
Save optimizer memory using GaLore optimizers
```python from galore_torch import GaLoreAdamW, GaLoreAdamW8bit, GaLoreAdafactor
define param groups as galoreparams and nongalore_params
paramgroups = [{'params': nongaloreparams}, {'params': galoreparams, 'rank': 128, 'updateprojgap': 200, 'scale': 0.25, 'projtype': 'std'}] optimizer = GaLoreAdamW(paramgroups, lr=0.01) ```
Save weight gradient memory using per-layer weight updates
We use register_post_accumulate_grad_hook provided by PyTorch (torch>=2.1.0) to enable per-layer weight updates. An example is shown below:
```python
define an optimizer for each parameter p, and store them in optimizer_dict
for p in model.parameters(): if p.requiresgrad: optimizerdict[p] = GaLoreAdamW([{'params': p, 'rank': 128, 'updateprojgap': 200, 'scale': 0.25, 'proj_type': 'std'}], lr=0.01)
define a hook function to update the parameter p during the backward pass
def optimizerhook(p): if p.grad is None: return optimizerdict[p].step() optimizerdict[p].zerograd()
Register the hook onto every parameter
for p in model.parameters(): if p.requiresgrad: p.registerpostaccumulategradhook(optimizerhook) ``` More details can be found in torchrun_main.py.
Benchmark 1: Pre-Training LLaMA on C4 dataset
torchrun_main.py is the main script for training LLaMA models on C4 with GaLore. Our benchmark scripts for various sizes of models are in scripts/benchmark_c4 folder.
For example, to train a 60m model on C4, do the following:
```bash
LLaMA-60M, GaLore-Adam, 1 A100, 1 Node
torchrun --standalone --nprocpernode 1 torchrunmain.py \ --modelconfig configs/llama60m.json \ --lr 0.01 \ --galorescale 0.25 \ --rank 128 \ --updateprojgap 200 \ --batchsize 256 \ --totalbatchsize 512 \ --numtrainingsteps 10000 \ --warmupsteps 1000 \ --weightdecay 0 \ --dtype bfloat16 \ --evalevery 1000 \ --optimizer galore_adamw ```
Train 7B model with a single GPU with 24GB memory
To train a 7B model with a single GPU such as NVIDIA RTX 4090, all you need to do is to specify --optimizer=galore_adamw8bit_per_layer, which enables GaLoreAdamW8bit with per-layer weight updates.
With activation checkpointing, you can maintain a batch size of 16 tested on NVIDIA RTX 4090.
```bash
LLaMA-7B, 8-bit GaLore-Adam, single GPU, activation checkpointing
bsz=16, 22.8G,
torchrun --standalone --nprocpernode 1 torchrunmain.py \ --modelconfig configs/llama7b.json \ --lr 0.005 \ --galorescale 0.25 \ --rank 1024 \ --updateprojgap 500 \ --batchsize 16 \ --totalbatchsize 512 \ --activationcheckpointing \ --numtrainingsteps 150000 \ --warmupsteps 15000 \ --weightdecay 0 \ --gradclipping 1.0 \ --dtype bfloat16 \ --evalevery 1000 \ --singlegpu \ --optimizer galoreadamw8bitperlayer ```
Currently per-layer weight updates technique is only supported for single GPU training (--single_gpu) without using nn.parallel.DistributedDataParallel. We are working on supporting multi-GPU training with per-layer weight updates.
Benchmark 2: Fine-Tuning RoBERTa on GLUE tasks
run_glue.py is the main script for fine-tuning RoBERTa models on GLUE tasks with GaLore. An example script is shown below:
bash
python run_glue.py \
--model_name_or_path roberta-base \
--task_name mrpc \
--enable_galore \
--lora_all_modules \
--max_length 512 \
--seed=1234 \
--lora_r 4 \
--galore_scale 4 \
--per_device_train_batch_size 16 \
--update_proj_gap 500 \
--learning_rate 3e-5 \
--num_train_epochs 30 \
--output_dir results/ft/roberta_base/mrpc
Citation
bibtex
@misc{zhao2024galore,
title={GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection},
author={Jiawei Zhao and Zhenyu Zhang and Beidi Chen and Zhangyang Wang and Anima Anandkumar and Yuandong Tian},
year={2024},
eprint={2403.03507},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
Owner
- Name: Jiawei Zhao
- Login: jiaweizzhao
- Kind: user
- Company: Caltech
- Website: http://jiawei-zhao.netlify.com
- Repositories: 20
- Profile: https://github.com/jiaweizzhao
Citation (CITATION.cff)
cff-version: 1.2.0
title: "GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection"
version: 1.0.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Jiawei"
given-names: "Zhao"
year: 2024
repository-code: "https://arxiv.org/abs/2403.03507"
GitHub Events
Total
- Issues event: 9
- Watch event: 195
- Issue comment event: 11
- Push event: 1
- Pull request event: 3
- Fork event: 22
Last Year
- Issues event: 9
- Watch event: 195
- Issue comment event: 11
- Push event: 1
- Pull request event: 3
- Fork event: 22
Committers
Last synced: 10 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Jiawei Zhao | j****w@g****m | 9 |
| Robert | r****1@g****m | 3 |
| darthjaja | 1****6 | 1 |
| Zhenyu (Allen) Zhang | z****g@u****u | 1 |
| Hoang Manh Linh | j****a@g****m | 1 |
| Andrew Gu | a****u@f****m | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 58
- Total pull requests: 16
- Average time to close issues: 5 days
- Average time to close pull requests: 23 days
- Total issue authors: 55
- Total pull request authors: 14
- Average comments per issue: 2.03
- Average comments per pull request: 0.38
- Merged pull requests: 5
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 10
- Pull requests: 6
- Average time to close issues: 5 days
- Average time to close pull requests: N/A
- Issue authors: 9
- Pull request authors: 4
- Average comments per issue: 0.4
- Average comments per pull request: 0.17
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- CrazyElements (2)
- nikitaved (2)
- liveck (1)
- imrankh46 (1)
- hiyouga (1)
- Edenzzzz (1)
- bhavnicksm (1)
- peterjc123 (1)
- calebmor460 (1)
- MYT677 (1)
- threewayhandshake (1)
- gaotianyu1350 (1)
- wsp317 (1)
- jie040109 (1)
- fy817 (1)
Pull Request Authors
- awgu (2)
- jeromeku (2)
- tomas-gajarsky (2)
- yao-matrix (2)
- eltociear (2)
- Kyriection (2)
- jetaudio (2)
- Robertboy18 (2)
- marcandrelarochelle (2)
- darthjaja6 (2)
- winglian (2)
- jiaweizzhao (2)
- Explorergt92 (2)
- gslama12 (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 14,684 last-month
- Total dependent packages: 1
- Total dependent repositories: 0
- Total versions: 1
- Total maintainers: 1
pypi.org: galore-torch
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
- Homepage: https://github.com/jiaweizzhao/GaLore
- Documentation: https://galore-torch.readthedocs.io/
- License: Apache 2.0
-
Latest release: 1.0
published almost 2 years ago