poet

https://github.com/zqiu24/poet

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.9%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: zqiu24
License: apache-2.0
Language: Python
Default Branch: main
Size: 1.62 MB

Statistics

Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Releases: 0

Created over 1 year ago · Last pushed over 1 year ago

Metadata Files

Readme License Citation

GaLore

This repo contains the pre-release version of GaLore algorithm, proposed by GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection.

Gradient Low-Rank Projection (GaLore) is a memory-efficient low-rank training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods, such as LoRA. As a gradient projection method, GaLore is independent of the choice of optimizers and can be easily plugged into existing ones with only two lines of code, as shown in Algorithm 1 below.

News

2024-09-01: We are working on GaLore 2, which is a more efficient and accessible version of GaLore. Please stay tuned!
2024-07-11: We release Q-GaLore: Quantized GaLore with INT4 Projection. [paper] [code]
2024-07-01: GaLore is accepted to ICML 2024 as Oral!
2024-04-20: Please join our Slack workspace GaLore-Social to discuss with us and the community.

Installation

Install GaLore optimizer

Install from pip: ```bash conda create -n galore python=3.10 -y pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121 pip install -r exp_requirements.txt

pip uninstall galore-torch

```

or if you want to install from source:

bash git clone git@github.com:jiaweizzhao/GaLore.git cd GaLore pip install -e .

Install experiment dependencies

bash pip install -r exp_requirements.txt

Our experiment scripts are tested on Python 3.8 with PyTorch 2.1.

Usage

Save optimizer memory using GaLore optimizers

```python from galore_torch import GaLoreAdamW, GaLoreAdamW8bit, GaLoreAdafactor

define param groups as galoreparams and nongalore_params

paramgroups = [{'params': nongaloreparams}, {'params': galoreparams, 'rank': 128, 'updateprojgap': 200, 'scale': 0.25, 'projtype': 'std'}] optimizer = GaLoreAdamW(paramgroups, lr=0.01) ```

Save weight gradient memory using per-layer weight updates

We use register_post_accumulate_grad_hook provided by PyTorch (torch>=2.1.0) to enable per-layer weight updates. An example is shown below:

```python

define an optimizer for each parameter p, and store them in optimizer_dict

for p in model.parameters(): if p.requiresgrad: optimizerdict[p] = GaLoreAdamW([{'params': p, 'rank': 128, 'updateprojgap': 200, 'scale': 0.25, 'proj_type': 'std'}], lr=0.01)

define a hook function to update the parameter p during the backward pass

def optimizerhook(p): if p.grad is None: return optimizerdict[p].step() optimizerdict[p].zerograd()

Register the hook onto every parameter

for p in model.parameters(): if p.requiresgrad: p.registerpostaccumulategradhook(optimizerhook) ``` More details can be found in torchrun_main.py.

Benchmark 1: Pre-Training LLaMA on C4 dataset

torchrun_main.py is the main script for training LLaMA models on C4 with GaLore. Our benchmark scripts for various sizes of models are in scripts/benchmark_c4 folder. For example, to train a 60m model on C4, do the following:

```bash

LLaMA-60M, GaLore-Adam, 1 A100, 1 Node

torchrun --standalone --nprocpernode 1 torchrunmain.py \ --modelconfig configs/llama60m.json \ --lr 0.01 \ --galorescale 0.25 \ --rank 128 \ --updateprojgap 200 \ --batchsize 256 \ --totalbatchsize 512 \ --numtrainingsteps 10000 \ --warmupsteps 1000 \ --weightdecay 0 \ --dtype bfloat16 \ --evalevery 1000 \ --optimizer galore_adamw ```

Train 7B model with a single GPU with 24GB memory

To train a 7B model with a single GPU such as NVIDIA RTX 4090, all you need to do is to specify --optimizer=galore_adamw8bit_per_layer, which enables GaLoreAdamW8bit with per-layer weight updates. With activation checkpointing, you can maintain a batch size of 16 tested on NVIDIA RTX 4090.

```bash

LLaMA-7B, 8-bit GaLore-Adam, single GPU, activation checkpointing

bsz=16, 22.8G,

torchrun --standalone --nprocpernode 1 torchrunmain.py \ --modelconfig configs/llama7b.json \ --lr 0.005 \ --galorescale 0.25 \ --rank 1024 \ --updateprojgap 500 \ --batchsize 16 \ --totalbatchsize 512 \ --activationcheckpointing \ --numtrainingsteps 150000 \ --warmupsteps 15000 \ --weightdecay 0 \ --gradclipping 1.0 \ --dtype bfloat16 \ --evalevery 1000 \ --singlegpu \ --optimizer galoreadamw8bitperlayer ```

Currently per-layer weight updates technique is only supported for single GPU training (--single_gpu) without using nn.parallel.DistributedDataParallel. We are working on supporting multi-GPU training with per-layer weight updates.

Benchmark 2: Fine-Tuning RoBERTa on GLUE tasks

run_glue.py is the main script for fine-tuning RoBERTa models on GLUE tasks with GaLore. An example script is shown below:

bash python run_glue.py \ --model_name_or_path roberta-base \ --task_name mrpc \ --enable_galore \ --lora_all_modules \ --max_length 512 \ --seed=1234 \ --lora_r 4 \ --galore_scale 4 \ --per_device_train_batch_size 16 \ --update_proj_gap 500 \ --learning_rate 3e-5 \ --num_train_epochs 30 \ --output_dir results/ft/roberta_base/mrpc

Citation

bibtex @misc{zhao2024galore, title={GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection}, author={Jiawei Zhao and Zhenyu Zhang and Beidi Chen and Zhangyang Wang and Anima Anandkumar and Yuandong Tian}, year={2024}, eprint={2403.03507}, archivePrefix={arXiv}, primaryClass={cs.LG} }

Owner

Name: Zeju Qiu
Login: zqiu24
Kind: user
Location: Munich

Repositories: 1
Profile: https://github.com/zqiu24

Citation (CITATION.cff)

cff-version: 1.2.0
title: "GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection"
version: 1.0.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: "Jiawei"
    given-names: "Zhao"
year: 2024
repository-code: "https://arxiv.org/abs/2403.03507"

GitHub Events

Total

Member event: 1
Push event: 4

Last Year

Member event: 1
Push event: 4

Dependencies

exp_requirements.txt pypi

bitsandbytes *
datasets *
evaluate *
lion-pytorch *
loguru *
matplotlib *
nvitop *
peft *
scikit-learn *
scipy *
tokenizers *
transformers ==4.31.0
wandb *

requirements.txt pypi

bitsandbytes *
tensorly *
torch *
transformers ==4.31.0

setup.py pypi

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science