torchtune

A Native-PyTorch Library for LLM Fine-tuning

https://github.com/hitkumar/torchtune

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.7%) to scientific vocabulary

Last synced: 11 months ago · JSON representation ·

Repository

A Native-PyTorch Library for LLM Fine-tuning

Basic Info

Host: GitHub
Owner: hitkumar
License: bsd-3-clause
Language: Python
Default Branch: main
Homepage:
Size: 10.9 MB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 0

Created about 2 years ago · Last pushed about 1 year ago

Metadata Files

Readme Contributing License Code of conduct Citation

torchtune

📣 Recent updates 📣

April 2025: Llama4 is now available in torchtune! Try out our full finetuning configs here (LoRA coming soon!)
February 2025: Multi-node training is officially open for business in torchtune! Full finetune on multiple nodes to take advantage of larger batch sizes and models.
December 2024: torchtune now supports Llama 3.3 70B! Try it out by following our installation instructions here, then run any of the configs here.
November 2024: torchtune has released v0.4.0 which includes stable support for exciting features like activation offloading and multimodal QLoRA
November 2024: torchtune has added Gemma2 to its models!
October 2024: torchtune added support for Qwen2.5 models - find the configs here
September 2024: torchtune has support for Llama 3.2 11B Vision, Llama 3.2 3B, and Llama 3.2 1B models! Try them out by following our installation instructions here, then run any of the text configs here or vision configs here.

Overview 📚

torchtune is a PyTorch library for easily authoring, post-training, and experimenting with LLMs. It provides:

Hackable training recipes for SFT, knowledge distillation, DPO, PPO, GRPO, and quantization-aware training
Simple PyTorch implementations of popular LLMs like Llama, Gemma, Mistral, Phi, Qwen, and more
Best-in-class memory efficiency, performance improvements, and scaling, utilizing the latest PyTorch APIs
YAML configs for easily configuring training, evaluation, quantization or inference recipes

Post-training recipes

torchtune supports the entire post-training lifecycle. A successful post-trained model will likely utilize several of the below methods.

Supervised Finetuning (SFT)

| Type of Weight Update | 1 Device | >1 Device | >1 Node | |-----------------------|:--------:|:---------:|:-------:| | Full | ✅ | ✅ | ✅ | | LoRA/QLoRA | ✅ | ✅ | ❌ |

Example: tune run lora_finetune_single_device --config llama3_2/3B_lora_single_device
You can also run e.g. tune ls lora_finetune_single_device for a full list of available configs.

Knowledge Distillation (KD)

| Type of Weight Update | 1 Device | >1 Device | >1 Node | |-----------------------|:--------:|:---------:|:-------:| | Full | ❌ | ❌ | ❌ | | LoRA/QLoRA | ✅ | ✅ | ❌ |

Example: tune run knowledge_distillation_distributed --config qwen2/1.5B_to_0.5B_KD_lora_distributed
You can also run e.g. tune ls knowledge_distillation_distributed for a full list of available configs.

Reinforcement Learning / Reinforcement Learning from Human Feedback (RLHF)

| Method | Type of Weight Update | 1 Device | >1 Device | >1 Node | |------------------------------|-----------------------|:--------:|:---------:|:-------:| | DPO | Full | ❌ | ✅ | ❌ | | | LoRA/QLoRA | ✅ | ✅ | ❌ | | PPO | Full | ✅ | ❌ | ❌ | | | LoRA/QLoRA | ❌ | ❌ | ❌ | | GRPO | Full | 🚧 | ✅ | ✅ | | | LoRA/QLoRA | ❌ | ❌ | ❌ |

Example: tune run lora_dpo_single_device --config llama3_1/8B_dpo_single_device
You can also run e.g. tune ls full_dpo_distributed for a full list of available configs.

Quantization-Aware Training (QAT)

| Type of Weight Update | 1 Device | >1 Device | >1 Node | |-----------------------|:--------:|:---------:|:-------:| | Full | ❌ | ✅ | ❌ | | LoRA/QLoRA | ❌ | ✅ | ❌ |

Example: tune run qat_distributed --config llama3_1/8B_qat_lora
You can also run e.g. tune ls qat_distributed for a full list of available configs.

The above configs are just examples to get you started. The full list of recipes can be found here. If you'd like to work on one of the gaps you see, please submit a PR! If there's a entirely new post-training method you'd like to see implemented in torchtune, feel free to open an Issue.

Models

For the above recipes, torchtune supports many state-of-the-art models available on the Hugging Face Hub or Kaggle Hub. Some of our supported models:

| Model | Sizes | |-----------------------------------------------|-----------| | Llama4 | Scout (17B x 16E) [models, configs] | | Llama3.3 | 70B [models, configs] | | Llama3.2-Vision | 11B, 90B [models, configs] | | Llama3.2 | 1B, 3B [models, configs] | | Llama3.1 | 8B, 70B, 405B [models, configs] | | Mistral | 7B [models, configs] | | Gemma2 | 2B, 9B, 27B [models, configs] | | Microsoft Phi4 | 14B [models, configs] | Microsoft Phi3 | Mini [models, configs] | Qwen2.5 | 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B [models, configs] | Qwen2 | 0.5B, 1.5B, 7B [models, configs]

We're always adding new models, but feel free to file an issue if there's a new one you would like to see in torchtune.

Memory and training speed

Below is an example of the memory requirements and training speed for different Llama 3.1 models.

[!NOTE] For ease of comparison, all the below numbers are provided for batch size 2 (without gradient accumulation), a dataset packed to sequence length 2048, and torch compile enabled.

If you are interested in running on different hardware or with different models, check out our documentation on memory optimizations here to find the right setup for you.

| Model | Finetuning Method | Runnable On | Peak Memory per GPU | Tokens/sec * | |:-:|:-:|:-:|:-:|:-:| | Llama 3.1 8B | Full finetune | 1x 4090 | 18.9 GiB | 1650 | | Llama 3.1 8B | Full finetune | 1x A6000 | 37.4 GiB | 2579| | Llama 3.1 8B | LoRA | 1x 4090 | 16.2 GiB | 3083 | | Llama 3.1 8B | LoRA | 1x A6000 | 30.3 GiB | 4699 | | Llama 3.1 8B | QLoRA | 1x 4090 | 7.4 GiB | 2413 | | Llama 3.1 70B | Full finetune | 8x A100 | 13.9 GiB ** | 1568 | | Llama 3.1 70B | LoRA | 8x A100 | 27.6 GiB | 3497 | | Llama 3.1 405B | QLoRA | 8x A100 | 44.8 GB | 653 |

= Measured over one full training epoch
*= Uses CPU offload with fused optimizer

Optimization flags

torchtune exposes a number of levers for memory efficiency and performance. The table below demonstrates the effects of applying some of these techniques sequentially to the Llama 3.2 3B model. Each technique is added on top of the previous one, except for LoRA and QLoRA, which do not use optimizer_in_bwd or AdamW8bit optimizer.

Baseline uses Recipe=fullfinetunesingle_device, Model=Llama 3.2 3B, Batch size=2, Max sequence length=4096, Precision=bf16, Hardware=A100

| Technique | Peak Memory Active (GiB) | % Change Memory vs Previous | Tokens Per Second | % Change Tokens/sec vs Previous| |:--|:-:|:-:|:-:|:-:| | Baseline | 25.5 | - | 2091 | - | | + Packed Dataset | 60.0 | +135.16% | 7075 | +238.40% | | + Compile | 51.0 | -14.93% | 8998 | +27.18% | | + Chunked Cross Entropy | 42.9 | -15.83% | 9174 | +1.96% | | + Activation Checkpointing | 24.9 | -41.93% | 7210 | -21.41% | | + Fuse optimizer step into backward | 23.1 | -7.29% | 7309 | +1.38% | | + Activation Offloading | 21.8 | -5.48% | 7301 | -0.11% | | + 8-bit AdamW | 17.6 | -19.63% | 6960 | -4.67% | | LoRA | 8.5 | -51.61% | 8210 | +17.96% | | QLoRA | 4.6 | -45.71% | 8035 | -2.13% |

The final row in the table vs baseline + Packed Dataset uses 81.9% less memory with a 284.3% increase in tokens per second.

Command to reproduce final row.

```bash tune run lora_finetune_single_device --config llama3_2/3B_qlora_single_device \ dataset.packed=True \ compile=True \ loss=torchtune.modules.loss.CEWithChunkedOutputLoss \ enable_activation_checkpointing=True \ optimizer_in_bwd=False \ enable_activation_offloading=True \ optimizer=torch.optim.AdamW \ tokenizer.max_seq_len=4096 \ gradient_accumulation_steps=1 \ epochs=1 \ batch_size=2 ```

Installation 🛠️

torchtune is tested with the latest stable PyTorch release as well as the preview nightly version. torchtune leverages torchvision for finetuning multimodal LLMs and torchao for the latest in quantization techniques; you should install these as well.

Install stable release

```bash

Install stable PyTorch, torchvision, torchao stable releases

pip install torch torchvision torchao pip install torchtune ```

Install nightly release

```bash

Install PyTorch, torchvision, torchao nightlies

pip install --pre --upgrade torch torchvision torchao --index-url https://download.pytorch.org/whl/nightly/cu126 # full options are cpu/cu118/cu121/cu124/cu126 pip install --pre --upgrade torchtune --extra-index-url https://download.pytorch.org/whl/nightly/cpu ```

You can also check out our install documentation for more information, including installing torchtune from source.

To confirm that the package is installed correctly, you can run the following command:

bash tune --help

And should see the following output:

```bash usage: tune [-h] {ls,cp,download,run,validate} ...

Welcome to the torchtune CLI!

options: -h, --help show this help message and exit

... ```

Get Started 🚀

To get started with torchtune, see our First Finetune Tutorial. Our End-to-End Workflow Tutorial will show you how to evaluate, quantize, and run inference with a Llama model. The rest of this section will provide a quick overview of these steps with Llama3.1.

Downloading a model

Follow the instructions on the official meta-llama repository to ensure you have access to the official Llama model weights. Once you have confirmed access, you can run the following command to download the weights to your local machine. This will also download the tokenizer model and a responsible use guide.

To download Llama3.1, you can run:

bash tune download meta-llama/Meta-Llama-3.1-8B-Instruct \ --output-dir /tmp/Meta-Llama-3.1-8B-Instruct \ --ignore-patterns "original/consolidated.00.pth" \ --hf-token <HF_TOKEN> \

[!Tip] Set your environment variable HF_TOKEN or pass in --hf-token to the command in order to validate your access. You can find your token at https://huggingface.co/settings/tokens

Running finetuning recipes

You can finetune Llama3.1 8B with LoRA on a single GPU using the following command:

bash tune run lora_finetune_single_device --config llama3_1/8B_lora_single_device

For distributed training, tune CLI integrates with torchrun. To run a full finetune of Llama3.1 8B on two GPUs:

bash tune run --nproc_per_node 2 full_finetune_distributed --config llama3_1/8B_full

[!Tip] Make sure to place any torchrun commands before the recipe specification. Any CLI args after this will override the config and not impact distributed training.

Modify Configs

There are two ways in which you can modify configs:

Config Overrides

You can directly overwrite config fields from the command line:

bash tune run lora_finetune_single_device \ --config llama2/7B_lora_single_device \ batch_size=8 \ enable_activation_checkpointing=True \ max_steps_per_epoch=128

Update a Local Copy

You can also copy the config to your local directory and modify the contents directly:

bash tune cp llama3_1/8B_full ./my_custom_config.yaml Copied to ./my_custom_config.yaml

Then, you can run your custom recipe by directing the tune run command to your local files:

bash tune run full_finetune_distributed --config ./my_custom_config.yaml

Check out tune --help for all possible CLI commands and options. For more information on using and updating configs, take a look at our config deep-dive.

Custom Datasets

torchtune supports finetuning on a variety of different datasets, including instruct-style, chat-style, preference datasets, and more. If you want to learn more about how to apply these components to finetune on your own custom dataset, please check out the provided links along with our API docs.

Community 🌍

torchtune focuses on integrating with popular tools and libraries from the ecosystem. These are just a few examples, with more under development:

Hugging Face Hub for accessing model weights
EleutherAI's LM Eval Harness for evaluating trained models
Hugging Face Datasets for access to training and evaluation datasets
PyTorch FSDP2 for distributed training
torchao for lower precision dtypes and post-training quantization techniques
Weights & Biases for logging metrics and checkpoints, and tracking training progress
Comet as another option for logging
ExecuTorch for on-device inference using finetuned models
bitsandbytes for low memory optimizers for our single-device recipes
PEFT for continued finetuning or inference with torchtune models in the Hugging Face ecosystem

Community Contributions

We really value our community and the contributions made by our wonderful users. We'll use this section to call out some of these contributions. If you'd like to help out as well, please see the CONTRIBUTING guide.

@SalmanMohammadi for adding a comprehensive end-to-end recipe for Reinforcement Learning from Human Feedback (RLHF) finetuning with PPO to torchtune
@fyabc for adding Qwen2 models, tokenizer, and recipe integration to torchtune
@solitude-alive for adding the Gemma 2B model to torchtune, including recipe changes, numeric validations of the models and recipe correctness
@yechenzhi for adding Direct Preference Optimization (DPO) to torchtune, including the recipe and config along with correctness checks
@Optimox for adding all the Gemma2 variants to torchtune!

Acknowledgements 🙏

The transformer code in this repository is inspired by the original Llama2 code. We also want to give a huge shout-out to EleutherAI, Hugging Face and Weights & Biases for being wonderful collaborators and for working with us on some of these integrations within torchtune. In addition, we want to acknowledge some other awesome libraries and tools from the ecosystem:

gpt-fast for performant LLM inference techniques which we've adopted out-of-the-box
llama recipes for spring-boarding the llama2 community
bitsandbytes for bringing several memory and performance based techniques to the PyTorch ecosystem
@winglian and axolotl for early feedback and brainstorming on torchtune's design and feature set.
lit-gpt for pushing the LLM finetuning community forward.
HF TRL for making reward modeling more accessible to the PyTorch community.

Citing torchtune 📝

If you find the torchtune library useful, please cite it in your work as below.

bibtex @software{torchtune, title = {torchtune: PyTorch's finetuning library}, author = {torchtune maintainers and contributors}, url = {https//github.com/pytorch/torchtune}, license = {BSD-3-Clause}, month = apr, year = {2024} }

License

torchtune is released under the BSD 3 license. However you may have other legal obligations that govern your use of other content, such as the terms of service for third-party models.

Local changes from hitkumar

Llama 3.2 3B model stored in /tmp/Llama-3.2-3B-Instruct

Llama 4 17B 16E instruct model stored in /tmp/Llama-4-Scout-17B-16E-Instruct

Virtual envs to use - torchtune: for general development (kernel name is torchtune) - vllm: For inference related to vllm. (kernel name is vllm)

Owner

Name: Hitesh
Login: hitkumar
Kind: user
Location: New York City
Company: Meta

Website: https://www.linkedin.com/in/hiteshkumar90/
Repositories: 13
Profile: https://github.com/hitkumar

Machine Learning Engineer

Citation (CITATION.cff)

cff-version: 1.2.0
title: "torchtune: PyTorch's post-training library"
message: "If you use this software, please cite it as below."
type: software
authors:
  - given-names: "torchtune maintainers and contributors"
url: "https//github.com/pytorch/torchtune"
license: "BSD-3-Clause"
date-released: "2024-04-14"

GitHub Events

Total

Push event: 22

Last Year

Push event: 22

Dependencies

.github/workflows/build_docs.yaml actions

actions/checkout v4 composite
actions/download-artifact v4 composite
actions/upload-artifact v4 composite
conda-incubator/setup-miniconda v2 composite
seemethere/upload-artifact-s3 v5 composite

.github/workflows/build_linux_wheels.yaml actions

.github/workflows/export.yaml actions

actions/checkout v3 composite
codecov/codecov-action v3 composite
conda-incubator/setup-miniconda v2 composite

.github/workflows/gpu_test.yaml actions

actions/checkout v3 composite
codecov/codecov-action v3 composite
conda-incubator/setup-miniconda v2 composite

.github/workflows/lint.yaml actions

actions/checkout v3 composite
actions/setup-python v4 composite
tj-actions/changed-files d6e91a2266cdb9d62096cebf1e8546899c6aa18f composite

.github/workflows/regression_test.yaml actions

actions/checkout v3 composite
aws-actions/configure-aws-credentials v1.7.0 composite
codecov/codecov-action v3 composite
conda-incubator/setup-miniconda v2 composite

.github/workflows/unit_test.yaml actions

actions/checkout v3 composite
codecov/codecov-action v3 composite
conda-incubator/setup-miniconda v2 composite

docs/requirements.txt pypi

matplotlib *
sphinx ==5.0.0
sphinx-gallery >0.11
sphinx-tabs *
sphinx_copybutton *
sphinx_design *

pyproject.toml pypi

Pillow >=9.4.0
blobfile >=2
datasets *
huggingface_hub [hf_transfer]
kagglehub *
numpy *
omegaconf *
psutil *
safetensors *
sentencepiece *
tiktoken *
tokenizers *
torchdata *
tqdm *

torchtune

Science Score: 44.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

torchtune

📣 Recent updates 📣

Overview 📚

Post-training recipes

Supervised Finetuning (SFT)

Knowledge Distillation (KD)

Reinforcement Learning / Reinforcement Learning from Human Feedback (RLHF)

Quantization-Aware Training (QAT)

Models

Memory and training speed

Optimization flags

Installation 🛠️

Install stable release

Install stable PyTorch, torchvision, torchao stable releases

Install nightly release

Install PyTorch, torchvision, torchao nightlies

Get Started 🚀

Downloading a model

Running finetuning recipes

Modify Configs

Custom Datasets

Community 🌍

Community Contributions

Acknowledgements 🙏

Citing torchtune 📝

License

Local changes from hitkumar

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Dependencies