llm-recipes

Ongoing Research Project for continaual pre-training LLM(dense mode)

https://github.com/okoge-kaz/llm-recipes

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.3%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

Ongoing Research Project for continaual pre-training LLM(dense mode)

Basic Info
  • Host: GitHub
  • Owner: okoge-kaz
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 18.4 MB
Statistics
  • Stars: 39
  • Watchers: 3
  • Forks: 4
  • Open Issues: 3
  • Releases: 0
Created over 2 years ago · Last pushed over 1 year ago
Metadata Files
Readme Citation

README.md

llm-recipes

User-friendly tool for seamless continual pre-training of Large Language Models

llm-recipes

llm-recipes is a tool designed to make the continual pre-training of Large Language Models (LLMs) easy and efficient. With an intuitive interface and flexible configuration options, researchers and developers can effortlessly manage training on any model or dataset. The tool supports distributed training on large GPU clusters and offers extensive customization, enabling users to leverage cutting-edge techniques with ease.

What sets llm-recipes apart is its seamless integration with Hugging Face Transformers, allowing you to continue pre-training or perform instruction tuning on Dense LLMs (non-MoE models) with minimal changes. This means there’s no need to convert checkpoints or deal with complex workflows—just focus on refining your model.

| Feature | llm-recipes | llama-recipes | torchtune | |---------------------------------|-------------|---------------|-----------| | SFT(Supervised Fine-Tuning) | ✅ | ✅ | ✅ | | Continual Pre-Training | ✅ | ✅ | ✅ | | DPO(Direct Preference Optimization) | ✅ | ❌ | ✅ | | Llama Models Support | ✅ | ✅ | ✅ | | Non-Llama Models Support | ✅ | ❌ | ✅ | | Multi-Node Support | ✅ | ✅ | ❌ |

Table of Contents

Installation

This package has been tested with Python 3.10 and 3.11. The recommended environment is with CUDA Toolkit 12.1.

To install the required packages, simply run:

bash pip install -r requirements.txt

Note: The requirements.txt assumes that CUDA Toolkit 12.1 is installed on your system.

Multi-node Support

For multi-node support, ensure you have the following dependencies installed:

```bash module load openmpi/4.x.x

pip install mpi4py ```

FlashAttention

For GPU-accelerated FlashAttention, follow these steps:

bash pip install ninja packaging wheel pip install flash-attn --no-build-isolation

Usage

LLM Instruction Tuning

1. Data Preparation

Prepare your data in the below format and save it as a JSONL file:

jsonl { "input": [ { "role": "user", "content": "What is the weather like today?" } ], "output": { "role": "assistant", "content": "The weather is sunny with a high of 25 degrees." } }

2. Change Dataset Class

Please modify the Dataset class in src/llama_recipes/utils/instruction_tuning.py to adjust to the model's expected format. But, almost all the models have chat templates, so you may not need to change the Dataset class.

3. Indexing

To load dataset efficiently, create an index file using the following command:

bash python tools/pre-process/index_dataset.py \ --data-file-path <path-to-jsonl-file>

After indexing, .index_cache directory will be created in the same directory as the JSONL file.

4. Training

We provide an example script for instruction tuning for Llama-3-8B in scripts/tsubame/instruct/Llama-3-8B/Llama-3-8B-instruct-v0.2.sh. You can modify the script to suit your needs.

LLM Continual Pre-Training

1. Data Preparation

Prepare your data in the below format and save it as a JSONL file:

jsonl { "text": "What is the weather like today?\nThe weather is sunny with a high of 25 degrees." }

2. Tokenize Data

Tokenize your data using the tokenizer provided by the model you are using. For example, to tokenize data for Codestral(Mistral-AI), run the following command:

```bash DATASETDIR=/path/to/datasets/samples OUTPUTDIR=/path/to/datasets/debug/Codestral-22B-v0.1

mkdir -p ${OUTPUT_DIR}

python megatronlm/tools/preprocessdata.py \ --input ${DATASETDIR}/jawiki.jsonl \ --output-prefix ${OUTPUTDIR}/jawiki \ --tokenizer-type Llama2Tokenizer \ --tokenizer-model /path/to/hf_checkpoints/Codestral-22B-v0.1/tokenizer.model \ --append-eod \ --workers 64 ```

3. Training

We support Llama-2, Llama-3, Llama-3.1, Mistral, Codestral, Phi-3, Yi-1.5, and gemma-2. If you want to continually pre-train or instruction tune other models, you should modify src/llama_recipes/get_models.py and src/llama_recipes/get_model_decoder_layer.py.

We provide example scripts for continual pre-training for codestral-22B in scripts/gcp/codestral-22b.sh. You can modify the script to suit your needs.

LLM DPO

we experimentally support DPO, but it is not fully tested. The documentation will be updated soon.

Checkpoint formats

llm-recipes format

llm-recipes supports 2 types of checkpoints: PyTorch format and PyTorch distributed format. The PyTorch format is a simple checkpoint format. The example of the PyTorch format is as follows:

bash model.pt optimizer.pt rng.pt sampler.pt scheduler.pt PyTorch distributed format is a checkpoint format that can be distributed-loaded using torch.distributed. The example of the PyTorch distributed format is as follows:

bash __0_0.distcp __1_0.distcp __2_0.distcp __3_0.distcp __4_0.distcp __5_0.distcp __6_0.distcp __7_0.distcp rng.pt sampler.pt scheduler.pt

PyTorch format to Hugging Face format

You can convert the PyTorch format to the Hugging Face format using the following command:

```bash ITERATION=1000 FORMATTEDITERATION=$(printf "iter%07d" $ITERATION)

CHECKPOINTPATH=/path/to/train/checkpoint/${FORMATTEDITERATION}/model.pt OUTPUTPATH=/path/to/converted/checkpoint/${FORMATTED_ITERATION}

mkdir -p $OUTPUT_PATH

BASEMODELCHECKPOINT=/path/to/huggingface-checkpoint/Llama-2-7b-hf

python tools/checkpoint-convert/convertckpt.py \ --model $BASEMODELCHECKPOINT \ --ckpt $CHECKPOINTPATH \ --out $OUTPUTPATH \ --sequence-length 4096 ```

PyTorch distributed format to Hugging Face format

You can convert the PyTorch distributed format to the Hugging Face format using the following command:

```bash ITERATION=1000 FORMATTEDITERATION=$(printf "iter%07d" $ITERATION)

CHECKPOINTPATH=/path/to/fsdp/checkpoint/${FORMATTEDITERATION} OUTPUTPATH=/path/to/converted-hf-checkpoint/${FORMATTED_ITERATION}

echo "convert FSDP ${CHECKPOINTPATH} to ${OUTPUT_PATH}"

mkdir -p $OUTPUT_PATH

BASEMODELCHECKPOINT=/path/to/hf-checkpoints/Meta-Llama-3-8B-Instruct

python tools/checkpoint-convert/convertfsdp.py \ --hf-base-model-path $BASEMODELCHECKPOINT \ --tokenizer-path $BASEMODELCHECKPOINT \ --fsdp-checkpoint-path $CHECKPOINTPATH \ --checkpoint-output-path $OUTPUTPATH \ --sequence-length 8192 ```

Inference

After checkpoint conversion, you can use the Hugging Face Transformers library to load the converted checkpoint and perform inference.

The following is an example of how to do inference using the converted checkpoint (huggingface format):

bash python tools/inference/inference.py \ --model-path /path/to/converted/iter_0004000 \ --tokenizer-path /path/to/tokenizer/path \ --prompt "Tokyo is the capital of"

Training Speed and Scalability

We are currently working on improving the training speed and scalability of llm-recipes. We will update this section with more information soon.

Projects Using llm-recipes

Below are some of the projects where we have directly used llm-recipes:

Citation

we are current submitting the paper to SC24 workshop, and the citation will be updated soon.

bibtex @software{Fujii_llm-recipes_2024, author = {Kazuki Fujii and Taishi Nakamura and Rio Yokota}, month = may, title = {{llm-recipes}}, url = {https://github.com/okoge-kaz/llm-recipes}, version = {1.0.0}, year = {2024} }

Owner

  • Name: Kazuki Fujii
  • Login: okoge-kaz
  • Kind: user
  • Location: Tokyo Japan

bachelor (Computer Science) student of Tokyo Institute of Technology

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Fujii"
  given-names: "Kazuki"
- family-names: "Nakamura"
  given-names: "Taishi"
- family-names: "Yokota"
  given-names: "Rio"
title: "llm-recipes"
version: 1.0.0
date-released: 2024-5-24
url: "https://github.com/okoge-kaz/llm-recipes"

GitHub Events

Total
  • Issues event: 2
  • Watch event: 17
  • Push event: 15
  • Create event: 2
Last Year
  • Issues event: 2
  • Watch event: 17
  • Push event: 15
  • Create event: 2

Issues and Pull Requests

Last synced: over 1 year ago

All Time
  • Total issues: 2
  • Total pull requests: 24
  • Average time to close issues: N/A
  • Average time to close pull requests: 3 days
  • Total issue authors: 1
  • Total pull request authors: 1
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 24
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 2
  • Pull requests: 24
  • Average time to close issues: N/A
  • Average time to close pull requests: 3 days
  • Issue authors: 1
  • Pull request authors: 1
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 24
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • mchi-zg (1)
  • etsurin (1)
  • okoge-kaz (1)
Pull Request Authors
  • okoge-kaz (29)
Top Labels
Issue Labels
Pull Request Labels
bug (2)

Dependencies

requirements.txt pypi
  • accelerate *
  • appdirs *
  • bitsandbytes *
  • black *
  • datasets *
  • deepspeed *
  • fire *
  • flake8 *
  • loralib *
  • mpi4py *
  • nltk *
  • optimum *
  • peft *
  • py7zr *
  • pybind11 *
  • scipy *
  • sentencepiece *
  • torch ==2.1.2
  • transformers >=4.35.0
  • wandb *
tools/requirements.txt pypi
  • accelerate *
  • appdirs *
  • bitsandbytes *
  • black *
  • datasets *
  • deepspeed *
  • fire *
  • flake8 *
  • loralib *
  • mpi4py *
  • nltk *
  • optimum *
  • peft *
  • py7zr *
  • pybind11 *
  • scipy *
  • sentencepiece *
  • torch ==2.1.2
  • transformers >=4.35.0
  • wandb *