lm-engine

LM engine is a library for pretraining/finetuning LLMs

https://github.com/open-lm-engine/lm-engine

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.7%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

LM engine is a library for pretraining/finetuning LLMs

Basic Info
  • Host: GitHub
  • Owner: open-lm-engine
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 6.12 MB
Statistics
  • Stars: 65
  • Watchers: 6
  • Forks: 21
  • Open Issues: 13
  • Releases: 0
Created over 1 year ago · Last pushed 6 months ago
Metadata Files
Readme License Citation

README.md

LM Engine

Introduction

This repository contains code used for training new model architectures. This repo has also been used to train IBM's Granite models. It also includes the following key innovations on model architectures, finetuning methods, systems optimizations: 1. Saving Memory Using Padding-Free Transformer Layers during Finetuning
Mayank Mishra
image 1. Reducing Transformer Key-Value Cache Size with Cross-Layer Attention
William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, Jonathan Ragan Kelly
image image 1. Power scheduler: a batch size and token number agnostic learning rate scheduler
Yikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan, Aditya Prasad, Adriana Meza Soria, David D. Cox, Rameswar Panda
image 1. Scattered Mixture-of-Experts Implementation
Shawn Tan, Yikang Shen, Rameswar Panda, Aaron Courville
image image image 1. Stick-breaking Attention
Shawn Tan, Yikang Shen, Songlin Yang, Aaron Courville, Rameswar Panda
image

Discord Server

Join the discord server if you are interested in LLM architecture or distributed training/inference research.

Getting Started

Run make install to install the requirements for this repository. You might need to install flash-attn.

Distributed finetuning

This repository is meant for pretraining and finetuning large language models.

The repository currently only supports generative models but can be easily extended to non-generative models if needed. 2 main class of models from HuggingFace are supported:

  1. decoder models (AutoModelForCausalLM) like Granite, Llama, BLOOM etc
  2. encoder-decoder models (AutoModelForSeq2SeqLM) like T5, BART etc

Please note that this repository doesn't support Tensor Parallel or Pipeline Parallel (yet :wink:).

HuggingFace compatible custom models

This repository works with all HuggingFace models (text-to-text only for the moment) out-of-the-box. The checkpoints have to be in safetensors format, if not you can check tools/pt_to_safetensors.py.

[!TIP] You might be able to enjoy additional memory and computation savings when finetuning your models using the padding free transformers optimization. This optimization is currently only supported for decoder models and requires converting your model (say LLama-3 for example) to a custom class implemented in this repo. This is completely optional and not required for finetuning. The conversion can be achieved as follows: ```python from lmengine.hfmodels import importfromhuggingface

importfromhuggingface( pretrainedmodelnameorpath="ibm-granite/granite-3b-code-base", savepath="lmenginecompatiblemodel" ) Once done training, you can convert the model back to the HF class as: python from lmengine.hfmodels import exporttohuggingface

exporttohuggingface( pretrainedmodelnameorpath="trainedcheckpoint", savepath="hfcompatiblemodel", model_type="llama", ) ```

If you are interested in using this optimization outside this repo for some reason, you can do as follows: ```python from lmengine.enums import Kernel from lmengine.hfmodels import GPTBaseForCausalLM from lmengine.kernels import enable_kernels

we need unpadded lists here for avoiding any useless computations on pad tokens

this is a bit different from the standard transformer which takes in tensors and an attention mask

if you turn off padding free transformers, you can use the tensor inputs with this class too

input_ids = [[1, 2, 3, 4, 5, 0], [6, 7, 8, 0]] labels = [[-100, -100, -100, 4, 5, 0], [-100, -100, 8, 0]]

this will throw a warning saying that the model is of gpt_bigcode class

ignore the warning

model = GPTBaseForCausalLM.frompretrained(<modelpath>, usepaddingfree_transformer=True).cuda()

with enablekernels([Kernel.flashattention2]): loss = model(inputids=input_ids, labels=labels).loss ```

Note that padding free transformers doesn't support generation and thus for running generation on the model, you will need to load the model without padding-free transformers.

Usage

The typical training workflow looks like: 1. Pretraining or Finetuning: This is the actual training process ```shell

for finetuning

sh scripts/common/finetune.sh configs/sst2/training.yml shell

for pretraining

sh scripts/common/pretrain.sh configs/pretraining-examples/pretrain-1.yml ```

  1. Unshard the checkpoint: This is used to unshard the model to a safetensors checkpoint since lm-engine saves a sharded model during training shell sh scripts/common/unshard.sh configs/sst2/unshard.yml

Running basic inference

For a simple HuggingFace inference example, refer to tools/inference.py. For an example running tensor parallel inference, refer to tools/tensorparallelinference.py.

Using custom datasets

The data directory should obey the following structure: text 📦data ┣ 📂train ┃ ┣ 📜filename1.jsonl ┃ ┣ 📜filename2.jsonl ┃ ┗ 📜filename3.jsonl ┗ 📂val ┃ ┣ 📜filename1.jsonl ┃ ┣ 📜filename2.jsonl ┃ ┣ 📜filename3.jsonl ┣ 📂test ┃ ┣ 📜filename1.jsonl ┃ ┣ 📜filename2.jsonl ┃ ┣ 📜filename3.jsonl Filenames can be anything as long as there are no whitespaces in them. Each line in each file should be a json (jsonlines file format) with the entries looking like: json {"input": "The movie sucks", "output": "negative"} {"input": "The movie was awesome", "output": "positive"} Note for the test set, only input field is needed in the json instances in each line. output field is not needed.

All the files in each directory are concatenated to form the respective split.

If you need reformatting of the examples, you can use input_format and output_format arguments. For example input_format = 'Classify the sentiment of the sentence:\n__input__\nSentiment:' and output_format = ' __output__' reformats the input and output examples to: ```text INPUT: Classify the sentiment of the sentence: The movie sucks Sentiment:

OUTPUT: negative `` If you don't need any reformatting, leave the argumentsinputformatandoutputformatto their default valuesinputandoutput` respectively.

Please note that the user is expected to provide this at both training and inference time.

Try not to have trailing spaces in input_format, if you need a space between input and output, the space should be part of the output_format as in the above example.

[!TIP] Alternatively, you can also add your own dataset class in the repository if you don't want to use the jsonlines format or need custom logic to load your own dataset.

Currently, the repo has following implemented dataclasses: text AlpacaDataset DebugDataset DollyDataset HuggingFaceDataset SlimOrcaDataset SST2Dataset

Using Megatron Dataset outside of this repository

This repository implements the dataloader from Megatron-LM for efficient pretraining. If for some reason you need to use that dataloader outside this repository, take a look at this example.

Supported optimizers

```python

https://pytorch.org/docs/stable/optim.html

from torch.optim.adadelta import Adadelta as TorchAdadelta from torch.optim.adagrad import Adagrad as TorchAdagrad from torch.optim.adam import Adam as TorchAdam from torch.optim.adamax import Adamax as TorchAdamax from torch.optim.adamw import AdamW as TorchAdamW from torch.optim.asgd import ASGD as TorchASGD from torch.optim.lbfgs import LBFGS as TorchLBFGS from torch.optim.nadam import NAdam as TorchNAdam from torch.optim.radam import RAdam as TorchRAdam from torch.optim.rmsprop import RMSprop as TorchRMSprop from torch.optim.rprop import Rprop as TorchRprop from torch.optim.sgd import SGD as TorchSGD ```

Citation

If you find this repository useful, please consider citing it in your research: bibtex @software{Mishra_lm_engine_A_2024, author = {Mishra, Mayank}, month = jun, title = {{LM Engine: A Hyper-Optimized Library for Pretraining and Finetuning}}, url = {https://github.com/ibm/lm-engine}, year = {2024} }

Owner

  • Name: open-lm-engine
  • Login: open-lm-engine
  • Kind: organization

Citation (CITATION.cff)

cff-version: 1.2.0
date-released: 2024-06-23
message: "If you use this software, please cite it using this metadata."
title: "LM Engine: A Hyper-Optimized Library for Pretraining and Finetuning"
url: "https://github.com/open-lm-engine/lm-engine"
authors: 
  - family-names: Mishra
    given-names: Mayank

GitHub Events

Total
  • Watch event: 8
  • Delete event: 40
  • Issue comment event: 1
  • Push event: 361
  • Pull request review event: 10
  • Pull request review comment event: 4
  • Pull request event: 78
  • Fork event: 1
  • Create event: 43
Last Year
  • Watch event: 8
  • Delete event: 40
  • Issue comment event: 1
  • Push event: 361
  • Pull request review event: 10
  • Pull request review comment event: 4
  • Pull request event: 78
  • Fork event: 1
  • Create event: 43

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 0
  • Total pull requests: 32
  • Average time to close issues: N/A
  • Average time to close pull requests: about 8 hours
  • Total issue authors: 0
  • Total pull request authors: 2
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 17
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 32
  • Average time to close issues: N/A
  • Average time to close pull requests: about 8 hours
  • Issue authors: 0
  • Pull request authors: 2
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 17
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • mayank31398 (44)
  • shawntan (3)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

.github/workflows/style.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
.github/workflows/unit-tests.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
requirements-dev.txt pypi
  • parameterized * development
  • pre-commit * development
  • pytest * development
requirements.txt pypi
  • datasets *
  • peft *
  • pydantic *
  • safetensors *
  • torch >=2.3
  • transformers *
setup.py pypi