moe-infinity

PyTorch library for cost-effective, fast and easy serving of MoE models.

https://github.com/efficientmoe/moe-infinity

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.8%) to scientific vocabulary

Keywords

huggingface inference-engine large-language-models llm-inference mixture-of-experts pytorch
Last synced: 6 months ago · JSON representation ·

Repository

PyTorch library for cost-effective, fast and easy serving of MoE models.

Basic Info
  • Host: GitHub
  • Owner: EfficientMoE
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 614 KB
Statistics
  • Stars: 200
  • Watchers: 4
  • Forks: 17
  • Open Issues: 9
  • Releases: 0
Topics
huggingface inference-engine large-language-models llm-inference mixture-of-experts pytorch
Created about 2 years ago · Last pushed 8 months ago
Metadata Files
Readme License Code of conduct Citation

README.md

MoE-Infinity

MoE-Infinity is a cost-effective, fast, and easy-to-use library for Mixture-of-Experts (MoE) inference.

MoE-Infinity is cost-effective yet fast:

  • Offloading MoE's experts to host memory, allowing memory-constrained GPUs to serve MoE models.
  • Minimizing the expert offloading overheads through several novel techniques: expert activation tracing, activation-aware expert prefetching, and activation-aware expert caching.
  • Supporting LLM acceleration techniques (such as FlashAttention).
  • Supporting multi-GPU environments with numeorous OS-level performance optimizations.
  • Achieving SOTA latency performance when serving MoEs in a resource-constrained GPU environment (in comparison with vLLM, HuggingFace Accelerate, DeepSpeed, Mixtral-Offloading, and Ollama/LLama.cpp).

MoE-Infinity is easy-to-use:

Note that: The open-sourced MoE-Infinity has been redesigned for making it HuggingFace-users friendly. This version is different from the version reported in the paper, which takes extreme performance as the top priority. As a result, distributed inference is currently not supported in this open-sourced version.

Contents

Performance

Single GPU A5000 (24GB Memory), per-token-latency (seconds) for generation with a mixed dataset that includes LongBench, GSM8K, FLAN, BIG-Bench and MMLU datasets. Lower per-token-latency is preferable.

| | Switch-large-128 | NLLB-MoE-54B | Mixtral-8x7b | DeepSeek-V2-Lite | :---: | :---: | :---: | :---: | :---: | | MoE-Infinity | 0.130 | 0.119 | 0.735 | 0.155 | | Accelerate | 1.043 | 3.071 | 6.633 | 1.743 | |DeepSpeed | 4.578 | 8.381 | 2.486 | 0.737 | |Mixtral Offloading| X | X | 1.752 | X | |Ollama | X | X | 0.903 | 1.250 | |vLLM| X | X | 2.137 | 0.493 |

Installation

We recommend installing MoE-Infinity in a virtual environment. To install MoE-Infinity, you can either install it from PyPI or build it from source.

```bash conda create -n moe-infinity python=3.9 conda activate moe-infinity

install from either PyPI or Source will trigger requirements.txt automatically

```

Install from PyPI

```bash

install stable release

pip install moe-infinity

install nightly release

pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ moe-infinity ```

Install from Source

bash git clone https://github.com/EfficientMoE/MoE-Infinity.git cd MoE-Infinity pip install -e . conda install -c conda-forge libstdcxx-ng=12 # assume using conda, otherwise install libstdcxx-ng=12 using your package manager or gcc=12

Enable FlashAttention (Optional)

Install FlashAttention (>=2.5.2) for faster inference with the following command. bash FLASH_ATTENTION_FORCE_BUILD=TRUE pip install flash-attn Post-installation, MoE-Infinity will automatically integrate with FlashAttention to enhance performance.

Usage and Examples

We provide a simple API for diverse setups, including single GPU, multiple GPUs, and multiple nodes. The following examples show how to use MoE-Infinity to run generation on a Huggingface LLM model.

Important Note

  • The offload_path must be unique for each MoE model. Reusing the same offload_path for different MoE models will result in unexpected behavior.

Sample Code of Huggingface LLM Inference

```python import torch import os from transformers import AutoTokenizer from moe_infinity import MoE

user_home = os.path.expanduser('~')

checkpoint = "deepseek-ai/DeepSeek-V2-Lite-Chat" tokenizer = AutoTokenizer.frompretrained(checkpoint, trustremote=True)

config = { "offloadpath": os.path.join(userhome, "moe-infinity"), "devicememoryratio": 0.75, # 75% of the device memory is used for caching, change the value according to your device memory size on OOM }

model = MoE(checkpoint, config)

inputtext = "translate English to German: How old are you?" inputids = tokenizer(inputtext, returntensors="pt").input_ids.to("cuda:0")

outputids = model.generate(inputids) outputtext = tokenizer.decode(outputids[0], skipspecialtokens=True)

print(output_text) ```

Running Inference

This command runs the script on selected GPUs. bash CUDA_VISIBLE_DEVICES=0,1 python script.py

We provide a simple example to run inference on a Huggingface LLM model. The script will download the model checkpoint and run inference on the specified input text. The output will be printed to the console.

bash CUDA_VISIBLE_DEVICES=0 python examples/interface_example.py --model_name_or_path "deepseek-ai/DeepSeek-V2-Lite-Chat" --offload_dir <your local path on SSD>

OpenAI-Compatible Server

Start the OpenAI-compatible server locally bash python -m moe_infinity.entrypoints.openai.api_server --model deepseek-ai/DeepSeek-V2-Lite-Chat --offload-dir ./offload_dir

Query the model via /v1/components/. (We currently only support the required fields, i.e., "model" and "prompt"). bash curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "deepseek-ai/DeepSeek-V2-Lite-Chat", "prompt": "Hello, my name is" }' You can also use openai python package to query the model. bash pip install openai python tests/test_oai_completions.py

Query the model via /v1/chat/completions. (We currently only support the required fields, i.e., "model" and "messages"). bash curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "deepseek-ai/DeepSeek-V2-Lite-Chat", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Tell me a joke"} ] }' You can also use openai python package to query the model. bash pip install openai python tests/test_oai_chat_completions.py

Release Plan

We plan to release two functions in the following months:

  • We currently support PyTorch as the default inference engine, and we are in the process of supporting vLLM as another inference runtime, which includes the support of KV cache offloading.
  • Supporting expert parallelism for distributed MoE inference.
  • More (We welcome contributors to join us!)

Citation

If you use MoE-Inifity for your research, please cite our paper: bibtex @misc{moe-infinity, author = {Leyang Xue and Yao Fu and Zhan Lu and Luo Mai and Mahesh Marina}, title = {MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache}, archivePrefix= {arXiv}, eprint = {2401.14361}, year = {2024} }

Owner

  • Name: EfficientMoE
  • Login: EfficientMoE
  • Kind: organization

Citation (CITATIONS.md)

```bibtex
@misc{moe-infinity,
  author       = {Leyang Xue and
                  Yao Fu and
                  Zhan Lu and
                  Luo Mai and
                  Mahesh Marina},
  title        = {MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache},
  archivePrefix= {arXiv},
  eprint       = {2401.14361},
  year         = {2024}
}
```

GitHub Events

Total
  • Issues event: 22
  • Watch event: 69
  • Delete event: 11
  • Issue comment event: 26
  • Push event: 42
  • Pull request review comment event: 7
  • Pull request event: 18
  • Pull request review event: 16
  • Fork event: 4
  • Create event: 9
Last Year
  • Issues event: 22
  • Watch event: 69
  • Delete event: 11
  • Issue comment event: 26
  • Push event: 42
  • Pull request review comment event: 7
  • Pull request event: 18
  • Pull request review event: 16
  • Fork event: 4
  • Create event: 9