moe-infinity

PyTorch library for cost-effective, fast and easy serving of MoE models.

https://github.com/efficientmoe/moe-infinity

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.8%) to scientific vocabulary

Keywords

huggingface inference-engine large-language-models llm-inference mixture-of-experts pytorch

Last synced: 6 months ago · JSON representation ·

Repository

PyTorch library for cost-effective, fast and easy serving of MoE models.

Basic Info

Host: GitHub
Owner: EfficientMoE
License: apache-2.0
Language: Python
Default Branch: main
Homepage:
Size: 614 KB

Statistics

Stars: 200
Watchers: 4
Forks: 17
Open Issues: 9
Releases: 0

Topics

huggingface inference-engine large-language-models llm-inference mixture-of-experts pytorch

Created about 2 years ago · Last pushed 8 months ago

Metadata Files

Readme License Code of conduct Citation

MoE-Infinity

MoE-Infinity is a cost-effective, fast, and easy-to-use library for Mixture-of-Experts (MoE) inference.

MoE-Infinity is cost-effective yet fast:

Offloading MoE's experts to host memory, allowing memory-constrained GPUs to serve MoE models.
Minimizing the expert offloading overheads through several novel techniques: expert activation tracing, activation-aware expert prefetching, and activation-aware expert caching.
Supporting LLM acceleration techniques (such as FlashAttention).
Supporting multi-GPU environments with numeorous OS-level performance optimizations.
Achieving SOTA latency performance when serving MoEs in a resource-constrained GPU environment (in comparison with vLLM, HuggingFace Accelerate, DeepSpeed, Mixtral-Offloading, and Ollama/LLama.cpp).

MoE-Infinity is easy-to-use:

HuggingFace model compatible, and HuggingFace programmer friendly.
Supporting all available MoE checkpoints (including Deepseek-V2, Google Switch Transformers, Meta NLLB-MoE, and Mixtral).

Note that: The open-sourced MoE-Infinity has been redesigned for making it HuggingFace-users friendly. This version is different from the version reported in the paper, which takes extreme performance as the top priority. As a result, distributed inference is currently not supported in this open-sourced version.

Performance
Installation
Usage and Examples
- Sample Code of Huggingface LLM Inference
- Running Inference
Release Plan
Citation

Performance

Single GPU A5000 (24GB Memory), per-token-latency (seconds) for generation with a mixed dataset that includes LongBench, GSM8K, FLAN, BIG-Bench and MMLU datasets. Lower per-token-latency is preferable.

| | Switch-large-128 | NLLB-MoE-54B | Mixtral-8x7b | DeepSeek-V2-Lite | :---: | :---: | :---: | :---: | :---: | | MoE-Infinity | 0.130 | 0.119 | 0.735 | 0.155 | | Accelerate | 1.043 | 3.071 | 6.633 | 1.743 | |DeepSpeed | 4.578 | 8.381 | 2.486 | 0.737 | |Mixtral Offloading| X | X | 1.752 | X | |Ollama | X | X | 0.903 | 1.250 | |vLLM| X | X | 2.137 | 0.493 |

Installation

We recommend installing MoE-Infinity in a virtual environment. To install MoE-Infinity, you can either install it from PyPI or build it from source.

```bash conda create -n moe-infinity python=3.9 conda activate moe-infinity

install from either PyPI or Source will trigger requirements.txt automatically

```

Install from PyPI

```bash

install stable release

pip install moe-infinity

install nightly release

pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ moe-infinity ```

Install from Source

bash git clone https://github.com/EfficientMoE/MoE-Infinity.git cd MoE-Infinity pip install -e . conda install -c conda-forge libstdcxx-ng=12 # assume using conda, otherwise install libstdcxx-ng=12 using your package manager or gcc=12

Enable FlashAttention (Optional)

Install FlashAttention (>=2.5.2) for faster inference with the following command. bash FLASH_ATTENTION_FORCE_BUILD=TRUE pip install flash-attn Post-installation, MoE-Infinity will automatically integrate with FlashAttention to enhance performance.

Usage and Examples

We provide a simple API for diverse setups, including single GPU, multiple GPUs, and multiple nodes. The following examples show how to use MoE-Infinity to run generation on a Huggingface LLM model.

Important Note

The offload_path must be unique for each MoE model. Reusing the same offload_path for different MoE models will result in unexpected behavior.

Sample Code of Huggingface LLM Inference

```python import torch import os from transformers import AutoTokenizer from moe_infinity import MoE

user_home = os.path.expanduser('~')

checkpoint = "deepseek-ai/DeepSeek-V2-Lite-Chat" tokenizer = AutoTokenizer.frompretrained(checkpoint, trustremote=True)

config = { "offloadpath": os.path.join(userhome, "moe-infinity"), "devicememoryratio": 0.75, # 75% of the device memory is used for caching, change the value according to your device memory size on OOM }

model = MoE(checkpoint, config)

inputtext = "translate English to German: How old are you?" inputids = tokenizer(inputtext, returntensors="pt").input_ids.to("cuda:0")

outputids = model.generate(inputids) outputtext = tokenizer.decode(outputids[0], skipspecialtokens=True)

print(output_text) ```

Running Inference

This command runs the script on selected GPUs. bash CUDA_VISIBLE_DEVICES=0,1 python script.py

We provide a simple example to run inference on a Huggingface LLM model. The script will download the model checkpoint and run inference on the specified input text. The output will be printed to the console.

bash CUDA_VISIBLE_DEVICES=0 python examples/interface_example.py --model_name_or_path "deepseek-ai/DeepSeek-V2-Lite-Chat" --offload_dir <your local path on SSD>

OpenAI-Compatible Server

Start the OpenAI-compatible server locally bash python -m moe_infinity.entrypoints.openai.api_server --model deepseek-ai/DeepSeek-V2-Lite-Chat --offload-dir ./offload_dir

Query the model via /v1/components/. (We currently only support the required fields, i.e., "model" and "prompt"). bash curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "deepseek-ai/DeepSeek-V2-Lite-Chat", "prompt": "Hello, my name is" }' You can also use openai python package to query the model. bash pip install openai python tests/test_oai_completions.py

Query the model via /v1/chat/completions. (We currently only support the required fields, i.e., "model" and "messages"). bash curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "deepseek-ai/DeepSeek-V2-Lite-Chat", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Tell me a joke"} ] }' You can also use openai python package to query the model. bash pip install openai python tests/test_oai_chat_completions.py

Release Plan

We plan to release two functions in the following months:

We currently support PyTorch as the default inference engine, and we are in the process of supporting vLLM as another inference runtime, which includes the support of KV cache offloading.
Supporting expert parallelism for distributed MoE inference.
More (We welcome contributors to join us!)

Citation

If you use MoE-Inifity for your research, please cite our paper: bibtex @misc{moe-infinity, author = {Leyang Xue and Yao Fu and Zhan Lu and Luo Mai and Mahesh Marina}, title = {MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache}, archivePrefix= {arXiv}, eprint = {2401.14361}, year = {2024} }

Owner

Name: EfficientMoE
Login: EfficientMoE
Kind: organization

Repositories: 1
Profile: https://github.com/EfficientMoE

Citation (CITATIONS.md)

```bibtex
@misc{moe-infinity,
  author       = {Leyang Xue and
                  Yao Fu and
                  Zhan Lu and
                  Luo Mai and
                  Mahesh Marina},
  title        = {MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache},
  archivePrefix= {arXiv},
  eprint       = {2401.14361},
  year         = {2024}
}
```

GitHub Events

Total

Issues event: 22
Watch event: 69
Delete event: 11
Issue comment event: 26
Push event: 42
Pull request review comment event: 7
Pull request event: 18
Pull request review event: 16
Fork event: 4
Create event: 9

Last Year

Issues event: 22
Watch event: 69
Delete event: 11
Issue comment event: 26
Push event: 42
Pull request review comment event: 7
Pull request event: 18
Pull request review event: 16
Fork event: 4
Create event: 9

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science