https://github.com/all-secure-src/tgi-2.0

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.2%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: all-secure-src
License: apache-2.0
Language: Python
Default Branch: main
Size: 1.91 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created about 2 years ago · Last pushed about 2 years ago

Metadata Files

Readme License

README.md

# Text Generation Inference

A Rust, Python and gRPC server for text generation inference. Used in production at [HuggingFace](https://huggingface.co) to power Hugging Chat, the Inference API and Inference Endpoint.

Get Started
Optimized architectures
Run Mistral
- Run
- Quantization
Develop
Testing

Text Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and more. TGI implements many features, such as:

Simple launcher to serve most popular LLMs
Production ready (distributed tracing with Open Telemetry, Prometheus metrics)
Tensor Parallelism for faster inference on multiple GPUs
Token streaming using Server-Sent Events (SSE)
Continuous batching of incoming requests for increased total throughput
Optimized transformers code for inference using Flash Attention and Paged Attention on the most popular architectures
Quantization with :
- bitsandbytes
- GPT-Q
- EETQ
- AWQ
Safetensors weight loading
Watermarking with A Watermark for Large Language Models
Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see transformers.LogitsProcessor)
Stop sequences
Log probabilities
Speculation ~2x latency
Guidance/JSON. Specify output format to speed up inference and make sure the output is valid according to some specs..
Custom Prompt Generation: Easily generate text by providing custom prompts to guide the model's output
Fine-tuning Support: Utilize fine-tuned models for specific tasks to achieve higher accuracy and performance

Hardware support

Get Started

Docker

For a detailed starting guide, please see the Quick Tour. The easiest way of getting started is using the official Docker container:

```shell model=HuggingFaceH4/zephyr-7b-beta volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model ```

And then you can make requests like

bash curl 127.0.0.1:8080/generate_stream \ -X POST \ -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \ -H 'Content-Type: application/json'

Note: To use NVIDIA GPUs, you need to install the NVIDIA Container Toolkit. We also recommend using NVIDIA drivers with CUDA version 12.2 or higher. For running the Docker container on a machine with no GPUs or CUDA support, it is enough to remove the --gpus all flag and add --disable-custom-kernels, please note CPU is not the intended platform for this project, so performance might be subpar.

Note: TGI supports AMD Instinct MI210 and MI250 GPUs. Details can be found in the Supported Hardware documentation. To use AMD GPUs, please use docker run --device /dev/kfd --device /dev/dri --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0-rocm --model-id $model instead of the command above.

To see all options to serve your models (in the code or in the cli): text-generation-launcher --help

API documentation

You can consult the OpenAPI documentation of the text-generation-inference REST API using the /docs route. The Swagger UI is also available at: https://huggingface.github.io/text-generation-inference.

Using a private or gated model

You have the option to utilize the HUGGING_FACE_HUB_TOKEN environment variable for configuring the token employed by text-generation-inference. This allows you to gain access to protected resources.

For example, if you want to serve the gated Llama V2 model variants:

Go to https://huggingface.co/settings/tokens
Copy your cli READ token
Export HUGGING_FACE_HUB_TOKEN=<your cli READ token>

or with Docker:

```shell model=meta-llama/Llama-2-7b-chat-hf volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run token=

docker run --gpus all --shm-size 1g -e HUGGINGFACEHUB_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model ```

A note on Shared Memory (shm)

NCCL is a communication framework used by PyTorch to do distributed training/inference. text-generation-inference make use of NCCL to enable Tensor Parallelism to dramatically speed up inference for large language models.

In order to share data between the different devices of a NCCL group, NCCL might fall back to using the host memory if peer-to-peer using NVLink or PCI is not possible.

To allow the container to use 1G of Shared Memory and support SHM sharing, we add --shm-size 1g on the above command.

If you are running text-generation-inference inside Kubernetes. You can also add Shared Memory to the container by creating a volume with:

yaml - name: shm emptyDir: medium: Memory sizeLimit: 1Gi

and mounting it to /dev/shm.

Finally, you can also disable SHM sharing by using the NCCL_SHM_DISABLE=1 environment variable. However, note that this will impact performance.

Distributed Tracing

text-generation-inference is instrumented with distributed tracing using OpenTelemetry. You can use this feature by setting the address to an OTLP collector with the --otlp-endpoint argument.

Architecture

TGI architecture

Local install

You can also opt to install text-generation-inference locally.

First install Rust and create a Python virtual environment with at least Python 3.9, e.g. using conda:

```shell curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

conda create -n text-generation-inference python=3.11 conda activate text-generation-inference ```

You may also need to install Protoc.

On Linux:

shell PROTOC_ZIP=protoc-21.12-linux-x86_64.zip curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP sudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*' rm -f $PROTOC_ZIP

On MacOS, using Homebrew:

shell brew install protobuf

Then run:

shell BUILD_EXTENSIONS=True make install # Install repository and HF/transformer fork with CUDA kernels text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2

Note: on some machines, you may also need the OpenSSL libraries and gcc. On Linux machines, run:

shell sudo apt-get install libssl-dev gcc -y

Optimized architectures

TGI works out of the box to serve optimized models for all modern models. They can be found in this list.

Other architectures are supported on a best-effort basis using:

AutoModelForCausalLM.from_pretrained(<model>, device_map="auto")

AutoModelForSeq2SeqLM.from_pretrained(<model>, device_map="auto")

Run locally

Run

shell text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2

Quantization

You can also quantize the weights with bitsandbytes to reduce the VRAM requirement:

shell text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2 --quantize

4bit quantization is available using the NF4 and FP4 data types from bitsandbytes. It can be enabled by providing --quantize bitsandbytes-nf4 or --quantize bitsandbytes-fp4 as a command line argument to text-generation-launcher.

Develop

shell make server-dev make router-dev

Testing

```shell

python

make python-server-tests make python-client-tests

or both server and client tests

make python-tests

rust cargo tests

make rust-tests

integration tests

make integration-tests ```

Owner

Name: all-secure-src
Login: all-secure-src
Kind: organization

Repositories: 1
Profile: https://github.com/all-secure-src

GitHub Events

Total

Last Year

Dependencies

.github/workflows/autodocs.yml actions

actions/checkout v2 composite

.github/workflows/build.yaml actions

actions/checkout v2 composite
actions/checkout v3 composite
actions/setup-python v4 composite
aws-actions/configure-aws-credentials v1 composite
docker/build-push-action v4 composite
docker/login-action v2.1.0 composite
docker/login-action v2 composite
docker/metadata-action v4.3.0 composite
docker/setup-buildx-action v2.0.0 composite
philschmid/philschmid-ec2-github-runner main composite
rlespinasse/github-slug-action v4.4.1 composite
tailscale/github-action 7bd8039bf25c23c4ab1b8d6e2cc2da2280601966 composite

.github/workflows/build_documentation.yml actions

.github/workflows/build_pr_documentation.yml actions

.github/workflows/client-tests.yaml actions

actions/checkout v2 composite
actions/setup-python v1 composite

.github/workflows/load_test.yaml actions

actions/checkout v3 composite
aws-actions/configure-aws-credentials v1 composite
philschmid/philschmid-ec2-github-runner main composite

.github/workflows/stale.yml actions

actions/stale v8 composite

.github/workflows/tests.yaml actions

actions-rs/toolchain v1 composite
actions/cache v3 composite
actions/checkout v2 composite
actions/github-script v6 composite
actions/setup-python v1 composite
arduino/setup-protoc v1 composite

.github/workflows/upload_pr_documentation.yml actions

Cargo.lock cargo

459 dependencies

Cargo.toml cargo

benchmark/Cargo.toml cargo

launcher/Cargo.toml cargo

float_eq 1.0.1 development
reqwest 0.11.20 development
clap 4.4.5
ctrlc 3.4.1
hf-hub 0.3.2
nix 0.28.0
once_cell 1.19.0
serde 1.0.188
serde_json 1.0.107
tracing 0.1.37
tracing-subscriber 0.3.17

router/Cargo.toml cargo

router/client/Cargo.toml cargo

router/grpc-metadata/Cargo.toml cargo

Dockerfile docker

base latest build
chef latest build
kernel-builder latest build
lukemathwalker/cargo-chef latest-rust-1.75 build
nvidia/cuda 12.1.0-devel-ubuntu22.04 build
nvidia/cuda 12.1.0-base-ubuntu22.04 build
pytorch-install latest build

clients/python/poetry.lock pypi

aiohttp 3.8.5
aiosignal 1.3.1
annotated-types 0.5.0
async-timeout 4.0.3
asynctest 0.13.0
atomicwrites 1.4.1
attrs 23.1.0
certifi 2023.7.22
charset-normalizer 3.2.0
colorama 0.4.6
coverage 7.2.7
filelock 3.12.2
frozenlist 1.3.3
fsspec 2023.1.0
huggingface-hub 0.16.4
idna 3.4
importlib-metadata 6.7.0
iniconfig 2.0.0
multidict 6.0.4
packaging 23.1
pluggy 1.2.0
py 1.11.0
pydantic 2.5.3
pydantic-core 2.14.6
pytest 6.2.5
pytest-asyncio 0.17.2
pytest-cov 3.0.0
pyyaml 6.0.1
requests 2.31.0
toml 0.10.2
tomli 2.0.1
tqdm 4.66.1
typing-extensions 4.7.1
urllib3 2.0.5
yarl 1.9.2
zipp 3.15.0

clients/python/pyproject.toml pypi

pytest ^6.2.5 develop
pytest-asyncio ^0.17.2 develop
pytest-cov ^3.0.0 develop
aiohttp ^3.8
huggingface-hub >= 0.12, < 1.0
pydantic > 2, < 3
python ^3.7

integration-tests/poetry.lock pypi

aiohttp 3.8.5
aiosignal 1.3.1
annotated-types 0.6.0
async-timeout 4.0.3
attrs 23.1.0
certifi 2023.7.22
charset-normalizer 3.2.0
colorama 0.4.6
colored 1.4.4
docker 6.1.3
exceptiongroup 1.1.3
filelock 3.12.3
frozenlist 1.4.0
fsspec 2023.6.0
huggingface-hub 0.16.4
idna 3.4
iniconfig 2.0.0
multidict 6.0.4
packaging 23.1
pluggy 1.3.0
pydantic 2.6.4
pydantic-core 2.16.3
pytest 7.4.0
pytest-asyncio 0.21.1
pywin32 306
pyyaml 6.0.1
requests 2.31.0
syrupy 4.0.1
text-generation 0.6.1
tomli 2.0.1
tqdm 4.66.1
typing-extensions 4.7.1
urllib3 2.0.4
websocket-client 1.6.2
yarl 1.9.2

integration-tests/pyproject.toml pypi

docker ^6.1.3
pydantic > 2, < 3
pytest ^7.4.0
pytest-asyncio ^0.21.1
python >=3.9,<3.13
syrupy 4.0.1
text-generation ^0.6.0

integration-tests/requirements.txt pypi

aiohttp ==3.8.5 test
aiosignal ==1.3.1 test
annotated-types ==0.6.0 test
async-timeout ==4.0.3 test
attrs ==23.1.0 test
certifi ==2023.7.22 test
charset-normalizer ==3.2.0 test
colorama ==0.4.6 test
colored ==1.4.4 test
docker ==6.1.3 test
exceptiongroup ==1.1.3 test
filelock ==3.12.3 test
frozenlist ==1.4.0 test
fsspec ==2023.6.0 test
huggingface-hub ==0.16.4 test
idna ==3.4 test
iniconfig ==2.0.0 test
multidict ==6.0.4 test
packaging ==23.1 test
pluggy ==1.3.0 test
pydantic ==2.6.4 test
pydantic-core ==2.16.3 test
pytest ==7.4.0 test
pytest-asyncio ==0.21.1 test
pywin32 ==306 test
pyyaml ==6.0.1 test
requests ==2.31.0 test
syrupy ==4.0.1 test
text-generation ==0.6.1 test
tomli ==2.0.1 test
tqdm ==4.66.1 test
typing-extensions ==4.7.1 test
urllib3 ==2.0.4 test
websocket-client ==1.6.2 test
yarl ==1.9.2 test

server/custom_kernels/setup.py pypi

server/exllama_kernels/setup.py pypi

server/exllamav2_kernels/setup.py pypi

server/poetry.lock pypi

110 dependencies

server/pyproject.toml pypi

grpcio-tools ^1.51.1 develop
pytest ^7.3.0 develop
accelerate ^0.29.1
bitsandbytes ^0.43.0
datasets ^2.14.0
einops ^0.6.1
grpc-interceptor ^0.15.0
grpcio ^1.51.1
grpcio-reflection ^1.51.1
grpcio-status ^1.51.1
hf-transfer ^0.1.2
huggingface-hub ^0.19.3
loguru ^0.6.0
opentelemetry-api ^1.15.0
opentelemetry-exporter-otlp ^1.15.0
opentelemetry-instrumentation-grpc ^0.36b0
outlines ^0.0.36
peft ^0.10
pillow ^10.0.0
protobuf ^4.21.7
python >=3.9,<3.13
safetensors ^0.4
scipy ^1.11.1
sentencepiece ^0.1.97
texttable ^1.6.7
tokenizers ^0.15.0
torch ^2.1.1
transformers ^4.39
typer ^0.6.1

server/requirements_cuda.txt pypi

backoff ==2.2.1
certifi ==2024.2.2
charset-normalizer ==3.3.2
click ==8.1.7
colorama ==0.4.6
deprecated ==1.2.14
einops ==0.6.1
filelock ==3.13.3
fsspec ==2024.2.0
googleapis-common-protos ==1.63.0
grpc-interceptor ==0.15.4
grpcio ==1.62.1
grpcio-reflection ==1.62.1
grpcio-status ==1.62.1
hf-transfer ==0.1.6
huggingface-hub ==0.19.4
idna ==3.6
loguru ==0.6.0
numpy ==1.26.4
opentelemetry-api ==1.15.0
opentelemetry-exporter-otlp ==1.15.0
opentelemetry-exporter-otlp-proto-grpc ==1.15.0
opentelemetry-exporter-otlp-proto-http ==1.15.0
opentelemetry-instrumentation ==0.36b0
opentelemetry-instrumentation-grpc ==0.36b0
opentelemetry-proto ==1.15.0
opentelemetry-sdk ==1.15.0
opentelemetry-semantic-conventions ==0.36b0
packaging ==24.0
pillow ==10.3.0
protobuf ==4.25.3
pyyaml ==6.0.1
regex ==2023.12.25
requests ==2.31.0
safetensors ==0.4.2
scipy ==1.13.0
sentencepiece ==0.1.99
setuptools ==69.2.0
tokenizers ==0.15.2
tqdm ==4.66.2
transformers ==4.39.3
typer ==0.6.1
typing-extensions ==4.11.0
urllib3 ==2.2.1
win32-setctime ==1.1.0
wrapt ==1.16.0

server/requirements_rocm.txt pypi

backoff ==2.2.1
certifi ==2024.2.2
charset-normalizer ==3.3.2
click ==8.1.7
colorama ==0.4.6
deprecated ==1.2.14
einops ==0.6.1
filelock ==3.13.3
fsspec ==2024.2.0
googleapis-common-protos ==1.63.0
grpc-interceptor ==0.15.4
grpcio ==1.62.1
grpcio-reflection ==1.62.1
grpcio-status ==1.62.1
hf-transfer ==0.1.6
huggingface-hub ==0.19.4
idna ==3.6
loguru ==0.6.0
numpy ==1.26.4
opentelemetry-api ==1.15.0
opentelemetry-exporter-otlp ==1.15.0
opentelemetry-exporter-otlp-proto-grpc ==1.15.0
opentelemetry-exporter-otlp-proto-http ==1.15.0
opentelemetry-instrumentation ==0.36b0
opentelemetry-instrumentation-grpc ==0.36b0
opentelemetry-proto ==1.15.0
opentelemetry-sdk ==1.15.0
opentelemetry-semantic-conventions ==0.36b0
packaging ==24.0
pillow ==10.3.0
protobuf ==4.25.3
pyyaml ==6.0.1
regex ==2023.12.25
requests ==2.31.0
safetensors ==0.4.2
scipy ==1.13.0
sentencepiece ==0.1.99
setuptools ==69.2.0
tokenizers ==0.15.2
tqdm ==4.66.2
transformers ==4.39.3
typer ==0.6.1
typing-extensions ==4.11.0
urllib3 ==2.2.1
win32-setctime ==1.1.0
wrapt ==1.16.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science