https://github.com/all-secure-src/tgi-2.0

https://github.com/all-secure-src/tgi-2.0

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.2%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: all-secure-src
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Size: 1.91 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created about 2 years ago · Last pushed about 2 years ago
Metadata Files
Readme License

README.md

Making TGI deployment optimal # Text Generation Inference GitHub Repo stars Swagger API documentation A Rust, Python and gRPC server for text generation inference. Used in production at [HuggingFace](https://huggingface.co) to power Hugging Chat, the Inference API and Inference Endpoint.

Table of contents

Text Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and more. TGI implements many features, such as:

  • Simple launcher to serve most popular LLMs
  • Production ready (distributed tracing with Open Telemetry, Prometheus metrics)
  • Tensor Parallelism for faster inference on multiple GPUs
  • Token streaming using Server-Sent Events (SSE)
  • Continuous batching of incoming requests for increased total throughput
  • Optimized transformers code for inference using Flash Attention and Paged Attention on the most popular architectures
  • Quantization with :
  • Safetensors weight loading
  • Watermarking with A Watermark for Large Language Models
  • Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see transformers.LogitsProcessor)
  • Stop sequences
  • Log probabilities
  • Speculation ~2x latency
  • Guidance/JSON. Specify output format to speed up inference and make sure the output is valid according to some specs..
  • Custom Prompt Generation: Easily generate text by providing custom prompts to guide the model's output
  • Fine-tuning Support: Utilize fine-tuned models for specific tasks to achieve higher accuracy and performance

Hardware support

Get Started

Docker

For a detailed starting guide, please see the Quick Tour. The easiest way of getting started is using the official Docker container:

```shell model=HuggingFaceH4/zephyr-7b-beta volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model ```

And then you can make requests like

bash curl 127.0.0.1:8080/generate_stream \ -X POST \ -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \ -H 'Content-Type: application/json'

Note: To use NVIDIA GPUs, you need to install the NVIDIA Container Toolkit. We also recommend using NVIDIA drivers with CUDA version 12.2 or higher. For running the Docker container on a machine with no GPUs or CUDA support, it is enough to remove the --gpus all flag and add --disable-custom-kernels, please note CPU is not the intended platform for this project, so performance might be subpar.

Note: TGI supports AMD Instinct MI210 and MI250 GPUs. Details can be found in the Supported Hardware documentation. To use AMD GPUs, please use docker run --device /dev/kfd --device /dev/dri --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0-rocm --model-id $model instead of the command above.

To see all options to serve your models (in the code or in the cli): text-generation-launcher --help

API documentation

You can consult the OpenAPI documentation of the text-generation-inference REST API using the /docs route. The Swagger UI is also available at: https://huggingface.github.io/text-generation-inference.

Using a private or gated model

You have the option to utilize the HUGGING_FACE_HUB_TOKEN environment variable for configuring the token employed by text-generation-inference. This allows you to gain access to protected resources.

For example, if you want to serve the gated Llama V2 model variants:

  1. Go to https://huggingface.co/settings/tokens
  2. Copy your cli READ token
  3. Export HUGGING_FACE_HUB_TOKEN=<your cli READ token>

or with Docker:

```shell model=meta-llama/Llama-2-7b-chat-hf volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run token=

docker run --gpus all --shm-size 1g -e HUGGINGFACEHUB_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model ```

A note on Shared Memory (shm)

NCCL is a communication framework used by PyTorch to do distributed training/inference. text-generation-inference make use of NCCL to enable Tensor Parallelism to dramatically speed up inference for large language models.

In order to share data between the different devices of a NCCL group, NCCL might fall back to using the host memory if peer-to-peer using NVLink or PCI is not possible.

To allow the container to use 1G of Shared Memory and support SHM sharing, we add --shm-size 1g on the above command.

If you are running text-generation-inference inside Kubernetes. You can also add Shared Memory to the container by creating a volume with:

yaml - name: shm emptyDir: medium: Memory sizeLimit: 1Gi

and mounting it to /dev/shm.

Finally, you can also disable SHM sharing by using the NCCL_SHM_DISABLE=1 environment variable. However, note that this will impact performance.

Distributed Tracing

text-generation-inference is instrumented with distributed tracing using OpenTelemetry. You can use this feature by setting the address to an OTLP collector with the --otlp-endpoint argument.

Architecture

TGI architecture

Local install

You can also opt to install text-generation-inference locally.

First install Rust and create a Python virtual environment with at least Python 3.9, e.g. using conda:

```shell curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

conda create -n text-generation-inference python=3.11 conda activate text-generation-inference ```

You may also need to install Protoc.

On Linux:

shell PROTOC_ZIP=protoc-21.12-linux-x86_64.zip curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP sudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*' rm -f $PROTOC_ZIP

On MacOS, using Homebrew:

shell brew install protobuf

Then run:

shell BUILD_EXTENSIONS=True make install # Install repository and HF/transformer fork with CUDA kernels text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2

Note: on some machines, you may also need the OpenSSL libraries and gcc. On Linux machines, run:

shell sudo apt-get install libssl-dev gcc -y

Optimized architectures

TGI works out of the box to serve optimized models for all modern models. They can be found in this list.

Other architectures are supported on a best-effort basis using:

AutoModelForCausalLM.from_pretrained(<model>, device_map="auto")

or

AutoModelForSeq2SeqLM.from_pretrained(<model>, device_map="auto")

Run locally

Run

shell text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2

Quantization

You can also quantize the weights with bitsandbytes to reduce the VRAM requirement:

shell text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2 --quantize

4bit quantization is available using the NF4 and FP4 data types from bitsandbytes. It can be enabled by providing --quantize bitsandbytes-nf4 or --quantize bitsandbytes-fp4 as a command line argument to text-generation-launcher.

Develop

shell make server-dev make router-dev

Testing

```shell

python

make python-server-tests make python-client-tests

or both server and client tests

make python-tests

rust cargo tests

make rust-tests

integration tests

make integration-tests ```

Owner

  • Name: all-secure-src
  • Login: all-secure-src
  • Kind: organization

GitHub Events

Total
Last Year

Dependencies

.github/workflows/autodocs.yml actions
  • actions/checkout v2 composite
.github/workflows/build.yaml actions
  • actions/checkout v2 composite
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
  • aws-actions/configure-aws-credentials v1 composite
  • docker/build-push-action v4 composite
  • docker/login-action v2.1.0 composite
  • docker/login-action v2 composite
  • docker/metadata-action v4.3.0 composite
  • docker/setup-buildx-action v2.0.0 composite
  • philschmid/philschmid-ec2-github-runner main composite
  • rlespinasse/github-slug-action v4.4.1 composite
  • tailscale/github-action 7bd8039bf25c23c4ab1b8d6e2cc2da2280601966 composite
.github/workflows/build_documentation.yml actions
.github/workflows/build_pr_documentation.yml actions
.github/workflows/client-tests.yaml actions
  • actions/checkout v2 composite
  • actions/setup-python v1 composite
.github/workflows/load_test.yaml actions
  • actions/checkout v3 composite
  • aws-actions/configure-aws-credentials v1 composite
  • philschmid/philschmid-ec2-github-runner main composite
.github/workflows/stale.yml actions
  • actions/stale v8 composite
.github/workflows/tests.yaml actions
  • actions-rs/toolchain v1 composite
  • actions/cache v3 composite
  • actions/checkout v2 composite
  • actions/github-script v6 composite
  • actions/setup-python v1 composite
  • arduino/setup-protoc v1 composite
.github/workflows/upload_pr_documentation.yml actions
Cargo.lock cargo
  • 459 dependencies
Cargo.toml cargo
benchmark/Cargo.toml cargo
launcher/Cargo.toml cargo
  • float_eq 1.0.1 development
  • reqwest 0.11.20 development
  • clap 4.4.5
  • ctrlc 3.4.1
  • hf-hub 0.3.2
  • nix 0.28.0
  • once_cell 1.19.0
  • serde 1.0.188
  • serde_json 1.0.107
  • tracing 0.1.37
  • tracing-subscriber 0.3.17
router/Cargo.toml cargo
router/client/Cargo.toml cargo
router/grpc-metadata/Cargo.toml cargo
Dockerfile docker
  • base latest build
  • chef latest build
  • kernel-builder latest build
  • lukemathwalker/cargo-chef latest-rust-1.75 build
  • nvidia/cuda 12.1.0-devel-ubuntu22.04 build
  • nvidia/cuda 12.1.0-base-ubuntu22.04 build
  • pytorch-install latest build
clients/python/poetry.lock pypi
  • aiohttp 3.8.5
  • aiosignal 1.3.1
  • annotated-types 0.5.0
  • async-timeout 4.0.3
  • asynctest 0.13.0
  • atomicwrites 1.4.1
  • attrs 23.1.0
  • certifi 2023.7.22
  • charset-normalizer 3.2.0
  • colorama 0.4.6
  • coverage 7.2.7
  • filelock 3.12.2
  • frozenlist 1.3.3
  • fsspec 2023.1.0
  • huggingface-hub 0.16.4
  • idna 3.4
  • importlib-metadata 6.7.0
  • iniconfig 2.0.0
  • multidict 6.0.4
  • packaging 23.1
  • pluggy 1.2.0
  • py 1.11.0
  • pydantic 2.5.3
  • pydantic-core 2.14.6
  • pytest 6.2.5
  • pytest-asyncio 0.17.2
  • pytest-cov 3.0.0
  • pyyaml 6.0.1
  • requests 2.31.0
  • toml 0.10.2
  • tomli 2.0.1
  • tqdm 4.66.1
  • typing-extensions 4.7.1
  • urllib3 2.0.5
  • yarl 1.9.2
  • zipp 3.15.0
clients/python/pyproject.toml pypi
  • pytest ^6.2.5 develop
  • pytest-asyncio ^0.17.2 develop
  • pytest-cov ^3.0.0 develop
  • aiohttp ^3.8
  • huggingface-hub >= 0.12, < 1.0
  • pydantic > 2, < 3
  • python ^3.7
integration-tests/poetry.lock pypi
  • aiohttp 3.8.5
  • aiosignal 1.3.1
  • annotated-types 0.6.0
  • async-timeout 4.0.3
  • attrs 23.1.0
  • certifi 2023.7.22
  • charset-normalizer 3.2.0
  • colorama 0.4.6
  • colored 1.4.4
  • docker 6.1.3
  • exceptiongroup 1.1.3
  • filelock 3.12.3
  • frozenlist 1.4.0
  • fsspec 2023.6.0
  • huggingface-hub 0.16.4
  • idna 3.4
  • iniconfig 2.0.0
  • multidict 6.0.4
  • packaging 23.1
  • pluggy 1.3.0
  • pydantic 2.6.4
  • pydantic-core 2.16.3
  • pytest 7.4.0
  • pytest-asyncio 0.21.1
  • pywin32 306
  • pyyaml 6.0.1
  • requests 2.31.0
  • syrupy 4.0.1
  • text-generation 0.6.1
  • tomli 2.0.1
  • tqdm 4.66.1
  • typing-extensions 4.7.1
  • urllib3 2.0.4
  • websocket-client 1.6.2
  • yarl 1.9.2
integration-tests/pyproject.toml pypi
  • docker ^6.1.3
  • pydantic > 2, < 3
  • pytest ^7.4.0
  • pytest-asyncio ^0.21.1
  • python >=3.9,<3.13
  • syrupy 4.0.1
  • text-generation ^0.6.0
integration-tests/requirements.txt pypi
  • aiohttp ==3.8.5 test
  • aiosignal ==1.3.1 test
  • annotated-types ==0.6.0 test
  • async-timeout ==4.0.3 test
  • attrs ==23.1.0 test
  • certifi ==2023.7.22 test
  • charset-normalizer ==3.2.0 test
  • colorama ==0.4.6 test
  • colored ==1.4.4 test
  • docker ==6.1.3 test
  • exceptiongroup ==1.1.3 test
  • filelock ==3.12.3 test
  • frozenlist ==1.4.0 test
  • fsspec ==2023.6.0 test
  • huggingface-hub ==0.16.4 test
  • idna ==3.4 test
  • iniconfig ==2.0.0 test
  • multidict ==6.0.4 test
  • packaging ==23.1 test
  • pluggy ==1.3.0 test
  • pydantic ==2.6.4 test
  • pydantic-core ==2.16.3 test
  • pytest ==7.4.0 test
  • pytest-asyncio ==0.21.1 test
  • pywin32 ==306 test
  • pyyaml ==6.0.1 test
  • requests ==2.31.0 test
  • syrupy ==4.0.1 test
  • text-generation ==0.6.1 test
  • tomli ==2.0.1 test
  • tqdm ==4.66.1 test
  • typing-extensions ==4.7.1 test
  • urllib3 ==2.0.4 test
  • websocket-client ==1.6.2 test
  • yarl ==1.9.2 test
server/custom_kernels/setup.py pypi
server/exllama_kernels/setup.py pypi
server/exllamav2_kernels/setup.py pypi
server/poetry.lock pypi
  • 110 dependencies
server/pyproject.toml pypi
  • grpcio-tools ^1.51.1 develop
  • pytest ^7.3.0 develop
  • accelerate ^0.29.1
  • bitsandbytes ^0.43.0
  • datasets ^2.14.0
  • einops ^0.6.1
  • grpc-interceptor ^0.15.0
  • grpcio ^1.51.1
  • grpcio-reflection ^1.51.1
  • grpcio-status ^1.51.1
  • hf-transfer ^0.1.2
  • huggingface-hub ^0.19.3
  • loguru ^0.6.0
  • opentelemetry-api ^1.15.0
  • opentelemetry-exporter-otlp ^1.15.0
  • opentelemetry-instrumentation-grpc ^0.36b0
  • outlines ^0.0.36
  • peft ^0.10
  • pillow ^10.0.0
  • protobuf ^4.21.7
  • python >=3.9,<3.13
  • safetensors ^0.4
  • scipy ^1.11.1
  • sentencepiece ^0.1.97
  • texttable ^1.6.7
  • tokenizers ^0.15.0
  • torch ^2.1.1
  • transformers ^4.39
  • typer ^0.6.1
server/requirements_cuda.txt pypi
  • backoff ==2.2.1
  • certifi ==2024.2.2
  • charset-normalizer ==3.3.2
  • click ==8.1.7
  • colorama ==0.4.6
  • deprecated ==1.2.14
  • einops ==0.6.1
  • filelock ==3.13.3
  • fsspec ==2024.2.0
  • googleapis-common-protos ==1.63.0
  • grpc-interceptor ==0.15.4
  • grpcio ==1.62.1
  • grpcio-reflection ==1.62.1
  • grpcio-status ==1.62.1
  • hf-transfer ==0.1.6
  • huggingface-hub ==0.19.4
  • idna ==3.6
  • loguru ==0.6.0
  • numpy ==1.26.4
  • opentelemetry-api ==1.15.0
  • opentelemetry-exporter-otlp ==1.15.0
  • opentelemetry-exporter-otlp-proto-grpc ==1.15.0
  • opentelemetry-exporter-otlp-proto-http ==1.15.0
  • opentelemetry-instrumentation ==0.36b0
  • opentelemetry-instrumentation-grpc ==0.36b0
  • opentelemetry-proto ==1.15.0
  • opentelemetry-sdk ==1.15.0
  • opentelemetry-semantic-conventions ==0.36b0
  • packaging ==24.0
  • pillow ==10.3.0
  • protobuf ==4.25.3
  • pyyaml ==6.0.1
  • regex ==2023.12.25
  • requests ==2.31.0
  • safetensors ==0.4.2
  • scipy ==1.13.0
  • sentencepiece ==0.1.99
  • setuptools ==69.2.0
  • tokenizers ==0.15.2
  • tqdm ==4.66.2
  • transformers ==4.39.3
  • typer ==0.6.1
  • typing-extensions ==4.11.0
  • urllib3 ==2.2.1
  • win32-setctime ==1.1.0
  • wrapt ==1.16.0
server/requirements_rocm.txt pypi
  • backoff ==2.2.1
  • certifi ==2024.2.2
  • charset-normalizer ==3.3.2
  • click ==8.1.7
  • colorama ==0.4.6
  • deprecated ==1.2.14
  • einops ==0.6.1
  • filelock ==3.13.3
  • fsspec ==2024.2.0
  • googleapis-common-protos ==1.63.0
  • grpc-interceptor ==0.15.4
  • grpcio ==1.62.1
  • grpcio-reflection ==1.62.1
  • grpcio-status ==1.62.1
  • hf-transfer ==0.1.6
  • huggingface-hub ==0.19.4
  • idna ==3.6
  • loguru ==0.6.0
  • numpy ==1.26.4
  • opentelemetry-api ==1.15.0
  • opentelemetry-exporter-otlp ==1.15.0
  • opentelemetry-exporter-otlp-proto-grpc ==1.15.0
  • opentelemetry-exporter-otlp-proto-http ==1.15.0
  • opentelemetry-instrumentation ==0.36b0
  • opentelemetry-instrumentation-grpc ==0.36b0
  • opentelemetry-proto ==1.15.0
  • opentelemetry-sdk ==1.15.0
  • opentelemetry-semantic-conventions ==0.36b0
  • packaging ==24.0
  • pillow ==10.3.0
  • protobuf ==4.25.3
  • pyyaml ==6.0.1
  • regex ==2023.12.25
  • requests ==2.31.0
  • safetensors ==0.4.2
  • scipy ==1.13.0
  • sentencepiece ==0.1.99
  • setuptools ==69.2.0
  • tokenizers ==0.15.2
  • tqdm ==4.66.2
  • transformers ==4.39.3
  • typer ==0.6.1
  • typing-extensions ==4.11.0
  • urllib3 ==2.2.1
  • win32-setctime ==1.1.0
  • wrapt ==1.16.0