server - Release 2.60.0 corresponding to NGC container 25.08

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

New Features and Improvements

* Added CUDA 13 support.

Known Issues

* Triton ONNX Runtime Backend build uses [microsoft/onnxruntime/commit/1d1712fdaf](https://github.com/microsoft/onnxruntime/commit/1d1712fdafb9e61b2d6d033c4433c1033395d7e7) and may have some limitations on DGX Spark hardware which will be addressed in future versions. * CuPy has issues with the CUDA 13 Device API in multithreaded contexts. Avoid using tritonclient cuda_shared_memory APIs in multithreaded environments until fixed by CuPy * CuPy does not support CUDA 13 at the time of writing. Issues may be encountered when using CuPy before it officially supports CUDA 13, see https://github.com/triton-inference-server/server/tree/r25.08/python/openai#pre-requisites for more details * TensorRT calibration cache may require size adjustment in some cases, which was observed for the IGX platform. * The core Python binding may incur an additional D2H and H2D copy if the backend and frontend both specify device memory to be used for response tensors. * A segmentation fault related to DCGM and NSCQ may be encountered during server shutdown on NVSwitch systems. A possible workaround for this issue is to disable the collection of GPU metrics `tritonserver --allow-gpu-metrics false ...` * When using TensorRT models, if auto-complete configuration is disabled and `is_non_linear_format_io:true` for [reformat-free tensors](https://github.com/triton-inference-server/server/blob/r24.08/docs/user_guide/model_configuration.md#non-linear-io-formats) is not provided in the model configuration, the model may not load successfully. * When using Python models in [decoupled mode](https://github.com/triton-inference-server/python_backend/tree/main?tab=readme-ov-file#decoupled-mode), users need to ensure that the `ResponseSender` goes out of scope or is properly cleaned up before unloading the model to guarantee that the unloading process executes correctly. * Restart support was temporarily removed for Python models. * Triton Inference Server with vLLM backend currently does not support running vLLM models with tensor parallelism sizes greater than 1 and the default "distributed_executor_backend" setting when using explicit model control mode. In attempt to load a vllm model (tp > 1) in explicit mode, users could potentially see failure at `initialize` step: `could not acquire lock for <_io.BufferedWriter name=''> at interpreter shutdown, possibly due to daemon threads`. For the default model control mode, after server shutdown, vllm related sub-processes are not killed. Related vllm issue: https://github.com/vllm-project/vllm/issues/6766 . Please specify "distributed_executor_backend":"ray" in the `model.json` when deploying vllm models with tensor parallelism > 1. * When loading models with file override, multiple model configuration files are not supported. Users must provide the model configuration by setting parameter `"config" : ""` instead of custom configuration file in the following format: `"file:configs/.pbtxt" : ""`. * TensorRT-LLM [backend](https://github.com/triton-inference-server/tensorrtllm_backend) provides limited support of Triton extensions and features. * The TensorRT-LLM backend may core dump on server shutdown. This impacts server teardown only and will not impact inferencing. * The Java CAPI is known to have intermittent segfaults. * Some systems which implement `malloc()` may not release memory back to the operating system right away causing a false memory leak. This can be mitigated by using a different malloc implementation. `TCMalloc` and `jemalloc` are installed in the Triton container and can be [used by specifying the library in LD_PRELOAD](https://github.com/triton-inference-server/server/blob/r25.01/docs/user_guide/model_management.md). NVIDIA recommends experimenting with both `tcmalloc` and `jemalloc` to determine which one works better for your use case. * Auto-complete may cause an increase in server start time. To avoid a start time increase, users can provide the full model configuration and launch the server with `--disable-auto-complete-config`. * Auto-complete does not support PyTorch models due to lack of metadata in the model. It can only verify that the number of inputs and the input names matches what is specified in the model configuration. There is no model metadata about the number of outputs and datatypes. Related PyTorch bug:[ https://github.com/pytorch/pytorch/issues/38273](https://github.com/pytorch/pytorch/issues/38273) * Triton Client PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton Client library for Arm SBSA. The correct client wheel file can be pulled directly from the Arm SBSA SDK image and manually installed. * Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU. Refer to [pytorch/pytorch#66930](https://github.com/pytorch/pytorch/issues/66930) for more information. * Triton cannot retrieve GPU metrics with [MIG-enabled GPU devices](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#supported-gpus). * Triton metrics might not work if the host machine is running a separate DCGM agent on bare-metal or in a container. * When cloud storage (AWS, GCS, AZURE) is used as a model repository and a model has multiple versions, Triton creates an extra local copy of the cloud model’s folder in the temporary directory, which is deleted upon server’s shutdown. * Python backend support for Windows is limited and does not currently support the following features: - GPU tensors - CPU and GPU-related metrics - Custom execution environments - The model load/unload APIs

Client Libraries and Examples

Ubuntu 24.04 builds of the client libraries and examples are included in this release in the attached `v2.60.0_ubuntu2404.clients.tar.gz` file. The SDK is also available for as an Ubuntu 24.04 based [NGC Container](https://ngc.nvidia.com/catalog/containers/nvidia:tritonserver/tags). The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See [Getting the Client Libraries](https://github.com/triton-inference-server/client/tree/r25.08#getting-the-client-libraries-and-examples) for more information on each of these options.

Windows Support

> [!NOTE] > There is no Windows release for 25.08, the latest release is [25.01](https://github.com/triton-inference-server/server/releases/tag/v2.54.0).

Jetson iGPU Support

A release of Triton for [IGX](https://www.nvidia.com/en-us/edge-computing/products/igx/) is provided in the attached tar file: [`tritonserver2.60.0-igpu.tar`](https://github.com/triton-inference-server/server/releases/download/v2.60.0/tritonserver2.60.0-igpu.tar). * This release supports **CUDA** `12.9`, **TensorRT** `10.11.0.33`, **Onnx Runtime** `1.23.0+1d1712fdaf`, **PyTorch** [`2.8.0a+34c6371d24`](https://docs.nvidia.com/deeplearning/frameworks/install-pytorch-jetson-platform-release-notes/pytorch-jetson-rel.html), **Python** `3.12` and as well as _ensembles_. * ONNX Runtime backend does not support the OpenVINO and TensorRT execution providers. The CUDA execution provider is in Beta. * System shared memory is supported on Jetson. CUDA shared memory is not supported. * GPU metrics, GCS storage, S3 storage and Azure storage are not supported. The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples. For more information on how to install and use Triton on JetPack refer to [`jetson.md`](https://github.com/triton-inference-server/server/blob/r25.07/docs/user_guide/jetson.md). The wheel for the Python client library is present in the tar file and can be installed by running the following command: ``` python3 -m pip install --upgrade clients/python/tritonclient-2.59.0-py3-none-manylinux2014_aarch64.whl[all] ```

Jetson AGX Systems Support

A release of Triton for [AGX Systems](https://www.nvidia.com/en-us/deep-learning-ai/products/agx-systems) is provided in the attached tar file: [`tritonserver2.60.0-agx.tar`](https://github.com/triton-inference-server/server/releases/download/v2.60.0/tritonserver2.60.0-agx.tar). > [!NOTE] > There is no Jetson AGX release for 25.08, requires DCGM version 4 to be installed in order to use GPU metrics. > Please use following command to install DCGM 4: > ``` > curl -o /tmp/cuda-keyring.deb \ > https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb \ > && apt install /tmp/cuda-keyring.deb \ > && rm /tmp/cuda-keyring.deb \ > && apt update \ > && apt install --yes --no-install-recommends \ > datacenter-gpu-manager-4-core=1:4.4.0-1 > ``` * This release supports **CUDA** `13.0`, **TensorRT** `10.13.2.6`, **Onnx Runtime** `1.23.0+1d1712fdaf`, **PyTorch** [`2.8.0a0+34c6371`](https://docs.nvidia.com/deeplearning/frameworks/install-pytorch-jetson-platform-release-notes/pytorch-jetson-rel.html), **Python** `3.12` and as well as _ensembles_. * This package is a subset of `nvcr.io/nvidia/tritonserver:25.08-py3` ARM container image assets it.

Triton TRT-LLM Container Support Matrix

The Triton TensorRT-LLM container is built from the 25.04 image [`nvcr.io/nvidia/tritonserver:25.06-py3-min`](http://nvcr.io/nvidia/tritonserver:25.06-py3-min). Please refer to the [support matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html) and [compatibility.md](https://github.com/triton-inference-server/server/blob/v2.60.0/docs/introduction/compatibility.md#container-name-trtllm-python-py3) for all dependency versions related to 25.04. However, the packages listed below have different versions than those specified in the support matrix. | Dependency | Version | | :------------: | :---------------: | | TensorRT-LLM | 0.21.0 | | TensorRT | 10.11.0.33 |

- Python
Published by mc-nv 11 months ago

server - Release 2.59.1 corresponding to NGC container 25.07

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

New Features and Improvements

* Fixed vulnerabilities in the Triton Inference Server.

Known Issues

* There was no python wheels packages released as part of 25.07 release * TensorRT calibration cache may require size adjustment in some cases, which was observed for the IGX platform. * The core Python binding may incur an additional D2H and H2D copy if the backend and frontend both specify device memory to be used for response tensors. * A segmentation fault related to DCGM and NSCQ may be encountered during server shutdown on NVSwitch systems. A possible workaround for this issue is to disable the collection of GPU metrics `tritonserver --allow-gpu-metrics false ...` * vLLM backend currently does not take advantage of the [vLLM v0.6](https://blog.vllm.ai/2024/09/05/perf-update.html) performance improvement when metrics are enabled. * When using TensorRT models, if auto-complete configuration is disabled and `is_non_linear_format_io:true` for [reformat-free tensors](https://github.com/triton-inference-server/server/blob/r24.08/docs/user_guide/model_configuration.md#non-linear-io-formats) is not provided in the model configuration, the model may not load successfully. * When using Python models in [decoupled mode](https://github.com/triton-inference-server/python_backend/tree/main?tab=readme-ov-file#decoupled-mode), users need to ensure that the `ResponseSender` goes out of scope or is properly cleaned up before unloading the model to guarantee that the unloading process executes correctly. * Restart support was temporarily removed for Python models. * Triton Inference Server with vLLM backend currently does not support running vLLM models with tensor parallelism sizes greater than 1 and the default "distributed_executor_backend" setting when using explicit model control mode. In attempt to load a vllm model (tp > 1) in explicit mode, users could potentially see failure at `initialize` step: `could not acquire lock for <_io.BufferedWriter name=''> at interpreter shutdown, possibly due to daemon threads`. For the default model control mode, after server shutdown, vllm related sub-processes are not killed. Related vllm issue: https://github.com/vllm-project/vllm/issues/6766 . Please specify "distributed_executor_backend":"ray" in the `model.json` when deploying vllm models with tensor parallelism > 1. * When loading models with file override, multiple model configuration files are not supported. Users must provide the model configuration by setting parameter `"config" : ""` instead of custom configuration file in the following format: `"file:configs/.pbtxt" : ""`. * TensorRT-LLM [backend](https://github.com/triton-inference-server/tensorrtllm_backend) provides limited support of Triton extensions and features. * The TensorRT-LLM backend may core dump on server shutdown. This impacts server teardown only and will not impact inferencing. * The Java CAPI is known to have intermittent segfaults. * Some systems which implement `malloc()` may not release memory back to the operating system right away causing a false memory leak. This can be mitigated by using a different malloc implementation. `TCMalloc` and `jemalloc` are installed in the Triton container and can be [used by specifying the library in LD_PRELOAD](https://github.com/triton-inference-server/server/blob/r25.01/docs/user_guide/model_management.md). NVIDIA recommends experimenting with both `tcmalloc` and `jemalloc` to determine which one works better for your use case. * Auto-complete may cause an increase in server start time. To avoid a start time increase, users can provide the full model configuration and launch the server with `--disable-auto-complete-config`. * Auto-complete does not support PyTorch models due to lack of metadata in the model. It can only verify that the number of inputs and the input names matches what is specified in the model configuration. There is no model metadata about the number of outputs and datatypes. Related PyTorch bug:[ https://github.com/pytorch/pytorch/issues/38273](https://github.com/pytorch/pytorch/issues/38273) * Triton Client PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton Client library for Arm SBSA. The correct client wheel file can be pulled directly from the Arm SBSA SDK image and manually installed. * Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU. Refer to [pytorch/pytorch#66930](https://github.com/pytorch/pytorch/issues/66930) for more information. * Triton cannot retrieve GPU metrics with [MIG-enabled GPU devices](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#supported-gpus). * Triton metrics might not work if the host machine is running a separate DCGM agent on bare-metal or in a container. * When cloud storage (AWS, GCS, AZURE) is used as a model repository and a model has multiple versions, Triton creates an extra local copy of the cloud model’s folder in the temporary directory, which is deleted upon server’s shutdown. * Python backend support for Windows is limited and does not currently support the following features: - GPU tensors - CPU and GPU-related metrics - Custom execution environments - The model load/unload APIs

Client Libraries and Examples

Ubuntu 24.04 builds of the client libraries and examples are included in this release in the attached `v2.59.1_ubuntu2404.clients.tar.gz` file. The SDK is also available for as an Ubuntu 24.04 based [NGC Container](https://ngc.nvidia.com/catalog/containers/nvidia:tritonserver/tags). The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See [Getting the Client Libraries](https://github.com/triton-inference-server/client/tree/r25.05#getting-the-client-libraries-and-examples) for more information on each of these options.

Windows Support

> [!NOTE] > There is no Windows release for 25.07, the latest release is [25.01](https://github.com/triton-inference-server/server/releases/tag/v2.54.0).

Jetson iGPU Support

A release of Triton for [IGX](https://www.nvidia.com/en-us/edge-computing/products/igx/) is provided in the attached tar file: [`tritonserver2.59.1-igpu.tar`](https://github.com/triton-inference-server/server/releases/download/v2.59.1/tritonserver2.59.1-igpu.tar). * This release supports **TensorRT** `10.11.0.33`, **Onnx Runtime** `1.22.0`, **PyTorch** [`2.8.0a0+5228986c39.nv25.6`](https://docs.nvidia.com/deeplearning/frameworks/install-pytorch-jetson-platform-release-notes/pytorch-jetson-rel.html), **Python** `3.12` and as well as _ensembles_. * ONNX Runtime backend does not support the OpenVINO and TensorRT execution providers. The CUDA execution provider is in Beta. * System shared memory is supported on Jetson. CUDA shared memory is not supported. * GPU metrics, GCS storage, S3 storage and Azure storage are not supported. The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples. For more information on how to install and use Triton on JetPack refer to [`jetson.md`](https://github.com/triton-inference-server/server/blob/r25.07/docs/user_guide/jetson.md). The wheel for the Python client library is present in the tar file and can be installed by running the following command: ``` python3 -m pip install --upgrade clients/python/tritonclient-2.59.0-py3-none-manylinux2014_aarch64.whl[all] ```

Triton TRT-LLM Container Support Matrix

The Triton TensorRT-LLM container is built from the 25.04 image [`nvcr.io/nvidia/tritonserver:25.04-py3-min`](http://nvcr.io/nvidia/tritonserver:25.04-py3-min). Please refer to the [support matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html) and [compatibility.md](https://github.com/triton-inference-server/server/blob/v2.59.1/docs/introduction/compatibility.md#container-name-trtllm-python-py3) for all dependency versions related to 25.04. However, the packages listed below have different versions than those specified in the support matrix. | Dependency | Version | | :------------: | :---------------: | | TensorRT-LLM | 0.20.0 | | TensorRT | 10.10.0.31 |

- Python
Published by mc-nv 12 months ago

server - Release 2.59.0 corresponding to NGC container 25.06

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

New Features and Improvements

* Improved ensemble model performance in scenarios that allow out-of-order responses by increasing maximum throughput and reducing latency.

Known Issues

* TensorRT calibration cache may require size adjustment in some cases, which was observed for the IGX platform. * The core Python binding may incur an additional D2H and H2D copy if the backend and frontend both specify device memory to be used for response tensors. * A segmentation fault related to DCGM and NSCQ may be encountered during server shutdown on NVSwitch systems. A possible workaround for this issue is to disable the collection of GPU metrics `tritonserver --allow-gpu-metrics false ...` * vLLM backend currently does not take advantage of the [vLLM v0.6](https://blog.vllm.ai/2024/09/05/perf-update.html) performance improvement when metrics are enabled. * When using TensorRT models, if auto-complete configuration is disabled and `is_non_linear_format_io:true` for [reformat-free tensors](https://github.com/triton-inference-server/server/blob/r24.08/docs/user_guide/model_configuration.md#non-linear-io-formats) is not provided in the model configuration, the model may not load successfully. * When using Python models in [decoupled mode](https://github.com/triton-inference-server/python_backend/tree/main?tab=readme-ov-file#decoupled-mode), users need to ensure that the `ResponseSender` goes out of scope or is properly cleaned up before unloading the model to guarantee that the unloading process executes correctly. * Restart support was temporarily removed for Python models. * Triton Inference Server with vLLM backend currently does not support running vLLM models with tensor parallelism sizes greater than 1 and the default "distributed_executor_backend" setting when using explicit model control mode. In attempt to load a vllm model (tp > 1) in explicit mode, users could potentially see failure at `initialize` step: `could not acquire lock for <_io.BufferedWriter name=''> at interpreter shutdown, possibly due to daemon threads`. For the default model control mode, after server shutdown, vllm related sub-processes are not killed. Related vllm issue: https://github.com/vllm-project/vllm/issues/6766 . Please specify "distributed_executor_backend":"ray" in the `model.json` when deploying vllm models with tensor parallelism > 1. * When loading models with file override, multiple model configuration files are not supported. Users must provide the model configuration by setting parameter `"config" : ""` instead of custom configuration file in the following format: `"file:configs/.pbtxt" : ""`. * TensorRT-LLM [backend](https://github.com/triton-inference-server/tensorrtllm_backend) provides limited support of Triton extensions and features. * The TensorRT-LLM backend may core dump on server shutdown. This impacts server teardown only and will not impact inferencing. * The Java CAPI is known to have intermittent segfaults. * Some systems which implement `malloc()` may not release memory back to the operating system right away causing a false memory leak. This can be mitigated by using a different malloc implementation. `TCMalloc` and `jemalloc` are installed in the Triton container and can be [used by specifying the library in LD_PRELOAD](https://github.com/triton-inference-server/server/blob/r25.01/docs/user_guide/model_management.md). NVIDIA recommends experimenting with both `tcmalloc` and `jemalloc` to determine which one works better for your use case. * Auto-complete may cause an increase in server start time. To avoid a start time increase, users can provide the full model configuration and launch the server with `--disable-auto-complete-config`. * Auto-complete does not support PyTorch models due to lack of metadata in the model. It can only verify that the number of inputs and the input names matches what is specified in the model configuration. There is no model metadata about the number of outputs and datatypes. Related PyTorch bug:[ https://github.com/pytorch/pytorch/issues/38273](https://github.com/pytorch/pytorch/issues/38273) * Triton Client PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton Client library for Arm SBSA. The correct client wheel file can be pulled directly from the Arm SBSA SDK image and manually installed. * Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU. Refer to [pytorch/pytorch#66930](https://github.com/pytorch/pytorch/issues/66930) for more information. * Triton cannot retrieve GPU metrics with [MIG-enabled GPU devices](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#supported-gpus). * Triton metrics might not work if the host machine is running a separate DCGM agent on bare-metal or in a container. * When cloud storage (AWS, GCS, AZURE) is used as a model repository and a model has multiple versions, Triton creates an extra local copy of the cloud model’s folder in the temporary directory, which is deleted upon server’s shutdown. * Python backend support for Windows is limited and does not currently support the following features: - GPU tensors - CPU and GPU-related metrics - Custom execution environments - The model load/unload APIs

Client Libraries and Examples

Ubuntu 24.04 builds of the client libraries and examples are included in this release in the attached `v2.59.0_ubuntu2404.clients.tar.gz` file. The SDK is also available for as an Ubuntu 24.04 based [NGC Container](https://ngc.nvidia.com/catalog/containers/nvidia:tritonserver/tags). The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See [Getting the Client Libraries](https://github.com/triton-inference-server/client/tree/r25.05#getting-the-client-libraries-and-examples) for more information on each of these options.

Windows Support

> [!NOTE] > There is no Windows release for 25.06, the latest release is [25.01](https://github.com/triton-inference-server/server/releases/tag/v2.54.0).

Jetson iGPU Support

A release of Triton for [IGX](https://www.nvidia.com/en-us/edge-computing/products/igx/) is provided in the attached tar file: [`tritonserver2.59.0-igpu.tar`](https://github.com/triton-inference-server/server/releases/download/v2.59.0/tritonserver2.59.0-igpu.tar). * This release supports **TensorRT** `10.11.0.33`, **Onnx Runtime** `1.22.0`, **PyTorch** [`2.8.0a0+5228986c39.nv25.6`](https://docs.nvidia.com/deeplearning/frameworks/install-pytorch-jetson-platform-release-notes/pytorch-jetson-rel.html), **Python** `3.12` and as well as _ensembles_. * ONNX Runtime backend does not support the OpenVINO and TensorRT execution providers. The CUDA execution provider is in Beta. * System shared memory is supported on Jetson. CUDA shared memory is not supported. * GPU metrics, GCS storage, S3 storage and Azure storage are not supported. The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples. For more information on how to install and use Triton on JetPack refer to [`jetson.md`](https://github.com/triton-inference-server/server/blob/r25.06/docs/user_guide/jetson.md). The wheel for the Python client library is present in the tar file and can be installed by running the following command: ``` python3 -m pip install --upgrade clients/python/tritonclient-2.59.0-py3-none-manylinux2014_aarch64.whl[all] ```

Triton TRT-LLM Container Support Matrix

The Triton TensorRT-LLM container is built from the 25.04 image [`nvcr.io/nvidia/tritonserver:25.04-py3-min`](http://nvcr.io/nvidia/tritonserver:25.04-py3-min). Please refer to the [support matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html) and [compatibility.md](https://github.com/triton-inference-server/server/blob/v2.59.0/docs/introduction/compatibility.md#container-name-trtllm-python-py3) for all dependency versions related to 25.04. However, the packages listed below have different versions than those specified in the support matrix. | Dependency | Version | | :------------: | :---------------: | | TensorRT-LLM | 0.20.0 | | TensorRT | 10.10.0.31 |

- Python
Published by mc-nv about 1 year ago

server - Release 2.58.0 corresponding to NGC container 25.05

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

New Features and Improvements

- Optional “execution_context_allocation_strategy” parameter in the TensorRT backend configuration allows selection of memory allocation behavior. - Support Tool calling functionality with Llama 3 and Mistral models in OpenAI frontend. - Improvements around memory allocation and various bug fixes. - GenAI-Perf now offers a new configuration file alongside the command line. - GenAI-Perf now collects GPU metrics from /metrics endpoint exposed by DCGM Exporter. - GenAI-Perf supports new Power, Utilization, Ecc, Errors and PCie metrics.

Known Issues

* vLLM backend for 25.05 might be unstable with the vLLM V1 architecture. We recommend switching to V0 for this release, by setting `VLLM_USE_V1` environment variable to 0. However, users should be aware that vLLM's V0 API is affected by vulnerabilities. * vLLM containers include vllm version 0.8.4 which is affected by vulnerabilities. Workarounds: Prior to the fix, your options include: - Do not expose the vLLM host to a network where any untrusted connections may reach the host. - Ensure that only the other vLLM hosts are able to connect to the TCP port used for the XPUB socket. Note that port used is random. * The core Python binding may incur an additional D2H and H2D copy if the backend and frontend both specify device memory to be used for response tensors. * A segmentation fault related to DCGM and NSCQ may be encountered during server shutdown on NVSwitch systems. A possible workaround for this issue is to disable the collection of GPU metrics `tritonserver --allow-gpu-metrics false ...` * vLLM backend currently does not take advantage of the [vLLM v0.6](https://blog.vllm.ai/2024/09/05/perf-update.html) performance improvement when metrics are enabled. * When using TensorRT models, if auto-complete configuration is disabled and `is_non_linear_format_io:true` for [reformat-free tensors](https://github.com/triton-inference-server/server/blob/r24.08/docs/user_guide/model_configuration.md#non-linear-io-formats) is not provided in the model configuration, the model may not load successfully. * When using Python models in [decoupled mode](https://github.com/triton-inference-server/python_backend/tree/main?tab=readme-ov-file#decoupled-mode), users need to ensure that the `ResponseSender` goes out of scope or is properly cleaned up before unloading the model to guarantee that the unloading process executes correctly. * Restart support was temporarily removed for Python models. * Triton Inference Server with vLLM backend currently does not support running vLLM models with tensor parallelism sizes greater than 1 and the default "distributed_executor_backend" setting when using explicit model control mode. In attempt to load a vllm model (tp > 1) in explicit mode, users could potentially see failure at `initialize` step: `could not acquire lock for <_io.BufferedWriter name=''> at interpreter shutdown, possibly due to daemon threads`. For the default model control mode, after server shutdown, vllm related sub-processes are not killed. Related vllm issue: https://github.com/vllm-project/vllm/issues/6766 . Please specify "distributed_executor_backend":"ray" in the `model.json` when deploying vllm models with tensor parallelism > 1. * When loading models with file override, multiple model configuration files are not supported. Users must provide the model configuration by setting parameter `"config" : ""` instead of custom configuration file in the following format: `"file:configs/.pbtxt" : ""`. * TensorRT-LLM [backend](https://github.com/triton-inference-server/tensorrtllm_backend) provides limited support of Triton extensions and features. * The TensorRT-LLM backend may core dump on server shutdown. This impacts server teardown only and will not impact inferencing. * The Java CAPI is known to have intermittent segfaults. * Some systems which implement `malloc()` may not release memory back to the operating system right away causing a false memory leak. This can be mitigated by using a different malloc implementation. `TCMalloc` and `jemalloc` are installed in the Triton container and can be [used by specifying the library in LD_PRELOAD](https://github.com/triton-inference-server/server/blob/r25.01/docs/user_guide/model_management.md). NVIDIA recommends experimenting with both `tcmalloc` and `jemalloc` to determine which one works better for your use case. * Auto-complete may cause an increase in server start time. To avoid a start time increase, users can provide the full model configuration and launch the server with `--disable-auto-complete-config`. * Auto-complete does not support PyTorch models due to lack of metadata in the model. It can only verify that the number of inputs and the input names matches what is specified in the model configuration. There is no model metadata about the number of outputs and datatypes. Related PyTorch bug:[ https://github.com/pytorch/pytorch/issues/38273](https://github.com/pytorch/pytorch/issues/38273) * Triton Client PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton Client library for Arm SBSA. The correct client wheel file can be pulled directly from the Arm SBSA SDK image and manually installed. * Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU. Refer to [pytorch/pytorch#66930](https://github.com/pytorch/pytorch/issues/66930) for more information. * Triton cannot retrieve GPU metrics with [MIG-enabled GPU devices](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#supported-gpus). * Triton metrics might not work if the host machine is running a separate DCGM agent on bare-metal or in a container. * When cloud storage (AWS, GCS, AZURE) is used as a model repository and a model has multiple versions, Triton creates an extra local copy of the cloud model’s folder in the temporary directory, which is deleted upon server’s shutdown. * Python backend support for Windows is limited and does not currently support the following features: - GPU tensors - CPU and GPU-related metrics - Custom execution environments - The model load/unload APIs

Client Libraries and Examples

Ubuntu 24.04 builds of the client libraries and examples are included in this release in the attached `v2.58.0_ubuntu2404.clients.tar.gz` file. The SDK is also available for as an Ubuntu 24.04 based [NGC Container](https://ngc.nvidia.com/catalog/containers/nvidia:tritonserver/tags). The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See [Getting the Client Libraries](https://github.com/triton-inference-server/client/tree/r25.05#getting-the-client-libraries-and-examples) for more information on each of these options.

Windows Support

> [!NOTE] > There is no Windows release for 25.05, the latest release is [25.02](https://github.com/triton-inference-server/server/releases/tag/v2.55.0).

Jetson iGPU Support

A release of Triton for [IGX](https://www.nvidia.com/en-us/edge-computing/products/igx/) is provided in the attached tar file: [`tritonserver2.58.0-igpu.tar`](https://github.com/triton-inference-server/server/releases/download/v2.58.0/tritonserver2.58.0-igpu.tar). * This release supports **TensorRT** `10.10.0.31`, **Onnx Runtime** `1.22.0`, **PyTorch** [`2.8.0a0+5228986c39.nv25.5`](https://docs.nvidia.com/deeplearning/frameworks/install-pytorch-jetson-platform-release-notes/pytorch-jetson-rel.html), **Python** `3.12` and as well as _ensembles_. * ONNX Runtime backend does not support the OpenVINO and TensorRT execution providers. The CUDA execution provider is in Beta. * System shared memory is supported on Jetson. CUDA shared memory is not supported. * GPU metrics, GCS storage, S3 storage and Azure storage are not supported. The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples. For more information on how to install and use Triton on JetPack refer to [`jetson.md`](https://github.com/triton-inference-server/server/blob/r25.05/docs/user_guide/jetson.md). The wheel for the Python client library is present in the tar file and can be installed by running the following command: ``` python3 -m pip install --upgrade clients/python/tritonclient-2.58.0-py3-none-manylinux2014_aarch64.whl[all] ```

Triton TRT-LLM Container Support Matrix

The Triton TensorRT-LLM container is built from the 25.03 image [`nvcr.io/nvidia/tritonserver:25.03-py3-min`](http://nvcr.io/nvidia/tritonserver:25.03-py3-min). Please refer to the [support matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html) and [compatibility.md](https://github.com/triton-inference-server/server/blob/v2.58.0/docs/introduction/compatibility.md#container-name-trtllm-python-py3) for all dependency versions related to 25.03. However, the packages listed below have different versions than those specified in the support matrix. | Dependency | Version | | :------------: | :---------------: | | TensorRT-LLM | 0.19.0 | | TensorRT | 10.9.0.34 |

- Python
Published by dmitry-tokarev-nv about 1 year ago

server - Release 2.57.0 corresponding to NGC container 25.04

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

New Features and Improvements

- Exposed gRPC infer thread count as a server option. - Improved server stability during the gRPC client cancellation. - Improved server stability in tracing mode. - Added BLS decoupled request cancellation in the Python Backend - GenAI-Perf now offers a new configuration file alongside the command line. - GenAI-Perf now supports the Huggingface TGI generated endpoint. - GenAI-Perf added a Token per second per user (TPS/user) metric. - GenAI-Perf metric parsing speed was increased by 60%.

Known Issues

* vLLM backend for 25.04 might be unstable with the vLLM V1 architecture. We recommend switching to V0 for this release, by setting `VLLM_USE_V1` environment variable to 0. However, users should be aware that vLLM's V0 API is affected by vulnerabilities. * vLLM containers include vllm version 0.8.1 which is affected by new vulnerabilities. Workarounds: Prior to the fix, your options include: - Do not expose the vLLM host to a network where any untrusted connections may reach the host. - Ensure that only the other vLLM hosts are able to connect to the TCP port used for the XPUB socket. Note that port used is random. * The core Python binding may incur an additional D2H and H2D copy if the backend and frontend both specify device memory to be used for response tensors. * A segmentation fault related to DCGM and NSCQ may be encountered during server shutdown on NVSwitch systems. A possible workaround for this issue is to disable the collection of GPU metrics `tritonserver --allow-gpu-metrics false ...` * vLLM backend currently does not take advantage of the [vLLM v0.6](https://blog.vllm.ai/2024/09/05/perf-update.html) performance improvement when metrics are enabled. * When using TensorRT models, if auto-complete configuration is disabled and `is_non_linear_format_io:true` for [reformat-free tensors](https://github.com/triton-inference-server/server/blob/r24.08/docs/user_guide/model_configuration.md#non-linear-io-formats) is not provided in the model configuration, the model may not load successfully. * When using Python models in [decoupled mode](https://github.com/triton-inference-server/python_backend/tree/main?tab=readme-ov-file#decoupled-mode), users need to ensure that the `ResponseSender` goes out of scope or is properly cleaned up before unloading the model to guarantee that the unloading process executes correctly. * Restart support was temporarily removed for Python models. * Triton Inference Server with vLLM backend currently does not support running vLLM models with tensor parallelism sizes greater than 1 and the default "distributed_executor_backend" setting when using explicit model control mode. In attempt to load a vllm model (tp > 1) in explicit mode, users could potentially see failure at `initialize` step: `could not acquire lock for <_io.BufferedWriter name=''> at interpreter shutdown, possibly due to daemon threads`. For the default model control mode, after server shutdown, vllm related sub-processes are not killed. Related vllm issue: https://github.com/vllm-project/vllm/issues/6766 . Please specify "distributed_executor_backend":"ray" in the `model.json` when deploying vllm models with tensor parallelism > 1. * When loading models with file override, multiple model configuration files are not supported. Users must provide the model configuration by setting parameter `"config" : ""` instead of custom configuration file in the following format: `"file:configs/.pbtxt" : ""`. * TensorRT-LLM [backend](https://github.com/triton-inference-server/tensorrtllm_backend) provides limited support of Triton extensions and features. * The TensorRT-LLM backend may core dump on server shutdown. This impacts server teardown only and will not impact inferencing. * The Java CAPI is known to have intermittent segfaults. * Some systems which implement `malloc()` may not release memory back to the operating system right away causing a false memory leak. This can be mitigated by using a different malloc implementation. `TCMalloc` and `jemalloc` are installed in the Triton container and can be [used by specifying the library in LD_PRELOAD](https://github.com/triton-inference-server/server/blob/r25.01/docs/user_guide/model_management.md). NVIDIA recommends experimenting with both `tcmalloc` and `jemalloc` to determine which one works better for your use case. * Auto-complete may cause an increase in server start time. To avoid a start time increase, users can provide the full model configuration and launch the server with `--disable-auto-complete-config`. * Auto-complete does not support PyTorch models due to lack of metadata in the model. It can only verify that the number of inputs and the input names matches what is specified in the model configuration. There is no model metadata about the number of outputs and datatypes. Related PyTorch bug:[ https://github.com/pytorch/pytorch/issues/38273](https://github.com/pytorch/pytorch/issues/38273) * Triton Client PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton Client library for Arm SBSA. The correct client wheel file can be pulled directly from the Arm SBSA SDK image and manually installed. * Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU. Refer to [pytorch/pytorch#66930](https://github.com/pytorch/pytorch/issues/66930) for more information. * Triton cannot retrieve GPU metrics with [MIG-enabled GPU devices](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#supported-gpus). * Triton metrics might not work if the host machine is running a separate DCGM agent on bare-metal or in a container. * When cloud storage (AWS, GCS, AZURE) is used as a model repository and a model has multiple versions, Triton creates an extra local copy of the cloud model’s folder in the temporary directory, which is deleted upon server’s shutdown. * Python backend support for Windows is limited and does not currently support the following features: - GPU tensors - CPU and GPU-related metrics - Custom execution environments - The model load/unload APIs

Client Libraries and Examples

Ubuntu 24.04 builds of the client libraries and examples are included in this release in the attached `v2.57.0_ubuntu2404.clients.tar.gz` file. The SDK is also available for as an Ubuntu 24.04 based [NGC Container](https://ngc.nvidia.com/catalog/containers/nvidia:tritonserver/tags). The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See [Getting the Client Libraries](https://github.com/triton-inference-server/client/tree/r25.04#getting-the-client-libraries-and-examples) for more information on each of these options.

Windows Support

> [!NOTE] > There is no Windows release for 25.04, the latest release is [25.02](https://github.com/triton-inference-server/server/releases/tag/v2.55.0).

Jetson iGPU Support

A release of Triton for [IGX](https://www.nvidia.com/en-us/edge-computing/products/igx/) is provided in the attached tar file: [`tritonserver2.57.0-igpu.tar`](https://github.com/triton-inference-server/server/releases/download/v2.57.0/tritonserver2.57.0-igpu.tar). * This release supports **TensorRT** `10.9.0.34`, **Onnx Runtime** `1.21.0`, **PyTorch** [`2.7.0a0+79aa17489c.nv25.4`](https://docs.nvidia.com/deeplearning/frameworks/install-pytorch-jetson-platform-release-notes/pytorch-jetson-rel.html), **Python** `3.12` and as well as _ensembles_. * ONNX Runtime backend does not support the OpenVINO and TensorRT execution providers. The CUDA execution provider is in Beta. * System shared memory is supported on Jetson. CUDA shared memory is not supported. * GPU metrics, GCS storage, S3 storage and Azure storage are not supported. The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples. For more information on how to install and use Triton on JetPack refer to [`jetson.md`](https://github.com/triton-inference-server/server/blob/r25.04/docs/user_guide/jetson.md). The wheel for the Python client library is present in the tar file and can be installed by running the following command: ``` python3 -m pip install --upgrade clients/python/tritonclient-2.57.0-py3-none-manylinux2014_aarch64.whl[all] ```

Triton TRT-LLM Container Support Matrix

The Triton TensorRT-LLM container is built from the 25.03 image [`nvcr.io/nvidia/tritonserver:25.03-py3-min`](http://nvcr.io/nvidia/tritonserver:25.03-py3-min). Please refer to the [support matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html) and [compatibility.md](https://github.com/triton-inference-server/server/blob/v2.57.0/docs/introduction/compatibility.md#container-name-trtllm-python-py3) for all dependency versions related to 25.03. However, the packages listed below have different versions than those specified in the support matrix. | Dependency | Version | | :------------: | :---------------: | | TensorRT-LLM | 0.18.2 | | TensorRT | 10.9.0.34 |

- Python
Published by dmitry-tokarev-nv about 1 year ago

server - Release 2.56.0 corresponding to NGC container 25.03

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

New Features and Improvements

* The Tensorflow Backend has been deprecated starting in 25.03. The last release of Triton Inference Server with the Tensorflow Backend is 25.02. Users wishing to continue using the Tensorflow Backend in 25.03 and later can build the [Tensorflow Backend](https://github.com/triton-inference-server/tensorflow_backend?tab=readme-ov-file#build-the-tensorflow-backend) from source and install the result into the `/opt/tritonserver/backends/` directory. * The “XX.YY-tf2-python-py3” container will no longer be available starting in 25.03. See the Tensorflow Backend deprecation. * Added generate and generate_stream inference types to SageMaker server. Customers can choose which inference types - infer (default), generate or generate_stream using SAGEMAKER_TRITON_INFERENCE_TYPE environment variable during server launch. * In an effort to allow quick, on-demand metric retrieval for external load balancers such as the [Kubernetes Inference Gateway API](https://gateway-api-inference-extension.sigs.k8s.io/), Triton when used with TRT-LLM can include live KV-cache utilization and capacity metrics in the HTTP response header when processing inference requests.

Known Issues

* The core Python binding may incur an additional D2H and H2D copy if the backend and frontend both specify device memory to be used for response tensors. * A segmentation fault related to DCGM and NSCQ may be encountered during server shutdown on NVSwitch systems. A possible workaround for this issue is to disable the collection of GPU metrics `tritonserver --allow-gpu-metrics false ...` * vLLM backend currently does not take advantage of the [vLLM v0.6](https://blog.vllm.ai/2024/09/05/perf-update.html) performance improvement when metrics are enabled. * Incorrect results are known to occur when using TensorRT (TRT) Backend for inference using int8 data type for I/O on the Blackwell GPU architecture. * When using TensorRT models, if auto-complete configuration is disabled and `is_non_linear_format_io:true` for [reformat-free tensors](https://github.com/triton-inference-server/server/blob/r24.08/docs/user_guide/model_configuration.md#non-linear-io-formats) is not provided in the model configuration, the model may not load successfully. * When using Python models in [decoupled mode](https://github.com/triton-inference-server/python_backend/tree/main?tab=readme-ov-file#decoupled-mode), users need to ensure that the `ResponseSender` goes out of scope or is properly cleaned up before unloading the model to guarantee that the unloading process executes correctly. * Restart support was temporarily removed for Python models. * Triton Inference Server with vLLM backend currently does not support running vLLM models with tensor parallelism sizes greater than 1 and the default "distributed_executor_backend" setting when using explicit model control mode. In attempt to load a vllm model (tp > 1) in explicit mode, users could potentially see failure at `initialize` step: `could not acquire lock for <_io.BufferedWriter name=''> at interpreter shutdown, possibly due to daemon threads`. For the default model control mode, after server shutdown, vllm related sub-processes are not killed. Related vllm issue: https://github.com/vllm-project/vllm/issues/6766 . Please specify "distributed_executor_backend":"ray" in the `model.json` when deploying vllm models with tensor parallelism > 1. * When loading models with file override, multiple model configuration files are not supported. Users must provide the model configuration by setting parameter `"config" : ""` instead of custom configuration file in the following format: `"file:configs/.pbtxt" : ""`. * TensorRT-LLM [backend](https://github.com/triton-inference-server/tensorrtllm_backend) provides limited support of Triton extensions and features. * The TensorRT-LLM backend may core dump on server shutdown. This impacts server teardown only and will not impact inferencing. * The Java CAPI is known to have intermittent segfaults. * Some systems which implement `malloc()` may not release memory back to the operating system right away causing a false memory leak. This can be mitigated by using a different malloc implementation. `TCMalloc` and `jemalloc` are installed in the Triton container and can be [used by specifying the library in LD_PRELOAD](https://github.com/triton-inference-server/server/blob/r25.01/docs/user_guide/model_management.md). NVIDIA recommends experimenting with both `tcmalloc` and `jemalloc` to determine which one works better for your use case. * Auto-complete may cause an increase in server start time. To avoid a start time increase, users can provide the full model configuration and launch the server with `--disable-auto-complete-config`. * Auto-complete does not support PyTorch models due to lack of metadata in the model. It can only verify that the number of inputs and the input names matches what is specified in the model configuration. There is no model metadata about the number of outputs and datatypes. Related PyTorch bug:[ https://github.com/pytorch/pytorch/issues/38273](https://github.com/pytorch/pytorch/issues/38273) * Triton Client PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton Client library for Arm SBSA. The correct client wheel file can be pulled directly from the Arm SBSA SDK image and manually installed. * Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU. Refer to [pytorch/pytorch#66930](https://github.com/pytorch/pytorch/issues/66930) for more information. * Triton cannot retrieve GPU metrics with [MIG-enabled GPU devices](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#supported-gpus). * Triton metrics might not work if the host machine is running a separate DCGM agent on bare-metal or in a container. * When cloud storage (AWS, GCS, AZURE) is used as a model repository and a model has multiple versions, Triton creates an extra local copy of the cloud model’s folder in the temporary directory, which is deleted upon server’s shutdown. * Python backend support for Windows is limited and does not currently support the following features: - GPU tensors - CPU and GPU-related metrics - Custom execution environments - The model load/unload APIs

Client Libraries and Examples

Ubuntu 24.04 builds of the client libraries and examples are included in this release in the attached `v2.56.0_ubuntu2404.clients.tar.gz` file. The SDK is also available for as an Ubuntu 24.04 based [NGC Container](https://ngc.nvidia.com/catalog/containers/nvidia:tritonserver/tags). The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See [Getting the Client Libraries](https://github.com/triton-inference-server/client/tree/r25.03#getting-the-client-libraries-and-examples) for more information on each of these options.

Windows Support

> [!NOTE] > There is no Windows release for 25.03, the latest release is [25.02](https://github.com/triton-inference-server/server/releases/tag/v2.55.0).

Jetson iGPU Support

A release of Triton for [IGX](https://www.nvidia.com/en-us/edge-computing/products/igx/) is provided in the attached tar file: [`tritonserver2.56.0-igpu.tgz`](https://github.com/triton-inference-server/server/releases/download/v2.56.0/tritonserver2.56.0-igpu.tgz). * This release supports **TensorRT** `10.9.0.34`, **Onnx Runtime** `1.21.0`, **PyTorch** [`2.7.0a0+7c8ec84dab.nv25.3`](https://docs.nvidia.com/deeplearning/frameworks/install-pytorch-jetson-platform-release-notes/pytorch-jetson-rel.html), **Python** `3.12` and as well as _ensembles_. * ONNX Runtime backend does not support the OpenVINO and TensorRT execution providers. The CUDA execution provider is in Beta. * System shared memory is supported on Jetson. CUDA shared memory is not supported. * GPU metrics, GCS storage, S3 storage and Azure storage are not supported. The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples. For more information on how to install and use Triton on JetPack refer to [`jetson.md`](https://github.com/triton-inference-server/server/blob/r25.03/docs/user_guide/jetson.md). The wheel for the Python client library is present in the tar file and can be installed by running the following command: ``` python3 -m pip install --upgrade clients/python/tritonclient-2.56.0-py3-none-manylinux2014_aarch64.whl[all] ```

Triton TRT-LLM Container Support Matrix

The Triton TensorRT-LLM container is built from the 25.03 image [`nvcr.io/nvidia/tritonserver:25.03-py3-min`](http://nvcr.io/nvidia/tritonserver:25.03-py3-min). Please refer to the [support matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html) for all dependency versions related to 25.03. However, the packages listed below have different versions than those specified in the support matrix. | Dependency | Version | | :------------: | :---------------: | | TensorRT-LLM | 0.18.0 | | TensorRT | 10.9.0.34 |

- Python
Published by mc-nv over 1 year ago

server - Release 2.55.0 corresponding to NGC container 25.02

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

New Features and Improvements

* Python backend now supports setting and retrieving [Inference Response Parameters](https://github.com/triton-inference-server/python_backend/tree/r25.02#inference-response-parameters) on InferenceResponse objects on model.py. * Optimized the core Python binding architecture leading to improved OpenAI frontend performance. * Added dynamic sampling parameter handling, improving flexibility and consistency across vllm interactions. Added support for “guided_generation” request parameter for efficient constrained decoding workflows. * Improved Multi-Lora handling in TRTLLM GRPC Client `end_to_end_grpc_client.py` * GenAI-Perf added the ability to format output using Jinja2 templates. * GenAI-Perf telemetry now supports multiple metric endpoints. * GenAI-Perf now supports increased corpus size, 90x the previously supported size. * GenAI-Perf now supports keys without values as input. * GenAI-Perf fixed the OSL issue due to Performance Analyzer not removing the first 4 bytes from output. * GenAI-Perf added a chat template option for the TRT-LLM engine. * Performance Analyzer fixed TRITON_ENABLE_GPU compile definition bug. * Performance Analyzer bumped minimum required C++ version to C++20. * Performance Analyzer modified to disallow user attempts to use concurrency and warmup with the schedule flag.

Known Issues

* The core Python binding may incur an additional D2H and H2D copy if the backend and frontend both specify device memory to be used for response tensors. * A segmentation fault related to DCGM and NSCQ may be encountered during server shutdown on NVSwitch systems. A possible workaround for this issue is to disable the collection of GPU metrics `tritonserver --allow-gpu-metrics false ...` * vLLM backend currently does not take advantage of the [vLLM v0.6](https://blog.vllm.ai/2024/09/05/perf-update.html) performance improvement when metrics are enabled. * Incorrect results are known to occur when using TensorRT (TRT) Backend for inference using int8 data type for I/O on the Blackwell GPU architecture. * When using TensorRT models, if auto-complete configuration is disabled and `is_non_linear_format_io:true` for [reformat-free tensors](https://github.com/triton-inference-server/server/blob/r24.08/docs/user_guide/model_configuration.md#non-linear-io-formats) is not provided in the model configuration, the model may not load successfully. * When using Python models in [decoupled mode](https://github.com/triton-inference-server/python_backend/tree/main?tab=readme-ov-file#decoupled-mode), users need to ensure that the `ResponseSender` goes out of scope or is properly cleaned up before unloading the model to guarantee that the unloading process executes correctly. * Restart support was temporarily removed for Python models. * Triton Inference Server with vLLM backend currently does not support running vLLM models with tensor parallelism sizes greater than 1 and the default "distributed_executor_backend" setting when using explicit model control mode. In attempt to load a vllm model (tp > 1) in explicit mode, users could potentially see failure at `initialize` step: `could not acquire lock for <_io.BufferedWriter name=''> at interpreter shutdown, possibly due to daemon threads`. For the default model control mode, after server shutdown, vllm related sub-processes are not killed. Related vllm issue: https://github.com/vllm-project/vllm/issues/6766 . Please specify "distributed_executor_backend":"ray" in the `model.json` when deploying vllm models with tensor parallelism > 1. * When loading models with file override, multiple model configuration files are not supported. Users must provide the model configuration by setting parameter `"config" : ""` instead of custom configuration file in the following format: `"file:configs/.pbtxt" : ""`. * TensorRT-LLM [backend](https://github.com/triton-inference-server/tensorrtllm_backend) provides limited support of Triton extensions and features. * The TensorRT-LLM backend may core dump on server shutdown. This impacts server teardown only and will not impact inferencing. * The Java CAPI is known to have intermittent segfaults. * Some systems which implement `malloc()` may not release memory back to the operating system right away causing a false memory leak. This can be mitigated by using a different malloc implementation. `TCMalloc` and `jemalloc` are installed in the Triton container and can be [used by specifying the library in LD_PRELOAD](https://github.com/triton-inference-server/server/blob/r25.01/docs/user_guide/model_management.md). NVIDIA recommends experimenting with both `tcmalloc` and `jemalloc` to determine which one works better for your use case. * Auto-complete may cause an increase in server start time. To avoid a start time increase, users can provide the full model configuration and launch the server with `--disable-auto-complete-config`. * Auto-complete does not support PyTorch models due to lack of metadata in the model. It can only verify that the number of inputs and the input names matches what is specified in the model configuration. There is no model metadata about the number of outputs and datatypes. Related PyTorch bug:[ https://github.com/pytorch/pytorch/issues/38273](https://github.com/pytorch/pytorch/issues/38273) * Triton Client PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton Client library for Arm SBSA. The correct client wheel file can be pulled directly from the Arm SBSA SDK image and manually installed. * Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU. Refer to [pytorch/pytorch#66930](https://github.com/pytorch/pytorch/issues/66930) for more information. * Triton cannot retrieve GPU metrics with [MIG-enabled GPU devices](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#supported-gpus). * Triton metrics might not work if the host machine is running a separate DCGM agent on bare-metal or in a container. * When cloud storage (AWS, GCS, AZURE) is used as a model repository and a model has multiple versions, Triton creates an extra local copy of the cloud model’s folder in the temporary directory, which is deleted upon server’s shutdown. * Python backend support for Windows is limited and does not currently support the following features: - GPU tensors - CPU and GPU-related metrics - Custom execution environments - The model load/unload APIs

Client Libraries and Examples

Ubuntu 24.04 builds of the client libraries and examples are included in this release in the attached `v2.55.0_ubuntu2404.clients.tar.gz` file. The SDK is also available for as an Ubuntu 24.04 based [NGC Container](https://ngc.nvidia.com/catalog/containers/nvidia:tritonserver/tags). The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See [Getting the Client Libraries](https://github.com/triton-inference-server/client/tree/r25.02#getting-the-client-libraries-and-examples) for more information on each of these options.

Windows Support

> [!NOTE] > The 25.02 Windows release is under development.

Jetson iGPU Support

A release of Triton for [IGX](https://www.nvidia.com/en-us/edge-computing/products/igx/) is provided in the attached tar file: [`tritonserver2.55.0-igpu.tgz`](https://github.com/triton-inference-server/server/releases/download/v2.55.0/tritonserver2.55.0-igpu.tgz). * This release supports **TensorFlow** `2.17.0`, **TensorRT** `10.8.0.40`, **Onnx Runtime** `1.20.1`, **PyTorch** [`2.6.0a0+ecf3bae40a`](https://docs.nvidia.com/deeplearning/frameworks/install-pytorch-jetson-platform-release-notes/pytorch-jetson-rel.html), **Python** `3.12` and as well as _ensembles_. * ONNX Runtime backend does not support the OpenVINO and TensorRT execution providers. The CUDA execution provider is in Beta. * System shared memory is supported on Jetson. CUDA shared memory is not supported. * GPU metrics, GCS storage, S3 storage and Azure storage are not supported. The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples. For more information on how to install and use Triton on JetPack refer to [`jetson.md`](https://github.com/triton-inference-server/server/blob/r25.02/docs/user_guide/jetson.md). The wheel for the Python client library is present in the tar file and can be installed by running the following command: ``` python3 -m pip install --upgrade clients/python/tritonclient-2.55.0-py3-none-manylinux2014_aarch64.whl[all] ```

Triton TRT-LLM Container Support Matrix

The Triton TensorRT-LLM container is built from the 25.02 image [`nvcr.io/nvidia/tritonserver:25.02-py3-min`](http://nvcr.io/nvidia/tritonserver:25.02-py3-min). Please refer to the [support matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html) for all dependency versions related to 25.02. However, the packages listed below have different versions than those specified in the support matrix. | Dependency | Version | | :------------: | :---------------: | | TensorRT-LLM | 0.17.0 | | TensorRT | 10.8.0.40 |

- Python
Published by nv-kmcgill53 over 1 year ago

server - Release 2.54.0 corresponding to NGC container 25.01

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

New Features and Improvements

* Starting with the 25.01 release, Triton Inference Server supports Blackwell GPU architectures. * Starting from 25.01, the vLLM container shipped by Triton is NVIDIA optimized. Users who wish to use the public version of vLLM can continue to build a Triton-vLLM container on their end. * Fixed a bug when passing the correlation ID of string type to python_backend. Added datatype checks to correlation ID values. * vLLM backend can now take advantage of the vLLM v0.6 performance improvement by communicating with the vLLM engine via ZMQ. * GenAI-Perf now provides the exact input sequence length requested for synthetic text generation * GenAI-Perf supports the creation of a prefix pool to emulate system prompts via `--num-system-prompts` and `--system-prompt-length` * GenAI-Perf improved error visibility via returning more detailed errors when OpenAI frontends return an error or metric generation fails * GenAI-Perf reports time-to-second-token and request count in its metrics * GenAI-Perf allows the use of a custom tokenizer in its “compare” subcommand for comparing multiple profiles * GenAI-Perf natively supports `--request-count` for sending a specific number of requests and `--header` for sending a list of headers with every request * Model Analyzer functionality has been migrated to GenAI-Perf via the “analyze” subcommand, enabling the tool to sweep and find the optimal model configuration * A bytes appending bug was fixed in GenAI-Perf, resulting in more accurate output sequence lengths for Triton * _Update February 12th, 2025_: Triton Windows release now has CUDA context sharing support in the TensorRT Backend

Known Issues

* A segmentation fault related to DCGM and NSCQ may be encountered during server shutdown on NVSwitch systems. A possible workaround for this issue is to disable the collection of GPU metrics `tritonserver --allow-gpu-metrics false ...` * vLLM backend currently does not take advantage of the [vLLM v0.6](https://blog.vllm.ai/2024/09/05/perf-update.html) performance improvement when metrics are enabled. * Please note, that the vllm version provided in 25.01 container is 0.6.3.post1. Due to some issues with vllm library versioning, `vllm.__version__` displays `0.6.3`. * Incorrect results are known to occur when using TensorRT (TRT) Backend for inference using int8 data type for I/O on the Blackwell GPU architecture. * When running Torch TRT models, the output may differ from running the same model on a previous release. * When using TensorRT models, if auto-complete configuration is disabled and `is_non_linear_format_io:true` for [reformat-free tensors](https://github.com/triton-inference-server/server/blob/r24.08/docs/user_guide/model_configuration.md#non-linear-io-formats) is not provided in the model configuration, the model may not load successfully. * When using Python models in [decoupled mode](https://github.com/triton-inference-server/python_backend/tree/main?tab=readme-ov-file#decoupled-mode), users need to ensure that the `ResponseSender` goes out of scope or is properly cleaned up before unloading the model to guarantee that the unloading process executes correctly. * Restart support was temporarily removed for Python models. * Triton Inference Server with vLLM backend currently does not support running vLLM models with tensor parallelism sizes greater than 1 and the default "distributed_executor_backend" setting when using explicit model control mode. In attempt to load a vllm model (tp > 1) in explicit mode, users could potentially see failure at `initialize` step: `could not acquire lock for <_io.BufferedWriter name=''> at interpreter shutdown, possibly due to daemon threads`. For the default model control mode, after server shutdown, vllm related sub-processes are not killed. Related vllm issue: https://github.com/vllm-project/vllm/issues/6766 . Please specify "distributed_executor_backend":"ray" in the `model.json` when deploying vllm models with tensor parallelism > 1. * When loading models with file override, multiple model configuration files are not supported. Users must provide the model configuration by setting parameter `"config" : ""` instead of custom configuration file in the following format: `"file:configs/.pbtxt" : ""`. * TensorRT-LLM [backend](https://github.com/triton-inference-server/tensorrtllm_backend) provides limited support of Triton extensions and features. * The TensorRT-LLM backend may core dump on server shutdown. This impacts server teardown only and will not impact inferencing. * The Java CAPI is known to have intermittent segfaults. * Some systems which implement `malloc()` may not release memory back to the operating system right away causing a false memory leak. This can be mitigated by using a different malloc implementation. `TCMalloc` and `jemalloc` are installed in the Triton container and can be [used by specifying the library in LD_PRELOAD](https://github.com/triton-inference-server/server/blob/r25.01/docs/user_guide/model_management.md). NVIDIA recommends experimenting with both `tcmalloc` and `jemalloc` to determine which one works better for your use case. * Auto-complete may cause an increase in server start time. To avoid a start time increase, users can provide the full model configuration and launch the server with `--disable-auto-complete-config`. * Auto-complete does not support PyTorch models due to lack of metadata in the model. It can only verify that the number of inputs and the input names matches what is specified in the model configuration. There is no model metadata about the number of outputs and datatypes. Related PyTorch bug:[ https://github.com/pytorch/pytorch/issues/38273](https://github.com/pytorch/pytorch/issues/38273) * Triton Client PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton Client library for Arm SBSA. The correct client wheel file can be pulled directly from the Arm SBSA SDK image and manually installed. * Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU. Refer to [pytorch/pytorch#66930](https://github.com/pytorch/pytorch/issues/66930) for more information. * Triton cannot retrieve GPU metrics with [MIG-enabled GPU devices](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#supported-gpus). * Triton metrics might not work if the host machine is running a separate DCGM agent on bare-metal or in a container. * When cloud storage (AWS, GCS, AZURE) is used as a model repository and a model has multiple versions, Triton creates an extra local copy of the cloud model’s folder in the temporary directory, which is deleted upon server’s shutdown. * Python backend support for Windows is limited and does not currently support the following features: - GPU tensors - CPU and GPU-related metrics - Custom execution environments - The model load/unload APIs

Client Libraries and Examples

Ubuntu 24.04 builds of the client libraries and examples are included in this release in the attached `v2.54.0_ubuntu2404.clients.tar.gz` file. The SDK is also available for as an Ubuntu 24.04 based [NGC Container](https://ngc.nvidia.com/catalog/containers/nvidia:tritonserver/tags). The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See [Getting the Client Libraries](https://github.com/triton-inference-server/client/tree/r25.01#getting-the-client-libraries-and-examples) for more information on each of these options.

Windows Support

A beta release of Triton for Windows is provided in the attached file: `tritonserver2.54.0-win.zip`. This is a beta release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release: * HTTP/REST and GRPC endpoints are supported. *ONNX models are supported by the ONNX Runtime backend. The ONNX Runtime version is `1.20.1`. The CPU, CUDA, and TensorRT execution providers are supported. The OpenVINO execution provider is not supported. * OpenVINO models are supported. The OpenVINO version is `2024.5.0`. * Prometheus metrics endpoint is not supported. * System and CUDA shared memory are not supported. Known Issues * In our internal testing, we observed large latency in retrieving inference results from HTTP client. We recommend using gRPC to circumvent this issue. To use the Windows version of Triton, you must install all the necessary dependencies on your Windows system. These dependencies are available in the [Dockerfile.win10.min](https://github.com/triton-inference-server/server/blob/r25.01/Dockerfile.win10.min). The Dockerfile includes the following CUDA-related components: * Python `3.12.3` * CUDA `12.6.3` * cuDNN `9.6.0.74` * TensorRT `10.7.0.23`

Jetson iGPU Support

A release of Triton for [IGX](https://www.nvidia.com/en-us/edge-computing/products/igx/) is provided in the attached tar file: [`tritonserver2.54.0-igpu.tgz`](https://github.com/triton-inference-server/server/releases/download/v2.54.0/tritonserver2.54.0-igpu.tgz). * This release supports **TensorFlow** `2.17.0`, **TensorRT** `10.8.0.40`, **Onnx Runtime** `1.20.1`, **PyTorch** [`2.6.0a0+ecf3bae40a`](https://docs.nvidia.com/deeplearning/frameworks/install-pytorch-jetson-platform-release-notes/pytorch-jetson-rel.html), **Python** `3.12` and as well as _ensembles_. * ONNX Runtime backend does not support the OpenVINO and TensorRT execution providers. The CUDA execution provider is in Beta. * System shared memory is supported on Jetson. CUDA shared memory is not supported. * GPU metrics, GCS storage, S3 storage and Azure storage are not supported. The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples. For more information on how to install and use Triton on JetPack refer to [`jetson.md`](https://github.com/triton-inference-server/server/blob/r25.01/docs/user_guide/jetson.md). The wheel for the Python client library is present in the tar file and can be installed by running the following command: ``` python3 -m pip install --upgrade clients/python/tritonclient-2.54.0-py3-none-manylinux2014_aarch64.whl[all] ```

Triton TRT-LLM Container Support Matrix

The Triton TensorRT-LLM container is built from the 25.01 image [`nvcr.io/nvidia/tritonserver:25.01-py3-min`](http://nvcr.io/nvidia/tritonserver:25.01-py3-min). Please refer to the [support matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html) for all dependency versions related to 25.01. However, the packages listed below have different versions than those specified in the support matrix. | Dependency | Version | | :------------: | :---------------: | | TensorRT-LLM | 0.17.0 | | TensorRT | 10.8.0.43 |

- Python
Published by mc-nv over 1 year ago

server - Release 2.53.0 corresponding to NGC container 24.12

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

New Features and Improvements

vLLM backend health check may be optionally enabled which unloads the model if the vLLM engine health check failed.
vLLM backend supports sending additional outputs from vLLM if requested.
Improved server stability during the gRPC client cancellation.
Perf Analyzer: Added trtllm multi node process support.
Windows executables and DLLs are signed by NVIDIA. This should remove the un-trusted software popup when starting Triton outside of administrator mode.
Triton on Windows supports long path notation for model repositories
Triton on Windows supports wide character encoding, UTF-16, for model repositories

Known Issues

To build the Llama 3.1 engine inside the 24.09-trtllm-python-py3 image, make sure to upgrade the transformer library to 4.43+ due to the bug in 4.43.x. One option to do so is to run pip install -U transformers. For more information, please refer to the discussion: https://github.com/NVIDIA/TensorRT-LLM/issues/2121
Triton vLLM container comes with the vLLM version, which has a known vulnerability: https://github.com/advisories/GHSA-w2r7-9579-27hf. Note, that the affected code is not invoked at runtime, therefore the Triton vLLM container is not affected by this issue.
When running Torch TRT models, the output may differ from running the same model on a previous release.
When using TensorRT models, if auto-complete configuration is disabled and isnonlinearformatio:true for reformat-free tensors is not provided in the model configuration, the model may not load successfully.
When using Python models in decoupled mode, users need to ensure that the ResponseSender goes out of scope or is properly cleaned up before unloading the model to guarantee that the unloading process executes correctly.
Restart support was temporarily removed for Python models.
Triton TensorRT-LLM Backend container image uses TensorRT-LLM version 0.16.0 and built out of nvcr.io/nvidia/tritonserver:24.11-py3-min. Please refer to the Triton TRT-LLM Container Support Matrix section in the GitHub release note for more details.
Triton Inference Server with vLLM backend currently does not support running vLLM models with tensor parallelism sizes greater than 1 and the default "distributedexecutorbackend" setting when using explicit model control mode. In attempt to load a vllm model (tp > 1) in explicit mode, users could potentially see failure at initialize step: could not acquire lock for <_io.BufferedWriter name='<stdout>'> at interpreter shutdown, possibly due to daemon threads. For the default model control mode, after server shutdown, vllm related sub-processes are not killed. Related vllm issue: https://github.com/vllm-project/vllm/issues/6766 . Please specify "distributed_executor_backend":"ray" in themodel.json` when deploying vllm models with tensor parallelism > 1.
When loading models with file override, multiple model configuration files are not supported. Users must provide the model configuration by setting parameter "config" : "" instead of custom configuration file in the following format:"file:configs/.pbtxt" : "".
Perf Analyzer no longer supports --trace-file option.
TensorRT-LLM backend provides limited support of Triton extensions and features.
The TensorRT-LLM backend may core dump on server shutdown. This impacts server teardown only and will not impact inferencing.
The Java CAPI is known to have intermittent segfaults.
Some systems which implement malloc() may not release memory back to the operating system right away causing a false memory leak. This can be mitigated by using a different malloc implementation. Tcmalloc and jemalloc are installed in the Triton container and can be used by specifying the library in LD_PRELOAD. NVIDIA recommends experimenting with both tcmalloc and jemalloc to determine which one works better for your use case.
Auto-complete may cause an increase in server start time. To avoid a start time increase, users can provide the full model configuration and launch the server with --disable-auto-complete-config.
Auto-complete does not support PyTorch models due to lack of metadata in the model. It can only verify that the number of inputs and the input names matches what is specified in the model configuration. There is no model metadata about the number of outputs and datatypes. Related PyTorch bug: https://github.com/pytorch/pytorch/issues/38273
Triton Client PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton Client library for Arm SBSA. The correct client wheel file can be pulled directly from the Arm SBSA SDK image and manually installed.
Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU. Refer to pytorch/pytorch#66930 for more information.
Triton cannot retrieve GPU metrics with MIG-enabled GPU devices.
Triton metrics might not work if the host machine is running a separate DCGM agent on bare-metal or in a container.
When cloud storage (AWS, GCS, AZURE) is used as a model repository and a model has multiple versions, Triton creates an extra local copy of the cloud model’s folder in the temporary directory, which is deleted upon server’s shutdown.
Python backend support for Windows is limited and does not currently support the following features:
- GPU tensors
- CPU and GPU-related metrics
- Custom execution environments
- The model load/unload APIs

Client Libraries and Examples

Ubuntu 24.04 builds of the client libraries and examples are included in this release in the attached v2.53.0_ubuntu2404.clients.tar.gz file. The SDK is also available for as an Ubuntu 24.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

Windows Support

[!IMPORTANT] This release of Triton on Windows includes features specific to the Windows platform and does not affect the 2.53.0 release for other platforms. This is released as a patch to 2.53.0 because of the differing feature commits.

A beta release of Triton for Windows is provided in the attached file: tritonserver2.53.1-win.zip. This is a beta release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release:

HTTP/REST and GRPC endpoints are supported.

*ONNX models are supported by the ONNX Runtime backend. The ONNX Runtime version is 1.20.1. The CPU, CUDA, and TensorRT execution providers are supported. The OpenVINO execution provider is not supported.

OpenVINO models are supported. The OpenVINO version is 2024.5.0.
Prometheus metrics endpoint is not supported.
System and CUDA shared memory are not supported.

Known Issues

In our internal testing, we observed large latency in retrieving inference results from HTTP client. We recommend using gRPC to circumvent this issue.

To use the Windows version of Triton, you must install all the necessary dependencies on your Windows system. These dependencies are available in the Dockerfile.win10.min. The Dockerfile includes the following CUDA-related components:

Python 3.12.3
CUDA 12.6.3
cuDNN 9.6.0.74
TensorRT 10.7.0.23

Jetson iGPU Support

A release of Triton for IGX is provided in the attached tar file: tritonserver2.53.0-igpu.tgz.

This release supports TensorFlow 2.17.0, TensorRT 10.7.0.23, Onnx Runtime 1.20.1, PyTorch 2.6.0a0+df5bbc0, Python 3.12 and as well as ensembles.
ONNX Runtime backend does not support the OpenVINO and TensorRT execution providers. The CUDA execution provider is in Beta.
System shared memory is supported on Jetson. CUDA shared memory is not supported.
GPU metrics, GCS storage, S3 storage and Azure storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples. For more information on how to install and use Triton on JetPack refer to jetson.md.

The wheel for the Python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tritonclient-2.53.0-py3-none-manylinux2014_aarch64.whl[all]

Triton TRT-LLM Container Support Matrix

The Triton TensorRT-LLM container is built from the 24.11 image nvcr.io/nvidia/tritonserver:24.11-py3-min. Please refer to the support matrix for all dependency versions related to 24.11. However, the packages listed below have different versions than those specified in the support matrix. | Dependency | Version | | :------------: | :---------------: | | TensorRT-LLM | 0.16.0 | | TensorRT | 10.6.0.26 |

- Python
Published by mc-nv over 1 year ago

server - Release 2.52.0 corresponding to NGC container 24.11

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

New Features and Improvements

Conceptual Guides were enhanced with a comprehensive tutorial on Semantic Caching optimization for LLM workloads.
Triton Metrics:
- Added a new histogram metric "Request to First Response Time” to decoupled models. Enabled by setting --metrics-config histogram_latencies=true ^[user_guide].
- A new model configuration field model_metrics that allows overriding default buckets for histogram metric families.

Known Issues

TensorFlow backend may leak memory due to a known issue with the cuDNN library shipped with the container.
The latest GenAI-Perf package on pypi.org is version 0.0.9dev while the latest Triton SDK container (24.11) contains GenAI-Perf version 0.0.8.
Numpy 2.x is not currently supported for Python Backend models and may cause them to return empty tensors unxpectedly, please use Numpy 1.x until support is added.
Triton vLLM container comes with the vLLM version, which has a known vulnerability: https://github.com/advisories/GHSA-w2r7-9579-27hf. Note, that the affected code is not invoked at runtime, therefore the Triton vLLM container is not affected by this issue.
When running Torch TRT models, the output may differ from running the same model on a previous release.
When using TensorRT models, if auto-complete configuration is disabled and isnonlinearformatio:true for reformat-free tensors is not provided in the model configuration, the model may not load successfully.
When using Python models in decoupled mode, users need to ensure that the ResponseSender goes out of scope or is properly cleaned up before unloading the model to guarantee that the unloading process executes correctly.
Restart support was temporarily removed for Python models.
Triton TensorRT-LLM Backend container image uses TensorRT-LLM version 0.15.0 and built out of nvcr.io/nvidia/tritonserver:24.10-py3-min. Please refer to the Triton TRT-LLM Container Support Matrix section in the GitHub release note for more details.
Triton Inference Server with vLLM backend currently does not support running vLLM models with tensor parallelism sizes greater than 1 and the default "distributedexecutorbackend" setting when using explicit model control mode. In attempt to load a vllm model (tp > 1) in explicit mode, users could potentially see failure at initialize step: could not acquire lock for <_io.BufferedWriter name='<stdout>'> at interpreter shutdown, possibly due to daemon threads. For the default model control mode, after server shutdown, vllm related sub-processes are not killed. Related vllm issue: https://github.com/vllm-project/vllm/issues/6766 . Please specify "distributed_executor_backend":"ray" in themodel.json` when deploying vllm models with tensor parallelism > 1.
When loading models with file override, multiple model configuration files are not supported. Users must provide the model configuration by setting parameter "config" : "" instead of custom configuration file in the following format:"file:configs/.pbtxt" : "".
Perf Analyzer no longer supports --trace-file option.
TensorRT-LLM backend provides limited support of Triton extensions and features.
The TensorRT-LLM backend may core dump on server shutdown. This impacts server teardown only and will not impact inferencing.
The Java CAPI is known to have intermittent segfaults.
Some systems which implement malloc() may not release memory back to the operating system right away causing a false memory leak. This can be mitigated by using a different malloc implementation. Tcmalloc and jemalloc are installed in the Triton container and can be used by specifying the library in LD_PRELOAD. NVIDIA recommends experimenting with both tcmalloc and jemalloc to determine which one works better for your use case.
Auto-complete may cause an increase in server start time. To avoid a start time increase, users can provide the full model configuration and launch the server with --disable-auto-complete-config.
Auto-complete does not support PyTorch models due to lack of metadata in the model. It can only verify that the number of inputs and the input names matches what is specified in the model configuration. There is no model metadata about the number of outputs and datatypes. Related PyTorch bug: https://github.com/pytorch/pytorch/issues/38273
Triton Client PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton Client library for Arm SBSA. The correct client wheel file can be pulled directly from the Arm SBSA SDK image and manually installed.
Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU. Refer to pytorch/pytorch#66930 for more information.
Triton cannot retrieve GPU metrics with MIG-enabled GPU devices.
Triton metrics might not work if the host machine is running a separate DCGM agent on bare-metal or in a container.
When cloud storage (AWS, GCS, AZURE) is used as a model repository and a model has multiple versions, Triton creates an extra local copy of the cloud model’s folder in the temporary directory, which is deleted upon server’s shutdown.
Python backend support for Windows is limited and does not currently support the following features:
- GPU tensors
- CPU and GPU-related metrics
- Custom execution environments
- The model load/unload APIs

Client Libraries and Examples

Ubuntu 22.04 builds of the client libraries and examples are included in this release in the attached v2.52.0_ubuntu22.04.clients.tar.gz file. The SDK is also available for as an Ubuntu 22.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

Windows Support

[!NOTE] There is no Windows release for 24.11, the latest release is 24.10.

Jetson iGPU Support

A release of Triton for IGX is provided in the attached tar file: tritonserver2.52.0-igpu.tgz.

This release supports TensorFlow 2.17.0, TensorRT 10.6.0.26, Onnx Runtime 1.19.2, PyTorch 2.6.0a0+df5bbc0, Python 3.12 and as well as ensembles.
ONNX Runtime backend does not support the OpenVINO and TensorRT execution providers. The CUDA execution provider is in Beta.
System shared memory is supported on Jetson. CUDA shared memory is not supported.
GPU metrics, GCS storage, S3 storage and Azure storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples. For more information on how to install and use Triton on JetPack refer to jetson.md.

The wheel for the Python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tritonclient-2.52.0-py3-none-manylinux2014_aarch64.whl[all]

Triton TRT-LLM Container Support Matrix

The Triton TensorRT-LLM container is built from the 24.10 image nvcr.io/nvidia/tritonserver:24.10-py3-min. Please refer to the support matrix for all dependency versions related to 24.10. However, the packages listed below have different versions than those specified in the support matrix. | Dependency | Version | | :------------: | :---------------: | | TensorRT-LLM | 0.15.0 | | TensorRT | 10.6.0.26 |

- Python
Published by mc-nv over 1 year ago

server - Release 2.51.0 corresponding to NGC container 24.10

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

New Features and Improvements

Optimized vLLM performance with custom metrics.

Known Issues

Numpy 2.x is not currently supported for Python Backend models and may cause them to return empty tensors unxpectedly, please use Numpy 1.x until support is added.
To build the Llama 3.1 engine inside the 24.09-trtllm-python-py3 image, make sure to upgrade the transformer library to 4.43+ due to the bug in 4.43.x. One option to do so is to run pip install -U transformers. For more information, please refer to the discussion: https://github.com/NVIDIA/TensorRT-LLM/issues/2121
Triton vLLM container comes with the vLLM version, which has a known vulnerability: https://github.com/advisories/GHSA-w2r7-9579-27hf. Note, that the affected code is not invoked at runtime, therefore the Triton vLLM container is not affected by this issue.
When running Torch TRT models, the output may differ from running the same model on a previous release.
When using TensorRT models, if auto-complete configuration is disabled and isnonlinearformatio:true for reformat-free tensors is not provided in the model configuration, the model may not load successfully.
When using Python models in decoupled mode, users need to ensure that the ResponseSender goes out of scope or is properly cleaned up before unloading the model to guarantee that the unloading process executes correctly.
Restart support was temporarily removed for Python models.
Triton TensorRT-LLM Backend container image uses TensorRT-LLM version 0.14.0 and built out of nvcr.io/nvidia/tritonserver:24.07-py3-min. Please refer to the Triton TRT-LLM Container Support Matrix section in the GitHub release note for more details.
Triton Inference Server with vLLM backend currently does not support running vLLM models with tensor parallelism sizes greater than 1 and the default "distributedexecutorbackend" setting when using explicit model control mode. In attempt to load a vllm model (tp > 1) in explicit mode, users could potentially see failure at initialize step: could not acquire lock for <_io.BufferedWriter name='<stdout>'> at interpreter shutdown, possibly due to daemon threads. For the default model control mode, after server shutdown, vllm related sub-processes are not killed. Related vllm issue: https://github.com/vllm-project/vllm/issues/6766 . Please specify "distributed_executor_backend":"ray" in themodel.json` when deploying vllm models with tensor parallelism > 1.
When loading models with file override, multiple model configuration files are not supported. Users must provide the model configuration by setting parameter "config" : "" instead of custom configuration file in the following format:"file:configs/.pbtxt" : "".
Perf Analyzer no longer supports --trace-file option.
TensorRT-LLM backend provides limited support of Triton extensions and features.
The TensorRT-LLM backend may core dump on server shutdown. This impacts server teardown only and will not impact inferencing.
The Java CAPI is known to have intermittent segfaults.
Some systems which implement malloc() may not release memory back to the operating system right away causing a false memory leak. This can be mitigated by using a different malloc implementation. Tcmalloc and jemalloc are installed in the Triton container and can be used by specifying the library in LD_PRELOAD. NVIDIA recommends experimenting with both tcmalloc and jemalloc to determine which one works better for your use case.
Auto-complete may cause an increase in server start time. To avoid a start time increase, users can provide the full model configuration and launch the server with --disable-auto-complete-config.
Auto-complete does not support PyTorch models due to lack of metadata in the model. It can only verify that the number of inputs and the input names matches what is specified in the model configuration. There is no model metadata about the number of outputs and datatypes. Related PyTorch bug: https://github.com/pytorch/pytorch/issues/38273
Triton Client PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton Client library for Arm SBSA. The correct client wheel file can be pulled directly from the Arm SBSA SDK image and manually installed.
Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU. Refer to pytorch/pytorch#66930 for more information.
Triton cannot retrieve GPU metrics with MIG-enabled GPU devices.
Triton metrics might not work if the host machine is running a separate DCGM agent on bare-metal or in a container.
When cloud storage (AWS, GCS, AZURE) is used as a model repository and a model has multiple versions, Triton creates an extra local copy of the cloud model’s folder in the temporary directory, which is deleted upon server’s shutdown.
Python backend support for Windows is limited and does not currently support the following features:
- GPU tensors
- CPU and GPU-related metrics
- Custom execution environments
- The model load/unload APIs

Client Libraries and Examples

Ubuntu 22.04 builds of the client libraries and examples are included in this release in the attached v2.51.0_ubuntu22.04.clients.tar.gz file. The SDK is also available for as an Ubuntu 22.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

For Windows, the client libraries and some examples are available in the attached tritonserver2.51.0-sdk-win.zip file.

Windows Support

A beta release of Triton for Windows is provided in the attached file:tritonserver2.51.0-win.zip. This is a beta release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release:

HTTP/REST and GRPC endpoints are supported.
ONNX models are supported by the ONNX Runtime backend. The ONNX Runtime version is 1.19.2. The CPU, CUDA, and TensorRT execution providers are supported. The OpenVINO execution provider is not supported.
OpenVINO models are supported. The OpenVINO version is 2024.4.0.
Prometheus metrics endpoint is not supported.
System and CUDA shared memory are not supported.

Known Issues

In our internal testing, we observed large latency in retrieving inference results from HTTP client. We recommend using gRPC to circumvent this issue.

To use the Windows version of Triton, you must install all the necessary dependencies on your Windows system. These dependencies are available in the Dockerfile.win10.min. The Dockerfile includes the following CUDA-related components:

Python 3.10.11
CUDA 12.5.1
cuDNN 9.5.0.50
TensorRT 10.5.0.18

Jetson iGPU Support

A release of Triton for IGX is provided in the attached tar file: tritonserver2.51.0-igpu.tgz.

This release supports TensorFlow 2.16.1, TensorRT 10.5.0.18, Onnx Runtime 1.19.2, PyTorch 2.5.0a0+e000cf0, Python 3.10 and as well as ensembles.
ONNX Runtime backend does not support the OpenVINO and TensorRT execution providers. The CUDA execution provider is in Beta.
System shared memory is supported on Jetson. CUDA shared memory is not supported.
GPU metrics, GCS storage, S3 storage and Azure storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples. For more information on how to install and use Triton on JetPack refer to jetson.md.

The wheel for the Python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tritonclient-2.51.0-py3-none-manylinux2014_aarch64.whl[all]

- Python
Published by pvijayakrish over 1 year ago

server - Release 2.50.0 corresponding to NGC container 24.09

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

New Features and Improvements

Our tutorials were updated with 2 extensive guides on constrained decoding implementation in TensorRT-LLM python backend and function/tool calling. Guides can be found here
Our tutorials were also updated for Kubernetes Multi-Node and Multi-Instance Scaling with Triton and TRT-LLM; they can be found here.
vLLM backend now supports these additional metrics. For additional details, see vllm_backend.
- vllm:e2erequestlatency_seconds
- vllm:requestprompttokens
- vllm:requestgenerationtokens
- vllm:requestparamsbest_of
- vllm:requestparamsn

Known Issues

To build the Llama 3.1 engine inside the 24.09-trtllm-python-py3 image, make sure to upgrade the transformer library to 4.43+ due to the bug in 4.43.x. One option to do so is to run pip install -U transformers. For more information, please refer to the discussion: https://github.com/NVIDIA/TensorRT-LLM/issues/2121.
Triton vLLM container comes with the vLLM version, which has a known vulnerability: https://github.com/advisories/GHSA-w2r7-9579-27hf. Note, that the affected code is not invoked at runtime, therefore the Triton vLLM container is not affected by this issue.
When running Torch TRT models, the output may differ from running the same model on a previous release.
When using TensorRT models, if auto-complete configuration is disabled and isnonlinearformatio:true for reformat-free tensors is not provided in the model configuration, the model may not load successfully.
When using Python models in decoupled mode, users need to ensure that the ResponseSender goes out of scope or is properly cleaned up before unloading the model to guarantee that the unloading process executes correctly.
Restart support was temporarily removed for Python models.
Triton TensorRT-LLM Backend container image uses TensorRT-LLM version 0.13.0 and built out of nvcr.io/nvidia/tritonserver:24.07-py3-min. Please refer to the Triton TRT-LLM Container Support Matrix section in the GitHub release note for more details.
Triton Inference Server with vLLM backend currently does not support running vLLM models with tensor parallelism sizes greater than 1 and the default "distributedexecutorbackend" setting when using explicit model control mode. In attempt to load a vllm model (tp > 1) in explicit mode, users could potentially see failure at initialize step: could not acquire lock for <_io.BufferedWriter name='<stdout>'> at interpreter shutdown, possibly due to daemon threads. For the default model control mode, after server shutdown, vllm related sub-processes are not killed. Related vllm issue: https://github.com/vllm-project/vllm/issues/6766 . Please specify "distributed_executor_backend":"ray" in themodel.json` when deploying vllm models with tensor parallelism > 1.
When loading models with file override, multiple model configuration files are not supported. Users must provide the model configuration by setting parameter "config" : "" instead of custom configuration file in the following format:"file:configs/.pbtxt" : "".
Perf Analyzer no longer supports --trace-file option.
TensorRT-LLM backend provides limited support of Triton extensions and features.
The TensorRT-LLM backend may core dump on server shutdown. This impacts server teardown only and will not impact inferencing.
The Java CAPI is known to have intermittent segfaults.
Some systems which implement malloc() may not release memory back to the operating system right away causing a false memory leak. This can be mitigated by using a different malloc implementation. Tcmalloc and jemalloc are installed in the Triton container and can be used by specifying the library in LD_PRELOAD. NVIDIA recommends experimenting with both tcmalloc and jemalloc to determine which one works better for your use case.
Auto-complete may cause an increase in server start time. To avoid a start time increase, users can provide the full model configuration and launch the server with --disable-auto-complete-config.
Auto-complete does not support PyTorch models due to lack of metadata in the model. It can only verify that the number of inputs and the input names matches what is specified in the model configuration. There is no model metadata about the number of outputs and datatypes. Related PyTorch bug: https://github.com/pytorch/pytorch/issues/38273
Triton Client PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton Client library for Arm SBSA. The correct client wheel file can be pulled directly from the Arm SBSA SDK image and manually installed.
Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU. Refer to pytorch/pytorch#66930 for more information.
Triton cannot retrieve GPU metrics with MIG-enabled GPU devices.
Triton metrics might not work if the host machine is running a separate DCGM agent on bare-metal or in a container.
When cloud storage (AWS, GCS, AZURE) is used as a model repository and a model has multiple versions, Triton creates an extra local copy of the cloud model’s folder in the temporary directory, which is deleted upon server’s shutdown.
Python backend support for Windows is limited and does not currently support the following features:
- GPU tensors
- CPU and GPU-related metrics
- Custom execution environments
- The model load/unload APIs

Client Libraries and Examples

Ubuntu 22.04 builds of the client libraries and examples are included in this release in the attached v2.50.0_ubuntu22.04.clients.tar.gz file. The SDK is also available for as an Ubuntu 22.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

For Windows, the client libraries and some examples are available in the attached tritonserver2.50.0-sdk-win.zip file.

Windows Support

A beta release of Triton for Windows is provided in the attached file:tritonserver2.50.0-win.zip. This is a beta release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release:

HTTP/REST and GRPC endpoints are supported.
ONNX models are supported by the ONNX Runtime backend. The ONNX Runtime version is 1.19.2. The CPU, CUDA, and TensorRT execution providers are supported. The OpenVINO execution provider is not supported.
OpenVINO models are supported. The OpenVINO version is 2024.0.0.
Prometheus metrics endpoint is not supported.
System and CUDA shared memory are not supported.

Known Issues

In our internal testing, we observed large latency in retrieving inference results from HTTP client. We recommend using gRPC to circumvent this issue.

To use the Windows version of Triton, you must install all the necessary dependencies on your Windows system. These dependencies are available in the Dockerfile.win10.min. The Dockerfile includes the following CUDA-related components:

Python 3.10.11
CUDA 12.6.1
cuDNN 9.4.0.58
TensorRT 10.4.0.26

RHEL8 Support

Split tar files are included in the 'Assets' section of this release that comprise an early access (EA) release of Triton for RHEL8 for both x86 and aarch64 architectures. Once downloaded, you can untar the assets with the following commands: cat *x86*.tar.gz* | tar xvfz - and cat *aarch*.tar.gz* | tar xvfz - for x86 and aarch64, respectively.

This release was compiled with AlmaLinux 8.9 to be compatible with RHEL 8. See the included README.md for complete details about installation, verification, and support. This release supports TensorFlow 2.16.1, TensorRT 10.3.0.26, Onnx Runtime 1.18.1, PyTorch 2.5.0a0+872d972e41, Python 3.10 and as well as ensembles. GCS storage, S3 storage are not supported; however, Azure storage is supported. Some optional backend features such as the PyTorch backend's TorchTRT extension are not currently supported.

The split tar files contains the Tritonserver executable and shared libraries to run the server. Triton server clients and examples are not included at this time.

Jetson iGPU Support

A release of Triton for IGX is provided in the attached tar file: tritonserver2.49.0-igpu.tgz.

This release supports TensorFlow 2.16.1, TensorRT 10.4.0.26, Onnx Runtime 1.19.2, PyTorch 2.5.0a0+b465a58, Python 3.10 and as well as ensembles.
ONNX Runtime backend does not support the OpenVINO and TensorRT execution providers. The CUDA execution provider is in Beta.
System shared memory is supported on Jetson. CUDA shared memory is not supported.
GPU metrics, GCS storage, S3 storage and Azure storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples. For more information on how to install and use Triton on JetPack refer to jetson.md.

The wheel for the Python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tritonclient-2.50.0-py3-none-manylinux2014_aarch64.whl[all]

Triton TRT-LLM Container Support Matrix

The Triton TensorRT-LLM container is built from the 24.07 image [nvcr.io/nvidia/tritonserver:24.07-py3-min](http://nvcr.io/nvidia/tritonserver:24.07-py3-min). Please refer to the support matrix for all dependency versions related to 24.07. However, the packages listed below have different versions than those specified in the support matrix. | Dependency | Version | | :------------: | :---------------: | | TensorRT-LLM | 0.13.0 | | TensorRT | 10.4.0.26 |

- Python
Published by pvijayakrish almost 2 years ago

server - Release 2.49.0 corresponding to NGC container 24.08

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

New Features and Improvements

OpenAI-compatible embeddings and Hugging Face TEI re-ranker API-compatible rankings can now be profiled via GenAI-Perf.
GenAI-Perf can now receive multiple user-specified prompts via --input-file.
The request-rate for async requests have been updated in the OpenAI and HTTP clients to send requests at exactly that rate. Users submitting more requests than their models can handle can see increased latency.
The stabilization metric for Perf Analyzer has been updated due to these changes, so if latency does not stabilize for async models, a warning will be printed but Perf Analyzer will still complete.
Perf Analyzer will not validate any user-supplied inputs and outputs, returning an error if the model does not contain them.
Python backend now supports BF16 tensors via DLPack
vLLM backend now supports these reporting metrics.
- vllm:prompttokenstotal
- vllm:generationtokenstotal
- vllm:timetofirsttokenseconds

To enable the vLLM model's metrics reporting, add these lines to config.pbtxt: parameters: { key: "REPORT_CUSTOM_METRICS" value: { string_value:"yes" } }

TensorRT-LLM backend now supports specifying GPU device IDs per instance using the “gpudeviceids” field.
After the model config is updated to load new model versions, any loaded model versions whose model files are unmodified will not be reloaded.

Known Issues

When running Torch TRT models, the output may differ from running the same model on a previous release. This issue is expected to be fixed on the next release.
When using TensorRT models, if auto-complete configuration is disabled and isnonlinearformatio:true for reformat-free tensors is not provided in the model configuration, the model may not load successfully.
When using Python models in decoupled mode, users need to ensure that the ResponseSender goes out of scope or is properly cleaned up before unloading the model to guarantee that the unloading process executes correctly.
Restart support was temporarily removed for Python models.
Triton TensorRT-LLM Backend container image uses TensorRT-LLM version 0.12.0 and built out of nvcr.io/nvidia/tritonserver:24.07-py3-min. Please refer to the Triton TRT-LLM Container Support Matrix section in the GitHub release note for more details.
Triton Inference Server with vLLM backend currently does not support running vLLM models with tensor parallelism sizes greater than 1 and the default "distributedexecutorbackend" setting when using explicit model control mode. In attempt to load a vllm model (tp > 1) in explicit mode, users could potentially see failure at initialize step: could not acquire lock for <_io.BufferedWriter name='<stdout>'> at interpreter shutdown, possibly due to daemon threads. For the default model control mode, after server shutdown, vllm related sub-processes are not killed. Related vllm issue: https://github.com/vllm-project/vllm/issues/6766 . Please specify "distributed_executor_backend":"ray" in themodel.json` when deploying vllm models with tensor parallelism > 1.
When loading models with file override, multiple model configuration files are not supported. Users must provide the model configuration by setting parameter "config" : "" instead of custom configuration file in the following format:"file:configs/.pbtxt" : "".
Perf Analyzer no longer supports --trace-file option.
TensorRT-LLM backend provides limited support of Triton extensions and features.
The TensorRT-LLM backend may core dump on server shutdown. This impacts server teardown only and will not impact inferencing.
The Java CAPI is known to have intermittent segfaults.
Some systems which implement malloc() may not release memory back to the operating system right away causing a false memory leak. This can be mitigated by using a different malloc implementation. Tcmalloc and jemalloc are installed in the Triton container and can be used by specifying the library in LD_PRELOAD. NVIDIA recommends experimenting with both tcmalloc and jemalloc to determine which one works better for your use case.
Auto-complete may cause an increase in server start time. To avoid a start time increase, users can provide the full model configuration and launch the server with --disable-auto-complete-config.
Auto-complete does not support PyTorch models due to lack of metadata in the model. It can only verify that the number of inputs and the input names matches what is specified in the model configuration. There is no model metadata about the number of outputs and datatypes. Related PyTorch bug: https://github.com/pytorch/pytorch/issues/38273
Triton Client PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton Client library for Arm SBSA. The correct client wheel file can be pulled directly from the Arm SBSA SDK image and manually installed.
Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU. Refer to pytorch/pytorch#66930 for more information.
Triton cannot retrieve GPU metrics with MIG-enabled GPU devices.
Triton metrics might not work if the host machine is running a separate DCGM agent on bare-metal or in a container.
When cloud storage (AWS, GCS, AZURE) is used as a model repository and a model has multiple versions, Triton creates an extra local copy of the cloud model’s folder in the temporary directory, which is deleted upon server’s shutdown.
Python backend support for Windows is limited and does not currently support the following features:
- GPU tensors
- CPU and GPU-related metrics
- Custom execution environments
- The model load/unload APIs

Client Libraries and Examples

Ubuntu 22.04 builds of the client libraries and examples are included in this release in the attached v2.49.0_ubuntu22.04.clients.tar.gz file. The SDK is also available for as an Ubuntu 22.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

For Windows, the client libraries and some examples are available in the attached tritonserver2.49.0-sdk-win.zip file.

Windows Support

A beta release of Triton for Windows is provided in the attached file:tritonserver2.49.0-win.zip. This is a beta release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release:

HTTP/REST and GRPC endpoints are supported.
ONNX models are supported by the ONNX Runtime backend. The ONNX Runtime version is 1.18.1. The CPU, CUDA, and TensorRT execution providers are supported. The OpenVINO execution provider is not supported.
OpenVINO models are supported. The OpenVINO version is 2024.0.0.
Prometheus metrics endpoint is not supported.
System and CUDA shared memory are not supported.

Known Issues

In our internal testing, we observed large latency in retrieving inference results from HTTP client. We recommend using gRPC to circumvent this issue.

To use the Windows version of Triton, you must install all the necessary dependencies on your Windows system. These dependencies are available in the Dockerfile.win10.min. The Dockerfile includes the following CUDA-related components:

Python 3.10.11
CUDA 12.5.1
cuDNN 9.3.0.75
TensorRT 10.3.0.26

Jetson iGPU Support

A release of Triton for IGX is provided in the attached tar file: tritonserver2.49.0-igpu.tgz.

This release supports TensorFlow 2.16.1, TensorRT 10.3.0.26, Onnx Runtime 1.19.0, PyTorch 2.5.0a0+872d972, Python 3.10 and as well as ensembles.
ONNX Runtime backend does not support the OpenVINO and TensorRT execution providers. The CUDA execution provider is in Beta.
System shared memory is supported on Jetson. CUDA shared memory is not supported.
GPU metrics, GCS storage, S3 storage and Azure storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples. For more information on how to install and use Triton on JetPack refer to jetson.md.

The wheel for the Python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tritonclient-2.49.0-py3-none-manylinux2014_aarch64.whl[all]

Triton TRT-LLM Container Support Matrix

The Triton TensorRT-LLM container is built from the 24.07 image nvcr.io/nvidia/tritonserver:24.07-py3-min. Please refer to the support matrix for all dependency versions related to 24.07. However, the packages listed below have different versions than those specified in the support matrix. | Dependency | Version | | :------------: | :---------------: | | TensorRT-LLM | 0.12.0 | | TensorRT | 10.3.0.26 |

- Python
Published by pvijayakrish almost 2 years ago

server - Release 2.48.0 corresponding to NGC container 24.07

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

New Features and Improvements

OpenAI-compatible embeddings and Hugging Face TEI re-ranker API-compatible rankings can now be profiled via GenAI-Perf.
GenAI-Perf can now receive multiple user-specified prompts via --input-file.
The request-rate for async requests have been updated in the OpenAI and HTTP clients to send requests at exactly that rate. Users submitting more requests than their models can handle can see increased latency.
- The stabilization metric for Perf Analyzer has been updated due to these changes, so if latency does not stabilize for async models, a warning will be printed but Perf Analyzer will still complete.
Perf Analyzer will not validate any user-supplied inputs and outputs, returning an error if the model does not contain them.
Triton now supports tracing custom activities in the backend. For more information please refer to the documentation.
Enhanced Failure Count Metrics to reflect failure reason of inference request.

Known Issues

When using Python models in decoupled mode, users need to ensure that the ResponseSender goes out of scope or is properly cleaned up before unloading the model to guarantee that the unloading process executes correctly.
Triton TensorRT-LLM Backend container image uses TensorRT-LLM version 0.11.0 and built out of nvcr.io/nvidia/tritonserver:24.05-py3-min. Please refer to the Triton TRT-LLM Container Support Matrix section below for more details.
The Triton Inference Server with vLLM backend, when using explicit model control mode, does not support running vLLM models with the default "distributedexecutorbackend" and tensor parallelism sizes greater than 1. Attempting to load a vLLM model in explicit mode with tensor parallelism>1 will result in failure at initialize step: could not acquire lock for <_io.BufferedWriter name='<stdout>'> at interpreter shutdown, possibly due to daemon threads. Please specify "distributed_executor_backend":"ray" in the model.json when deploying vllm models with tensor parallelism > 1.
When loading models with file override, multiple model configuration files are not supported. Users must provide the model configuration by setting parameter "config" : "" instead of custom configuration file in the following format: "file:configs/.pbtxt" : "".
Perf Analyzer no longer supports --trace-file option.
TensorRT-LLM backend provides limited support of Triton extensions and features.
The TensorRT-LLM backend may core dump on server shutdown. This impacts server teardown only and will not impact inferencing.
The Java CAPI is known to have intermittent segfaults.
Restart support was temporarily removed for Python models.
Some systems which implement malloc() may not release memory back to the operating system right away causing a false memory leak. This can be mitigated by using a different malloc implementation. Tcmalloc and jemalloc are installed in the Triton container and can be used by specifying the library in LD_PRELOAD. NVIDIA recommends experimenting with both tcmalloc and jemalloc to determine which one works better for your use case.
Auto-complete may cause an increase in server start time. To avoid a start time increase, users can provide the full model configuration and launch the server with --disable-auto-complete-config.
Auto-complete does not support PyTorch models due to lack of metadata in the model. It can only verify that the number of inputs and the input names matches what is specified in the model configuration. There is no model metadata about the number of outputs and datatypes. Related PyTorch bug: https://github.com/pytorch/pytorch/issues/38273
Triton Client PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton Client library for Arm SBSA. The correct client wheel file can be pulled directly from the Arm SBSA SDK image and manually installed.
Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU. Refer to pytorch/pytorch#66930 for more information.
Triton cannot retrieve GPU metrics with MIG-enabled GPU devices.
Triton metrics might not work if the host machine is running a separate DCGM agent on bare-metal or in a container.
When cloud storage (AWS, GCS, AZURE) is used as a model repository and a model has multiple versions, Triton creates an extra local copy of the cloud model’s folder in the temporary directory, which is deleted upon server’s shutdown.
Python backend support for Windows is limited and does not currently support the following features:
- GPU tensors
- CPU and GPU-related metrics
- Custom execution environments
- The model load/unload APIs
Starting in 24.06, if you use Triton's iGPU container you might encounter this error message when loading TensorRT models built with the 24.06 TensorRT iGPU container: “Serialization (Serialization assertion stdVersionRead == serializationVersion failed.Version tag does not match. Note: Current Version: 236, Serialized Engine Version: 237)”. If this happens you can rebuild your iGPU models with the 24.04 TensorRT iGPU container and then run them in the Triton 24.06 iGPU container.

Client Libraries and Examples

Ubuntu 22.04 builds of the client libraries and examples are included in this release in the attached v2.48.0_ubuntu22.04.clients.tar.gz file. The SDK is also available for as an Ubuntu 22.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

For Windows, the client libraries and some examples are available in the attached tritonserver2.48.0-sdk-win.zip file.

Windows Support

A beta release of Triton for Windows is provided in the attached file:tritonserver2.48.0-win.zip. This is a beta release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release:

HTTP/REST and GRPC endpoints are supported.
ONNX models are supported by the ONNX Runtime backend. The ONNX Runtime version is 1.18.1. The CPU, CUDA, and TensorRT execution providers are supported. The OpenVINO execution provider is not supported.
OpenVINO models are supported. The OpenVINO version is 2024.0.0.
Prometheus metrics endpoint is not supported.
System and CUDA shared memory are not supported.

Known Issues

In our internal testing, we observed large latency in retrieving inference results from HTTP client. We recommend using gRPC to circumvent this issue.

To use the Windows version of Triton, you must install all the necessary dependencies on your Windows system. These dependencies are available in the Dockerfile.win10.min. The Dockerfile includes the following CUDA-related components:

Python 3.10.11
CUDA 12.5.1
cuDNN 9.2.1.18
TensorRT 10.2.0.19

Jetson iGPU Support

A release of Triton for IGX is provided in the attached tar file: tritonserver2.48.0-igpu.tgz.

This release supports TensorFlow 2.16.1, TensorRT 8.6.2.3, Onnx Runtime 1.18.0, PyTorch 2.4.0a0+3bcc3cd, Python 3.10 and as well as ensembles.
ONNX Runtime backend does not support the OpenVINO and TensorRT execution providers. The CUDA execution provider is in Beta.
System shared memory is supported on Jetson. CUDA shared memory is not supported.
GPU metrics, GCS storage, S3 storage and Azure storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples. For more information on how to install and use Triton on JetPack refer to jetson.md.

The wheel for the Python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tritonclient-2.48.0-py3-none-manylinux2014_aarch64.whl[all]

RHEL Compatibility Support

A release of Triton for RHEL 8 compatibility is provided as a zip file: tritonserver2.48.0-rhel8-compat.zip.

This release was compiled with AlmaLinux 8.9 to be compatible with RHEL 8 and above and is considered Early Access [EA]. See the included README.md for complete details about installation, verification, and support.
This release supports ONNXRuntime 1.18.1 as well as ensembles.
GCS storage, S3 storage are not supported; however, Azure storage is supported

The zip file contains the Tritonserver executable and shared libraries to run the server. Triton server clients and examples are not included at this time.

Triton TRT-LLM Container Support Matrix

The Triton TensorRT-LLM container is built from the 24.05 image nvcr.io/nvidia/tritonserver:24.05-py3-min. Please refer to the support matrix for all dependency versions related to 24.05. However, the packages listed below have different versions than those specified in the support matrix. | Dependency | Version | | :------------: | :---------------: | | TensorRT-LLM | 0.11.0 | | TensorRT | 10.1.0.27 |

- Python
Published by krishung5 almost 2 years ago

server - Release 2.47.0 corresponding to NGC container 24.06

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

New Features and Improvements

The TensorRT Backend now supports the BF16 datatype.
A new tutorial on auto-scaling and load balancing TensorRT-LLM model deployments with Triton Inference Server has been released: https://github.com/triton-inference-server/tutorials/tree/main/Deployment/Kubernetes/TensorRT-LLMAutoscalingandLoadBalancing
A compare subcommand has been added to GenAi-Perf to allow comparison across multiple runs
Multi-LoRA and multi-model support in GenAI-Perf
Custom visualizations in GenAI-Perf
A fixed request count can now be requested from Perf Analyzer
Ensemble top-level response caching support in Perf Analyzer
Added --enable-peer-access to control trying to enable GPU peer access on triton startup.. Default is TRUE.
Python models in default mode may send its response using the InferenceResponseSender similarly to models in decoupled mode.
Addressed an issue where Triton would cease processing gRPC requests after receiving multiple cancellation requests.

Known Issues

Starting in 24.06, if you use Triton's iGPU container you might encounter this error message when loading TensorRT models built with the 24.06 TensorRT iGPU container: Serialization (Serialization assertion stdVersionRead == serializationVersion failed.Version tag does not match. Note: Current Version: 236, Serialized Engine Version: 237). If this happens you can rebuild your iGPU models with the 24.04 TensorRT iGPU container and then run them in the Triton 24.06 iGPU container.
Triton TensorRT-LLM Backend container image uses TensorRT-LLM version 0.9.0 and built out of nvcr.io/nvidia/pytorch:24.03-py3
When using Python models in decoupled mode, users need to ensure that the ResponseSender goes out of scope or is properly cleaned up before unloading the model to guarantee that the unloading process executes correctly.
Restart support was temporarily removed for Python models.
TensorRT v10 does not support implicit batching. As a result, Triton no longer supports TensorRT models with implicit batch dimensions.
Since TensorRT v10 no longer supports implicit batch, Tritonserver will not be able to load existing TF-TRT models that use implicit batch. Therefore, we need to build TF-TRT models with dynamic batch support.
Multiple model configuration files are not supported by loading models with file override. Users still need to provide the model configuration by setting parameter "config" : "<JSON>" instead of custom configuration file "file:configs/<model-config-name>.pbtxt" : "<base64-encoded-file-content>"
Perf Analyzer no longer supports --trace-file option.
The TensorRT-LLM backend provides limited support of Triton extensions and features.
The TensorRT-LLM backend may core dump on server shutdown. This impacts server teardown only and will not impact inferencing.
The Java CAPI is known to have intermittent segfaults we’re looking for a root cause.
Some systems which implement malloc() may not release memory back to the operating system right away causing a false memory leak. This can be mitigated by using a different malloc implementation. Tcmalloc and jemalloc are installed in the Triton container and can be used by specifying the library in LD_PRELOAD. We recommend experimenting with both tcmalloc and jemalloc to determine which one works better for your use case.
Auto-complete may cause an increase in server start time. To avoid a start time increase, users can provide the full model configuration and launch the server with --disable-auto-complete-config.
Auto-complete does not support PyTorch models due to lack of metadata in the model. It can only verify that the number of inputs and the input names matches what is specified in the model configuration. There is no model metadata about the number of outputs and datatypes. Related PyTorch bug: https://github.com/pytorch/pytorch/issues/38273
Triton Client PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton Client library for Arm SBSA. The correct client wheel file can be pulled directly from the Arm SBSA SDK image and manually installed.
Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU. Refer to https://github.com/pytorch/pytorch/issues/66930 for more information.
Triton cannot retrieve GPU metrics with MIG-enabled GPU devices (A100 and A30).
Triton metrics might not work if the host machine is running a separate DCGM agent on bare-metal or in a container.
When cloud storage (AWS, GCS, AZURE) is used as a model repository and a model has multiple versions, Triton creates an extra local copy of the cloud model’s folder in the temporary directory, which is deleted upon server’s shutdown.
Python backend support for Windows is limited and does not currently support the following features:
- GPU tensors
- CPU and GPU-related metrics
- Custom execution environments
- The model load/unload APIs

Client Libraries and Examples

Ubuntu 22.04 builds of the client libraries and examples are included in this release in the attached v2.47.0_ubuntu22.04.clients.tar.gz file. The SDK is also available for as an Ubuntu 22.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

For Windows, the client libraries and some examples are available in the attached tritonserver2.47.0-sdk-win.zip file.

Windows Support

A beta release of Triton for Windows is provided in the attached file:tritonserver2.47.0-win.zip. This is a beta release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release:

HTTP/REST and GRPC endpoints are supported.
ONNX models are supported by the ONNX Runtime backend. The ONNX Runtime version is 1.18.0. The CPU, CUDA, and TensorRT execution providers are supported. The OpenVINO execution provider is not supported.
OpenVINO models are supported. The OpenVINO version is 2024.0.0.
Prometheus metrics endpoint is not supported.
System and CUDA shared memory are not supported.

Known Issues

In our internal testing, we observed large latency in retrieving inference results from HTTP client. We recommend using gRPC to circumvent this issue.

To use the Windows version of Triton, you must install all the necessary dependencies on your Windows system. These dependencies are available in the Dockerfile.win10.min. The Dockerfile includes the following CUDA-related components:

Python 3.10.11
CUDA 12.5.0
cuDNN 9.1.0.70
TensorRT 10.0.1.6

Jetson iGPU Support

A release of Triton for IGX is provided in the attached tar file: tritonserver2.47.0-igpu.tgz.

This release supports TensorFlow 2.16.1, TensorRT 8.6.2.3, Onnx Runtime 1.18.0, PyTorch 2.4.0a0+f70bd71a48, Python 3.10 and as well as ensembles.
ONNX Runtime backend does not support the OpenVINO and TensorRT execution providers. The CUDA execution provider is in Beta.
System shared memory is supported on Jetson. CUDA shared memory is not supported.
GPU metrics, GCS storage, S3 storage and Azure storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples. For more information on how to install and use Triton on JetPack refer to jetson.md.

The wheel for the Python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tritonclient-2.47.0-py3-none-manylinux2014_aarch64.whl[all]

- Python
Published by mc-nv about 2 years ago

server - Release 2.46.0 corresponding to NGC container 24.05

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

New Features and Improvements

Added namespace label in metrics if the server is launched with —model-namespacing=true. The label can now be used to distinguish metrics from two models with the same name belonging to different namespaces.
Response caching has been extended to top-level requests to ensemble models.
Improved the performance of Python HTTP Client library.
Model repository can now include multiple model configuration files for a given model. The specific model configuration to use can be selected when launching the server with -—model-config-name option.
INTER_OP_THREAD_COUNT and INTRA_OP_THREAD_COUNT parameter can now be set in config.pbtxt for PyTorch Backend to control thread counts in PyTorch model execution.
FIL backend is now included in Triton’s ARM-SBSA container image.
Triton’s vLLM Backend now supports deployment of models with multiple LoRA adapters. See this tutorial to learn more.
Triton logging format has been modified. See logging format section for more details.
GenAI-Perf added a new compare subcommand to enable generating visual comparisons of different profile runs
GenAI-Perf can now accept an input file containing a single prompt string to populate input generation.
Refer to the the Frameworks Support Matrix for container image versions on which the inference server container is based.

Known Issues

Triton TensorRT-LLM Backend container image uses TensorRT-LLM version 0.9.0 and built out of nvcr.io/nvidia/pytorch:24.03-py3
When using Python models in decoupled mode, users need to ensure that the ResponseSender goes out of scope or is properly cleaned up before unloading the model to guarantee that the unloading process executes correctly.
TensorRT v10 does not support implicit batching. As a result, Triton no longer supports TensorRT models with implicit batch dimensions.
Since TensorRT v10 no longer supports implicit batch, Tritonserver will not be able to load existing TF-TRT models that use implicit batch. Therefore, we need to build TF-TRT models with dynamic batch support.
Multiple model configuration files are not supported by loading models with file override. Users still need to provide the model configuration by setting parameter "config" : "<JSON>" instead of custom configuration file "file:configs/<model-config-name>.pbtxt" : "<base64-encoded-file-content>"
Perf Analyzer no longer supports --trace-file option.
The TensorRT-LLM backend provides limited support of Triton extensions and features.
The TensorRT-LLM backend may core dump on server shutdown. This impacts server teardown only and will not impact inferencing.
The Java CAPI is known to have intermittent segfaults we’re looking for a root cause.
Some systems which implement malloc() may not release memory back to the operating system right away causing a false memory leak. This can be mitigated by using a different malloc implementation. Tcmalloc and jemalloc are installed in the Triton container and can be used by specifying the library in LD_PRELOAD. We recommend experimenting with both tcmalloc and jemalloc to determine which one works better for your use case.
Auto-complete may cause an increase in server start time. To avoid a start time increase, users can provide the full model configuration and launch the server with --disable-auto-complete-config.
Auto-complete does not support PyTorch models due to lack of metadata in the model. It can only verify that the number of inputs and the input names matches what is specified in the model configuration. There is no model metadata about the number of outputs and datatypes. Related PyTorch bug: https://github.com/pytorch/pytorch/issues/38273
Triton Client PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton Client library for Arm SBSA. The correct client wheel file can be pulled directly from the Arm SBSA SDK image and manually installed.
Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU. Refer to https://github.com/pytorch/pytorch/issues/66930 for more information.
Triton cannot retrieve GPU metrics with MIG-enabled GPU devices (A100 and A30).
Triton metrics might not work if the host machine is running a separate DCGM agent on bare-metal or in a container.
When cloud storage (AWS, GCS, AZURE) is used as a model repository and a model has multiple versions, Triton creates an extra local copy of the cloud model’s folder in the temporary directory, which is deleted upon server’s shutdown.
Python backend support for Windows is limited and does not currently support the following features:
- GPU tensors
- CPU and GPU-related metrics
- Custom execution environments
- The model load/unload APIs

Client Libraries and Examples

Ubuntu 22.04 builds of the client libraries and examples are included in this release in the attached v2.46.0_ubuntu22.04.clients.tar.gz file. The SDK is also available for as an Ubuntu 22.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

For Windows, the client libraries and some examples are available in the attached tritonserver2.46.0-sdk-win.zip file.

Windows Support

[!NOTE] There is no Windows release for 24.05, the latest release is 24.03.

Jetson iGPU Support

[!NOTE] Jetpack v5.X.X refers to our Xavier series of Jetson devices. New feature support for these devices ended in our r23.06 release, however, due to a CVE patch, the latest release for this family of devices is included in this release. This family of devices in not compatible with our igpu container releases. Jetpack v6.X.X refers to our Orin series of Jetson devices. Triton is currently publishing monthly release containers for this family of devices, which can be found here with the suffix -igpu .

[!IMPORTANT] For Jetpack v5.1.2 running Triton 23.06 or older, an update has been posted on the 23.06 release page, tritonserver2.35.0-jetpack5.1.2-update-2.tgz, which fixes CVE-2023-31036. See our security bulletin for more details. This new updated package also contains a boost filesystem shared library that Triton depends on in the folder boost_filesystem . This shared library must be added to dynamic loader path for path for proper operation.

A release of Triton for IGX is provided in the attached tar file: tritonserver2.46.0-igpu.tgz.

This release supports TensorFlow 2.15.0, TensorRT 8.6.2.3, Onnx Runtime 1.17.3, PyTorch 2.4.0a0+07cecf4, Python 3.10 and as well as ensembles.
ONNX Runtime backend does not support the OpenVINO and TensorRT execution providers. The CUDA execution provider is in Beta.
System shared memory is supported on Jetson. CUDA shared memory is not supported.
GPU metrics, GCS storage, S3 storage and Azure storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples. For more information on how to install and use Triton on JetPack refer to jetson.md.

The wheel for the Python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tritonclient-2.46.0-py3-none-manylinux2014_aarch64.whl[all]

- Python
Published by mc-nv about 2 years ago

server - Release 2.45.0 corresponding to NGC container 24.04

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

New Features and Improvements

Beta support for AsyncIO in decoupled mode in Python backend.
Enhancements to server shutdown to take into account both HTTP live connections and inflight inferences.
Python backend shared memory region naming has been updated to use UUIDs. This allows multiple servers to run on the machine without requiring different shared memory region prefixes.
Support retrieving OpenTelemetry trace settings from the gRPC/HTTP endpoints.
Log file and trace file locations can no longer be updated using the gRPC/HTTP endpoints.
Added an iterative scheduling tutorial to demonstrate how to use iterative scheduling with a GPT2 model.
Trace settings API now returns trace_mode and trace_config information.
The TensorRT-LLM container now includes the tensorrt_llm Python package for creating engines.
Added Python Client API docs to the documentation website.
Added metric visualizations to GenAI-Perf.
Added support to Model Analyzer for profiling LLMs with GenAI-Perf
Added the ability to select an output token distribution in GenAI-Perf
Some arguments have been renamed in the latest version of GenAI-Perf.

Known Issues

The TensorRT-LLM container uses nvcr.io/nvidia/pytorch:24.02-py3 as the base image.
Perf Analyzer no longer supports --trace-file option.
The TensorRT-LLM backend provides limited support of Triton extensions and features.
The TensorRT-LLM backend may core dump on server shutdown. This impacts server teardown only and will not impact inferencing.
When using decoupled models, there is a possibility that response order as sent from the backend may not match with the order in which these responses are received by the streaming gRPC client. Note that this only applies to responses from different requests. Any responses corresponding to the same request will still be received in their expected order, relative to each other.
The Java CAPI is known to have intermittent segfaults we’re looking for a root cause.
Some systems which implement malloc() may not release memory back to the operating system right away causing a false memory leak. This can be mitigated by using a different malloc implementation. Tcmalloc and jemalloc are installed in the Triton container and can be used by specifying the library in LD_PRELOAD. We recommend experimenting with both tcmalloc and jemalloc to determine which one works better for your use case.
Auto-complete may cause an increase in server start time. To avoid a start time increase, users can provide the full model configuration and launch the server with --disable-auto-complete-config.
Auto-complete does not support PyTorch models due to lack of metadata in the model. It can only verify that the number of inputs and the input names matches what is specified in the model configuration. There is no model metadata about the number of outputs and datatypes. Related PyTorch bug: https://github.com/pytorch/pytorch/issues/38273
Triton Client PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton Client library for Arm SBSA. The correct client wheel file can be pulled directly from the Arm SBSA SDK image and manually installed.
Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU. Refer to https://github.com/pytorch/pytorch/issues/66930 for more information.
Triton cannot retrieve GPU metrics with MIG-enabled GPU devices (A100 and A30).
Triton metrics might not work if the host machine is running a separate DCGM agent on bare-metal or in a container.
When cloud storage (AWS, GCS, AZURE) is used as a model repository and a model has multiple versions, Triton creates an extra local copy of the cloud model’s folder in the temporary directory, which is deleted upon server’s shutdown.
Python backend support for Windows is limited and does not currently support the following features:
- GPU tensors
- CPU and GPU-related metrics
- Custom execution environments
- The model load/unload APIs

Client Libraries and Examples

Ubuntu 22.04 builds of the client libraries and examples are included in this release in the attached v2.45.0_ubuntu22.04.clients.tar.gz file. The SDK is also available for as an Ubuntu 22.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

For Windows, the client libraries and some examples are available in the attached tritonserver2.45.0-sdk-win.zip file.

Windows Support

[!NOTE] There is no Windows release for 24.04, the latest release is 24.03.

Jetson iGPU Support

A release of Triton for IGX is provided in the attached tar file: tritonserver2.45.0-igpu.tgz.

This release supports TensorFlow 2.15.0, TensorRT 8.6.2.3, Onnx Runtime 1.17.3, PyTorch 2.3.0a0+6ddf5cf, Python 3.10 and as well as ensembles.
ONNX Runtime backend does not support the OpenVINO and TensorRT execution providers. The CUDA execution provider is in Beta.
System shared memory is supported on Jetson. CUDA shared memory is not supported.
GPU metrics, GCS storage, S3 storage and Azure storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples. For more information on how to install and use Triton on JetPack refer to jetson.md.

The wheel for the Python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tritonclient-2.45.0-py3-none-manylinux2014_aarch64.whl[all]

- Python
Published by mc-nv about 2 years ago

server - Release 2.44.0 corresponding to NGC container 24.03

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

New Features and Improvements

OpenTelemetry context for a trace started on the triton server side is now accessible from the Python Backend.
Python Backend now supports correlation strings in BLS models.
Triton now case-insensitively matches HTTP headers when using the header forwarding feature.
Triton’s backend API now allows users to collect per-response metrics.
Triton now publishes request cancellations in the response statistics.
GenAI-Perf is a new tool that facilitates LLM benchmarking and is currently available as an alpha release.

Known Issues

There is a known issue with ONNX Runtime with TensorRT Execution Provider which causes segmentation faults when attempting to load multiple instances of a model on the same GPU. This issue is being tracked here: https://github.com/microsoft/onnxruntime/issues/20089. As a work around, users can serially load models and ensure only one model instance per gpu.
TensorRT-LLM backend is installed with Triton 24.01 base container due to incompatibility reasons.
The TensorRT-LLM backend provides limited support of Triton extensions and features.
The TensorRT-LLM backend may core dump on server shutdown. This impacts server teardown only and will not impact inferencing.
When using decoupled models, there is a possibility that response order as sent from the backend may not match with the order in which these responses are received by the streaming gRPC client. Note that this only applies to responses from different requests. Any responses corresponding to the same request will still be received in their expected order, relative to each other.
The Java CAPI is known to have intermittent segfaults we’re looking for a root cause.
Some systems which implement malloc() may not release memory back to the operating system right away causing a false memory leak. This can be mitigated by using a different malloc implementation. Tcmalloc and jemalloc are installed in the Triton container and can be used by specifying the library in LD_PRELOAD. We recommend experimenting with both tcmalloc and jemalloc to determine which one works better for your use case.
Auto-complete may cause an increase in server start time. To avoid a start time increase, users can provide the full model configuration and launch the server with --disable-auto-complete-config.
Auto-complete does not support PyTorch models due to lack of metadata in the model. It can only verify that the number of inputs and the input names matches what is specified in the model configuration. There is no model metadata about the number of outputs and datatypes. Related PyTorch bug: https://github.com/pytorch/pytorch/issues/38273
Triton Client PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton Client library for Arm SBSA. The correct client wheel file can be pulled directly from the Arm SBSA SDK image and manually installed.
Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU. Refer to https://github.com/pytorch/pytorch/issues/66930 for more information.
Triton cannot retrieve GPU metrics with MIG-enabled GPU devices (A100 and A30).
Triton metrics might not work if the host machine is running a separate DCGM agent on bare-metal or in a container.
When cloud storage (AWS, GCS, AZURE) is used as a model repository and a model has multiple versions, Triton creates an extra local copy of the cloud model’s folder in the temporary directory, which is deleted upon server’s shutdown.
Python backend support for Windows is limited and does not currently support the following features:
- GPU tensors
- CPU and GPU-related metrics
- Custom execution environments
- The model load/unload APIs

Client Libraries and Examples

Ubuntu 22.04 builds of the client libraries and examples are included in this release in the attached v2.44.0_ubuntu22.04.clients.tar.gz file. The SDK is also available for as an Ubuntu 22.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

For Windows, the client libraries and some examples are available in the attached tritonserver2.44.0-sdk-win.zip file.

Windows Support

A beta release of Triton for Windows is provided in the attached file:tritonserver2.44.0-win.zip. This is a beta release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release:

HTTP/REST and GRPC endpoints are supported.
ONNX models are supported by the ONNX Runtime backend. The ONNX Runtime version is 1.17.2. The CPU, CUDA, and TensorRT execution providers are supported. The OpenVINO execution provider is not supported.
OpenVINO models are supported. The OpenVINO version is 2023.3.0.
Prometheus metrics endpoint is not supported.
System and CUDA shared memory are not supported.

To use the Windows version of Triton, you must install all the necessary dependencies on your Windows system. These dependencies are available in the Dockerfile.win10.min. The Dockerfile includes the following CUDA-related components:

Python 3.8.10
CUDA 12.3.2
cuDNN 9.0.0.312
TensorRT 8.6.1.6

[!IMPORTANT] The 24.03 version of the ONNX Runtime Backend depends on cuDNN 9.0.0.312 while the TensorRT Backend depends on cuDNN 8.9.7.29. This requires the user to ensure the runtime PATH includes paths to the respective cuDNN DLLs for each of the backends to load correctly.

Jetson iGPU Support

A release of Triton for IGX is provided in the attached tar file: tritonserver2.44.0-igpu.tgz.

This release supports TensorFlow 2.15.0, TensorRT 8.6.2.3, Onnx Runtime 1.17.2, PyTorch 2.3.0a0+40ec155e58, Python 3.10 and as well as ensembles.
ONNX Runtime backend does not support the OpenVINO and TensorRT execution providers. The CUDA execution provider is in Beta.
System shared memory is supported on Jetson. CUDA shared memory is not supported.
GPU metrics, GCS storage, S3 storage and Azure storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples. For more information on how to install and use Triton on JetPack refer to jetson.md.

The wheel for the Python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tritonclient-2.44.0-py3-none-manylinux2014_aarch64.whl[all]

- Python
Published by mc-nv over 2 years ago

server - Release 2.43.0 corresponding to NGC container 24.02

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

New Features and Improvements

Added base python backend functionality for Windows.
Removed the Wait/Read(avg) and Overhead metrics for gRPC in the Trace Summary Tool to avoid displaying inaccurate readings.

OpenTelemetry trace mode switched to Batch Span Processor, which batches completed spans and sends them in bulk. This processor supports both size and time based batching. Size-based batching is controlled by 2 parameters: bsp_max_export_batch_size and bsp_max_queue_size, while time-based batching is controlled by bsp_schedule_delay.

Refer to the the Frameworks Support Matrix for container image versions on which the inference server container is based.

Known Issues

ONNX Runtime backend is not included with 24.02 release due to incompatibility reasons. However iGPU and Windows build assets shipped with ONNX Runtime backend.
TensorRT-LLM backend is installed with Triton 24.01 base container due to incompatibility reasons.
The TensorRT-LLM backend provides limited support of Triton extensions and features.
The TensorRT-LLM backend may core dump on server shutdown. This impacts server teardown only and will not impact inferencing.
When using decoupled models, there is a possibility that response order as sent from the backend may not match with the order in which these responses are received by the streaming gRPC client. Note that this only applies to responses from different requests. Any responses corresponding to the same request will still be received in their expected order, relative to each other.
The Java CAPI is known to have intermittent segfaults we’re looking for a root cause.
Some systems which implement malloc() may not release memory back to the operating system right away causing a false memory leak. This can be mitigated by using a different malloc implementation. Tcmalloc and jemalloc are installed in the Triton container and can be used by specifying the library in LD_PRELOAD. We recommend experimenting with both tcmalloc and jemalloc to determine which one works better for your use case.
Auto-complete may cause an increase in server start time. To avoid a start time increase, users can provide the full model configuration and launch the server with --disable-auto-complete-config.
Auto-complete does not support PyTorch models due to lack of metadata in the model. It can only verify that the number of inputs and the input names matches what is specified in the model configuration. There is no model metadata about the number of outputs and datatypes. Related PyTorch bug: https://github.com/pytorch/pytorch/issues/38273
Triton Client PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton Client library for Arm SBSA. The correct client wheel file can be pulled directly from the Arm SBSA SDK image and manually installed.
Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU. Refer to https://github.com/pytorch/pytorch/issues/66930 for more information.
Triton cannot retrieve GPU metrics with MIG-enabled GPU devices (A100 and A30).
Triton metrics might not work if the host machine is running a separate DCGM agent on bare-metal or in a container.
When cloud storage (AWS, GCS, AZURE) is used as a model repository and a model has multiple versions, Triton creates an extra local copy of the cloud model’s folder in the temporary directory, which is deleted upon server’s shutdown.
Python backend support for Windows is limited and does not currently support the following features:
- GPU tensors
- CPU and GPU-related metrics
- Custom execution environments
- The model load/unload APIs

Client Libraries and Examples

Ubuntu 22.04 builds of the client libraries and examples are included in this release in the attached v2.43.0_ubuntu22.04.clients.tar.gz file. The SDK is also available for as an Ubuntu 22.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

For Windows, the client libraries and some examples are available in the attached tritonserver2.43.0-sdk-win.zip file.

Windows Support

A beta release of Triton for Windows is provided in the attached file:tritonserver2.43.0-win.zip. This is a beta release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release:

HTTP/REST and GRPC endpoints are supported.
ONNX models are supported by the ONNXRuntime backend. The ONNX Runtime version is 1.16.3. The CPU, CUDA, and TensorRT execution providers are supported. The OpenVINO execution provider is not supported.
OpenVINO models are supported. The OpenVINO version is 2023.3.0.
Prometheus metrics endpoint is not supported.
System and CUDA shared memory are not supported.

To use the Windows version of Triton, you must install all the necessary dependencies on your Windows system. These dependencies are available in the Dockerfile.win10.min. The Dockerfile includes the following CUDA-related components:

Python 3.8.10
CUDA 12.3.2
cuDNN 8.9.7.29
TensorRT 8.6.1.6

Jetson iGPU Support

A release of Triton for IGX is provided in the attached tar file: tritonserver2.43.0-igpu.tgz.

This release supports TensorFlow 2.15.0, TensorRT 8.6.2.3, Onnx Runtime 1.16.3, PyTorch 2.3.0a0+ebedce2, Python 3.10 and as well as ensembles.
ONNX Runtime backend does not support the OpenVINO and TensorRT execution providers. The CUDA execution provider is in Beta.
System shared memory is supported on Jetson. CUDA shared memory is not supported.
GPU metrics, GCS storage, S3 storage and Azure storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples. For more information on how to install and use Triton on JetPack refer to jetson.md.

The wheel for the Python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tritonclient-2.43.0-py3-none-manylinux2014_aarch64.whl[all]

- Python
Published by mc-nv over 2 years ago

server - Release 2.42.0 corresponding to NGC container 24.01

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

New Features and Improvements

Added Triton Python API for in-process integration in Python environment.
Added command line option to retry loading failed model in number of attempts specified.
Added support for Context Propagation in OpenTelemetry trace mode.
Added Triton pinned memory pool usage in reporting metrics.
Improved error response in HTTP endpoint that HTTP status codes different than 400 may be returned to align with the error type.
Added experimental support for serving PyTorch 2.0 models.
The FasterTransformer backend has been deprecated as of 24.01 and will no longer be supported or released with this and future versions of Triton.
The Model Analyzer now correctly loads and optimizes ensemble models.
The Model Analyzer now handles the case of optimizing a model on a remote Triton server without requiring a local GPU.
Refer to the the Frameworks Support Matrix for container image versions on which the inference server container is based.

Known Issues

The TensorRT-LLM backend provides limited support of Triton extensions and features.
The TensorRT-LLM backend may core dump on server shutdown. This impacts server teardown only and will not impact inferencing.
When using decoupled models, there is a possibility that response order as sent from the backend may not match with the order in which these responses are received by the streaming gRPC client. Note that this only applies to responses from different requests. Any responses corresponding to the same request will still be received in their expected order, relative to each other.
The Java CAPI is known to have intermittent segfaults we’re looking for a root cause.
Some systems which implement malloc() may not release memory back to the operating system right away causing a false memory leak. This can be mitigated by using a different malloc implementation. Tcmalloc and jemalloc are installed in the Triton container and can be used by specifying the library in LD_PRELOAD. We recommend experimenting with both tcmalloc and jemalloc to determine which one works better for your use case.
Auto-complete may cause an increase in server start time. To avoid a start time increase, users can provide the full model configuration and launch the server with --disable-auto-complete-config.
Auto-complete does not support PyTorch models due to lack of metadata in the model. It can only verify that the number of inputs and the input names matches what is specified in the model configuration. There is no model metadata about the number of outputs and datatypes. Related PyTorch bug: https://github.com/pytorch/pytorch/issues/38273
Triton Client PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton Client library for Arm SBSA. The correct client wheel file can be pulled directly from the Arm SBSA SDK image and manually installed.
Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU. Refer to https://github.com/pytorch/pytorch/issues/66930 for more information.
Triton cannot retrieve GPU metrics with MIG-enabled GPU devices (A100 and A30).
Triton metrics might not work if the host machine is running a separate DCGM agent on bare-metal or in a container.
When cloud storage (AWS, GCS, AZURE) is used as a model repository and a model has multiple versions, Triton creates an extra local copy of the cloud model’s folder in the temporary directory, which is deleted upon server’s shutdown.

Client Libraries and Examples

Ubuntu 22.04 builds of the client libraries and examples are included in this release in the attached v2.42.0_ubuntu22.04.clients.tar.gz file. The SDK is also available for as an Ubuntu 22.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

For Windows, the client libraries and some examples are available in the attached tritonserver2.42.0-sdk-win.zip file.

Windows Support

[!NOTE] There is no Windows release for 23.12, the latest release is 23.11.

Jetson iGPU Support

A release of Triton for IGX is provided in the attached tar file: tritonserver2.42.0-igpu.tgz.

This release supports TensorFlow 2.14.0, TensorRT 8.6.2.3, Onnx Runtime 1.16.3, PyTorch 2.2.0a0+81ea7a4, Python 3.10 and as well as ensembles.
ONNXRuntime backend does not support the OpenVino and TensorRT execution providers. The CUDA execution provider is in Beta.
System shared memory is supported on Jetson. CUDA shared memory is not supported.
GPU metrics, GCS storage, S3 storage and Azure storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples. For more information on how to install and use Triton on JetPack refer to jetson.md.

The wheel for the Python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tritonclient-2.42.0-py3-none-manylinux2014_aarch64.whl[all]

- Python
Published by mc-nv over 2 years ago

server - Release 2.41.0 corresponding to NGC container 23.12

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

New Features and Improvements

Added metrics support to TRTLLM backend when running within Triton.
Request ID will be included in opentelemetry tracing.
For Jetson devices which support Jetpack 6.0 and above, Triton now publishes containers, based on the latest version of Jetpack, on NGC with the suffix -igpu. These containers are:
- XX.YY-py3-igpu - much like the XX.YY-py3 container, this contains tritonserver and all supported backends for Jetson devices.
- XX.YY-py3-sdk-igpu - much like the XX.YY-py3-sdk container, this contains the Tritonclient and Triton Tools supported on Jetson devices.
Refer to the 23.12 column of the Frameworks Support Matrix for container image versions on which the 23.10 inference server container is based.

Known Issues

Reuse-grpc-port and reuse-http-port are now properly parsed as booleans. 0 and 1 will continue to work as values. Any other integers will throw an error.
The TensorRT-LLM backend provides limited support of Triton extensions and features.
The TensorRT-LLM backend may core dump on server shutdown. This impacts server teardown only and will not impact inferencing.
When using decoupled models, there is a possibility that response order as sent from the backend may not match with the order in which these responses are received by the streaming gRPC client. Note that this only applies to responses from different requests. Any responses corresponding to the same request will still be received in their expected order, relative to each other.
The FasterTransformer backend is only officially supported for 22.12, though it can be built for Triton container versions up to 23.07.
The Java CAPI is known to have intermittent segfaults we’re looking for a root cause.
Some systems which implement malloc() may not release memory back to the operating system right away causing a false memory leak. This can be mitigated by using a different malloc implementation. Tcmalloc and jemalloc are installed in the Triton container and can be used by specifying the library in LD_PRELOAD. We recommend experimenting with both tcmalloc and jemalloc to determine which one works better for your use case.
Auto-complete may cause an increase in server start time. To avoid a start time increase, users can provide the full model configuration and launch the server with --disable-auto-complete-config.
Auto-complete does not support PyTorch models due to lack of metadata in the model. It can only verify that the number of inputs and the input names matches what is specified in the model configuration. There is no model metadata about the number of outputs and datatypes. Related PyTorch bug: https://github.com/pytorch/pytorch/issues/38273
Triton Client PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton Client library for Arm SBSA. The correct client wheel file can be pulled directly from the Arm SBSA SDK image and manually installed.
Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU. Refer to https://github.com/pytorch/pytorch/issues/66930 for more information.
Triton cannot retrieve GPU metrics with MIG-enabled GPU devices (A100 and A30).
Triton metrics might not work if the host machine is running a separate DCGM agent on bare-metal or in a container.
When cloud storage (AWS, GCS, AZURE) is used as a model repository and a model has multiple versions, Triton creates an extra local copy of the cloud model’s folder in the temporary directory, which is deleted upon server’s shutdown.
Model Analyzer is not able to analyze and optimize ensemble model configs due to a bug in the way composing models are loaded.
Model Analyzer does not work with SSL via gRPC.

Client Libraries and Examples

Ubuntu 22.04 builds of the client libraries and examples are included in this release in the attached v2.41.0_ubuntu22.04.clients.tar.gz file. The SDK is also available for as an Ubuntu 22.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

For Windows, the client libraries and some examples are available in the attached tritonserver2.41.0-sdk-win.zip file.

Windows Support

[!NOTE] There is no Windows release for 23.12, the latest release is 23.11.

Jetson iGPU Support

[!IMPORTANT] For Jetpack v5.1.2 running Triton 23.06 or older, an update has been posted on the 23.06 release page , tritonserver2.35.0-jetpack5.1.2-update-1.tgz, which fixes CVE-2023-31036. See our security bulletin for more details.

A release of Triton for IGX is provided in the attached tar file: tritonserver2.41.0-igpu.tgz.

This release supports TensorFlow 2.14.0, TensorRT 8.6.2.3, Onnx Runtime 1.16.3, PyTorch 2.2.0a0+81ea7a4, Python 3.10 and as well as ensembles.
ONNXRuntime backend does not support the OpenVino and TensorRT execution providers. The CUDA execution provider is in Beta.
System shared memory is supported on Jetson. CUDA shared memory is not supported.
GPU metrics, GCS storage, S3 storage and Azure storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples. For more information on how to install and use Triton on JetPack refer to jetson.md.

The wheel for the Python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tritonclient-2.41.0-py3-none-manylinux2014_aarch64.whl[all]

- Python
Published by mc-nv over 2 years ago

server - Release 2.40.0 corresponding to NGC container 23.11

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

New Features and Improvements

Starting with the 23.11 release, Triton containers supporting iGPU architectures are published, and run on Jetson devices. Please refer to the Frameworks Support Matrix for information regarding which iGPU hardware/software is supported by which container.
Implicit state management has been enhanced to support growing buffers and use a single buffer for both input and output states.
Sequence batcher has been enhanced to support iterative scheduling.
The backend API has been enhanced to support rescheduling a request. Currently, only Python backend and Custom C++ backends support request rescheduling.
TRT-LLM backend now supports request cancellation.
Configuration of a vLLM backend model can now be auto-completed by Triton. The user just needs to pass backend: "vllm" to leverage the auto-complete feature.
Python backend now supports parameters in BLS requests.
Python backend GPU tensor support has been improved to provide better performance.
A new tutorial demonstrating how to deploy LLaMa2 using TRT-LLM has been added.
The HTTP endpoint has been enhanced to support access restriction.
Secure Deployment Guide has been added to provide guidance on deploying Triton securely.
The client model loading API no longer allows uploading files outside the model repository.
DCGM version has been upgraded to 3.2.6.
The Kubernetes Deploy example now supports Kubernetes’ new StartupProbe to allow Triton pods time to finish startup before running health probes.

Known Issues

When using the generate streaming endpoint, Triton will segfault if the client closes the connection before all responses have been generated. The fix will be available in the next release.
Reuse-grpc-port and reuse-http-port are now properly parsed as booleans. 0 and 1 will continue to work as values. Any other integers will throw an error.
The TensorRT-LLM backend provides limited support of Triton extensions and features.
The TensorRT-LLM backend may core dump on server shutdown. This impacts server teardown only and will not impact inferencing.
When using decoupled models, there is a possibility that response order as sent from the backend may not match with the order in which these responses are received by the streaming gRPC client. Note that this only applies to responses from different requests. Any responses corresponding to the same request will still be received in their expected order, relative to each other.
The FasterTransformer backend is only officially supported for 22.12, though it can be built for Triton container versions up to 23.07.
The Java CAPI is known to have intermittent segfaults we’re looking for a root cause.
Some systems which implement malloc() may not release memory back to the operating system right away causing a false memory leak. This can be mitigated by using a different malloc implementation. Tcmalloc and jemalloc are installed in the Triton container and can be used by specifying the library in LD_PRELOAD. We recommend experimenting with both tcmalloc and jemalloc to determine which one works better for your use case.
Auto-complete may cause an increase in server start time. To avoid a start time increase, users can provide the full model configuration and launch the server with --disable-auto-complete-config.
Auto-complete does not support PyTorch models due to lack of metadata in the model. It can only verify that the number of inputs and the input names matches what is specified in the model configuration. There is no model metadata about the number of outputs and datatypes. Related PyTorch bug: https://github.com/pytorch/pytorch/issues/38273
Triton Client PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton Client library for Arm SBSA. The correct client wheel file can be pulled directly from the Arm SBSA SDK image and manually installed.
Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU. Refer to https://github.com/pytorch/pytorch/issues/66930 for more information.
Triton cannot retrieve GPU metrics with MIG-enabled GPU devices (A100 and A30).
Triton metrics might not work if the host machine is running a separate DCGM agent on bare-metal or in a container.
When cloud storage (AWS, GCS, AZURE) is used as a model repository and a model has multiple versions, Triton creates an extra local copy of the cloud model’s folder in the temporary directory, which is deleted upon server’s shutdown.

Client Libraries and Examples

Ubuntu 22.04 builds of the client libraries and examples are included in this release in the attached v2.40.0_ubuntu22.04.clients.tar.gz file. The SDK is also available for as an Ubuntu 22.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

For Windows, the client libraries and some examples are available in the attached tritonserver2.40.0-sdk-win.zip file.

Windows Support

A beta release of Triton for Windows is provided in the attached file:tritonserver2.40.0-win.zip. This is a beta release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release:

HTTP/REST and GRPC endpoints are supported.
ONNX models are supported by the ONNXRuntime backend. The ONNXRuntime version is 1.16.3. The CPU, CUDA, and TensorRT execution providers are supported. The OpenVINO execution provider is not supported.
OpenVINO models are supported. The OpenVINO version is 2023.0.0.
Prometheus metrics endpoint is not supported.
System and CUDA shared memory are not supported.

To use the Windows version of Triton, you must install all the necessary dependencies on your Windows system. These dependencies are available in the Dockerfile.win10.min. The Dockerfile includes the following CUDA-related components:

CUDA 12.3.0
cuDNN 8.9.6.50
TensorRT 8.6.1.6

Jetson iGPU Support

A release of Triton for IGX is provided in the attached tar file: tritonserver2.40.0-igpu.tgz.

This release supports TensorFlow 2.14.0, TensorRT 8.6.2.3, Onnx Runtime 1.16.3, PyTorch 2.2.0a0+6a974be, Python 3.10 and as well as ensembles.
ONNXRuntime backend does not support the OpenVino and TensorRT execution providers. The CUDA execution provider is in Beta.
System shared memory is supported on Jetson. CUDA shared memory is not supported.
GPU metrics, GCS storage, S3 storage and Azure storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples. For more information on how to install and use Triton on JetPack refer to jetson.md.

The wheel for the Python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tritonclient-2.40.0-py3-none-manylinux2014_aarch64.whl[all]

- Python
Published by mc-nv over 2 years ago

server - Release 2.39.0 corresponding to NGC container 23.10

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

New Features and Improvements

Triton now supports the TensorRT-LLM backend. This backend uses the Nvidia TensorRT-LLM, which replaces the Fastertransformer backend. A new container with TensorRT-LLM backend is available on NGC for 23.10.
Added support for handling client-side request cancellation in Triton server and backends. (server docs, client docs).
Triton can deploy supported models on the vLLM engine using the new vLLM backend. A new container with vLLM backend is available on NGC for 23.10.
Added Generate extension (beta) which provides better REST APIs for inference on Large Language Models.
New tutorials with respect to how to run vLLM with the new REST API, how to run Llama2 with TensorRT-LLM backend, and how to run with HuggingFace models in the tutorial repo.
Support Scalar I/O in ONNXRuntime backend.
Added support for writing custom backends in python, a.k.a. Python-based backends.
Refer to the 23.10 column of the Frameworks Support Matrix for container image versions on which the 23.10 inference server container is based.

Known Issues

For its initial release, the TensorRT-LLM backend provides limited support of Triton extensions and features.
The TensorRT-LLM backend may core dump on server shutdown. This impacts server teardown only and will not impact inferencing.
When a model uses a backend which is not found, Triton would reference the missing backend as `backend_name /model.py” in the error message. This is already fixed for future releases.
When using decoupled models, there is a possibility that response order as sent from the backend may not match with the order in which these responses are received by the streaming gRPC client. Note that this only applies to responses from different requests. Any responses corresponding to the same request will still be received in their expected order, relative to each other.
The FasterTransformer backend is only officially supported for 22.12, though it can be built for Triton container versions up to 23.07.
The Java CAPI is known to have intermittent segfaults we’re looking for a root cause.
Some systems which implement malloc() may not release memory back to the operating system right away causing a false memory leak. This can be mitigated by using a different malloc implementation. Tcmalloc and jemalloc are installed in the Triton container and can be used by specifying the library in LD_PRELOAD. We recommend experimenting with both tcmalloc and jemalloc to determine which one works better for your use case.
Auto-complete may cause an increase in server start time. To avoid a start time increase, users can provide the full model configuration and launch the server with --disable-auto-complete-config.
Auto-complete does not support PyTorch models due to lack of metadata in the model. It can only verify that the number of inputs and the input names matches what is specified in the model configuration. There is no model metadata about the number of outputs and datatypes. Related PyTorch bug: https://github.com/pytorch/pytorch/issues/38273
Triton Client PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton Client library for Arm SBSA. The correct client wheel file can be pulled directly from the Arm SBSA SDK image and manually installed.
Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU. Refer to https://github.com/pytorch/pytorch/issues/66930 for more information.
Triton cannot retrieve GPU metrics with MIG-enabled GPU devices (A100 and A30).
Triton metrics might not work if the host machine is running a separate DCGM agent on bare-metal or in a container.
When cloud storage (AWS, GCS, AZURE) is used as a model repository and a model has multiple versions, Triton creates an extra local copy of the cloud model’s folder in the temporary directory, which is deleted upon server’s shutdown.

Client Libraries and Examples

Ubuntu 22.04 builds of the client libraries and examples are included in this release in the attached v2.39.0_ubuntu22.04.clients.tar.gz file. The SDK is also available for as an Ubuntu 22.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

For Windows, the client libraries and some examples are available in the attached tritonserver2.39.0-sdk-win.zip file.

Windows Support

Note There is no Windows release for 23.10, the latest release is 23.09.

Jetson Jetpack Support

Note There is no Jetpack release for 23.08, the latest release is 23.06.

- Python
Published by mc-nv over 2 years ago

server - Release 2.38.0 corresponding to NGC container 23.09

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

New Features and Improvements

Triton now has Python bindings for the C API. Please refer to this PR for usage.
Triton now forwards request parameters to each of the composing models of an ensemble model.
The Filesystem API now supports named temporary cache directories when downloading models using the repository agent.
Added the number of requests currently in the queue to the metrics API. Documentation can be found here.
Python backend models can now respond with error codes in addition to error messages.
TensorRT backend now supports TensortRT version compatibility across models generated with the same major version of TensorRT. Use the --backend-config=tensorrt,--version-compatible=true flag to enable this feature.
Triton’s backend API now supports accessing the inference response outputs by name or by index. See the new API here.
The Python backend now supports loading Pytorch models directly. This feature is experimental and should be treated as Beta.
Fixed an issue where if the user didn't call SetResponseReleaseCallback, canceling a new request could cancel the old response factory as well. Now when canceling a request which is being re-used, a new response factory is created for each inference.
Refer to the 23.09 column of the Frameworks Support Matrix for container image versions on which the 23.09 inference server container is based.

Known Issues

When using decoupled models, there is a possibility that response order as sent from the backend may not match with the order in which these responses are received by the streaming gRPC client. Note that this only applies to responses from different requests. Any responses corresponding to the same request will still be received in their expected order, relative to each other.
The FasterTransformer backend is only officially supported for 22.12, though it can be built for Triton container versions up to 23.07.
The Java CAPI is known to have intermittent segfaults we’re looking for a root cause.
Some systems which implement malloc() may not release memory back to the operating system right away causing a false memory leak. This can be mitigated by using a different malloc implementation. Tcmalloc and jemalloc are installed in the Triton container and can be used by specifying the library in LD_PRELOAD. We recommend experimenting with both tcmalloc and jemalloc to determine which one works better for your use case.
Auto-complete may cause an increase in server start time. To avoid a start time increase, users can provide the full model configuration and launch the server with --disable-auto-complete-config.
Auto-complete does not support PyTorch models due to lack of metadata in the model. It can only verify that the number of inputs and the input names matches what is specified in the model configuration. There is no model metadata about the number of outputs and datatypes. Related PyTorch bug: https://github.com/pytorch/pytorch/issues/38273
Triton Client PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton Client library for Arm SBSA. The correct client wheel file can be pulled directly from the Arm SBSA SDK image and manually installed.
Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU. Refer to https://github.com/pytorch/pytorch/issues/66930 for more information.
Triton cannot retrieve GPU metrics with MIG-enabled GPU devices (A100 and A30).
Triton metrics might not work if the host machine is running a separate DCGM agent on bare-metal or in a container.
When cloud storage (AWS, GCS, AZURE) is used as a model repository and a model has multiple versions, Triton creates an extra local copy of the cloud model’s folder in the temporary directory, which is deleted upon server’s shutdown.

Client Libraries and Examples

Ubuntu 22.04 builds of the client libraries and examples are included in this release in the attached v2.380_ubuntu2204.clients.tar.gz file. The SDK is also available for as an Ubuntu 22.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

For Windows, the client libraries and some examples are available in the attached tritonserver2.38.0-sdk-win.zip file.

Windows Support

A beta release of Triton for Windows is provided in the attached file:tritonserver2.37.0-win.zip. This is a beta release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release:

HTTP/REST and GRPC endpoints are supported.
ONNX models are supported by the ONNXRuntime backend. The ONNXRuntime version is 1.15.1. The CPU, CUDA, and TensorRT execution providers are supported. The OpenVINO execution provider is not supported.
OpenVINO models are supported. The OpenVINO version is 2023.0.0.
Prometheus metrics endpoint is not supported.
System and CUDA shared memory are not supported.

To use the Windows version of Triton, you must install all the necessary dependencies on your Windows system. These dependencies are available in the Dockerfile.win10.min. The Dockerfile includes the following CUDA-related components:

CUDA 12.2.0
cuDNN 8.9.4.25
TensorRT 8.6.1.6

Jetson Jetpack Support

Note There is no Jetpack release for 23.08, the latest release is 23.06.

- Python
Published by mc-nv almost 3 years ago

server - Release 2.37.0 corresponding to NGC container 23.08

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

New Features and Improvements

Triton can load model instances in parallel for supporting backends. See TRITONBACKEND_BackendAttributeSetParallelModelInstanceLoading for more details. As of 23.08, only python and onnxruntime backends support loading model instances in parallel.
Python backend models can capture trace for composing child models when executing BLS requests.
Triton OpenTelemetry Tracing exposes resource settings which can be used to configure the service name and version.
Python backend supports directly loading and serving PyTorch models with torch.compile().
Exposed preserve_ordering field to oldest strategy sequence batcher. The default behavior of the oldest strategy sequence batcher to preserve response order across the independent requests belonging to different sequences is changed from True to False. Note: This setting does not impact order of responses within a sequence.
Refer to the 23.08 column of the Frameworks Support Matrix for container image versions on which the 23.08 inference server container is based.

Known Issues

Triton uses OpenTelemetry CPP library version, which can cause Triton to crash , when OpenTelemetry’s exporter timeouts.
When using decoupled models, there is a possibility that response order as sent from the backend may not match with the order in which these responses are received by the streaming gRPC client.
The "fastertransformer_backend" is only officially supported for 22.12, though it can be built for Triton container versions up to 23.07.
The Java CAPI is known to have intermittent segfaults we’re looking for a root cause.
Some systems which implement malloc() may not release memory back to the operating system right away causing a false memory leak. This can be mitigate by using a different malloc implementation. tcmalloc and jemalloc are installed in the Triton container and can be used by specifying the library in LD_PRELOAD.

We recommend experimenting with both tcmalloc and jemalloc to determine which one works better for your use case.

Auto-complete may cause an increase in server start time. To avoid a start time increase, users can provide the full model configuration and launch the server with --disable-auto-complete-config.
Auto-complete does not support PyTorch models due to lack of metadata in the model. It can only verify that the number of inputs and the input names matches what is specified in the model configuration. There is no model metadata about the number of outputs and datatypes. Related PyTorch bug: https://github.com/pytorch/pytorch/issues/38273
Triton Client PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton Client library for Arm SBSA. The correct client wheel file can be pulled directly from the Arm SBSA SDK image and manually installed.
Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU. Refer to https://github.com/pytorch/pytorch/issues/66930 for more information.
Triton cannot retrieve GPU metrics with MIG-enabled GPU devices (A100 and A30).
Triton metrics might not work if the host machine is running a separate DCGM agent on bare-metal or in a container.

Client Libraries and Examples

Ubuntu 22.04 builds of the client libraries and examples are included in this release in the attached v2.37.0_ubuntu2204.clients.tar.gz file. The SDK is also available for as an Ubuntu 22.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

For Windows, the client libraries and some examples are available in the attached tritonserver2.37.0-sdk-win.zip file.

Windows Support

A beta release of Triton for Windows is provided in the attached file:tritonserver2.37.0-win.zip. This is a beta release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release:

HTTP/REST and GRPC endpoints are supported.
ONNX models are supported by the ONNXRuntime backend. The ONNXRuntime version is 1.15.1. The CPU, CUDA, and TensorRT execution providers are supported. The OpenVINO execution provider is not supported.
OpenVINO models are supported. The OpenVINO version is 2023.0.0.
Prometheus metrics endpoint is not supported.
System and CUDA shared memory are not supported.

To use the Windows version of Triton, you must install all the necessary dependencies on your Windows system. These dependencies are available in the Dockerfile.win10.min. The Dockerfile includes the following CUDA-related components:

CUDA 12.2.0
cuDNN 8.9.4.25
TensorRT 8.6.1.6

Jetson Jetpack Support

Note There is no Jetpack release for 23.08, the latest release is 23.06.

- Python
Published by tanmayv25 almost 3 years ago

server - Release 2.36.0 corresponding to NGC container 23.07

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

New Features and Improvements

"pytorch_backend" supports implicit state management.
"python_backend" supports direct serving of TensorFlow SavedModel.
"python_backend" supports unpacked Conda execution environment.
"python_backend" added the model loading APIs for BLS usage.
Triton OpenTelemetry trace mode supports ensemble model tracing.
Triton Python client supports DLPack tensors in CUDA shared memory utilities.
Triton supports the S3 model repository that contains more than 1000 files.
Added Java binding of the Triton in-process C++ API.
Refer to the 23.07 column of the Frameworks Support Matrix for container image versions on which the 23.07 inference server container is based.

Known Issues

The "fastertransformer_backend" build only works with Triton 23.04 and older releases.
Some systems which implement malloc() may not release memory back to the operating system right away causing a false memory leak. This can be mitigate by using a different malloc implementation. tcmalloc and jemalloc are installed in the Triton container and can be used by specifying the library in LD_PRELOAD.

We recommend experimenting with both tcmalloc and jemalloc to determine which one works better for your use case.

Auto-complete may cause an increase in server start time. To avoid a start time increase, users can provide the full model configuration and launch the server with --disable-auto-complete-config.
Auto-complete does not support PyTorch models due to lack of metadata in the model. It can only verify that the number of inputs and the input names matches what is specified in the model configuration. There is no model metadata about the number of outputs and datatypes. Related PyTorch bug: https://github.com/pytorch/pytorch/issues/38273
Triton Client PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton Client library for Arm SBSA. The correct client wheel file can be pulled directly from the Arm SBSA SDK image and manually installed.
Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU. Refer to https://github.com/pytorch/pytorch/issues/66930 for more information.
Triton cannot retrieve GPU metrics with MIG-enabled GPU devices (A100 and A30).
Triton metrics might not work if the host machine is running a separate DCGM agent on bare-metal or in a container.

Client Libraries and Examples

Ubuntu 22.04 builds of the client libraries and examples are included in this release in the attached v2.36.0_ubuntu2204.clients.tar.gz file. The SDK is also available for as an Ubuntu 22.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

For Windows, the client libraries and some examples are available in the attached tritonserver2.36.0-sdk-win.zip file.

Windows Support

A beta release of Triton for Windows is provided in the attached file:tritonserver2.36.0-win.zip. This is a beta release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release:

HTTP/REST and GRPC endpoints are supported.
ONNX models are supported by the ONNXRuntime backend. The ONNXRuntime version is 1.15.1. The CPU, CUDA, and TensorRT execution providers are supported. The OpenVINO execution provider is not supported.
OpenVINO models are supported. The OpenVINO version is 2023.0.0.
Prometheus metrics endpoint is not supported.
System and CUDA shared memory are not supported.

To use the Windows version of Triton, you must install all the necessary dependencies on your Windows system. These dependencies are available in the Dockerfile.win10.min. The Dockerfile includes the following CUDA-related components:

CUDA 12.1.1
cuDNN 8.9.3.28
TensorRT 8.6.1.6

Jetson Jetpack Support

Note There is no Jetpack release for 23.07, the latest release is 23.06.

- Python
Published by GuanLuo almost 3 years ago

server - Release 2.35.0 corresponding to NGC container 23.06

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

New Features and Improvements

Support for KIND_MODEL instance type has been extended to the PyTorch backend.
The gRPC clients can now indicate whether they want to receive the flags associated with each response. This can help the clients to programmatically determine when all the responses for a given request have been received on the client side for decoupled models.
Added beta support for using Redis as a cache for inference requests.
The statistics extension now includes the memory usage of the loaded models This statistics is currently implemented only for TensorRT and ONNXRuntime backends.
Added support for batch inputs in ragged batching for PyTorch backend.
Added serial sequences mode for Perf Analyzer.
Refer to the 23.06 column of the Frameworks Support Matrix for container image versions on which the 23.06 inference server container is based.

Known Issues

The Fastertransfomer backend build only works with Triton 23.04 and older releases.
Tensorflow backend no longer supports TensorFlow version 1.
OpenVINO 2022.1 is used in the OpenVINO backend and the OpenVINO execution provider for the Onnxruntime Backend. OpenVINO 2022.1 is not officially supported on Ubuntu 22.04 and should be treated as beta.
Some systems which implement malloc() may not release memory back to the operating system right away causing a false memory leak. This can be mitigated by using a different malloc implementation. Tcmalloc and jemalloc are installed in the Triton container and can be used by specifying the library in LD_PRELOAD. We recommend experimenting with both tcmalloc and jemalloc to determine which one works better for your use case.
Auto-complete may cause an increase in server start time. To avoid a start time increase, users can provide the full model configuration and launch the server with --disable-auto-complete-config.
Auto-complete does not support PyTorch models due to lack of metadata in the model. It can only verify that the number of inputs and the input names matches what is specified in the model configuration. There is no model metadata about the number of outputs and datatypes. Related PyTorch bug: https://github.com/pytorch/pytorch/issues/38273
Triton Client PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton Client library for Arm SBSA. The correct client wheel file can be pulled directly from the Arm SBSA SDK image and manually installed.
Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU. Refer to pytorch/pytorch#66930 for more information.
Triton cannot retrieve GPU metrics with MIG-enabled GPU devices (A100 and A30).
Triton metrics might not work if the host machine is running a separate DCGM agent on bare-metal or in a container.

Client Libraries and Examples

Ubuntu 22.04 builds of the client libraries and examples are included in this release in the attached v2.35.0_ubuntu2204.clients.tar.gz file. The SDK is also available for as an Ubuntu 22.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

For Windows, the client libraries and some examples are available in the attached tritonserver2.35.0-sdk-win.zip file.

Windows Support

A beta release of Triton for Windows is provided in the attached file:tritonserver2.35.0-win.zip. This is a beta release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release:

HTTP/REST and GRPC endpoints are supported.
ONNX models are supported by the ONNXRuntime backend. The ONNXRuntime version is 1.15.0. The CPU, CUDA, and TensorRT execution providers are supported. The OpenVINO execution provider is not supported.
OpenVINO models are supported. The OpenVINO version is 2021.4.
Prometheus metrics endpoint is not supported.
System and CUDA shared memory are not supported.

To use the Windows version of Triton, you must install all the necessary dependencies on your Windows system. These dependencies are available in the Dockerfile.win10.min. The Dockerfile includes the following CUDA-related components:

CUDA 12.1.1
cuDNN 8.9.2.26
TensorRT 8.6.1.6

Jetson Jetpack Support

A release of Triton for JetPack is provided in the attached tar file: tritonserver2.35.0-jetpack5.1.2.tgz.

This release supports TensorFlow 2.12.0, TensorRT 8.5.2.2, Onnx Runtime 1.15.0, PyTorch 2.1.0a0+41361538, Python 3.8 and as well as ensembles.
ONNXRuntime backend does not support the OpenVino and TensorRT execution providers. The CUDA execution provider is in Beta.
System shared memory is supported on Jetson. CUDA shared memory is not supported.
GPU metrics, GCS storage, S3 storage and Azure storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples. For more information on how to install and use Triton on JetPack refer to jetson.md.

The wheel for the Python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tritonclient-2.35.0-py3-none-manylinux2014_aarch64.whl[all]

- Python
Published by Tabrizian about 3 years ago

server - Release 2.34.0 corresponding to NGC container 23.05

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

What's New in 2.34.0

Python backend supports Custom Metrics allowing users to define and report counters and gauges similar to the C API.
Python Triton Client defines the Triton Client Plugin API allowing users to register custom plugins to add or modify request headers. This feature is in beta and is subject to change in future releases.
Improved performance of model instance creation/removal. When the model instance group is the only model configuration change, Triton will update the model with the number of instances needed rather than reloading the model. This feature is limited to non-sequence models only. Read more about this feature here in bullet point four.
Added new command line option --metrics-address=<address> allowing the metrics server to bind to a different address than the default 0.0.0.0.
Reduced the default number of model load threads from 2*(number of CPU cores) to 4. This eliminates Triton hitting resource limits on systems with large CPU core counts. Use the --model-load-thread-count command line option to change this default.
Added support for DLPack Python specification in Python backend.
Refer to the 23.05 column of the Frameworks Support Matrix for container image versions on which the 23.05 inference server container is based.

Known Issues

Tensorflow backend no longer supports TensorFlow version 1.
OpenVINO 2022.1 is used in the OpenVINO backend and the OpenVINO execution provider for the Onnxruntime Backend. OpenVINO 2022.1 is not officially supported on Ubuntu 22.04 and should be treated as beta.
Some systems which implement malloc() may not release memory back to the operating system right away causing a false memory leak. This can be mitigated by using a different malloc implementation. Tcmalloc and jemalloc are installed in the Triton container and can be used by specifying the library in LD_PRELOAD. We recommend experimenting with both tcmalloc and jemalloc to determine which one works better for your use case.
Auto-complete may cause an increase in server start time. To avoid a start time increase, users can provide the full model configuration and launch the server with --disable-auto-complete-config.
Auto-complete does not support PyTorch models due to lack of metadata in the model. It can only verify that the number of inputs and the input names matches what is specified in the model configuration. There is no model metadata about the number of outputs and datatypes. Related PyTorch bug: https://github.com/pytorch/pytorch/issues/38273.
Triton Client PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton Client library for Arm SBSA.

The correct client wheel file can be pulled directly from the Arm SBSA SDK image and manually installed.

Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU.

Refer to https://github.com/pytorch/pytorch/issues/66930 for more information.

Triton cannot retrieve GPU metrics with MIG-enabled GPU devices (A100 and A30).
Triton metrics might not work if the host machine is running a separate DCGM agent on bare-metal or in a container.

Client Libraries and Examples

Ubuntu 20.04 builds of the client libraries and examples are included in this release in the attached v2.34.0_ubuntu2004.clients.tar.gz file. The SDK is also available for as an Ubuntu 20.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

For windows, the client libraries and some examples are available in the attached tritonserver2.34.0-sdk-win.zip file.

Windows Support

A beta release of Triton for Windows is provided in the attached file:tritonserver2.34.0-win.zip. This is a beta release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release:

HTTP/REST and GRPC endpoints are supported.
ONNX models are supported by the ONNX Runtime backend. The ONNX Runtime version is 1.15.0. The CPU, CUDA, and TensorRT execution providers are supported. The OpenVINO execution provider is not supported.
OpenVINO models are supported. The OpenVINO version is 2021.4.
Prometheus metrics endpoint is not supported.
System and CUDA shared memory are not supported.

To use the Windows version of Triton, you must install all the necessary dependencies on your Windows system. These dependencies are available in the Dockerfile.win10.min. The Dockerfile includes the following CUDA-related components:

CUDA 12.1.1
cuDNN 8.9.1.23
TensorRT 8.6.1.6

Jetson Jetpack Support

A release of Triton for JetPack is provided in the attached tar file: tritonserver2.34.0-jetpack5.1.tgz.

This release supports TensorFlow 2.12.0, TensorRT 8.5.2.2, Onnx Runtime 1.15.0, PyTorch 2.0.0a0+8aa34602, Python 3.8 and as well as ensembles.
Onnx Runtime backend does not support the OpenVino and TensorRT execution providers. The CUDA execution provider is in Beta.
System shared memory is supported on Jetson. CUDA shared memory is not supported.
GPU metrics, GCS storage, S3 storage and Azure storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples. For more information on how to install and use Triton on JetPack refer to jetson.md.

The wheel for the Python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tritonclient-2.34.0-py3-none-manylinux2014_aarch64.whl[all]

- Python
Published by mc-nv about 3 years ago

server - Release 2.33.0 corresponding to NGC container 23.04

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

What's New in 2.33.0

Triton can now load models concurrently reducing the server start-up times.
Sequence batcher with direct scheduling strategy now includes experimental support for schedule policy.
Triton’s ragged batching support has been extended to PyTorch backend.
Triton can now forward HTTP/GRPC headers as inference request parameters to the backend.
Triton python backend’s business logic scripting now allows developers to select a specific device to receive output tensors from a BLS call.
Triton latency metrics can now be obtained as configurable quantiles over a sliding time window using experimental metrics summary support.
Users can now restrict the access of the protocols on a given Triton endpoint.
Triton now provides a limited support for tracing inference requests using OpenTelemetry Trace APIs.
Model Analyzer now supports BLS Models.
Refer to the 23.04 column of the Frameworks Support Matrix for container image versions on which the 23.04 inference server container is based.

Known Issues

Tensorflow backend no longer supports TensorFlow version 1.
Triton Inferentia guide is out of date. Some users have reported issues with running Triton on AWS Inferentia instances.
Some systems which implement malloc() may not release memory back to the operating system right away causing a false memory leak. This can be mitigated by using a different malloc implementation. Tcmalloc is installed in the Triton container and can be used by specifying the library in LD_PRELOAD.
Auto-complete may cause an increase in server start time. To avoid a start time increase, users can provide the full model configuration and launch the server with --disable-auto-complete-config.
Auto-complete does not support PyTorch models due to lack of metadata in the model. It can only verify that the number of inputs and the input names matches what is specified in the model configuration. There is no model metadata about the number of outputs and datatypes. Related PyTorch bug: https://github.com/pytorch/pytorch/issues/38273.
Triton Client PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton Client library for Arm SBSA.

The correct client wheel file can be pulled directly from the Arm SBSA SDK image and manually installed.

Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU.

Refer to https://github.com/pytorch/pytorch/issues/66930 for more information.

Triton cannot retrieve GPU metrics with MIG-enabled GPU devices (A100 and A30).
Triton metrics might not work if the host machine is running a separate DCGM agent on bare-metal or in a container.

Client Libraries and Examples

Ubuntu 20.04 builds of the client libraries and examples are included in this release in the attached v2.33.0_ubuntu2004.clients.tar.gz file. The SDK is also available for as an Ubuntu 20.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

For windows, the client libraries and some examples are available in the attached tritonserver2.33.0-sdk-win.zip file.

Windows Support

A beta release of Triton for Windows is provided in the attached file:tritonserver2.33.0-win.zip. This is a beta release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release:

HTTP/REST and GRPC endpoints are supported.
ONNX models are supported by the ONNX Runtime backend. The ONNX Runtime version is 1.14.1. The CPU, CUDA, and TensorRT execution providers are supported. The OpenVINO execution provider is not supported.
OpenVINO models are supported. The OpenVINO version is 2021.4.
Prometheus metrics endpoint is not supported.
System and CUDA shared memory are not supported.

To use the Windows version of Triton, you must install all the necessary dependencies on your Windows system. These dependencies are available in the Dockerfile.win10.min. The Dockerfile includes the following CUDA-related components:

CUDA 11.8.0
cuDNN 8.8.1.3
TensorRT 8.5.3.1

Jetson Jetpack Support

Note In order to build Jetson target from source code please refer to the "r23.04-jetson" branch for "python_backend".

A release of Triton for JetPack is provided in the attached tar file: tritonserver2.33.0-jetpack5.1.tgz.

This release supports TensorFlow 2.12.0, TensorRT 8.5.2.2, Onnx Runtime 1.14.1, PyTorch 2.0.0a0+8aa34602, Python 3.8 and as well as ensembles.
Onnx Runtime backend does not support the OpenVino and TensorRT execution providers. The CUDA execution provider is in Beta.
System shared memory is supported on Jetson. CUDA shared memory is not supported.
GPU metrics, GCS storage, S3 storage and Azure storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples. For more information on how to install and use Triton on JetPack refer to jetson.md.

The wheel for the Python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tritonclient-2.33.0-py3-none-manylinux2014_aarch64.whl[all]

- Python
Published by mc-nv about 3 years ago

server - Release 2.32.0 corresponding to NGC container 23.03

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

What's New in 2.32.0

Added the Parameters Extension which allows an inference request to provide custom parameters that cannot be provided as inputs. These parameters can be used in the python backend as described here.
Added support for models that use decoupled API for Business Scripting Logic (BLS) in Python backend. Examples can be found here.
The same model name can be used across different repositories if the --model-namespacing flag is set.
Triton’s Response Cache feature has been converted internally to a shared library implementation of the new TRITONCACHE APIs, similar to how backends and repo agents are used today. The default cache implementation is local_cache, which is equivalent to the fixed-size in-memory buffer implementation used before. The --response-cache-byte-size flag will continue to function in the same way, but the --cache-config flag will be the preferred method of cache configuration moving forward. For more information, see the cache documentation here.
Triton’s trace tool now supports tracing for request_id.
Refer to the 23.03 column of the Frameworks Support Matrix for container image versions on which the 23.03 inference server container is based.

Known Issues

Support for TensorFlow1 will be removed starting from 23.04.
Triton Inferentia guide is out of date. Some users have reported issues with running Triton on AWS Inferentia instances.
Some systems which implement malloc() may not release memory back to the operating system right away causing a false memory leak. This can be mitigated by using a different malloc implementation. Tcmalloc is installed in the Triton container and can be used by specifying the library in LD_PRELOAD.
Auto-complete may cause an increase in server start time. To avoid a start time increase, users can provide the full model configuration and launch the server with --disable-auto-complete-config.
Auto-complete does not support PyTorch models due to lack of metadata in the model. It can only verify that the number of inputs and the input names matches what is specified in the model configuration. There is no model metadata about the number of outputs and datatypes. Related PyTorch bug: https://github.com/pytorch/pytorch/issues/38273
Triton Client PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton Client library for Arm SBSA.

The correct client wheel file can be pulled directly from the Arm SBSA SDK image and manually installed.

Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU.

Refer to https://github.com/pytorch/pytorch/issues/66930 for more information.

Triton cannot retrieve GPU metrics with MIG-enabled GPU devices (A100 and A30).
Triton metrics might not work if the host machine is running a separate DCGM agent on bare-metal or in a container.

Client Libraries and Examples

Ubuntu 20.04 builds of the client libraries and examples are included in this release in the attached v2.32.0_ubuntu2004.clients.tar.gz file. The SDK is also available for as an Ubuntu 20.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

For windows, the client libraries and some examples are available in the attached tritonserver2.32.0-sdk-win.zip file.

Windows Support

A beta release of Triton for Windows is provided in the attached file:tritonserver2.32.0-win.zip. This is a beta release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release:

HTTP/REST and GRPC endpoints are supported.
ONNX models are supported by the ONNX Runtime backend. The ONNX Runtime version is 1.14.1. The CPU, CUDA, and TensorRT execution providers are supported. The OpenVINO execution provider is not supported.
OpenVINO models are supported. The OpenVINO version is 2021.4.
Prometheus metrics endpoint is not supported.
System and CUDA shared memory are not supported.

To use the Windows version of Triton, you must install all the necessary dependencies on your Windows system. These dependencies are available in the Dockerfile.win10.min. The Dockerfile includes the following CUDA-related components:

CUDA 11.8.0
cuDNN 8.8.1.3
TensorRT 8.5.3.1

Jetson Jetpack Support

A release of Triton for JetPack is provided in the attached tar file: tritonserver2.32.0-jetpack5.1.tgz.

This release supports TensorFlow 2.11.0, TensorFlow 1.15.5, TensorRT 8.5.2.2, Onnx Runtime 1.14.1, PyTorch2.0.0, Python 3.8 and as well as ensembles.
Onnx Runtime backend does not support the OpenVino and TensorRT execution providers. The CUDA execution provider is in Beta.
System shared memory is supported on Jetson. CUDA shared memory is not supported.
GPU metrics, GCS storage, S3 storage and Azure storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples. For more information on how to install and use Triton on JetPack refer to jetson.md.

The wheel for the Python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tritonclient-2.32.0-py3-none-manylinux2014_aarch64.whl[all]

- Python
Published by mc-nv over 3 years ago

server - Release 2.31.0 corresponding to NGC container 23.02

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

What's New in 2.31.0

Support for ensemble models in Model Analyzer.
Support for GRPC Standard Health Check Protocol
Fixed intermittent hangs during model loading for Python backend.
Refer to the 23.02 column of the Frameworks Support Matrix for container image versions on which the 23.02 inference server container is based.

Known Issues

In some rare cases Triton might overwrite input tensors while they are still in use which leads to corrupt input data being used for inference with TensorRT models. If you encounter accuracy issues with your TensorRT model, you can work-around the issue by enabling the outputcopystream option in your model's configuration.
Some systems which implement malloc() may not release memory back to the operating system right away causing a false memory leak. This can be mitigated by using a different malloc implementation. Tcmalloc is installed in the Triton container and can be used by specifying the library in LD_PRELOAD.
When using a custom operator for the PyTorch backend, the operator may not be loaded due to undefined Python library symbols. This can be work-around by specifying Python library in LD_PRELOAD.
Auto-complete may cause an increase in server start time. To avoid a start time increase, users can provide the full model configuration and launch the server with --disable-auto-complete-config.
Auto-complete does not support PyTorch models due to lack of metadata in the model. It can only verify that the number of inputs and the input names matches what is specified in the model configuration. There is no model metadata about the number of outputs and datatypes. Related PyTorch bug: https://github.com/pytorch/pytorch/issues/38273
Perf Analyzer stability criteria has been changed which may result in reporting instability for scenarios that were previously considered stable. This change has been made to improve the accuracy of Perf Analyzer results. If you observe this message, it can be resolved by increasing the --measurement-interval in the time windows mode or --measurement-request-count in the count windows mode.
Triton Client PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton Client library for Arm SBSA.

The correct client wheel file can be pulled directly from the Arm SBSA SDK image and manually installed.

Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU.

Refer to https://github.com/pytorch/pytorch/issues/66930 for more information.

Triton cannot retrieve GPU metrics with MIG-enabled GPU devices (A100 and A30).
Triton metrics might not work if the host machine is running a separate DCGM agent on bare-metal or in a container.

Client Libraries and Examples

Ubuntu 20.04 builds of the client libraries and examples are included in this release in the attached v2.31.0_ubuntu2004.clients.tar.gz file. The SDK is also available for as an Ubuntu 20.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

For windows, the client libraries and some examples are available in the attached tritonserver2.31.0-sdk-win.zip file.

Windows Support

A beta release of Triton for Windows is provided in the attached file:tritonserver2.31.0-win.zip. This is a beta release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release:

HTTP/REST and GRPC endpoints are supported.
ONNX models are supported by the ONNX Runtime backend. The ONNX Runtime version is 1.13.1. The CPU, CUDA, and TensorRT execution providers are supported. The OpenVINO execution provider is not supported.
OpenVINO models are supported. The OpenVINO version is 2021.4.
Prometheus metrics endpoint is not supported.
System and CUDA shared memory are not supported.

To use the Windows version of Triton, you must install all the necessary dependencies on your Windows system. These dependencies are available in the Dockerfile.win10.min. The Dockerfile includes the following CUDA-related components:

CUDA 11.8.0
cuDNN 8.7.0.84
TensorRT 8.5.1.7

Jetson Jetpack Support

A release of Triton for JetPack is provided in the attached tar file: tritonserver2.31.0-jetpack5.1.tgz.

This release supports TensorFlow 2.11.0, TensorFlow 1.15.5, TensorRT 8.5.2.2, Onnx Runtime 1.13.1, PyTorch 1.14.0, Python 3.8 and as well as ensembles.
Onnx Runtime backend does not support the OpenVino and TensorRT execution providers. The CUDA execution provider is in Beta.
System shared memory is supported on Jetson. CUDA shared memory is not supported.
GPU metrics, GCS storage, S3 storage and Azure storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples. For more information on how to install and use Triton on JetPack refer to jetson.md.

The wheel for the Python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tritonclient-2.31.0-py3-none-manylinux2014_aarch64.whl[all]

- Python
Published by mc-nv over 3 years ago

server - Release 2.30.0 corresponding to NGC container 23.01

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

What's New in 2.30.0

The dynamic batcher now accepts user-defined batching constraints, allowing users to specify custom batching strategies.
Relaxed Python client gRPC version requirement.
Refer to the 23.01 column of the Frameworks Support Matrix for container image versions on which the 23.01 inference server container is based.

Known Issues

In some rare cases Triton might overwrite input tensors while they are still in use which leads to corrupt input data being used for inference with TensorRT models. If you encounter accuracy issues with your TensorRT model, you can work-around the issue by enabling the outputcopystream option in your model's configuration.
Some systems which implement malloc() may not release memory back to the operating system right away causing a false memory leak. This can be mitigated by using a different malloc implementation. Tcmalloc is installed in the Triton container and can be used by specifying the library in LD_PRELOAD.
When using a custom operator for the PyTorch backend, the operator may not be loaded due to undefined Python library symbols. This can be work-around by specifying Python library in LD_PRELOAD.
Auto-complete may cause an increase in server start time. To avoid a start time increase, users can provide the full model configuration and launch the server with --disable-auto-complete-config.
Auto-complete does not support PyTorch models due to lack of metadata in the model. It can only verify that the number of inputs and the input names matches what is specified in the model configuration. There is no model metadata about the number of outputs and datatypes. Related PyTorch bug: https://github.com/pytorch/pytorch/issues/38273
Perf Analyzer stability criteria has been changed which may result in reporting instability for scenarios that were previously considered stable. This change has been made to improve the accuracy of Perf Analyzer results. If you observe this message, it can be resolved by increasing the --measurement-interval in the time windows mode or --measurement-request-count in the count windows mode.
Triton Client PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton Client library for Arm SBSA.

The correct client wheel file can be pulled directly from the Arm SBSA SDK image and manually installed.

Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU.

Refer to https://github.com/pytorch/pytorch/issues/66930 for more information.

Triton cannot retrieve GPU metrics with MIG-enabled GPU devices (A100 and A30).
Triton metrics might not work if the host machine is running a separate DCGM agent on bare-metal or in a container.

Client Libraries and Examples

Ubuntu 20.04 builds of the client libraries and examples are included in this release in the attached v2.30.0_ubuntu2004.clients.tar.gz file. The SDK is also available for as an Ubuntu 20.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

For windows, the client libraries and some examples are available in the attached tritonserver2.30.0-sdk-win.zip file.

Windows Support

A beta release of Triton for Windows is provided in the attached file:tritonserver2.30.0-win.zip. This is a beta release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release:

HTTP/REST and GRPC endpoints are supported.
ONNX models are supported by the ONNX Runtime backend. The ONNX Runtime version is 1.13.1. The CPU, CUDA, and TensorRT execution providers are supported. The OpenVINO execution provider is not supported.
OpenVINO models are supported. The OpenVINO version is 2021.4.
Prometheus metrics endpoint is not supported.
System and CUDA shared memory are not supported.

To use the Windows version of Triton, you must install all the necessary dependencies on your Windows system. These dependencies are available in the Dockerfile.win10.min. The Dockerfile includes the following CUDA-related components:

CUDA 11.8.0
cuDNN 8.7.0.84
TensorRT 8.5.1.7

Jetson Jetpack Support

A release of Triton for JetPack is provided in the attached tar file: tritonserver2.30.0-jetpack5.1.tgz.

This release supports TensorFlow 2.11.0, TensorFlow 1.15.5, TensorRT 8.5.2.1, Onnx Runtime 1.13.1, PyTorch 1.14.0, Python 3.8 and as well as ensembles.
Onnx Runtime backend does not support the OpenVino and TensorRT execution providers. The CUDA execution provider is in Beta.
System shared memory is supported on Jetson. CUDA shared memory is not supported.
GPU metrics, GCS storage, S3 storage and Azure storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples. For more information on how to install and use Triton on JetPack refer to jetson.md.

The wheel for the Python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tritonclient-2.30.0-py3-none-manylinux2014_aarch64.whl[all]

- Python
Published by mc-nv over 3 years ago

server - Release 2.29.0 corresponding to NGC container 22.12

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

What's New in 2.29.0

Improvements to container and non-container builds on Windows.
Refer to the 22.12 column of the Frameworks Support Matrix for container image versions on which the 22.12 inference server container is based.

Known Issues

In some rare cases Triton might overwrite input tensors while they are still in use which leads to corrupt input data being used for inference with TensorRT models. If you encounter accuracy issues with your TensorRT model, you can work-around the issue by enabling the outputcopystream option in your model's configuration.
Some systems which implement malloc() may not release memory back to the operating system right away causing a false memory leak. This can be mitigated by using a different malloc implementation. Tcmalloc is installed in the Triton container and can be used by specifying the library in LD_PRELOAD.
When using a custom operator for the PyTorch backend, the operator may not be loaded due to undefined Python library symbols. This can be work-around by specifying Python library in LD_PRELOAD.
Auto-complete may cause an increase in server start time. To avoid a start time increase, users can provide the full model configuration and launch the server with --disable-auto-complete-config.
Auto-complete does not support PyTorch models due to lack of metadata in the model. It can only verify that the number of inputs and the input names matches what is specified in the model configuration. There is no model metadata about the number of outputs and datatypes. Related PyTorch bug: https://github.com/pytorch/pytorch/issues/38273
Perf Analyzer stability criteria has been changed which may result in reporting instability for scenarios that were previously considered stable. This change has been made to improve the accuracy of Perf Analyzer results. If you observe this message, it can be resolved by increasing the --measurement-interval in the time windows mode or --measurement-request-count in the count windows mode.
Triton Client PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton Client library for Arm SBSA.

The correct client wheel file can be pulled directly from the Arm SBSA SDK image and manually installed.

Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU.

Refer to https://github.com/pytorch/pytorch/issues/66930 for more information.

Triton cannot retrieve GPU metrics with MIG-enabled GPU devices (A100 and A30).
Triton metrics might not work if the host machine is running a separate DCGM agent on bare-metal or in a container.

Client Libraries and Examples

Ubuntu 20.04 builds of the client libraries and examples are included in this release in the attached v2.29.0_ubuntu2004.clients.tar.gz file. The SDK is also available for as an Ubuntu 20.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

For windows, the client libraries and some examples are available in the attached tritonserver2.29.0-sdk-win.zip file.

Windows Support

A beta release of Triton for Windows is provided in the attached file:tritonserver2.29.0-win.zip. This is a beta release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release:

HTTP/REST and GRPC endpoints are supported.
ONNX models are supported by the ONNX Runtime backend. The ONNX Runtime version is 1.13.1. The CPU, CUDA, and TensorRT execution providers are supported. The OpenVINO execution provider is not supported.
OpenVINO models are supported. The OpenVINO version is 2021.4.
Prometheus metrics endpoint is not supported.
System and CUDA shared memory are not supported.

To use the Windows version of Triton, you must install all the necessary dependencies on your Windows system. These dependencies are available in the Dockerfile.win10.min. The Dockerfile includes the following CUDA-related components:

CUDA 11.8.0
cuDNN 8.7.0.84
TensorRT 8.5.1.7

Jetson Jetpack Support

NOTE: There is no Jetpack release for 22.12, the latest release is 22.10.

- Python
Published by mc-nv over 3 years ago

server - Release 2.28.0 corresponding to NGC container 22.11

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

What's New in 2.28.0

Support new TensorRT 8.5 features. Including:
- UINT8 I/O
- “Data dependent dynamic shapes" operators (i.e. ONNX NMS and NonZero operations)
Support execution environment paths outside model directory. This can be done via EXECUTION_ENV_PATH parameter in config.pbtxt. Refer to the python backend README for known limitations.
Refer to the 22.11 column of the Frameworks Support Matrix for container image versions on which the 22.11 inference server container is based.

Known Issues

In some rare cases Triton might overwrite input tensors while they are still in use which leads to corrupt input data being used for inference with TensorRT models. If you encounter accuracy issues with your TensorRT model, you can work-around the issue by enabling the outputcopystream option in your model's configuration.
Triton's TensorRT support depends on the CUDA event synchronization. In some rare cases the events may be triggered earlier than expected, causing Triton to overwrite input tensors while they are still in use and leading to corrupt input data being used for inference. If you encounter accuracy issues with your TensorRT model, you can work-around the issue by enabling the outputcopystream option in your model's configuration.
When using a custom operator for the PyTorch backend, the operator may not be loaded due to undefined Python library symbols. This can be work-around by specifying Python library in LD_PRELOAD.
Auto-complete may cause an increase in server start time. To avoid a start time increase, users can provide the full model configuration and launch the server with --disable-auto-complete-config.
Auto-complete does not support PyTorch models due to lack of metadata in the model. It can only verify that the number of inputs and the input names matches what is specified in the model configuration. There is no model metadata about the number of outputs and datatypes. Related PyTorch bug: https://github.com/pytorch/pytorch/issues/38273
Perf Analyzer stability criteria has been changed which may result in reporting instability for scenarios that were previously considered stable. This change has been made to improve the accuracy of Perf Analyzer results. If you observe this message, it can be resolved by increasing the --measurement-interval in the time windows mode or --measurement-request-count in the count windows mode.
Triton Client PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton Client library for Arm SBSA.

The correct client wheel file can be pulled directly from the Arm SBSA SDK image and manually installed.

Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU.

Refer to https://github.com/pytorch/pytorch/issues/66930 for more information.

Triton cannot retrieve GPU metrics with MIG-enabled GPU devices (A100 and A30).
Triton metrics might not work if the host machine is running a separate DCGM agent on bare-metal or in a container.

Client Libraries and Examples

Ubuntu 20.04 builds of the client libraries and examples are included in this release in the attached v2.28.0_ubuntu2004.clients.tar.gz file. The SDK is also available for as an Ubuntu 20.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

For windows, the client libraries and some examples are available in the attached tritonserver2.28.0-sdk-win.zip file.

Windows Support

A beta release of Triton for Windows is provided in the attached file:tritonserver2.28.0-win.zip. This is a beta release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release:

HTTP/REST and GRPC endpoints are supported.
ONNX models are supported by the ONNX Runtime backend. The ONNX Runtime version is 1.13.1. The CPU, CUDA, and TensorRT execution providers are supported. The OpenVINO execution provider is not supported.
OpenVINO models are supported. The OpenVINO version is 2021.4.
Prometheus metrics endpoint is not supported.
System and CUDA shared memory are not supported.

To use the Windows version of Triton, you must install all the necessary dependencies on your Windows system. These dependencies are available in the Dockerfile.win10.min. The Dockerfile includes the following CUDA-related components:

CUDA 11.8.0
cuDNN 8.7.0.80
TensorRT 8.5.1.7

Jetson Jetpack Support

NOTE: There is no Jetpack release for 22.11, the latest release is 22.10.

- Python
Published by mc-nv over 3 years ago

server - Release 2.27.0 corresponding to NGC container 22.10

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

What's New in 2.27.0

Added an example to demonstrate the use of JAX in python models.
Improved and enhanced Server Wrapper API to include missing features such as decoupled model and tracing support.
Multiple concurrent models can be profiled and analyzed by Model Analyzer. See Multi-Model Search Mode for additional details.

Known Issues

Triton's TensorRT support depends on the CUDA event synchronization. In some rare cases the events may be triggered earlier than expected, causing Triton to overwrite input tensors while they are still in use and leading to corrupt input data being used for inference. If you encounter accuracy issues with your TensorRT model, you can work-around the issue by enabling the outputcopystream option in your model's configuration.
When using a custom operator for the PyTorch backend, the operator may not be loaded due to undefined Python library symbols. This can be work-around by specifying Python library in LD_PRELOAD.
Auto-complete may cause an increase in server start time. To avoid a start time increase, users can provide the full model configuration and launch the server with --disable-auto-complete-config.
Auto-complete does not support PyTorch models due to lack of metadata in the model. It can only verify that the number of inputs and the input names matches what is specified in the model configuration. There is no model metadata about the number of outputs and datatypes. Related PyTorch bug: https://github.com/pytorch/pytorch/issues/38273
Perf Analyzer stability criteria has been changed which may result in reporting instability for scenarios that were previously considered stable. This change has been made to improve the accuracy of Perf Analyzer results. If you observe this message, it can be resolved by increasing the --measurement-interval in the time windows mode or --measurement-request-count in the count windows mode.
Triton Client PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton Client library for Arm SBSA.

The correct client wheel file can be pulled directly from the Arm SBSA SDK image and manually installed.

Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU.

Refer to https://github.com/pytorch/pytorch/issues/66930 for more information.

Triton cannot retrieve GPU metrics with MIG-enabled GPU devices (A100 and A30).
Triton metrics might not work if the host machine is running a separate DCGM agent on bare-metal or in a container.

Client Libraries and Examples

Ubuntu 20.04 builds of the client libraries and examples are included in this release in the attached v2.27.0_ubuntu2004.clients.tar.gz file. The SDK is also available for as an Ubuntu 20.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

For windows, the client libraries and some examples are available in the attached tritonserver2.27.0-sdk-win.zip file.

Windows Support

A beta release of Triton for Windows is provided in the attached file:tritonserver2.27.0-win.zip. This is a beta release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release:

HTTP/REST and GRPC endpoints are supported.
ONNX models are supported by the ONNX Runtime backend. The ONNX Runtime version is 1.13.1. The CPU, CUDA, and TensorRT execution providers are supported. The OpenVINO execution provider is not supported.
OpenVINO models are supported. The OpenVINO version is 2021.4.
Prometheus metrics endpoint is not supported.
System and CUDA shared memory are not supported.

To use the Windows version of Triton, you must install all the necessary dependencies on your Windows system. These dependencies are available in the Dockerfile.win10.min. The Dockerfile includes the following CUDA-related components:

CUDA 11.8.0
cuDNN 8.6.0.163
TensorRT 8.5.0.12

Jetson Jetpack Support

A release of Triton for JetPack is provided in the attached tar file: A release of Triton for JetPack is provided in the attached tar file: tritonserver2.27.0-jetpack5.0.2.tgz.

This release supports TensorFlow 2.10.0, TensorFlow 1.15.5, TensorRT 8.4.1.5, Onnx Runtime 1.13.1, PyTorch 1.13.0, Python 3.8 and as well as ensembles.
Onnx Runtime backend does not support the OpenVino and TensorRT execution providers. The CUDA execution provider is in Beta.
System shared memory is supported on Jetson. CUDA shared memory is not supported.
GPU metrics, GCS storage, S3 storage and Azure storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples. For more information on how to install and use Triton on JetPack refer to jetson.md.

The wheel for the Python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tritonclient-2.27.0-py3-none-manylinux2014_aarch64.whl[all]

- Python
Published by mc-nv over 3 years ago

server - Release 2.26.0 corresponding to NGC container 22.09

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

What's New in 2.26.0

Added developer tools Github repository that provides a simplified interface for users to interact with the Triton Core shared library. These developer tools are in beta and are subject to change.
Added CPU metrics reporting in Triton’s Prometheus metrics endpoint.
Added logging protocol extension for users to change logging configuration dynamically.
Users can specify the custom plugins to be loaded for TensorRT backend through command line option in addition to LD_PRELOAD.
Enabled auto-completion for OpenVINO backend.
Enabled Python backend to log messages through Triton’s logger.
Refer to the 22.09 column of the Frameworks Support Matrix for container image versions on which the 22.09 inference server container is based.
Added quick search algorithm to Model Analyzer to drastically reduce search time.
Added GPU metrics gathering to Perf Analyzer, which is also used by Model Analyzer to improve accuracy of those metrics.
NGC container release 22.09 supports CUDA compute capability 6.0 and later. This corresponds to GPUs in the NVIDIA Pascal, NVIDIA Volta™, NVIDIA Turing™, NVIDIA Ampere architecture, and NVIDIA Hopper™ architecture families.

Known Issues

In certain rare cases with specific backends, triton server may crash with segmentation fault when exiting. Preliminary analysis shows that there might be a race condition in clean up of backend/model/instance state objects. Exact root cause is still unknown.
Triton's TensorRT support depends on the CUDA event synchronization. In some rare cases the events may be triggered earlier than expected, causing Triton to overwrite input tensors while they are still in use and leading to corrupt input data being used for inference. If you encounter accuracy issues with your TensorRT model, you can work-around the issue by enabling the outputcopystream option in your model's configuration.
When using a custom operator for the PyTorch backend, the operator may not be loaded due to undefined Python library symbols. This can be work-around by specifying Python library in LD_PRELOAD
Auto-complete may cause an increase in server start time. To avoid a start time increase, users can provide the full model configuration and launch the server with --disable-auto-complete-config.
Auto-complete does not support PyTorch models due to lack of metadata in the model. It can only verify that the number of inputs and the input names matches what is specified in the model configuration. There is no model metadata about the number of outputs and datatypes. Related PyTorch bug: https://github.com/pytorch/pytorch/issues/38273
Perf Analyzer stability criteria has been changed which may result in reporting instability for scenarios that were previously considered stable. This change has been made to improve the accuracy of Perf Analyzer results. If you observe this message, it can be resolved by increasing the --measurement-interval in the time windows mode or --measurement-request-count in the count windows mode.
Triton Client PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton Client library for Arm SBSA.
The correct client wheel file can be pulled directly from the Arm SBSA SDK image and manually installed.
Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU.
Refer to https://github.com/pytorch/pytorch/issues/66930 for more information.
Triton cannot retrieve GPU metrics with MIG-enabled GPU devices (A100 and A30).
Triton metrics might not work if the host machine is running a separate DCGM agent on bare-metal or in a container.

Client Libraries and Examples

Ubuntu 20.04 builds of the client libraries and examples are included in this release in the attached v2.26.0_ubuntu2004.clients.tar.gz file. The SDK is also available for as an Ubuntu 20.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

For windows, the client libraries and some examples are available in the attached tritonserver2.26.0-sdk-win.zip file.

Windows Support

A beta release of Triton for Windows is provided in the attached file:tritonserver2.26.0-win.zip. This is a beta release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release:

HTTP/REST and GRPC endpoints are supported.
ONNX models are supported by the ONNX Runtime backend. The ONNX Runtime version is 1.12.1. The CPU, CUDA, and TensorRT execution providers are supported. The OpenVINO execution provider is not supported.
OpenVINO models are supported. The OpenVINO version is 2021.4.
Prometheus metrics endpoint is not supported.
System and CUDA shared memory are not supported.

To use the Windows version of Triton, you must install all the necessary dependencies on your Windows system. These dependencies are available in the Dockerfile.win10.min. The Dockerfile includes the following CUDA-related components:

CUDA 11.8.0
cuDNN 8.6.0.163
TensorRT 8.5.0.12

Jetson Jetpack Support

A release of Triton for JetPack is provided in the attached tar file: A release of Triton for JetPack is provided in the attached tar file: tritonserver2.26.0-jetpack5.0.2.tgz.

This release supports TensorFlow 2.9.1, TensorFlow 1.15.5, TensorRT 8.4.1.5, Onnx Runtime 1.12.0, PyTorch 1.13.0, Python 3.8 and as well as ensembles.
Onnx Runtime backend does not support the OpenVino and TensorRT execution providers. The CUDA execution provider is in Beta.
System shared memory is supported on Jetson. CUDA shared memory is not supported.
GPU metrics, GCS storage, S3 storage and Azure storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples. For more information on how to install and use Triton on JetPack refer to jetson.md.

The wheel for the Python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tritonclient-2.26.0-py3-none-manylinux2014_aarch64.whl[all]

- Python
Published by mc-nv almost 4 years ago

server - Release 2.25.0 corresponding to NGC container 22.08

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

What's New in 2.25.0

New support for multiple cloud credentials has been enabled. This feature is in beta and is subject to change.
Models using custom backends which implement auto-complete configuration, can be loaded without explicit config.pbtxt file if they are named in form <model_name>.<backend_name>.
Users can specify a maximum memory limit when loading models onto the GPU with the new --model-load-gpu-limit tritonserver option and the TRITONSERVER_ServerOptionsSetModelLoadDeviceLimit C API function
Added new documentation, Performance Tuning, with a step by step guide to optimize models for production
From this release onwards Triton will default to TensorFlow version 2.X. TensorFlow version 1.X can still be manually specified via backend config.
PyTorch backend has improved performance by using a separate CUDA Stream for each model instance when the instance kind is GPU.
Refer to the 22.08 column of the Frameworks Support Matrix for container image versions on which the 22.08 inference server container is based.
Model Analyzer's profile subcommand now analyzes the results after Profile is completed. Usage of the Analyze subcommand is deprecated. See Model Analyzer's documentation for further details.

Known Issues

There is no Jetpack release for 22.08, the latest release is 22.07.
Auto-complete may cause an increase in server start time. To avoid a start time increase, users can provide the full model configuration and launch the server with --disable-auto-complete-config.
When auto-completing some model configs, backends may generate a model config even though there is not enough metadata (ex. Graphdef models for TensorFlow Backend). The user will see the model successfully load but fail to inference. In this case the user should provide the full model configuration for these models or use the --disable-auto-complete-config CLI option to show which models fail to load.
Auto-complete does not support PyTorch models due to lack of metadata in the model. It can only verify that the number of inputs and the input names matches what is specified in the model configuration. There is no model metadata about the number of outputs and datatypes. Related PyTorch bug: https://github.com/pytorch/pytorch/issues/38273
Auto-complete is not supported in the OpenVINO backend
Perf Analyzer stability criteria has been changed which may result in reporting instability for scenarios that were previously considered stable. This change has been made to improve the accuracy of Perf Analyzer results. If you observe this message, it can be resolved by increasing the --measurement-interval in the time windows mode or --measurement-request-count in the count windows mode.
Triton Client PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton Client library for Arm SBSA.

The correct client wheel file can be pulled directly from the Arm SBSA SDK image and manually installed.

Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU.

Refer to https://github.com/pytorch/pytorch/issues/66930 for more information.

Triton cannot retrieve GPU metrics with MIG-enabled GPU devices (A100 and A30).
Triton metrics might not work if the host machine is running a separate DCGM agent on bare-metal or in a container.
Model Analyzer reported values for GPU utilization and GPU power are known to be inaccurate and generally lower than reality.

Client Libraries and Examples

Ubuntu 20.04 builds of the client libraries and examples are included in this release in the attached v2.25.0_ubuntu2004.clients.tar.gz file. The SDK is also available for as an Ubuntu 20.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

For windows, the client libraries and some examples are available in the attached tritonserver2.25.0-sdk-win.zip file.

Windows Support

A beta release of Triton for Windows is provided in the attached file:tritonserver2.25.0-win.zip. This is a beta release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release:

HTTP/REST and GRPC endpoints are supported.
ONNX models are supported by the ONNX Runtime backend. The ONNX Runtime version is 1.12.0. The CPU, CUDA, and TensorRT execution providers are supported. The OpenVINO execution provider is not supported.
OpenVINO models are supported. The OpenVINO version is 2021.4.
Prometheus metrics endpoint is not supported.
System and CUDA shared memory are not supported.

To use the Windows version of Triton, you must install all the necessary dependencies on your Windows system. These dependencies are available in the Dockerfile.win10.min. The Dockerfile includes the following CUDA-related components:

CUDA 11.7.1
cuDNN 8.4.1.5
TensorRT 8.4.1.5

Jetson Jetpack Support

NOTE: There is no Jetpack release for 22.08, the latest release is 22.07.

- Python
Published by mc-nv almost 4 years ago

server - Release 2.24.0 corresponding to NGC container 22.07

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

What's New in 2.24.0

Auto-Complete is enabled by default. The --strict-model-config option has been soft deprecated, use the new --disable-auto-complete-config CLI option instead.
New example backend demonstrating Business Logic Scripting in C++.
Users can provide values for "init_ops" in Tensorflow TF1.x GraphDef models through json file.
New asyncio compatible API to the Python GRPC/HTTP APIs.
Added thread pool to reduce service downtime for concurrently loading models. The thread pool size is configurable with the new --model-load-thread-count tritonserver option. You can find more information here.
Model Analyzer now doesn't require config.pbtxt file for models that can be auto-completed in Triton.
Refer to the 22.07 column of the Frameworks Support Matrix for container image versions on which the 22.07 inference server container is based.

Known Issues

JetPack release will be published later in the month in order to align with JetPack SDK public availability.
Auto-complete could cause an increase in server start time. To avoid a start time increase, users should provide the full model configuration.
When auto-completing some model configs, backends may generate a model config even though there is not enough metadata (ex. Graphdef models for Tensorflow Backend). The user will see the model successfully load but fail to inference. In this case the user should provide the full model configuration for these models or use the --disable-auto-complete-config CLI option to show which models fail to load.
Can't do autocomplete for PyTorch models, not enough metadata. Can only verify that the number of inputs is correct and the input names match what is specified in the model configuration. No info about number of outputs and datatypes. Related pytorch bug: https://github.com/pytorch/pytorch/issues/38273.
Running inference on multiple TensorRT model instances in Triton may fail with signal(6). The issue is expected to be fixed in a future release. Details can be found at https://github.com/triton-inference-server/server/issues/4566.
Perf Analyzer stability criteria has been changed which may result in reporting instability for scenarios that were previously considered stable. This change has been made to improve the accuracy of Perf Analyzer results. If you observe this message, it can be resolved by increasing the --measurement-interval in the time windows mode or --measurement-request-count in the count windows mode.
Unlike previously noted, 22.07 is the last release that defaults to TensorFlow version 1. From 22.08 onwards Triton will change the default TensorFlow version to 2.X.
Triton PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton for Arm SBSA.

The correct wheel file can be pulled directly from the Arm SBSA SDK image and manually installed.

Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU.

Refer to issue https://github.com/pytorch/pytorch/issues/66930 for more information.

Triton cannot retrieve GPU metrics with MIG-enabled GPU devices (A100 and A30).
Triton metrics might not work if the host machine is running a separate DCGM agent on bare-metal or in a container.
Starting from 22.02, the Triton container, which uses the 22.02 or above PyTorch container, will report an error during model loading in the PyTorch backend when using scripted models that were exported in the legacy format (using our 19.09 or previous PyTorch NGC containers corresponding to PyTorch 1.2.0 or previous releases).

To load the model successfully in Triton, you need to export the model again by using a recent version of PyTorch.

Model Analyzer reported values for GPU utilization and GPU power are known to be inaccurate and generally lower than reality

Client Libraries and Examples

Ubuntu 20.04 builds of the client libraries and examples are included in this release in the attached v2.24.0_ubuntu2004.clients.tar.gz file. The SDK is also available for as an Ubuntu 20.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

For windows, the client libraries and some examples are available in the attached tritonserver2.24.0-sdk-win.zip file.

Windows Support

A beta release of Triton for Windows is provided in the attached file:tritonserver2.24.0-win.zip. This is a beta release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release:

HTTP/REST and GRPC endpoints are supported.
ONNX models are supported by the ONNX Runtime backend. The ONNX Runtime version is 1.12.0. The CPU, CUDA, and TensorRT execution providers are supported. The OpenVINO execution provider is not supported.
OpenVINO models are supported. The OpenVINO version is 2021.4.
Prometheus metrics endpoint is not supported.
System and CUDA shared memory are not supported.

To use the Windows version of Triton, you must install all the necessary dependencies on your Windows system. These dependencies are available in the Dockerfile.win10.min. The Dockerfile includes the following CUDA-related components:

CUDA 11.7.1
cuDNN 8.4.1.5
TensorRT 8.4.1.5

Jetson Jetpack Support

A release of Triton for JetPack is provided in the attached tar file: A release of Triton for JetPack is provided in the attached tar file: tritonserver2.24.0-jetpack5.0.2.tgz..

This release supports TensorFlow 2.9.1, TensorFlow 1.15.5, TensorRT 8.4.1.5, Onnx Runtime 1.12.0, PyTorch 1.13.0, Python 3.8 and as well as ensembles.
Onnx Runtime backend does not support the OpenVino and TensorRT execution providers. The CUDA execution provider is in Beta.
System shared memory is supported on Jetson. CUDA shared memory is not supported.
GPU metrics, GCS storage, S3 storage and Azure storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples. For more information on how to install and use Triton on JetPack refer to jetson.md.

The wheel for the Python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tritonclient-2.24.0-py3-none-manylinux2014_aarch64.whl[all]

- Python
Published by mc-nv almost 4 years ago

server - Release 2.23.0 corresponding to NGC container 22.06

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

What's New in 2.23.0

Auto-generated model configuration enables dynamic batching in supported models by default.
Python backend models now support auto-generated model configuration.
Decoupled API support in Python Backend model is out of beta.
Updated I/O tensors naming convention for serving TorchScript models via PyTorch backend.
Improvements to Perf Analyzer stability and profiling logic.
Refer to the 22.06 column of the Frameworks Support Matrix for container image versions on which the 22.06 inference server container is based.

Known Issues

Perf Analyzer stability criteria has been changed which may result in reporting instability for scenarios that were previously considered stable. This change has been made to improve the accuracy of Perf Analyzer results. If you observe this message, it can be resolved by increasing the --measurement-interval in the time windows mode or --measurement-request-count in the count windows mode.
22.06 is the last release that defaults to TensorFlow version 1. From 22.07 onwards Triton will change the default TensorFlow version to 2.X.
Triton PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton for Arm SBSA.

The correct wheel file can be pulled directly from the Arm SBSA SDK image and manually installed.

Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU.

Refer to issue pytorch#66930 for more information.

Triton cannot retrieve GPU metrics with MIG-enabled GPU devices (A100 and A30).
Triton metrics might not work if the host machine is running a separate DCGM agent on bare-metal or in a container.
Running a PyTorch TorchScript model using the PyTorch backend, where multiple instances of a model are configured can lead to a slowdown in model execution due to the following PyTorch issue: pytorch#27902.
Starting from 22.02, the Triton container, which uses the 22.02 or above PyTorch container, will report an error during model loading in the PyTorch backend when using scripted models that were exported in the legacy format (using our 19.09 or previous PyTorch NGC containers corresponding to PyTorch 1.2.0 or previous releases).

To load the model successfully in Triton, you need to export the model again by using a recent version of PyTorch.

Client Libraries and Examples

Ubuntu 20.04 builds of the client libraries and examples are included in this release in the attached v2.23.0_ubuntu2004.clients.tar.gz file. The SDK is also available for as an Ubuntu 20.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

For windows, the client libraries and some examples are available in the attached tritonserver2.23.0-sdk-win.zip file.

Windows Support

A beta release of Triton for Windows is provided in the attached file:tritonserver2.23.0-win.zip. This is a beta release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release:

HTTP/REST and GRPC endpoints are supported.
ONNX models are supported by the ONNX Runtime backend. The ONNX Runtime version is 1.10.0. The CPU, CUDA, and TensorRT execution providers are supported. The OpenVINO execution provider is not supported.
OpenVINO models are supported. The OpenVINO version is 2021.4.
Prometheus metrics endpoint is not supported.
System and CUDA shared memory are not supported.

To use the Windows version of Triton, you must install all the necessary dependencies on your Windows system. These dependencies are available in the Dockerfile.win10.min. The Dockerfile includes the following CUDA-related components:

CUDA 11.5.0
cuDNN 8.3.2.44
TensorRT 8.2.2.1

Jetson Jetpack Support

A release of Triton for JetPack 5.0 Developer Preview is provided in the attached tar file: tritonserver2.23.0-jetpack5.0.tgz.

This release supports TensorFlow 2.8.0, TensorFlow 1.15.5, TensorRT 8.4.0.9, Onnx Runtime 1.10.0, PyTorch 1.12.0, Python 3.8 and as well as ensembles.
Onnx Runtime backend does not support the OpenVino and TensorRT execution providers. The CUDA execution provider is in Beta.
System shared memory is supported on Jetson. CUDA shared memory is not supported.
GPU metrics, GCS storage, S3 storage and Azure storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples. For more information on how to install and use Triton on JetPack refer to jetson.md.

The wheel for the Python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tritonclient-2.23.0-py3-none-manylinux2014_aarch64.whl[all]

- Python
Published by mc-nv about 4 years ago

server - Release 2.22.0 corresponding to NGC container 22.05

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

What's New in 2.22.0

Triton In-Process API is available in Java.
Python backend supports the decoupled API as BETA release.
You may load models with file content provided during the Triton Server API invocation.
Triton supports BF16 data type.
PyTorch backend supports 1-dimensional String I/O.
Explicit model control mode supports loading all models at startup.
You may specify customized GRPC channel settings in the GRPC client library.
Triton In-Process API supports dynamic model repository registration.
Improve build pipeline in build.py and generate build scripts used for pipeline examination.
ONNX Runtime backend updated to ONNX Runtime version 1.11.1 in both Ubuntu and Windows versions of Triton.
Refer to the 22.05 column of the Frameworks Support Matrix for container image versions on which the 22.05 inference server container is based.

Known Issues

Triton PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton for Arm SBSA.

The correct wheel file can be pulled directly from the Arm SBSA SDK image and manually installed.

Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU.

Refer to pytorch/pytorch#66930 for more information.

Triton cannot retrieve GPU metrics with MIG-enabled GPU devices (A100 and A30).
Triton metrics might not work if the host machine is running a separate DCGM agent on bare-metal or in a container.
Running a PyTorch TorchScript model using the PyTorch backend, where multiple instances of a model are configured can lead to a slowdown in model execution due to the following PyTorch issue: pytorch/pytorch#27902.
Starting in 22.02, the Triton container, which uses the 22.05 PyTorch container, will report an error during model loading in the PyTorch backend when using scripted models that were exported in the legacy format (using our 19.09 or previous PyTorch NGC containers corresponding to PyTorch 1.2.0 or previous releases).

To load the model successfully in Triton, you need to export the model again by using a recent version of PyTorch.

A protobuf python package version that satisfies protobuf>=3.5.0,<3.20 must be installed before installing the Triton ARM SBSA wheels or any tritonclient version of 2.22.0 or earlier. Tritonclient versions of 2.22.3 or newer for jetson, x86, and windows will work normally.

Client Libraries and Examples

Ubuntu 20.04 builds of the client libraries and examples are included in this release in the attached v2.22.0_ubuntu2004.clients.tar.gz file. The SDK is also available for as an Ubuntu 20.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

For windows, the client libraries and some examples are available in the attached tritonserver2.22.0-sdk-win.zip file.

Windows Support

A beta release of Triton for Windows is provided in the attached file:tritonserver2.22.0-win.zip. This is a beta release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release:

HTTP/REST and GRPC endpoints are supported.
ONNX models are supported by the ONNX Runtime backend. The ONNX Runtime version is 1.10.0. The CPU, CUDA, and TensorRT execution providers are supported. The OpenVINO execution provider is not supported.
OpenVINO models are supported. The OpenVINO version is 2021.4.
Prometheus metrics endpoint is not supported.
System and CUDA shared memory are not supported.

To use the Windows version of Triton, you must install all the necessary dependencies on your Windows system. These dependencies are available in the Dockerfile.win10.min. The Dockerfile includes the following CUDA-related components:

CUDA 11.5.0
cuDNN 8.3.2.44
TensorRT 8.2.2.1

Jetson Jetpack Support

A release of Triton for JetPack 5.0 Developer Preview is provided in the attached tar file: tritonserver2.22.0-jetpack5.0.tgz.

This release supports TensorFlow 2.8.0, TensorFlow 1.15.5, TensorRT 8.4.0.9, Onnx Runtime 1.10.0, PyTorch 1.12.0, Python 3.8 and as well as ensembles.
Onnx Runtime backend does not support the OpenVino and TensorRT execution providers. The CUDA execution provider is in Beta.
System shared memory is supported on Jetson. CUDA shared memory is not supported.
GPU metrics, GCS storage, S3 storage and Azure storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples. For more information on how to install and use Triton on JetPack refer to jetson.md.

The wheel for the Python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tritonclient-2.22.0-py3-none-manylinux2014_aarch64.whl[all]

- Python
Published by mc-nv about 4 years ago

server - Release 2.21.0 corresponding to NGC container 22.04

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

What's New In 2.21.0

Users can now specify a customized temp directory with the --tmp-dir argument to build.py during the container build.
Users can now send a raw binary request to eliminate the need for the specification of inference header.
Ensembles now recognize optional inputs.
Users can now add custom metrics to the existing Triton metrics endpoint in their custom backends and applications using the Triton C API. Documentation can be found here.
Official support for multiple cloud repositories. This includes the same as well as different cloud storage providers i.e. a single instance of Triton can load models from two S3 buckets, two GCS buckets and two Azure Storage containers.
ONNX Runtime backend now uses execution providers when available when autocomplete is enabled. This fixes the old behavior where it would always use the CPU execution provider.
The build.py and compose.py now support PyTorch and TensorFlow 1 backends for the CPU-only builds.
Refer to the 22.04 column of the Frameworks Support Matrix for container image versions on which the 22.04 inference server container is based.

Known Issues

Triton PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton for ARM SBSA.

The correct wheel file can be pulled directly from the ARM SBSA SDK image and manually installed.
Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU. See https://github.com/pytorch/pytorch/issues/66930.
Triton cannot retrieve GPU metrics with MIG-enabled GPU devices (A100 and A30)
Triton metrics may not work if the host machine is running a separate DCGM agent, either on bare-metal or in a container
Running a PyTorch TorchScript model using the PyTorch backend, where multiple instances of a model are configured can lead to a slowdown in model execution due to the following PyTorch issue: https://github.com/pytorch/pytorch/issues/27902
Starting in 22.02, the Triton container (which uses the 22.04 PyTorch container) will report an error during model loading in the PyTorch backend when using scripted models that were exported in the legacy format (using our 19.09 or previous PyTorch NGC containers corresponding to PyTorch 1.2.0 or previous releases). You will need to re-export the model using a recent version of PyTorch to be able to load the model successfully in Triton.
To best ensure the security and reliability of our RPM and Debian package repositories, NVIDIA is updating and rotating the signing keys used by the apt, dnf/yum, and zypper package managers beginning April 27, 2022. Triton r22.04 and prior release branches have not updated these repository signing keys. Due to this users should expect package management errors when attempting to access or install packages from CUDA repositories. Please follow these recommendations to mitigate the issue. Please update your branches prior to building to include the updated signing key(s). These changes are captured in this commit.

Client Libraries and Examples

Ubuntu 20.04 builds of the client libraries and examples are included in this release in the attached v2.21.0_ubuntu2004.clients.tar.gz file. The SDK is also available for as an Ubuntu 20.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

For windows, the client libraries and some examples are available in the attached tritonserver2.21.0-sdk-win.zip file.

Windows Support

A beta release of Triton for Windows is provided in the attached file: tritonserver2.21.0-win.zip. This is a beta release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release:

HTTP/REST and GRPC endpoints are supported.
ONNX models are supported by the ONNX Runtime backend. The ONNX Runtime version is 1.10.0. The CPU, CUDA, and TensorRT execution providers are supported. The OpenVINO execution provider is not supported.
OpenVINO models are supported. The OpenVINO version is 2021.4.
Prometheus metrics endpoint is not supported.
System and CUDA shared memory are not supported.

To use the Windows version of Triton, you must install all the necessary dependencies on your Windows system. These dependencies are available in the Dockerfile.win10.min. The Dockerfile includes the following CUDA-related components:

CUDA 11.5.0
cuDNN 8.3.2.44
TensorRT 8.2.2.1

Jetson Jetpack Support

A release of Triton for JetPack 5.0 Developer Preview is provided in the attached tar file: tritonserver2.21.0-jetpack5.0.tgz.

This release supports TensorFlow 2.8.0, TensorFlow 1.15.5, TensorRT 8.4.0.9, Onnx Runtime 1.10.0, PyTorch 1.12.0, Python 3.8 and as well as ensembles.
Onnx Runtime backend does not support the OpenVino and TensorRT execution providers. The CUDA execution provider is in Beta.
System shared memory is supported on Jetson. CUDA shared memory is not supported.
GPU metrics, GCS storage, S3 storage and Azure storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples. For more information on how to install and use Triton on JetPack refer to jetson.md.

The wheel for the Python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tritonclient-2.21.0-py3-none-manylinux2014_aarch64.whl[all]

- Python
Published by mc-nv about 4 years ago

server - Release 2.20.0 corresponding to NGC container 22.03

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

What's New In 2.20.0

Models can now load from a serialized model_config message with the Triton Server API.
ONNX Runtime, TensorRT, and Tensorflow backends now support server-side, multi-dimensional ragged batching.
Cache miss statistics have been added to the Prometheus metrics.
Trace settings can be configured with the Triton Server Trace Protocol.

Known Issues

Triton PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton for ARM SBSA. The correct wheel file can be pulled directly from the ARM SBSA SDK image and manually installed.
Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU. See https://github.com/pytorch/pytorch/issues/66930.
Triton cannot retrieve GPU metrics with MIG-enabled GPU devices (A100 and A30)
Triton metrics may not work if the host machine is running a separate DCGM agent, either on bare-metal or in a container
Running a PyTorch TorchScript model using the PyTorch backend, where multiple instances of a model are configured can lead to a slowdown in model execution due to the following PyTorch issue: https://github.com/pytorch/pytorch/issues/27902
Starting in 22.03, the Triton container (which uses the 22.03 PyTorch container) will report an error during model loading in the PyTorch backend when using scripted models that were exported in the legacy format (using our 19.09 or previous PyTorch NGC containers corresponding to PyTorch 1.2.0 or previous releases). You will need to re-export the model using a recent version of PyTorch to be able to load the model successfully in Triton.

Client Libraries and Examples

Ubuntu 20.04 builds of the client libraries and examples are included in this release in the attached v2.20.0_ubuntu2004.clients.tar.gz file. The SDK is also available for as an Ubuntu 20.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

For windows, the client libraries and some examples are available in the attached tritonserver2.20.0-sdk-win.zip file.

Windows Support

A beta release of Triton for Windows is provided in the attached file: tritonserver2.20.0-win.zip. This is a beta release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release:

HTTP/REST and GRPC endpoints are supported.
ONNX models are supported by the ONNX Runtime backend. The ONNX Runtime version is 1.10.0. The CPU, CUDA, and TensorRT execution providers are supported. The OpenVINO execution provider is not supported.
OpenVINO models are supported. The OpenVINO version is 2021.4.
Prometheus metrics endpoint is not supported.
System and CUDA shared memory are not supported.

To use the Windows version of Triton, you must install all the necessary dependencies on your Windows system. These dependencies are available in the Dockerfile.win10.min. The Dockerfile includes the following CUDA-related components:

CUDA 11.5.0
cuDNN 8.3.2.44
TensorRT 8.2.2.1

Jetson Jetpack Support

NOTE: The Jetson tritonserver2.20.0-jetpack5.0.tgz file will be added at a later date.

A release of Triton for JetPack 5.0 EA will be provided in the attached tar file: tritonserver2.20.0-jetpack5.0.tgz. It will be uploaded once JetPack 5.0 is released publicly.

This release supports TensorFlow 2.8.0, TensorFlow 1.15.5, TensorRT 8.4.0.6, Onnx Runtime 1.10.0, PyTorch 1.12.0, Python 3.8 and as well as ensembles.
Onnx Runtime backend does not support the OpenVino and TensorRT execution providers. The CUDA execution provider is in Beta.
System shared memory is supported on Jetson. CUDA shared memory is not supported.
GPU metrics, GCS storage, S3 storage and Azure storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples. For more information on how to install and use Triton on JetPack refer to jetson.md.

The wheel for the Python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tritonclient-2.20.0-py3-none-manylinux2014_aarch64.whl[all]

- Python
Published by mc-nv over 4 years ago

server - Release 2.19.0 corresponding to NGC container 22.02

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

What's New In 2.19.0

Enabled full-fledged SSL support in HTTP C++ client library.
New cache metrics added to Prometheus metrics.
PyTorch Backend now supports passing inputs in the form of a dictionary of tensors.

Known Issues

Starting in 22.02, the Triton container (which uses the 22.02 PyTorch container) will report an error in the PyTorch backend when using scripted models that were exported in the legacy format (using our 19.09 or previous PyTorch NGC containers corresponding to PyTorch 1.2.0 or previous releases). To avoid this error, you will need to re-export the model using a recent version of PyTorch to be able to load the model successfully in Triton.
Addition of cache metrics may affect 3rd party tools/calculations for inference/compute latencies in models with caching enabled not accounting for cache hit requests that don't require inference.
Triton PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton for ARM SBSA. The correct wheel file can be pulled directly from the ARM SBSA SDK image and manually installed.
Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU. See https://github.com/pytorch/pytorch/issues/66930.
Triton cannot retrieve GPU metrics with MIG-enabled GPU devices (A100 and A30)
Triton metrics may not work if the host machine is running a separate DCGM agent, either on bare-metal or in a container
Running a PyTorch TorchScript model using the PyTorch backend, where multiple instances of a model are configured can lead to a slowdown in model execution due to the following PyTorch issue: https://github.com/pytorch/pytorch/issues/27902
Starting in 22.02, the Triton container (which uses the 22.02 PyTorch container) will report an error during model loading in the PyTorch backend when using scripted models that were exported in the legacy format (using our 19.09 or previous PyTorch NGC containers corresponding to PyTorch 1.2.0 or previous releases). You will need to re-export the model using a recent version of PyTorch to be able to load the model successfully in Triton.

Client Libraries and Examples

Ubuntu 20.04 builds of the client libraries and examples are included in this release in the attached v2.19.0_ubuntu2004.clients.tar.gz file. The SDK is also available for as an Ubuntu 20.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

For windows, the client libraries and some examples are available in the attached tritonserver2.19.0-sdk-win.zip file.

Windows Support

A beta release of Triton for Windows is provided in the attached file: tritonserver2.19.0-win.zip. This is a beta release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release:

HTTP/REST and GRPC endpoints are supported.
ONNX models are supported by the ONNX Runtime backend. The ONNX Runtime version is 1.10.0. The CPU, CUDA, and TensorRT execution providers are supported. The OpenVINO execution provider is not supported.
OpenVINO models are supported. The OpenVINO version is 2021.2.
Prometheus metrics endpoint is not supported.
System and CUDA shared memory are not supported.

To use the Windows version of Triton, you must install all the necessary dependencies on your Windows system. These dependencies are available in the Dockerfile.win10.min. The Dockerfile includes the following CUDA-related components:

CUDA 11.5.0
cuDNN 8.3.2.44
TensorRT 8.2.2.1

Jetson Jetpack Support

A release of Triton for JetPack 4.6.1 is provided in the attached tar file: tritonserver2.19.0-jetpack4.6.1.tgz.

This release supports TensorFlow 2.7.0, TensorFlow 1.15.5, TensorRT 8.2.1.8, Onnx Runtime 1.10.0, PyTorch 1.11.0, Python 3.6 and as well as ensembles.
Onnx Runtime backend does not support the OpenVino execution provider. The TensorRT execution provider however is supported.
System shared memory is supported on Jetson. CUDA shared memory is not supported.
GPU metrics, GCS storage, S3 storage and Azure storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples. For more information on how to install and use Triton on JetPack refer to jetson.md.

The wheel for the Python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tritonclient-2.19.0-py3-none-manylinux2014_aarch64.whl[all]

- Python
Published by dzier over 4 years ago

server - Release 2.18.0 corresponding to NGC container 22.01

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

What's New In 2.18.0

Triton CPU-only build now supports TensorFlow2 backend for Linux x86.
Implicit state management can be used for ONNX Runtime and TensorRT backends.
State initialization from a constant is now supported in Implicit State management.
PyTorch and TensorFlow models now support batching on Inferentia.
PyTorch and Python backends are now supported on Jetson.
ARM Support has been added for the Performance Analyzer and Model Analyzer.

Known Issues

Triton PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton for ARM SBSA. The correct wheel file can be pulled directly from the ARM SBSA SDK image and manually installed.
Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU. See https://github.com/pytorch/pytorch/issues/66930.
Triton cannot retrieve GPU metrics with MIG-enabled GPU devices (A100 and A30)
Triton metrics may not work if the host machine is running a separate DCGM agent, either on bare-metal or in a container
Running a PyTorch TorchScript model using the PyTorch backend, where multiple instances of a model are configured can lead to a slowdown in model execution due to the following PyTorch issue: https://github.com/pytorch/pytorch/issues/27902

Client Libraries and Examples

Ubuntu 20.04 builds of the client libraries and examples are included in this release in the attached v2.18.0_ubuntu2004.clients.tar.gz file. The SDK is also available for as an Ubuntu 20.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

For windows, the client libraries and some examples are available in the attached tritonserver2.18.0-sdk-win.zip file.

Windows Support

A beta release of Triton for Windows is provided in the attached file: tritonserver2.18.0-win.zip. This is a beta release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release:

HTTP/REST and GRPC endpoints are supported.
ONNX models are supported by the ONNX Runtime backend. The ONNX Runtime version is 1.10.0. The CPU, CUDA, and TensorRT execution providers are supported. The OpenVINO execution provider is not supported.
OpenVINO models are supported. The OpenVINO version is 2021.2.
Prometheus metrics endpoint is not supported.
System and CUDA shared memory are not supported.

To use the Windows version of Triton, you must install all the necessary dependencies on your Windows system. These dependencies are available in the Dockerfile.win10.min. The Dockerfile includes the following CUDA-related components:

CUDA 11.5.0
cuDNN 8.3.2.44
TensorRT 8.2.2.1

Jetson Jetpack Support

NOTE: Jetson release of Triton is skipped for 2.18.0 (22.01) and the next release will be 2.19.0 (22.02).

- Python
Published by dzier over 4 years ago

server - Release 2.17.0 corresponding to NGC container 21.12

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

What's New In 2.17.0

Improved Inferentia support to use Neuron Runtime 2.x and multiple instances.
Models from MLflow can now be deployed to Triton with the MLflow plugin.
The preview release of TorchTRT models is now supported. PyTorch models optimized using TensorRT can now be loaded into Triton in the same way as regular PyTorch (TorchScript) models.
At the end of each Model Analyzer phase, an example command line will be printed to run the next phase.
ONNX Runtime backend updated to ONNX Runtime version 1.10.0 in both Ubuntu and Window versions of Triton.

Known Issues

There was a bug in the GRPC protobuf implementation that was resolved by https://github.com/triton-inference-server/common/pull/34. If the client code uses the 'bytecontents' field, the code must be updated to instead use 'bytescontents'.
Triton PIP wheels for ARM SBSA are not available from PyPI and pip will install an incorrect Jetson version of Triton for ARM SBSA. The correct wheel file can be pulled directly from the ARM SBSA SDK image and manually installed.
Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU. See https://github.com/pytorch/pytorch/issues/66930.
Triton’s TensorRT support depends on the input-consumed feature of TensorRT. In some rare cases using TensorRT 8.0 and earlier versions, the input-consumed event fires earlier than expected, causing Triton to overwrite input tensors while they are still in use and leading to corrupt input data being used for inference. This situation occurs when the inputs feed directly into a TensorRT layer that is optimized into a ForeignNode in the builder log. If you encounter accuracy issues with your TensorRT model, you can work-around the issue by enabling the output_copy_stream option in your model’s configuration (https://github.com/triton-inference-server/common/blob/main/protobuf/model_config.proto#L816)
Triton cannot retrieve GPU metrics with MIG-enabled GPU devices (A100 and A30)
Triton metrics may not work if the host machine is running a separate DCGM agent, either on bare-metal or in a container
Running a PyTorch TorchScript model using the PyTorch backend, where multiple instances of a model are configured can lead to a slowdown in model execution due to the following PyTorch issue: https://github.com/pytorch/pytorch/issues/27902

Client Libraries and Examples

Ubuntu 20.04 builds of the client libraries and examples are included in this release in the attached v2.17.0_ubuntu2004.clients.tar.gz file. The SDK is also available for as an Ubuntu 20.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

For windows, the client libraries and some examples are available in the attached tritonserver2.17.0-sdk-win.zip file.

Windows Support

An alpha release of Triton for Windows is provided in the attached file: tritonserver2.17.0-win.zip. This is an alpha release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release:

HTTP/REST and GRPC endpoints are now supported.
ONNX models are supported by the ONNX Runtime backend. The ONNX Runtime version is 1.10.0. The CPU, CUDA, and TensorRT execution providers are supported. The OpenVINO execution provider is not supported.
OpenVINO models are supported. The OpenVINO version is 2021.2.
Prometheus metrics endpoint is not supported.
System and CUDA shared memory are not supported.

To use the Windows version of Triton, you must install all the necessary dependencies on your Windows system. These dependencies are available in the Dockerfile.win10.min. The Dockerfile includes the following CUDA-related components:

NVIDIA Driver release 470 or later.
CUDA 11.4.2
cuDNN 8.2.4.15
TensorRT 8.0.3.4

Jetson Jetpack Support

A release of Triton for JetPack 4.6 (https://developer.nvidia.com/embedded/jetpack) is provided in the attached tar file: tritonserver2.17.0-jetpack4.6.tgz.

This release supports TensorFlow 2.6.0, TensorFlow 1.15.5, TensorRT 8.0.1.6, Onnx Runtime 1.10.0 and as well as ensembles.
For the Onnx Runtime backend the OpenVino execution provider is not supported but the TensorRT execution provider is supported.
System shared memory is supported on Jetson.
GPU metrics, GCS storage, S3 storage and Azure storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples.

Installation and Usage

The following dependencies must be installed before building / running Triton.

apt-get update && \
    apt-get install -y --no-install-recommends \
        software-properties-common \
        autoconf \
        automake \
        build-essential \
        cmake \
        git \
        libb64-dev \
        libre2-dev \
        libssl-dev \
        libtool \
        libboost-dev \
        libcurl4-openssl-dev \
        libopenblas-dev \
        rapidjson-dev \
        patchelf \
        zlib1g-dev

Note: When building Triton on Jetson, you will require a newer version of cmake. We recommend using cmake 3.21.0. Below is a script to upgrade your cmake version to 3.21.0. You can use cmake 3.18.4 if you are not enabling OnnxRuntime support.

apt remove cmake wget https://cmake.org/files/v3.21/cmake-3.21.0.tar.gz tar -xf cmake-3.21.0.tar.gz (cd cmake-3.21.0 && ./configure && make install)

Note: Seeing a core dump when using numpy 1.19.5 on Jetson is a known issue. We recommend using numpy version 1.19.4 or earlier to work around this issue.

To build / run the Triton client libraries and examples on Jetson, the following dependencies must be installed.

apt-get install -y --no-install-recommends \
        curl \
        pkg-config \
        python3 \
        python3-pip \
        python3-dev

pip3 install --upgrade wheel setuptools cython && \
pip3 install --upgrade grpcio-tools numpy==1.19.4 future attrdict

Note: OpenCV 4.1.1 is installed as a part of JetPack 4.6. It is one of the dependencies for the client build.

The Python wheel for the python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tritonclient-2.17.0-py3-none-manylinux2014_aarch64.whl[all]

On Jetson, the backend directory needs to be explicitly set with the --backend-directory flag. Triton also defaults to using TensorFlow 1.x and a version string is required to specify TensorFlow 2.x.

  tritonserver --model-repository=/path/to/model_repo --backend-directory=/path/to/tritonserver/backends \
         --backend-config=tensorflow,version=2

- Python
Published by dzier over 4 years ago

server - Release 2.16.0 corresponding to NGC container 21.11

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

What's New In 2.16.0

Added support for LightGBM models with categorical features in FIL backend.
Added Jetson examples in documentation.
Completed proof of concept of Inferentia support.
Added ARM Support for Model Analyzer.

Known Issues

Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU. See https://github.com/pytorch/pytorch/issues/66930.
Triton’s TensorRT support depends on the input-consumed feature of TensorRT. In some rare cases using TensorRT 8.0 and earlier versions, the input-consumed event fires earlier than expected, causing Triton to overwrite input tensors while they are still in use and leading to corrupt input data being used for inference. This situation occurs when the inputs feed directly into a TensorRT layer that is optimized into a ForeignNode in the builder log. If you encounter accuracy issues with your TensorRT model, you can work-around the issue by enabling the output_copy_stream option in your model’s configuration (https://github.com/triton-inference-server/common/blob/main/protobuf/model_config.proto#L816)
Triton cannot retrieve GPU metrics with MIG-enabled GPU devices (A100 and A30)
Triton metrics may not work if the host machine is running a separate DCGM agent, either on bare-metal or in a container
There is a known issue in TensorRT 8.0 regarding accuracy for a certain case of int8 inferencing on A40 and similar GPUs. The version of TF-TRT in TF2 21.09 includes a feature that works around this issue, but TF1 21.08 does not include that feature and therefore Triton users may experience the accuracy drop for a small subset of model/data type/batch size combinations on A40 when TF-TRT is used through the TF1 backend. This will be fixed in the next version of TensorRT.
Running a PyTorch TorchScript model using the PyTorch backend, where multiple instances of a model are configured can lead to a slowdown in model execution due to the following PyTorch issue: https://github.com/pytorch/pytorch/issues/27902

Client Libraries and Examples

Ubuntu 20.04 builds of the client libraries and examples are included in this release in the attached v2.16.0_ubuntu2004.clients.tar.gz file. The SDK is also available for as an Ubuntu 20.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

For windows, the client libraries and some examples are available in the attached tritonserver2.16.0-sdk-win.zip file.

Windows Support

An alpha release of Triton for Windows is provided in the attached file: tritonserver2.16.0-win.zip. This is an alpha release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release:

HTTP/REST and GRPC endpoints are now supported.
ONNX models are supported by the ONNX Runtime backend. The ONNX Runtime version is 1.9. The CPU, CUDA, and TensorRT execution providers are supported. The OpenVINO execution provider is not supported.
OpenVINO models are supported. The OpenVINO version is 2021.2.
Prometheus metrics endpoint is not supported.
System and CUDA shared memory are not supported.

The following components are required for this release and must be installed on the Windows system:

NVIDIA Driver release 470 or later.
CUDA 11.4.2
cuDNN 8.2.4.15
TensorRT 8.0.3.4

Jetson Jetpack Support

A release of Triton for JetPack 4.6 (https://developer.nvidia.com/embedded/jetpack) is provided in the attached tar file: tritonserver2.16.0-jetpack4.6.tgz.

This release supports the TensorFlow 2.6.0, TensorFlow 1.15.5, TensorRT 8.0.1.6, OnnxRuntime 1.8.1 and as well as ensembles.
For the OnnxRuntime backend the OpenVino execution provider is not supported but the TensorRT execution provider is supported.
System shared memory is supported on Jetson.
GPU metrics, GCS storage, S3 storage and Azure storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples.

Installation and Usage

The following dependencies must be installed before building / running Triton.

apt-get update && \
    apt-get install -y --no-install-recommends \
        software-properties-common \
        autoconf \
        automake \
        build-essential \
        cmake \
        git \
        libb64-dev \
        libre2-dev \
        libssl-dev \
        libtool \
        libboost-dev \
        libcurl4-openssl-dev \
        libopenblas-dev \
        rapidjson-dev \
        patchelf \
        zlib1g-dev

Note: When building Triton on Jetson, you will require a newer version of cmake. We recommend using cmake 3.21.0. Below is a script to upgrade your cmake version to 3.21.0. You can use cmake 3.18.4 if you are not enabling OnnxRuntime support.

apt remove cmake wget https://cmake.org/files/v3.21/cmake-3.21.0.tar.gz tar -xf cmake-3.21.0.tar.gz (cd cmake-3.21.0 && ./configure && make install)

Note: Seeing a core dump when using numpy 1.19.5 on Jetson is a known issue. We recommend using numpy version 1.19.4 or earlier to work around this issue.

To build / run the Triton client libraries and examples on Jetson, the following dependencies must be installed.

apt-get install -y --no-install-recommends \
        curl \
        pkg-config \
        python3 \
        python3-pip \
        python3-dev

pip3 install --upgrade wheel setuptools cython && \
pip3 install --upgrade grpcio-tools numpy==1.19.4 future attrdict

Note: OpenCV 4.1.1 is installed as a part of JetPack 4.6. It is one of the dependencies for the client build.

The Python wheel for the python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tritonclient-2.16.0-py3-none-manylinux2014_aarch64.whl[all]

On Jetson, the backend directory needs to be explicitly set with the --backend-directory flag. Triton also defaults to using TensorFlow 1.x and a version string is required to specify TensorFlow 2.x.

  tritonserver --model-repository=/path/to/model_repo --backend-directory=/path/to/tritonserver/backends \
         --backend-config=tensorflow,version=2

- Python
Published by dzier over 4 years ago

server - Release 2.15.0 corresponding to NGC container 21.10

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

What's New In 2.15.0

Rate limiter is now available and manages the rate at which requests are scheduled on model instances by Triton.
A beta version of Triton is available for ARM SBSA.
Windows Triton build now supports HTTP protocol.
Triton added support for caching responses to inference requests.
Sequence IDs can now accept strings.
Container composer tool can generate CPU-only Triton containers.

Known Issues

Traced models in PyTorch seem to create overflows when int8 tensor values are transformed to int32 on the GPU. See https://github.com/pytorch/pytorch/issues/66930.
Triton’s TensorRT support depends on the input-consumed feature of TensorRT. In some rare cases using TensorRT 8.0 and earlier versions, the input-consumed event fires earlier than expected, causing Triton to overwrite input tensors while they are still in use and leading to corrupt input data being used for inference. This situation occurs when the inputs feed directly into a TensorRT layer that is optimized into a ForeignNode in the builder log. If you encounter accuracy issues with your TensorRT model, you can work-around the issue by enabling the output_copy_stream option in your model’s configuration (https://github.com/triton-inference-server/common/blob/main/protobuf/model_config.proto#L816)
Triton cannot retrieve GPU metrics with MIG-enabled GPU devices (A100 and A30)
Triton metrics may not work if the host machine is running a separate DCGM agent, either on bare-metal or in a container
There is a known issue in TensorRT 8.0 regarding accuracy for a certain case of int8 inferencing on A40 and similar GPUs. The version of TF-TRT in TF2 21.09 includes a feature that works around this issue, but TF1 21.08 does not include that feature and therefore Triton users may experience the accuracy drop for a small subset of model/data type/batch size combinations on A40 when TF-TRT is used through the TF1 backend. This will be fixed in the next version of TensorRT.
Running a PyTorch TorchScript model using the PyTorch backend, where multiple instances of a model are configured can lead to a slowdown in model execution due to the following PyTorch issue: https://github.com/pytorch/pytorch/issues/27902

Client Libraries and Examples

Ubuntu 20.04 builds of the client libraries and examples are included in this release in the attached v2.15.0_ubuntu2004.clients.tar.gz file. The SDK is also available for as an Ubuntu 20.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

For windows, the client libraries and some examples are available in the attached tritonserver2.15.0-sdk-win.zip file.

Windows Support

An alpha release of Triton for Windows is provided in the attached file: tritonserver2.15.0-win.zip. This is an alpha release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release:

HTTP/REST and GRPC endpoints are now supported.
ONNX models are supported by the ONNX Runtime backend. The ONNX Runtime version is 1.9. The CPU, CUDA, and TensorRT execution providers are supported. The OpenVINO execution provider is not supported.
OpenVINO models are supported. The OpenVINO version is 2021.2.
Prometheus metrics endpoint is not supported.
System and CUDA shared memory are not supported.

The following components are required for this release and must be installed on the Windows system:

NVIDIA Driver release 470 or later.
CUDA 11.4.2
cuDNN 8.2.4.15
TensorRT 8.0.3.4

Jetson Jetpack Support

A release of Triton for JetPack 4.6 (https://developer.nvidia.com/embedded/jetpack) is provided in the attached tar file: tritonserver2.15.0-jetpack4.6.tgz.

This release supports the TensorFlow 2.6.0, TensorFlow 1.15.5, TensorRT 8.0.1.6, OnnxRuntime 1.8.1 and as well as ensembles.
For the OnnxRuntime backend the OpenVino execution provider is not supported but the TensorRT execution provider is supported.
System shared memory is supported on Jetson.
GPU metrics, GCS storage, S3 storage and Azure storage are not supported.
For this release the TF1 corresponds to the version from the 21.10 NGC TF container but TF2 corresponds to the version from the 21.09 NGC container.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples.

Installation and Usage

The following dependencies must be installed before building / running Triton.

apt-get update && \
    apt-get install -y --no-install-recommends \
        software-properties-common \
        autoconf \
        automake \
        build-essential \
        cmake \
        git \
        libb64-dev \
        libre2-dev \
        libssl-dev \
        libtool \
        libboost-dev \
        libcurl4-openssl-dev \
        libopenblas-dev \
        rapidjson-dev \
        patchelf \
        zlib1g-dev

Note: When building Triton on Jetson, you will require a newer version of cmake. We recommend using cmake 3.21.0. Below is a script to upgrade your cmake version to 3.21.0. You can use cmake 3.18.4 if you are not enabling OnnxRuntime support.

apt remove cmake wget https://cmake.org/files/v3.21/cmake-3.21.0.tar.gz tar -xf cmake-3.21.0.tar.gz (cd cmake-3.21.0 && ./configure && make install)

Note: Seeing a core dump when using numpy 1.19.5 on Jetson is a known issue. We recommend using numpy version 1.19.4 or earlier to work around this issue.

To build / run the Triton client libraries and examples on Jetson, the following dependencies must be installed.

apt-get install -y --no-install-recommends \
        curl \
        pkg-config \
        python3 \
        python3-pip \
        python3-dev

pip3 install --upgrade wheel setuptools cython && \
pip3 install --upgrade grpcio-tools numpy==1.19.4 future attrdict

Note: OpenCV 4.1.1 is installed as a part of JetPack 4.6. It is one of the dependencies for the client build.

The Python wheel for the python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tritonclient-2.15.0-py3-none-manylinux2014_aarch64.whl[all]

On Jetson, the backend directory needs to be explicitly set with the --backend-directory flag. Triton also defaults to using TensorFlow 1.x and a version string is required to specify TensorFlow 2.x.

  tritonserver --model-repository=/path/to/model_repo --backend-directory=/path/to/tritonserver/backends \
         --backend-config=tensorflow,version=2

- Python
Published by dzier over 4 years ago

server - Release 2.14.0 corresponding to NGC container 21.09

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

What's New In 2.14.0

Full-featured, Beta version of Business Logic Scripting (BLS) released.
Beta version for basic JAVA Client released. See https://github.com/triton-inference-server/client/tree/r21.09/src/java for a list of supported features.
A stack trace is now printed when Triton crashes to aid in debugging.
The Triton Client SDK wheel file is now available directly from PyPI for both Ubuntu and Windows.
The TensorRT backend is now an optional part of Triton just like all the other backends. The compose utility can be used to create a Triton container that does not contain the TensorRT backend.
Model Analyzer can profile with perf_analyzer's C-API.
Model Analyzer can use the CUDA Device Index in addition to the GPU UUID in the --gpus flag.

Known Issues

Triton’s TensorRT support depends on the input-consumed feature of TensorRT. In some rare cases using TensorRT 8.0 and earlier versions, the input-consumed event fires earlier than expected, causing Triton to overwrite input tensors while they are still in use and leading to corrupt input data being used for inference. This situation occurs when the inputs feed directly into a TensorRT layer that is optimized into a ForeignNode in the builder log. If you encounter accuracy issues with your TensorRT model, you can work-around the issue by enabling the output_copy_stream option in your model’s configuration (https://github.com/triton-inference-server/common/blob/main/protobuf/model_config.proto#L816)
Triton cannot retrieve GPU metrics with MIG-enabled GPU devices (A100 and A30)
Triton metrics may not work if the host machine is running a separate DCGM agent, either on bare-metal or in a container
There is a known issue in TensorRT 8.0 regarding accuracy for a certain case of int8 inferencing on A40 and similar GPUs. The version of TF-TRT in TF2 21.09 includes a feature that works around this issue, but TF1 21.08 does not include that feature and therefore Triton users may experience the accuracy drop for a small subset of model/data type/batch size combinations on A40 when TF-TRT is used through the TF1 backend. This will be fixed in the next version of TensorRT.
Running a PyTorch TorchScript model using the PyTorch backend, where multiple instances of a model are configured can lead to a slowdown in model execution due to the following PyTorch issue: https://github.com/pytorch/pytorch/issues/27902

Client Libraries and Examples

Ubuntu 20.04 builds of the client libraries and examples are included in this release in the attached v2.14.0_ubuntu2004.clients.tar.gz file. The SDK is also available for as an Ubuntu 20.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

For windows, the client libraries and some examples are available in the attached tritonserver2.14.0-sdk-win.zip file.

Windows Support

An alpha release of Triton for Windows is provided in the attached file: tritonserver2.14.0-win.zip. This is an alpha release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release:

TensorRT models are supported. The TensorRT version is 8.0.1.6.
ONNX models are supported by the ONNX Runtime backend. The ONNX Runtime version is 1.8.1. The CPU, CUDA, and TensorRT execution providers are supported. The OpenVINO execution provider is not supported.
OpenVINO models are supported. The OpenVINO version is 2021.2.
Only the GRPC endpoint is supported, HTTP/REST is not supported.
Prometheus metrics endpoint is not supported.
System and CUDA shared memory are not supported.

The following components are required for this release and must be installed on the Windows system:

NVIDIA Driver release 470 or later.
CUDA 11.4.1
cuDNN 8.2.2.26
TensorRT 8.0.1.6

Jetson Jetpack Support

A release of Triton for JetPack 4.6 (https://developer.nvidia.com/embedded/jetpack) is provided in the attached tar file: tritonserver2.14.0-jetpack4.6.tgz.

This release supports the TensorFlow 2.6.0, TensorFlow 1.15.5, TensorRT 8.0.1.6, OnnxRuntime 1.8.1 and as well as ensembles.
For the OnnxRuntime backend the OpenVino execution provider is not supported but the TensorRT execution provider is supported.
System shared memory is supported on Jetson.
GPU metrics, GCS storage, S3 storage and Azure storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples.

Installation and Usage

The following dependencies must be installed before building / running Triton.

apt-get update && \
    apt-get install -y --no-install-recommends \
        software-properties-common \
        autoconf \
        automake \
        build-essential \
        cmake \
        git \
        libb64-dev \
        libre2-dev \
        libssl-dev \
        libtool \
        libboost-dev \
        libcurl4-openssl-dev \
        libopenblas-dev \
        rapidjson-dev \
        patchelf \
        zlib1g-dev

Note: When building Triton on Jetson, you will require a newer version of cmake. We recommend using cmake 3.21.0. Below is a script to upgrade your cmake version to 3.21.0. You can use cmake 3.18.4 if you are not enabling OnnxRuntime support.

apt remove cmake wget https://cmake.org/files/v3.21/cmake-3.21.0.tar.gz tar -xf cmake-3.21.0.tar.gz (cd cmake-3.21.0 && ./configure && make install)

Note: Seeing a core dump when using numpy 1.19.5 on Jetson is a known issue. We recommend using numpy version 1.19.4 or earlier to work around this issue.

To run the clients the following dependencies must be installed.

apt-get install -y --no-install-recommends \
        curl \
        libopencv-dev=3.2.0+dfsg-4ubuntu0.1 \
        libopencv-core-dev=3.2.0+dfsg-4ubuntu0.1 \
        pkg-config \
        python3 \
        python3-pip \
        python3-dev

pip3 install --upgrade wheel setuptools cython && \
pip3 install --upgrade grpcio-tools numpy==1.19.4 future attrdict

The Python wheel for the python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tritonclient-2.14.0-py3-none-linux_aarch64.whl[all]

On Jetson, the backend directory needs to be explicitly set with the --backend-directory flag. Triton also defaults to using TensorFlow 1.x and a version string is required to specify TensorFlow 2.x.

  tritonserver --model-repository=/path/to/model_repo --backend-directory=/path/to/tritonserver/backends \
         --backend-config=tensorflow,version=2

- Python
Published by dzier almost 5 years ago

server - Release 2.13.0 corresponding to NGC container 21.08

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

What's New In 2.13.0

Initial Beta release for Business Logic Scripting, a new set of utility functions that allow the execution of inference requests on other models being served by Triton as part of executing a Python model.
Release new Container Composition Utility which can be used to create custom Triton containers with specific backends and repository agents.
Starting in 21.08, Triton will release two new containers on NGC.
- nvcr.io/nvidia/tritonserver:21.08-tf-python-py3 - GPU enabled Triton server with only the TensorFlow 2.x and Python backends.
- nvcr.io/nvidia/tritonserver:21.08-pyt-python-py3 - GPU enabled Triton server with only the PyTorch and Python backends.
Added Model Analyzer support for models with custom operations.

Known Issues

Loading models in ONNX Runtime on the Windows build of Triton may be slow due to the JIT compiler being invoked for newer CUDA architectures. For more information, refer to https://github.com/triton-inference-server/onnxruntime_backend/issues/58/
There is a known issue in TensorRT 8.0 regarding accuracy for a certain case of int8 inferencing on A40 and similar GPUs. The version of TF-TRT in TF2 21.08 includes a feature that works around this issue, but TF1 21.08 does not include that feature and therefore Triton users may experience the accuracy drop for a small subset of model/data type/batch size combinations on A40 when TF-TRT is used through the TF1 backend. This will be fixed in the next version of TensorRT.
Running a PyTorch TorchScript model using the PyTorch backend, where multiple instances of a model are configured can lead to a slowdown in model execution due to the following PyTorch issue: https://github.com/pytorch/pytorch/issues/27902
There are backwards incompatible changes in the example Python client shared-memory support library when that library is used for tensors of type BYTES. The utils.serializebytetensor() and utils.deserializebytetensor() functions now return np.object_ numpy arrays where previously they returned np.bytes_ numpy arrays. Code depending on np.bytes_ must be updated. This change was necessary because the np.bytes_ type removes all trailing zeros from each array element and so binary sequences ending in zero(s) could not be represented with the old behavior. Correct usage of the Python client shared-memory support library is shown in https://github.com/triton-inference-server/server/blob/r21.03/src/clients/python/examples/simplehttpshmstringclient.py.

Client Libraries and Examples

Ubuntu 20.04 builds of the client libraries and examples are included in this release in the attached v2.13.0_ubuntu2004.clients.tar.gz file. The SDK is also available for as an Ubuntu 20.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

For windows, the client libraries and some examples are available in the attached tritonserver2.13.0-sdk-win.zip file.

Windows Support

An alpha release of Triton for Windows is provided in the attached file: tritonserver2.13.0-win.zip. This is an alpha release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release:

TensorRT models are supported. The TensorRT version is 8.0.1.6.
ONNX models are supported by the ONNX Runtime backend. The ONNX Runtime version is 1.8.1. The CPU, CUDA, and TensorRT execution providers are supported. The OpenVINO execution provider is not supported.
OpenVINO models are supported. The OpenVINO version is 2021.2.
Only the GRPC endpoint is supported, HTTP/REST is not supported.
Prometheus metrics endpoint is not supported.
System and CUDA shared memory are not supported.

The following components are required for this release and must be installed on the Windows system:

NVIDIA Driver release 470 or later.
CUDA 11.4.0
cuDNN 8.2.2.26
TensorRT 8.0.1.6

Jetson Jetpack Support

A release of Triton for JetPack 4.6 (https://developer.nvidia.com/embedded/jetpack) is provided in the attached file: tritonserver2.13.0-jetpack4.6.tgz. This release supports the TensorFlow 2.5.0, TensorFlow 1.15.5, TensorRT 8.0.1, OnnxRuntime 1.8.1 and as well as ensembles. For the OnnxRuntime backend the OpenVino execution provider is not supported but the TensorRT execution provider is supported. System shared memory is supported on Jetson. GPU metrics, GCS storage, S3 storage and Azure storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples.

Installation and Usage

The following dependencies must be installed before running Triton.

apt-get update && \
    apt-get install -y --no-install-recommends \
        software-properties-common \
        autoconf \
        automake \
        build-essential \
        cmake \
        git \
        libb64-dev \
        libre2-dev \
        libssl-dev \
        libtool \
        libboost-dev \
        libcurl4-openssl-dev \
        libopenblas-dev \
        rapidjson-dev \
        patchelf \
        zlib1g-dev

Note: When building Triton on Jetson, you will require a newer version of cmake. We recommend using cmake 3.18.4. Below is a script to upgrade your cmake version to 3.18.4.

apt remove cmake wget https://cmake.org/files/v3.18/cmake-3.18.4.tar.gz tar -xf cmake-3.18.4.tar.gz (cd cmake-3.18.4 && ./configure && sudo make install)

Note: Seeing a core dump when using numpy 1.19.5 on Jetson is a known issue. We recommend using numpy version 1.19.4 or earlier to work around this issue.

To run the clients the following dependencies must be installed.

apt-get install -y --no-install-recommends \
        curl \
        libopencv-dev=3.2.0+dfsg-4ubuntu0.1 \
        libopencv-core-dev=3.2.0+dfsg-4ubuntu0.1 \
        pkg-config \
        python3 \
        python3-pip \
        python3-dev

pip3 install --upgrade wheel setuptools cython && \
pip3 install --upgrade grpcio-tools numpy==1.19.4 future attrdict

The Python wheel for the python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tritonclient-2.13.0-py3-none-linux_aarch64.whl[all]

On Jetson, the backend directory needs to be explicitly set with the --backend-directory flag. Triton also defaults to using TensorFlow 1.x and a version string is required to specify TensorFlow 2.x.

  tritonserver --model-repository=/path/to/model_repo --backend-directory=/path/to/tritonserver/backends \
         --backend-config=tensorflow,version=2

- Python
Published by dzier almost 5 years ago

server - Release 2.12.0 corresponding to NGC container 21.07

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

What's New In 2.12.0

Added support for CPU in RAPIDS FIL Backend.
Inference requests using the C API are now allowed to provide multiple copies of an input tensor in different memories. Triton will choose the most performant copy to use depending on where the inference request is executed.
For ONNX models using TensorRT acceleration, the tensorrt_accelerator option in the model configuration can now specify precision and workspace size. https://github.com/triton-inference-server/server/blob/main/docs/optimization.md#onnx-with-tensorrt-optimization
Model Analyzer added an offline mode, which prioritizes throughput over latency for offline inferencing scenarios. A new set of reports and graphs are created to better analyze the offline use case.

Known Issues

The 21.07 release includes libsystemd and libudev versions that have a known vulnerability that was discovered late in our QA process. See CVE-2021-33910 for details. This will be fixed in the next release.
ONNX Runtime TRT support was removed due to incompatibility with TensorRT 8.0.
There is a known issue in TensorRT 8.0 regarding accuracy for a certain case of int8 inferencing on A40 and similar GPUs. The version of TF-TRT in TF2 21.07 includes a feature that works around this issue, but TF1 21.07 does not include that feature and therefore Triton users may experience the accuracy drop for a small subset of model/data type/batch size combinations on A40 when TF-TRT is used through the TF1 backend. This will be fixed in the next version of TensorRT.
Running a PyTorch TorchScript model using the PyTorch backend, where multiple instances of a model are configured can lead to a slowdown in model execution due to the following PyTorch issue: https://github.com/pytorch/pytorch/issues/27902
There are backwards incompatible changes in the example Python client shared-memory support library when that library is used for tensors of type BYTES. The utils.serializebytetensor() and utils.deserializebytetensor() functions now return np.object_ numpy arrays where previously they returned np.bytes_ numpy arrays. Code depending on np.bytes_ must be updated. This change was necessary because the np.bytes_ type removes all trailing zeros from each array element and so binary sequences ending in zero(s) could not be represented with the old behavior. Correct usage of the Python client shared-memory support library is shown in https://github.com/triton-inference-server/server/blob/r21.03/src/clients/python/examples/simplehttpshmstringclient.py.

Client Libraries and Examples

Ubuntu 20.04 builds of the client libraries and examples are included in this release in the attached v2.12.0_ubuntu2004.clients.tar.gz file. The SDK is also available for as an Ubuntu 20.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

For windows, the client libraries and some examples are available in the attached tritonserver2.12.0-sdk-win.zip file.

Windows Support

An alpha release of Triton for Windows is provided in the attached file: tritonserver2.12.0-win.zip. This is an alpha release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release:

TensorRT models are supported. The TensorRT version is 7.2.2.
ONNX models are supported by the ONNX Runtime backend. The ONNX Runtime version is 1.8.0. The CPU, CUDA, and TensorRT execution providers are supported. The OpenVINO execution provider is not supported.
OpenVINO models are supported. The OpenVINO version is 2021.2.
Only the GRPC endpoint is supported, HTTP/REST is not supported.
Prometheus metrics endpoint is not supported.
System and CUDA shared memory are not supported.

The following components are required for this release and must be installed on the Windows system:

NVIDIA Driver release 455 or later.
CUDA 11.1.1
cuDNN 8.0.5
TensorRT 7.2.2

Jetson Jetpack Support

A release of Triton for JetPack 4.6 (https://developer.nvidia.com/embedded/jetpack) is provided in the attached file: tritonserver2.12.0-jetpack4.6.tgz. This release supports the TensorFlow 2.5.0, TensorFlow 1.15.5, TensorRT 8.0.1, OnnxRuntime 1.8.0 and as well as ensembles. For the OnnxRuntime backend the TensorRT and OpenVino execution providers are not supported. System shared memory is supported on Jetson. GPU metrics, GCS storage, S3 storage and Azure storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples.

Installation and Usage

The following dependencies must be installed before running Triton.

apt-get update && \
    apt-get install -y --no-install-recommends \
        software-properties-common \
        autoconf \
        automake \
        build-essential \
        cmake \
        git \
        libb64-dev \
        libre2-dev \
        libssl-dev \
        libtool \
        libboost-dev \
        libcurl4-openssl-dev \
        libopenblas-dev \
        rapidjson-dev \
        patchelf \
        zlib1g-dev

Note: When building Triton on Jetson, you will require a newer version of cmake. We recommend using cmake 3.18.4. Below is a script to upgrade your cmake version to 3.18.4.

apt remove cmake wget https://cmake.org/files/v3.18/cmake-3.18.4.tar.gz tar -xf cmake-3.18.4.tar.gz (cd cmake-3.18.4 && ./configure && sudo make install)

Note: Seeing a core dump when using numpy 1.19.5 on Jetson is a known issue. We recommend using numpy version 1.19.4 or earlier to work around this issue.

To run the clients the following dependencies must be installed.

apt-get install -y --no-install-recommends \
        curl \
        libopencv-dev=3.2.0+dfsg-4ubuntu0.1 \
        libopencv-core-dev=3.2.0+dfsg-4ubuntu0.1 \
        pkg-config \
        python3 \
        python3-pip \
        python3-dev

pip3 install --upgrade wheel setuptools cython && \
pip3 install --upgrade grpcio-tools numpy==1.19.4 future attrdict

The Python wheel for the python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tritonclient-2.12.0-py3-none-linux_aarch64.whl[all]

On Jetson, the backend directory needs to be explicitly set with the --backend-directory flag. Triton also defaults to using TensorFlow 1.x and a version string is required to specify TensorFlow 2.x.

  tritonserver --model-repository=/path/to/model_repo --backend-directory=/path/to/tritonserver/backends \
         --backend-config=tensorflow,version=2

- Python
Published by dzier almost 5 years ago

server - Release 2.11.0 corresponding to NGC container 21.06

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

What's New In 2.11.0

The Forest Inference Library (FIL) backend is added to Triton. The FIL backend allows forest models trained by several popular machine learning frameworks (including XGBoost, LightGBM, Scikit-Learn, and cuML) to be deployed in a Triton.
Windows version of Triton now includes the OpenVino backend.
The Performance Analyzer (perf_analyzer) now supports testing against the Triton C API.
The Python backend now allows the use of conda to create a unique execution environment for your Python model. See https://github.com/triton-inference-server/python_backend#using-custom-python-execution-environments.
Python models that crash or exit unexpectedly are now automatically restarted by Triton.
Model repositories in S3 storage can now be accessed using HTTPS protocol. See https://github.com/triton-inference-server/server/blob/main/docs/model_repository.md#s3 for more information.
Triton now collects GPU metrics for MIG partitions.
Passive model instances can now be specified in the model configuration. A passive model instance will be loaded and initialized by Triton, but no inference requests will be sent to the instance. Passive instances are typically used by a custom backend that uses its own mechanisms to distribute work to the passive instances. See the ModelInstanceGroup section of model_config.proto for the setting.
NVDLA support is added to the TensorRT backend.
ONNX Runtime version updated to 1.8.0.
Windows build documentation simplified and improved.
Improved detailed and summary reports in Model Analyzer.
Added an offline mode to Model Analyzer.
The DALI backend now accepts GPU inputs.
The DALI backend added support for dynamic batching and ragged inputs.

Known Issues

There are backwards incompatible changes in the example Python client shared-memory support library when that library is used for tensors of type BYTES. The utils.serializebytetensor() and utils.deserializebytetensor() functions now return np.object_ numpy arrays where previously they returned np.bytes_ numpy arrays. Code depending on np.bytes_ must be updated. This change was necessary because the np.bytes_ type removes all trailing zeros from each array element and so binary sequences ending in zero(s) could not be represented with the old behavior. Correct usage of the Python client shared-memory support library is shown in https://github.com/triton-inference-server/server/blob/r21.03/src/clients/python/examples/simplehttpshmstringclient.py.
The 21.06 release of Triton was built against the wrong commit of the FIL backend code, causing an incompatible version of RAPIDS to be used instead of the intended RAPIDS 21.06 stable release. This issue is fixed in the new 21.06.1 container released on NGC. Although the Triton server itself and other integrated backends will work, the FIL backend will not work in the 21.06 Triton container.

Client Libraries and Examples

Ubuntu 20.04 builds of the client libraries and examples are included in this release in the attached v2.11.0_ubuntu2004.clients.tar.gz file. The SDK is also available for as an Ubuntu 20.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

For windows, the client libraries and some examples are available in the attached tritonserver2.11.0-sdk-win.zip file.

Windows Support

An alpha release of Triton for Windows is provided in the attached file: tritonserver2.11.0-win.zip. This is an alpha release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release:

TensorRT models are supported. The TensorRT version is 7.2.2.
ONNX models are supported by the ONNX Runtime backend. The ONNX Runtime version is 1.8.0. The CPU, CUDA, and TensorRT execution providers are supported. The OpenVINO execution provider is not supported.
OpenVINO models are supported. The OpenVINO version is 2021.2.
Only the GRPC endpoint is supported, HTTP/REST is not supported.
Prometheus metrics endpoint is not supported.
System and CUDA shared memory are not supported.

The following components are required for this release and must be installed on the Windows system:

NVIDIA Driver release 455 or later.
CUDA 11.1.1
cuDNN 8.0.5
TensorRT 7.2.2

Jetson Jetpack Support

A release of Triton for JetPack 4.5 (https://developer.nvidia.com/embedded/jetpack) is provided in the attached file: tritonserver2.11.0-jetpack4.5.tgz. This release supports the TensorFlow 2.4.0, TensorFlow 1.15.5, TensorRT 7.1, OnnxRuntime 1.8.0 and as well as ensembles. For the OnnxRuntime backend the TensorRT execution provider is supported but the OpenVINO execution provider is not supported. System shared memory is supported on Jetson. GPU metrics, GCS storage, S3 storage and Azure storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples.

Installation and Usage

The following dependencies must be installed before running Triton.

apt-get update && \
    apt-get install -y --no-install-recommends \
        software-properties-common \
        autoconf \
        automake \
        build-essential \
        cmake \
        git \
        libb64-dev \
        libre2-dev \
        libssl-dev \
        libtool \
        libboost-dev \
        libcurl4-openssl-dev \
        libopenblas-dev \
        rapidjson-dev \
        patchelf \
        zlib1g-dev

To run the clients the following dependencies must be installed.

apt-get install -y --no-install-recommends \
        curl \
        libopencv-dev=3.2.0+dfsg-4ubuntu0.1 \
        libopencv-core-dev=3.2.0+dfsg-4ubuntu0.1 \
        pkg-config \
        python3 \
        python3-pip \
        python3-dev

pip3 install --upgrade wheel setuptools cython && \
pip3 install --upgrade grpcio-tools numpy future attrdict

The Python wheel for the python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tritonclient-2.11.0-py3-none-linux_aarch64.whl[all]

On Jetson, the backend directory needs to be explicitly set with the --backend-directory flag. Triton also defaults to using TensorFlow 1.x and a version string is required to specify TensorFlow 2.x.

  tritonserver --model-repository=/path/to/model_repo --backend-directory=/path/to/tritonserver/backends \
         --backend-config=tensorflow,version=2

- Python
Published by dzier about 5 years ago

server - Release 2.10.0 corresponding to NGC container 21.05

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

What's New In 2.10.0

Triton on Jetson now supports ONNX via the ONNX Runtime backend.
The Triton server and HTTP clients (Python and C++) now support compression.
Ragged batching is now supported for ONNX models.
The Triton clients have moved to a separate repo: https://github.com/triton-inference-server/client
Trace now correctly reports all timestamps for all backends.
NVTX annotations are fixed.
The legacy custom backend support is removed. All custom backends must be implemented using the TRITONBACKEND API described here: https://github.com/triton-inference-server/backend
Added CLI subcommands in Model Analyzer for profile, analyze, and report. See CLI documentation for usage instructions.
- This is a breaking change and requires updating Model Analyzer config files and CLI flags. See Configuring Model Analyzer and Quick Start for more information.
Model Analyzer can create a detailed report of any specific model configuration with the report subcommand.
CPU only mode is now supported in Model Analyzer.

Known Issues

There are backwards incompatible changes in the example Python client shared-memory support library when that library is used for tensors of type BYTES. The utils.serializebytetensor() and utils.deserializebytetensor() functions now return np.object_ numpy arrays where previously they returned np.bytes_ numpy arrays. Code depending on np.bytes_ must be updated. This change was necessary because the np.bytes_ type removes all trailing zeros from each array element and so binary sequences ending in zero(s) could not be represented with the old behavior. Correct usage of the Python client shared-memory support library is shown in https://github.com/triton-inference-server/server/blob/r21.03/src/clients/python/examples/simplehttpshmstringclient.py.
Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LDLIBRARYPATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.

Client Libraries and Examples

Ubuntu 20.04 builds of the client libraries and examples are included in this release in the attached v2.10.0_ubuntu2004.clients.tar.gz file. The SDK is also available for as an Ubuntu 20.04 based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer and Model Analyzer. Some components are also available in the tritonclient pip package. See Getting the Client Libraries for more information on each of these options.

For windows, the client libraries and some examples are available in the attached tritonserver2.10.0-sdk-win.zip file.

Windows Support

An alpha release of Triton for Windows is provided in the attached file: tritonserver2.10.0-win.zip. This is an alpha release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release:

TensorRT models are supported. The TensorRT version is 7.2.2.
ONNX models are supported by the ONNX Runtime backend. The ONNX Runtime version is 1.7.1. The CPU, CUDA, and TensorRT execution providers are supported. The OpenVINO execution provider is not supported.
Only the GRPC endpoint is supported, HTTP/REST is not supported.
Prometheus metrics endpoint is not supported.
System and CUDA shared memory are not supported.

The following components are required for this release and must be installed on the Windows system:

NVIDIA Driver release 455 or later.
CUDA 11.1.1
cuDNN 8.0.5
TensorRT 7.2.2

Jetson Jetpack Support

A release of Triton for JetPack 4.5 (https://developer.nvidia.com/embedded/jetpack) is provided in the attached file: tritonserver2.10.0-jetpack4.5.tgz. This release supports the TensorFlow 2.4.0, TensorFlow 1.15.5, TensorRT 7.1, OnnxRuntime 1.7.1 and as well as ensembles. For the OnnxRuntime backend the TensorRT execution provider is supported but the OpenVINO execution provider is not supported. System shared memory is supported on Jetson. GPU metrics, GCS storage, S3 storage and Azure storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples.

Installation and Usage

The following dependencies must be installed before running Triton.

apt-get update && \
    apt-get install -y --no-install-recommends \
        software-properties-common \
        autoconf \
        automake \
        build-essential \
        cmake \
        git \
        libb64-dev \
        libre2-dev \
        libssl-dev \
        libtool \
        libboost-dev \
        libcurl4-openssl-dev \
        libopenblas-dev \
        rapidjson-dev \
        patchelf \
        zlib1g-dev

To run the clients the following dependencies must be installed.

apt-get install -y --no-install-recommends \
        curl \
        libopencv-dev=3.2.0+dfsg-4ubuntu0.1 \
        libopencv-core-dev=3.2.0+dfsg-4ubuntu0.1 \
        pkg-config \
        python3 \
        python3-pip \
        python3-dev

pip3 install --upgrade wheel setuptools cython && \
pip3 install --upgrade grpcio-tools numpy future attrdict

The Python wheel for the python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tritonclient-2.10.0-py3-none-linux_aarch64.whl[all]

On Jetson, the backend directory needs to be explicitly set with the --backend-directory flag. Triton also defaults to using TensorFlow 1.x and a version string is required to specify TensorFlow 2.x.

  tritonserver --model-repository=/path/to/model_repo --backend-directory=/path/to/tritonserver/backends \
         --backend-config=tensorflow,version=2

- Python
Published by dzier about 5 years ago

server - Release 2.9.0 corresponding to NGC container 21.04

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

What's New In 2.9.0

Python backend performance has been increased significantly.
Onnx Runtime update to version 1.7.1.
Triton Server is now available as a GKE Marketplace Application, see https://github.com/triton-inference-server/server/tree/master/deploy/gke-marketplace-app.
The GRPC client libraries now allow compression to be enabled.
Ragged batching is now supported for TensorFlow models.
For TensorFlow models represented with SavedModel format, it is now possible to choose which graph and signaturedef to load. See https://github.com/triton-inference-server/tensorflowbackend/tree/r21.04#parameters.
A Helm Chart example is added for AWS. See https://github.com/triton-inference-server/server/tree/master/deploy/aws.
The Model Control API is enhanced to provide an option when unloading an ensemble model. The option allows all contained models to be unloaded as part of unloading the ensemble. See https://github.com/triton-inference-server/server/blob/master/docs/protocol/extensionmodelrepository.md#model-repository-extension.
Model reloading using the Model Control API previously resulted in the model being unavailable for a short period of time. This is now fixed so that the model remains available during reloading.
Latency statistics and metrics for TensorRT models are fixed. Previously the sum of the "compute input", "compute infer" and "compute output" times accurately indicated the entire compute time but the total time could be incorrectly attributed across the three components. This incorrect attribution is now fixed and all values are now accurate.
Error reporting is improved for the Azure, S3 and GCS cloud file system support.
Fix trace support for ensembles. The models contained within an ensemble are now traced correctly.
Model Analyzer improvements
- Summary report now includes GPU Power usage
- Model Analyzer will find the Top N model configuration across multiple models.

Known Issues

There are backwards incompatible changes in the example Python client shared-memory support library when that library is used for tensors of type BYTES. The utils.serializebytetensor() and utils.deserializebytetensor() functions now return np.object_ numpy arrays where previously they returned np.bytes_ numpy arrays. Code depending on np.bytes_ must be updated. This change was necessary because the np.bytes_ type removes all trailing zeros from each array element and so binary sequences ending in zero(s) could not be represented with the old behavior. Correct usage of the Python client shared-memory support library is shown in https://github.com/triton-inference-server/server/blob/r21.03/src/clients/python/examples/simplehttpshmstringclient.py.
Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LDLIBRARYPATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.

Client Libraries and Examples

Ubuntu 20.04 builds of the client libraries and examples are included in this release in the attached v2.9.0_ubuntu2004.clients.tar.gz file. See Getting the Client Libraries for more information on the client libraries and examples. The client SDK is also available as a NGC Container.

Windows Support

An alpha release of Triton for Windows is provided in the attached file: tritonserver2.8.0-win.zip. This is an alpha release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release:

TensorRT models are supported. The TensorRT version is 7.2.2.
ONNX models are supported by the ONNX Runtime backend. The ONNX Runtime version is 1.6.0. The CPU, CUDA, and TensorRT execution providers are supported. The OpenVINO execution provider is not supported.
Only the GRPC endpoint is supported, HTTP/REST is not supported.
Prometheus metrics endpoint is not supported.
System and CUDA shared memory are not supported.

The following components are required for this release and must be installed on the Windows system:

NVIDIA Driver release 455 or later.
CUDA 11.1.1
cuDNN 8.0.5
TensorRT 7.2.2

Jetson Jetpack Support

A release of Triton for JetPack 4.5 (https://developer.nvidia.com/embedded/jetpack) is provided in the attached file: tritonserver2.9.0-jetpack4.5.tgz. This release supports the TensorFlow 2.4.0, TensorFlow 1.15.5, TensorRT 7.1, and Custom backends as well as ensembles. System shared memory is supported on Jetson. GPU metrics, GCS storage, S3 storage and Azure storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples.

Installation and Usage

The following dependencies must be installed before running Triton.

apt-get update && \
    apt-get install -y --no-install-recommends \
        software-properties-common \
        autoconf \
        automake \
        build-essential \
        cmake \
        git \
        libb64-dev \
        libre2-dev \
        libssl-dev \
        libtool \
        libboost-dev \
        libcurl4-openssl-dev \
        rapidjson-dev \
        patchelf \
        zlib1g-dev

To run the clients the following dependencies must be installed.

apt-get install -y --no-install-recommends \
        curl \
        libopencv-dev=3.2.0+dfsg-4ubuntu0.1 \
        libopencv-core-dev=3.2.0+dfsg-4ubuntu0.1 \
        pkg-config \
        python3 \
        python3-pip \
        python3-dev

pip3 install --upgrade wheel setuptools cython && \
pip3 install --upgrade grpcio-tools numpy future

The Python wheel for the python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tritonclient-2.9.0-py3-none-linux_aarch64.whl[all]

On jetson, the backend directory needs to be explicitly set with the --backend-directory flag. Triton also defaults to using TensorFlow 1.x and a version string is required to specify TensorFlow 2.x.

  tritonserver --model-repository=/path/to/model_repo --backend-directory=/path/to/tritonserver/backends \
         --backend-config=tensorflow,version=2

- Python
Published by dzier about 5 years ago

server - Release 2.8.0 corresponding to NGC container 21.03

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

What's New In 2.8.0

The Windows build now supports ONNX models.
Repository agent is a new extensibility C API added to Triton that allows implementation of custom authentication, decryption, conversion, or similar operations when a model is loaded. See https://github.com/triton-inference-server/server/blob/master/docs/repository_agents.md
An OpenVINO backend is added to Triton to enable the execution of OpenVINO models on CPUs. https://github.com/triton-inference-server/openvino_backend
The PyTorch backend is now maintained in its own repository: https://github.com/triton-inference-server/pytorch_backend
The ONNX Runtime backend is now maintained in its own repository: https://github.com/triton-inference-server/onnxruntime_backend
The Jetson release of Triton now supports the system shared-memory protocol between clients and the Triton server.
SSL/TLS Mutual Authentication support is added to the GRPC client library.
A new Model Configuration option, "gatherkernelbuffer_threshold", can be specified to instruct Triton to use a CUDA kernel to gather inputs buffers onto the GPU. Using this option can improve inference performance for some models.
The Python client libraries have been improved to more efficiently create numpy arrays for input and output tensors.
The client libraries examples have been improved to more clearly describe how string and byte-blob tensors are supported by the Python Client API. https://github.com/triton-inference-server/server/blob/master/docs/client_examples.md
Ubuntu 20.04 with February 2021 updates.

Known Issues

There are backwards incompatible changes in the example Python client shared-memory support library when that library is used for tensors of type BYTES. The utils.serializebytetensor() and utils.deserializebytetensor() functions now return np.object_ numpy arrays where previously they returned np.bytes_ numpy arrays. Code depending on np.bytes_ must be updated. This change was necessary because the np.bytes_ type removes all trailing zeros from each array element and so binary sequences ending in zero(s) could not be represented with the old behavior. Correct usage of the Python client shared-memory support library is shown in https://github.com/triton-inference-server/server/blob/r21.03/src/clients/python/examples/simplehttpshmstringclient.py.
Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LDLIBRARYPATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.

Client Libraries and Examples

Ubuntu 20.04 builds of the client libraries and examples are included in this release in the attached v2.8.0_ubuntu2004.clients.tar.gz file. See Getting the Client Libraries for more information on the client libraries and examples. The client SDK is also available as a NGC Container.

Windows Support

An alpha release of Triton for Windows is provided in the attached file: tritonserver2.8.0-win.zip. This is an alpha release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release:

TensorRT models are supported. The TensorRT version is 7.2.2.
ONNX models are supported by the ONNX Runtime backend. The ONNX Runtime version is 1.6.0. The CPU, CUDA, and TensorRT execution providers are supported. The OpenVINO execution provider is not supported.
Only the GRPC endpoint is supported, HTTP/REST is not supported.
Prometheus metrics endpoint is not supported.
System and CUDA shared memory are not supported.

The following components are required for this release and must be installed on the Windows system:

NVIDIA Driver release 455 or later.
CUDA 11.1.1
cuDNN 8.0.5
TensorRT 7.2.2

Jetson Jetpack Support

A release of Triton for JetPack 4.5 (https://developer.nvidia.com/embedded/jetpack) is provided in the attached file: tritonserver2.8.0-jetpack4.5.tgz. This release supports the TensorFlow 2.4.0, TensorFlow 1.15.5, TensorRT 7.1, and Custom backends as well as ensembles. System shared memory is supported on Jetson. GPU metrics, GCS storage, S3 storage and Azure storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples.

Installation and Usage

The following dependencies must be installed before running Triton.

apt-get update && \
    apt-get install -y --no-install-recommends \
        software-properties-common \
        autoconf \
        automake \
        build-essential \
        cmake \
        git \
        libb64-dev \
        libre2-dev \
        libssl-dev \
        libtool \
        libboost-dev \
        libcurl4-openssl-dev \
        rapidjson-dev \
        patchelf \
        zlib1g-dev

To run the clients the following dependencies must be installed.

apt-get install -y --no-install-recommends \
        curl \
        libopencv-dev=3.2.0+dfsg-4ubuntu0.1 \
        libopencv-core-dev=3.2.0+dfsg-4ubuntu0.1 \
        pkg-config \
        python3 \
        python3-pip \
        python3-dev

pip3 install --upgrade wheel setuptools cython && \
pip3 install --upgrade grpcio-tools numpy future

The Python wheel for the python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tritonclient-2.8.0-py3-none-linux_aarch64.whl[all]

On jetson, the backend directory needs to be explicitly set with the --backend-directory flag. Triton also defaults to using TensorFlow 1.x and a version string is required to specify TensorFlow 2.x.

  tritonserver --model-repository=/path/to/model_repo --backend-directory=/path/to/tritonserver/backends \
         --backend-config=tensorflow,version=2

- Python
Published by deadeyegoodwin over 5 years ago

server - Release 2.7.0 corresponding to NGC container 21.02

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

What's New In 2.7.0

Fix bug in TensorRT backend that could, in rare cases, lead to corruption of output tensors.
Fix performance issue in the HTTP/REST client that occurred when the client does not explicitly request specific outputs.
In this case all outputs are now returned as binary data where previously they were returned as JSON.
Add an example Java and Scala client based on GRPC-generated API.
Extend perf_analyzer to be able to work with TFServing and TorchServe.
The legacy custom backend API is deprecated and will be removed in a future release. The Triton Backend API should be used as the API for custom backends. The Triton Backend API remains fully supported and that support will continue indefinitely.
Model Analyzer parameters and test model configurations can be specified with YAML configuration file.
Model Analyzer will report performance metrics for end-to-end latency and CPU memory usage.
Refer to the 21.02 column of the Frameworks Support Matrix for container image versions that the 21.02 inference server container is based on.
Ubuntu 20.04 with January 2021 updates.

Known Issues

TensorRT reformat-free I/O is not supported.
Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LDLIBRARYPATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.
Observed memory leak in gRPC client library. Suggested workaround: restart client processes periodically or minimize creation of new InferenceServerGrpcClient objects. For more details on the issue in gRPC, please reference: https://github.com/triton-inference-server/server/issues/2517. The memory leak is fixed on master branch by https://github.com/triton-inference-server/server/pull/2533 and the fix will be included in the 21.03 release. If required, the change can be applied to the 21.02 branch and the client library can be rebuilt: https://github.com/triton-inference-server/server/blob/master/docs/client_libraries.md.

Client Libraries and Examples

Ubuntu 20.04 builds of the client libraries and examples are included in this release in the attached v2.7.0_ubuntu2004.clients.tar.gz file. See Getting the Client Libraries for more information on the client libraries and examples. The client SDK is also available as a NGC Container.

Windows Support

An alpha release of Triton for Windows is provided in the attached file: tritonserver2.7.0-win.zip. This is an alpha release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release:

Only TensorRT models are supported. The TensorRT version is 7.2.2.
Only the GRPC endpoint is supported, HTTP/REST is not supported.
Prometheus metrics endpoint is not supported.
System and CUDA shared memory are not supported.

The following components are required for this release and must be installed on the Windows system:

NVIDIA Driver release 455 or later.
CUDA 11.1.1
cuDNN 8.0.5
TensorRT 7.2.2

Jetson Jetpack Support

A release of Triton for JetPack 4.5 (https://developer.nvidia.com/embedded/jetpack) is provided in the attached file: tritonserver2.7.0-jetpack4.5.tgz. This release supports the TensorFlow 2.4.0, TensorFlow 1.15.5, TensorRT 7.1, and Custom backends as well as ensembles. System shared memory is supported on Jetson. GPU metrics, GCS storage, S3 storage and Azure storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples.

Installation and Usage

The following dependencies must be installed before running Triton.

apt-get update && \
    apt-get install -y --no-install-recommends \
        software-properties-common \
        autoconf \
        automake \
        build-essential \
        cmake \
        git \
        libb64-dev \
        libre2-dev \
        libssl-dev \
        libtool \
        libboost-dev \
        libcurl4-openssl-dev \
        rapidjson-dev \
        patchelf \
        zlib1g-dev

To run the clients the following dependencies must be installed.

apt-get install -y --no-install-recommends \
        curl \
        libopencv-dev=3.2.0+dfsg-4ubuntu0.1 \
        libopencv-core-dev=3.2.0+dfsg-4ubuntu0.1 \
        pkg-config \
        python3 \
        python3-pip \
        python3-dev

pip3 install --upgrade wheel setuptools cython && \
pip3 install --upgrade grpcio-tools numpy future

The Python wheel for the python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tritonclient-2.7.0-py3-none-linux_aarch64.whl[all]

On jetson, the backend directory needs to be explicitly set with the --backend-directory flag. Triton also defaults to using TensorFlow 1.x and a version string is required to specify TensorFlow 2.x.

  tritonserver --model-repository=/path/to/model_repo --backend-directory=/path/to/tritonserver/backends \
         --backend-config=tensorflow,version=2

- Python
Published by dzier over 5 years ago

server - Release 2.6.0 corresponding to NGC container 20.12

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

What's New In 2.6.0

An alpha release Triton for Windows is included in this release. See below for more details.
Due to interactions with Ubuntu 20.04, the ONNX Runtime's OpenVINO execution provider is disabled in this release. OpenVINO support will be re-enabled in a subsequent release.
The Triton *-py3-clientsdk container has been renamed to *-py3-sdk and now contains the Model Analyzer as well as the client libraries and examples.
The PyTorch backend has been moved to a separate repository: https://github.com/triton-inference-server/pytorch_backend. As a result, it is now easy to add or remove it from Triton without requiring a rebuild: https://github.com/triton-inference-server/server/blob/master/docs/compose.md.
Initial release of the Model Analyzer tool in the Triton SDK container and the PIP package, nvidia-triton-model-analyzer, in the NVIDIA Py Index.
Refer to the 20.12 column of the Frameworks Support Matrix for container image versions that the 20.12 inference server container is based on.
Ubuntu 20.04 with September 2020 updates.

Known Issues

TensorRT reformat-free I/O is not supported.
Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LDLIBRARYPATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.

Client Libraries and Examples

Ubuntu 20.04 builds of the client libraries and examples are included in this release in the attached v2.6.0_ubuntu2004.clients.tar.gz file. See Getting the Client Libraries for more information on the client libraries and examples. The client SDK is also available as a NGC Container.

Windows Support

An alpha release of Triton for Windows is provided in the attached file: tritonserver2.6.0-win.zip. This is an alpha release so functionality is limited and performance is not optimized. Additional features and improved performance will be provided in future releases. Specifically in this release:

Only TensorRT models are supported. The TensorRT version is 7.2.2.
Only the GRPC endpoint is supported, HTTP/REST is not supported.
Prometheus metrics endpoint is not supported.
System and CUDA shared memory are not supported.

The following components are required for this release and must be installed on the Windows system:

NVIDIA Driver release 455 or later.
CUDA 11.1.1
cuDNN 8.0.5
TensorRT 7.2.2

Jetson Jetpack Support

A release of Triton for JetPack 4.4 (https://developer.nvidia.com/embedded/jetpack) is provided in the attached file: tritonserver2.6.0-jetpack4.4.tgz. This release supports the TensorFlow 2.3.1, TensorFlow 1.15.4, TensorRT 7.1, and Custom backends as well as ensembles. GPU metrics, GCS storage, S3 storage and Azure storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples.

Installation and Usage

The following dependencies must be installed before running Triton.

apt-get update && \
    apt-get install -y --no-install-recommends \
        software-properties-common \
        autoconf \
        automake \
        build-essential \
        cmake \
        git \
        libb64-dev \
        libre2-dev \
        libssl-dev \
        libtool \
        libboost-dev \
        libcurl4-openssl-dev \
        rapidjson-dev \
        patchelf \
        zlib1g-dev

To run the clients the following dependencies must be installed.

apt-get install -y --no-install-recommends \
        curl \
        libopencv-dev=3.2.0+dfsg-4ubuntu0.1 \
        libopencv-core-dev=3.2.0+dfsg-4ubuntu0.1 \
        pkg-config \
        python3 \
        python3-pip \
        python3-dev

python3 -m pip install --upgrade wheel setuptools
python3 -m pip install --upgrade grpcio-tools numpy pillow

The Python wheel for the python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tritonclient-2.6.0-py3-none-linux_aarch64.whl[all]

On jetson, the backend directory needs to be explicitly set with the --backend-directory flag. Triton also defaults to using TensorFlow 1.x and a version string is required to specify TensorFlow 2.x.

  tritonserver --model-repository=/path/to/model_repo --backend-directory=/path/to/tritonserver/backends \
         --backend-config=tensorflow,version=2

- Python
Published by dzier over 5 years ago

server - Release 2.5.0 corresponding to NGC container 20.11

NVIDIA Triton Inference Server

The NVIDIA Triton Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

What's New In 2.5.0

ONNX Runtime backend updated to use ONNX Runtime 1.5.3.
The PyTorch backend is moved to a dedicated repo triton-inference-server/pytorch_backend.
The Caffe2 backend is removed. Caffe2 models are no longer supported.
Fix handling of failed model reloads. If a model reload fails, the currently loaded version of the model will remain loaded and its availability will be uninterrupted.
Releasing Triton ModelAnalyzer in the Triton SDK container and as a PIP package available in NVIDIA PyIndex.
Refer to the 20.11 column of the Frameworks Support Matrix for container image versions that the 20.11 inference server container is based on.
Ubuntu 18.04 with September 2020 updates.

Known Issues

TensorRT reformat-free I/O is not supported.
Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LDLIBRARYPATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.

Client Libraries and Examples

Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v2.5.0_ubuntu1804.clients.tar.gz file. See Getting the Client Libraries for more information on the client libraries and examples. The client SDK is also available as a NGC Container.

Jetson Jetpack Support

A release of Triton for JetPack 4.4 (https://developer.nvidia.com/embedded/jetpack) is provided in the attached file: 2.5.0-jetpack4.4-1795341.tgz. This release supports the TensorFlow 2.3.1, TensorFlow 1.15.4, TensorRT 7.1, and Custom backends as well as ensembles. GPU metrics, GCS storage and S3 storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples.

Installation and Usage

The following dependencies must be installed before running Triton.

apt-get update && \
    apt-get install -y --no-install-recommends \
        software-properties-common \
        autoconf \
        automake \
        build-essential \
        cmake \
        git \
        libb64-dev \
        libre2-dev \
        libssl-dev \
        libtool \
        libboost-dev \
        libcurl4-openssl-dev \
        rapidjson-dev \
        patchelf \
        zlib1g-dev

To run the clients the following dependencies must be installed.

apt-get install -y --no-install-recommends \
        curl \
        libopencv-dev=3.2.0+dfsg-4ubuntu0.1 \
        libopencv-core-dev=3.2.0+dfsg-4ubuntu0.1 \
        pkg-config \
        python3 \
        python3-pip \
        python3-dev

python3 -m pip install --upgrade wheel setuptools
python3 -m pip install --upgrade grpcio-tools numpy pillow

The Python wheel for the python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tritonclient-2.5.0-py3-none-linux_aarch64.whl[all]

On jetson, the backend directory needs to be explicitly set with the --backend-directory flag. Triton also defaults to using TensorFlow 1.x and a version string is required to specify TensorFlow 2.x.

  tritonserver --model-repository=/path/to/model_repo --backend-directory=/path/to/tritonserver/backends \
         --backend-config=tensorflow,version=2

- Python
Published by dzier over 5 years ago

server - Release 2.4.0 corresponding to NGC container 20.10

Triton Inference Server

The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

What's New In 2.4.0

A new Python backend allows Python code to run as a model within Triton. See https://github.com/triton-inference-server/python_backend.
A new DALI backend allows running pre-processing and augmentation pipelines within Triton. See https://github.com/triton-inference-server/dali_backend.
The perfclient application is renamed to perfanalyzer, functionality remains the same.
A new Model Analyzer project is started with the goal of providing analysis and guidance on how to best optimize single or multiple models within Triton. The initial release analyzes GPU memory usage. See https://github.com/triton-inference-server/model_analyzer.
Triton documentation now resides on GitHub and is reachable from https://github.com/triton-inference-server/server/blob/master/README.md.
Build process for Triton has changed, see https://github.com/triton-inference-server/server/blob/master/docs/build.md.
Triton backends are moving to separate repositories. In this release the TensorFlow, ONNX Runtime, Python and DALI backends are moved, see https://github.com/triton-inference-server/backend#where-can-i-find-all-the-backends-that-are-available-for-triton.
Refer to the 20.10 column of the Frameworks Support Matrix for container image versions that the 20.09 inference server container is based on.
Ubuntu 18.04 with September 2020 updates.

Known Issues

TensorRT reformat-free I/O is not supported.
Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LDLIBRARYPATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.

Client Libraries and Examples

Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v2.4.0_ubuntu1804.clients.tar.gz file. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files. The client SDK is also available as a NGC Container.

Jetson Jetpack Support

A release of Triton for the Developer Preview of JetPack 4.4 (https://developer.nvidia.com/embedded/jetpack) is provided in the attached file: v2.4.0-jetpack4.4-1718105.tgz. This release supports the TensorFlow 2.3.1, TensorFlow 1.15.4, TensorRT 7.1, and Custom backends as well as ensembles. GPU metrics, GCS storage and S3 storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples.

Installation and Usage

The following dependencies must be installed before running Triton.

apt-get update && \
    apt-get install -y --no-install-recommends \
        software-properties-common \
        autoconf \
        automake \
        build-essential \
        cmake \
        git \
        libb64-dev \
        libre2-dev \
        libssl-dev \
        libtool \
        libboost-dev \
        libcurl4-openssl-dev \
        rapidjson-dev \
        patchelf \
        zlib1g-dev

To run the clients the following dependencies must be installed.

apt-get install -y --no-install-recommends \
        curl \
        libopencv-dev=3.2.0+dfsg-4ubuntu0.1 \
        libopencv-core-dev=3.2.0+dfsg-4ubuntu0.1 \
        pkg-config \
        python3 \
        python3-pip \
        python3-dev

python3 -m pip install --upgrade wheel setuptools
python3 -m pip install --upgrade grpcio-tools numpy pillow

The Python wheel for the python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tritonclient-2.4.0-py3-none-linux_aarch64.whl[all]

On jetson, the backend directory needs to be explicitly set with the --backend-directory flag. Triton also defaults to using TensorFlow 1.x and a version string is required to specify TensorFlow 2.x.

  tritonserver --model-repository=/path/to/model_repo --backend-directory=/path/to/tritonserver/backends \
         --backend-config=tensorflow,version=2

- Python
Published by dzier over 5 years ago

server - Release 2.3.0 corresponding to NGC container 20.09

NVIDIA Triton Inference Server

The NVIDIA Triton Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

What's New In 2.3.0

Python Client library is now a pip package available from the NVIDIA pypi index. See Python client documentation for more information.
The custom backend API, custom.h and associated custom backend SDK are no longer provided as part of the Triton release. Existing custom backends will continue to work with Triton and older releases of the SDK can still be used to create "legacy" custom backends. However, all users are strongly encouraged to move to the new Triton backend API.
Fix a performance issue with the HTTP/REST protocol and the Python client library that caused reduced performance when outputs were not requested explicitly in an inference request.
Fix some bugs in reporting of statistics for ensemble models.
GRPC updated to version 1.25.0.

Known Issues

The KFServing HTTP/REST and GRPC protocols and corresponding V2 experimental Python and C++ clients are beta quality and are likely to change. Specifically:
- The data returned by the statistics API will be changing to include additional information.
- The data returned by the repository index API will be changing to include additional information.
The new C API specified in tritonserver.h is beta quality and is likely to change.
TensorRT reformat-free I/O is not supported.
Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LDLIBRARYPATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.

Client Libraries and Examples

Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v2.3.0_ubuntu1804.clients.tar.gz file. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files. The client SDK is also available as a NGC Container.

Jetson Jetpack Support

An experimental release of Triton for the Developer Preview of JetPack 4.4 is available as part of the 20.06 release. See 20.06 release for more information.

- Python
Published by dzier almost 6 years ago

server - Release 2.2.0 corresponding to NGC container 20.08

NVIDIA Triton Inference Server

The NVIDIA Triton Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

What's New In 2.2.0

TensorFlow 2.x is now supported in addition to TensorFlow 1.x. See the Frameworks Support Matrix for the supported TensorFlow versions. The version of TensorFlow used can be selected when launching Triton with the --backend-config=tensorflow,version=<version> flag. Set <version> to 1 or 2 to select TensorFlow1 or TensorFlow2 respectively. By default TensorFlow 1 is used.
Add inference request timeout option to Python and C++ client libraries.
GRPC inference protocol updated to fix performance regression.
Explicit major/minor versioning added to TRITONSERVER and TRITONBACKED APIs.
New CMake option TRITONCLIENTSKIP_EXAMPLES to disable building the client examples.

Known Issues

The KFServing HTTP/REST and GRPC protocols and corresponding V2 experimental Python and C++ clients are beta quality and are likely to change. Specifically:
- The data returned by the statistics API will be changing to include additional information.
- The data returned by the repository index API will be changing to include additional information.
The new C API specified in tritonserver.h is beta quality and is likely to change.
TensorRT reformat-free I/O is not supported.
Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LDLIBRARYPATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.

Client Libraries and Examples

Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v2.2.0_ubuntu1804.clients.tar.gz file. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files. The client SDK is also available as a NGC Container.

Custom Backend SDK

Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v2.2.0_ubuntu1804.custombackend.tar.gz file. See the documentation section 'Building a Custom Backend' for more information on using these files.

Jetson Jetpack Support

An experimental release of Triton for the Developer Preview of JetPack 4.4 is available as part of the 20.06 release. See 20.06 release for more information.

- Python
Published by dzier almost 6 years ago

server - Release 2.1.0 corresponding to NGC container 20.07

NVIDIA Triton Inference Server

The NVIDIA Triton Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

What's New In 2.1.0

Added TensorFlow optimization option that enables automatic FP16 optimization of the model.
The PyTorch backend now includes support for TorchVision operations.

Known Issues

The KFServing HTTP/REST and GRPC protocols and corresponding V2 experimental Python and C++ clients are beta quality and are likely to change. Specifically:
- The data returned by the statistics API will be changing to include additional information.
- The data returned by the repository index API will be changing to include additional information.
The new C API specified in tritonserver.h is beta quality and is likely to change.
TensorRT reformat-free I/O is not supported.
Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LDLIBRARYPATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.

Client Libraries and Examples

Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v2.1.0_ubuntu1804.clients.tar.gz file. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files. The client SDK is also available as a NGC Container.

Custom Backend SDK

Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v2.1.0_ubuntu1804.custombackend.tar.gz file. See the documentation section 'Building a Custom Backend' for more information on using these files.

Jetson Jetpack Support

An experimental release of Triton for the Developer Preview of JetPack 4.4 is available as part of the 20.06 release. See 20.06 release for more information.

- Python
Published by dzier almost 6 years ago

server - Release 1.15.0 corresponding to NGC container 20.07

NVIDIA Triton Inference Server

The NVIDIA Triton Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

What's New In 1.15.0

Support for the legacy V1 HTTP/REST, GRPC and corresponding client libraries is released on GitHub branch r20.07-v1 and as NGC container 20.07-v1-py3.

Known Issues

TensorRT reformat-free I/O is not supported.
Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LDLIBRARYPATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.

Client Libraries and Examples

Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.15.0_ubuntu1804.clients.tar.gz file. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files. The client SDK is also available as a NGC Container.

Custom Backend SDK

Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.15.0_ubuntu1804.custombackend.tar.gz file. See the documentation section 'Building a Custom Backend' for more information on using these files.

- Python
Published by dzier almost 6 years ago

server - Release 2.0.0 corresponding to NGC container 20.06

NVIDIA Triton Inference Server

The NVIDIA Triton Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

What's New In 2.0.0

Updates for KFserving HTTP/REST and GRPC protocols and corresponding Python and C++ client libraries.
Migration from Triton V1 to Triton V2 requires signficant changes, see the “Backwards Compatibility” and “Roadmap” sections of the GitHub README for more information.

Known Issues

The KFServing HTTP/REST and GRPC protocols and corresponding V2 experimental Python and C++ clients are beta quality and are likely to change. Specifically:
- The data returned by the statistics API will be changing to include additional information.
- The data returned by the repository index API will be changing to include additional information.
The new C API specified in tritonserver.h is beta quality and is likely to change.
TensorRT reformat-free I/O is not supported.
Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LDLIBRARYPATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.

Client Libraries and Examples

Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v2.0.0_ubuntu1804.clients.tar.gz file. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files. The client SDK is also available as a NGC Container.

Custom Backend SDK

Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v2.0.0_ubuntu1804.custombackend.tar.gz file. See the documentation section 'Building a Custom Backend' for more information on using these files.

Jetson Jetpack Support

A release of Triton for the Developer Preview of JetPack 4.4 (https://developer.nvidia.com/embedded/jetpack) is provided in the attached file: v2.0.0-jetpack4.4ga.tgz. This experimental release supports the TensorFlow (1.15.2), TensorRT (7.1) and Custom backends as well as ensembles. GPU metrics, GCS storage and S3 storage are not supported.

The tar file contains the Triton server executable and shared libraries and also the C++ and Python client libraries and examples.

Installation and Usage

The following dependencies must be installed before running Triton.

apt-get update && \
    apt-get install -y --no-install-recommends \
        software-properties-common \
        autoconf \
        automake \
        build-essential \
        cmake \
        git \
        libb64-dev \
        libgoogle-glog0v5 \
        libre2-dev \
        libssl-dev \
        libtool \
        libboost-dev \
        libcurl4-openssl-dev \
        rapidjson-dev \
        patchelf \
        zlib1g-dev

Additionally, to run the clients the following dependencies must be installed.

apt-get install -y --no-install-recommends \
        curl \
        libopencv-dev=3.2.0+dfsg-4ubuntu0.1 \
        libopencv-core-dev=3.2.0+dfsg-4ubuntu0.1 \
        pkg-config \
        python3 \
        python3-pip \
        python3-dev

python3 -m pip install --upgrade wheel setuptools
python3 -m pip install --upgrade grpcio-tools numpy pillow

The Python wheel for the python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/triton*.whl

- Python
Published by dzier about 6 years ago

server - Release 1.14.0 corresponding to NGC container 20.06

NVIDIA Triton Inference Server

The NVIDIA Triton Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

What's New In 1.14.0

Support for the legacy V1 HTTP/REST, GRPC and corresponding client libraries is released on GitHub branch r20.06-v1 and as NGC container 20.06-v1-py3.

Known Issues

TensorRT reformat-free I/O is not supported.
Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LDLIBRARYPATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.

Client Libraries and Examples

Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.14.0_ubuntu1804.clients.tar.gz file. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files. The client SDK is also available as a NGC Container.

Custom Backend SDK

Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.14.0_ubuntu1804.custombackend.tar.gz file. See the documentation section 'Building a Custom Backend' for more information on using these files.

- Python
Published by dzier about 6 years ago

server - Release 1.13.0 corresponding to NGC container 20.03.1

NVIDIA Triton Inference Server

The NVIDIA Triton Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

What's New In 1.13.0

Updates for KFserving HTTP/REST and GRPC protocols and corresponding Python and C++ client libraries. See Roadmap section of README for more information.
Update GRPC version to 1.24.0.
Several issues with S3 storage were resolved.
Fix lastinferrencetimestamp value to correctly show the time when inference last occurred for each model.
The Caffe2 backend is deprecated. Support for Caffe2 models will be removed in a future release.

Known Issues

The KFServing HTTP/REST and GRPC protocols and corresponding V2 experimental Python and C++ clients are beta quality and are likely to change. Specifically:
- The data returned by the statistics API will be changing to include additional information.
- The data returned by the repository index API will be changing to include additional information.
The new C API specified in tritonserver.h is beta quality and is likely to change.
When using the experimental V2 HTTP/REST C++ client, classification results are not supported for output tensors. This issue will be fixed in the next release.
When using the experimental V2 perfclientv2, for high concurrency values perfclientv2 may not be able to achieve throughput as high as V1 perf_client. This will be fixed in the next release.
TensorRT reformat-free I/O is not supported.
Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LDLIBRARYPATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.

Client Libraries and Examples

Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.13.0_ubuntu1804.clients.tar.gz file. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files. The client SDK is also available as a NGC Container.

Custom Backend SDK

Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.13.0_ubuntu1804.custombackend.tar.gz file. See the documentation section 'Building a Custom Backend' for more information on using these files.

Jetson Jetpack Support

An experimental release of Triton for the Developer Preview of JetPack 4.4 is available as part of the 20.03 release. See 20.03 release for more information.

- Python
Published by dzier about 6 years ago

server - Release 1.12.0 corresponding to NGC container 20.03

NVIDIA Triton Inference Server

The NVIDIA Triton Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

What's New In 1.12.0

Add queuing policies for dynamic batching scheduler. These policies are specified in the model configuration and allow each model to set maximum queue size, time outs, and priority levels for inference requests.
Support for large ONNX models where weights are stored in separate files.
Allow ONNX Runtime optimization level to be configured via the model configuration optimization setting.
Experimental Python client and server support for community standard GRPC inferencing API.
Add --min-supported-compute-capability flag to allow Triton Server to use older, unsupported GPUs.
Fix perf_client shared memory support. In some cases shared-memory option did not work correctly due to the input and output tensor names. This issue is now resolved.

Known Issues

TensorRT reformat-free I/O is not supported.
Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LDLIBRARYPATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.

Client Libraries and Examples

Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.12.0_ubuntu1804.clients.tar.gz file. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files. The client SDK is also available as a NGC Container.

Custom Backend SDK

Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.12.0_ubuntu1804.custombackend.tar.gz file. See the documentation section 'Building a Custom Backend' for more information on using these files.

Jetson Jetpack Support

An experimental release of Triton for the Developer Preview of JetPack 4.4 (https://developer.nvidia.com/embedded/jetpack) is provided in the attached file: v1.12.0-jetpack4.4dp.tgz. This experimental release supports the TensorFlow (1.15.2), TensorRT (7.1) and Custom backends as well as ensembles. GPU metrics, GCS storage and S3 storage are not supported.

The tar file contains the Triton executable and shared libraries and also the C++ and Python client libraries and examples.

Installation and Usage

The following dependencies must be installed before running Triton.

apt-get update && \
    apt-get install -y --no-install-recommends \
        software-properties-common \
        autoconf \
        automake \
        build-essential \
        cmake \
        git \
        libgoogle-glog0v5 \
        libre2-dev \
        libssl-dev \
        libtool \
        libboost-dev \
        libcurl4-openssl-dev \
        zlib1g-dev

Additionally, to run the clients the following dependencies must be installed.

apt-get install -y --no-install-recommends \
        curl \
        libopencv-dev=3.2.0+dfsg-4ubuntu0.1 \
        libopencv-core-dev=3.2.0+dfsg-4ubuntu0.1 \
        pkg-config \
        python3 \
        python3-pip \
        python3-dev

python3 -m pip install --upgrade wheel setuptools
python3 -m pip install --upgrade grpcio-tools numpy pillow

The Python wheel for the python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tensorrtserver-*.whl

- Python
Published by dzier over 6 years ago

server - Release 1.11.0 corresponding to NGC container 20.02

NVIDIA TensorRT Inference Server

The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.

What's New In 1.11.0

The TensorRT backend is improved to have significantly better performance. Improvements include reducing thread contention, using pinned memory for faster CPU<->GPU transfers, and increasing compute and memory copy overlap on GPUs.
Reduce memory usage of TensorRT models in many cases by sharing weights across multiple model instances.
Boolean data-type and shape tensors are now supported for TensorRT models.
A new model configuration option allows the dynamic batcher to create “ragged” batches for custom backend models. A ragged batch is a batch where one or more of the input/output tensors have different shapes in different batch entries.
Local S3 storage endpoints are now supported for model repositories. A local S3 endpoint is specified as 's3://host:port/path/to/repository'.
The Helm chart showing an example Kubernetes deployment is updated to include Prometheus and Grafana support so that inference server metrics can be collected and visualized.
The inference server container no longer sets LD_LIBRARY_PATH, instead the server uses RUNPATH to locate its shared libraries.
Python 2 is end-of-life so all support has been removed. Python 3 is still supported.
Ubuntu 18.04 with January 2020 updates

Known Issues

TensorRT reformat-free I/O is not supported.
Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LDLIBRARYPATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.

Client Libraries and Examples

Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.11.0ubuntu1604.clients.tar.gz and v1.11.0ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files. The client SDK is also available as a NGC Container.

Custom Backend SDK

Ubuntu 16.04 and Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.11.0ubuntu1604.custombackend.tar.gz and v1.11.0ubuntu1804.custombackend.tar.gz files. See the documentation section 'Building a Custom Backend' for more information on using these files.

- Python
Published by dzier over 6 years ago

server - Release 1.10.0 corresponding to NGC container 20.01

NVIDIA TensorRT Inference Server

The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.

What's New In 1.10.0

Server status can be requested in JSON format using the HTTP/REST API. Use endpoint /api/status?format=json.
The dynamic batcher now has an option to preserve the ordering of batched requests when there are multiple model instances. See model_config.proto for more information.

Known Issues

TensorRT reformat-free I/O is not supported.
Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LDLIBRARYPATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.

Client Libraries and Examples

Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.10.0ubuntu1604.clients.tar.gz and v1.10.0ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files. The client SDK is also available as a NGC Container.

Custom Backend SDK

Ubuntu 16.04 and Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.10.0ubuntu1604.custombackend.tar.gz and v1.10.0ubuntu1804.custombackend.tar.gz files. See the documentation section 'Building a Custom Backend' for more information on using these files.

- Python
Published by dzier over 6 years ago

server - Release 1.9.0, corresponding to NGC container 19.12

NVIDIA TensorRT Inference Server

The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.

What's New In 1.9.0

The model configuration now includes a model warmup option. This option provides the ability to tune and optimize the model before inference requests are received, avoiding initial inference delays. This option is especially useful for frameworks like TensorFlow that perform network optimization in response to the initial inference requests. Models can be warmed-up with one or more synthetic or realistic workloads before they become ready in the server.
An enhanced sequence batcher now has multiple scheduling strategies. A new Oldest strategy integrates with the dynamic batcher to enable improved inference performance for models that don’t require all inference requests in a sequence to be routed to the same batch slot.
The perf_client now has an option to generate requests using a realistic poisson distribution or a user provided distribution.
A new repository API (available in the shared library API, HTTP, and GRPC) returns an index of all models available in the model repositories) visible to the server. This index can be used to see what models are available for loading onto the server.
The server status returned by the server status API now includes the timestamp of the last inference request received for each model.
Inference server tracing capabilities are now documented in the Optimization section of the User Guide. Tracing support is enhanced to provide trace for ensembles and the contained models.
A community contributed Dockerfile is now available to build the TensorRT Inference Server clients on CentOS.

Known Issues

The beta of the custom backend API version 2 has non-backwards compatible changes to enable complete support for input and output tensors in both CPU and GPU memory:
- The signature of the CustomGetNextInputV2Fn_t function adds the memory_type_id argument.
- The signature of the CustomGetOutputV2Fn_t function adds the memory_type_id argument.
The beta of the inference server library API has non-backwards compatible changes to enable complete support for input and output tensors in both CPU and GPU memory:
- The signature and operation of the TRTSERVER_ResponseAllocatorAllocFn_t function has changed. See src/core/trtserver.h for a description of the new behavior.
- The signature of the TRTSERVER_InferenceRequestProviderSetInputData function adds the memory_type_id argument.
- The signature of the TRTSERVER_InferenceResponseOutputData function add the memory_type_id argument.
TensorRT reformat-free I/O is not supported.
Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LDLIBRARYPATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.

Client Libraries and Examples

Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.9.0ubuntu1604.clients.tar.gz and v1.9.0ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files. The client SDK is also available as a NGC Container.

Custom Backend SDK

Ubuntu 16.04 and Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.9.0ubuntu1604.custombackend.tar.gz and v1.9.0ubuntu1804.custombackend.tar.gz files. See the documentation section 'Building a Custom Backend' for more information on using these files.

- Python
Published by dzier over 6 years ago

server - Release 1.8.0, corresponding to NGC container 19.11

NVIDIA TensorRT Inference Server

The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.

What's New In 1.8.0

Shared-memory support is expanded to include CUDA shared memory.
Improve efficiency of pinned-memory used for ensemble models.
The perf_client application has been improved with easier-to-use command-line arguments (which maintaining compatibility with existing arguments).
Support for string tensors added to perf_client.
Documentation contains a new “Optimization” section discussing some common optimization strategies and how to use perf_client to explore these strategies.

Deprecated Features

The asynchronous inference API has been modified in the C++ and Python client libraries.
- In the C++ library:
- The non-callback version of the AsyncRun function was removed.
- The GetReadyAsyncRequest function was removed.
- The signature of the GetAsyncRunResults function was changed to remove the is_ready and wait arguments.
- In the Python library:
- The non-callback version of the async_run function was removed.
- The get_ready_async_request function was removed.
- The signature of the get_async_run_results function was changed to remove the wait argument.

Known Issues

The beta of the custom backend API version 2 has non-backwards compatible changes to enable complete support for input and output tensors in both CPU and GPU memory:
- The signature of the CustomGetNextInputV2Fn_t function adds the memory_type_id argument.
- The signature of the CustomGetOutputV2Fn_t function adds the memory_type_id argument.
The beta of the inference server library API has non-backwards compatible changes to enable complete support for input and output tensors in both CPU and GPU memory:
- The signature and operation of the TRTSERVER_ResponseAllocatorAllocFn_t function has changed. See src/core/trtserver.h for a description of the new behavior.
- The signature of the TRTSERVER_InferenceRequestProviderSetInputData function adds the memory_type_id argument.
- The signature of the TRTSERVER_InferenceResponseOutputData function add the memory_type_id argument.
TensorRT reformat-free I/O is not supported.
Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LDLIBRARYPATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.

Client Libraries and Examples

Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.8.0ubuntu1604.clients.tar.gz and v1.8.0ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files. The client SDK is also available as a NGC Container.

Custom Backend SDK

Ubuntu 16.04 and Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.8.0ubuntu1604.custombackend.tar.gz and v1.8.0ubuntu1804.custombackend.tar.gz files. See the documentation section 'Building a Custom Backend' for more information on using these files.

- Python
Published by dzier over 6 years ago

server - Release 1.7.0, corresponding to NGC container 19.10

NVIDIA TensorRT Inference Server

The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.

What's New In 1.7.0

A Client SDK container is now provided on NGC in addition to the inference server container. The client SDK container includes the client libraries and examples.
TensorRT optimization may now be enabled for any TensorFlow model by enabling the feature in the optimization section of the model configuration.
The ONNXRuntime backend now includes the TensorRT and Open Vino execution providers. These providers are enabled in the optimization section of the model configuration.
Automatic configuration generation (--strict-model-config=false) now works correctly for TensorRT models with variable-sized inputs and/or outputs.
Multiple model repositories may now be specified on the command line. Optional command-line options can be used to explicitly load specific models from each repository.
Ensemble models are now pruned dynamically so that only models needed to calculate the requested outputs are executed.
The example clients now include a simple Go example that uses the GRPC API.

Known Issues

In TensorRT 6.0.1, reformat-free I/O is not supported.
Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LDLIBRARYPATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.

Client Libraries and Examples

Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.6.0ubuntu1604.clients.tar.gz and v1.6.0ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files. The client SDK is also available as a NGC Container.

Custom Backend SDK

Ubuntu 16.04 and Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.6.0ubuntu1604.custombackend.tar.gz and v1.6.0ubuntu1804.custombackend.tar.gz files. See the documentation section 'Building a Custom Backend' for more information on using these files.

- Python
Published by dzier over 6 years ago

server - Release 1.6.0, corresponding to NGC container 19.09

NVIDIA TensorRT Inference Server

The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.

What's New In 1.6.0

Added TensorRT 6 support, which includes support for TensorRT dynamic shapes.
Shared memory support is added as an alpha feature in this release. This support allows input and output tensors to be communicated via shared memory instead of over the network. Currently only system (CPU) shared memory is supported.
Amazon S3 is now supported as a remote file system for model repositories. Use the s3:// prefix on model repository paths to reference S3 locations.
The inference server library API is available as a beta in this release. The library API allows you to link against libtrtserver.so so that you can include all the inference server functionality directly in your application.
GRPC endpoint performance improvement. The inference server’s GRPC endpoint now uses significantly less memory while delivering higher performance.
The ensemble scheduler is now more flexible in allowing batching and non-batching models to be composed together in an ensemble.
The ensemble scheduler will now keep tensors in GPU memory between models when possible. Doing so significantly increases performance of some ensembles by avoiding copies to and from system memory.
The performance client, perf_client, now supports models with variable-sized input tensors.

Known Issues

The ONNX Runtime backend could not be updated to the 0.5.0 release due to multiple performance and correctness issues with that release.
In TensorRT 6:
- Reformat-free I/O is not supported.
- Only models that have a single optimization profile are currently supported.
Google Kubernetes Engine (GKE) version 1.14 contains a regression in the handling of LDLIBRARYPATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version to avoid this issue.

Client Libraries and Examples

Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.6.0ubuntu1604.clients.tar.gz and v1.6.0ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files.

Custom Backend SDK

Ubuntu 16.04 and Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.6.0ubuntu1604.custombackend.tar.gz and v1.6.0ubuntu1804.custombackend.tar.gz files. See the documentation section 'Building a Custom Backend' for more information on using these files.

- Python
Published by dzier almost 7 years ago

server - Release 1.5.0, corresponding to NGC container 19.08

NVIDIA TensorRT Inference Server

The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.

What's New In 1.5.0

Added a new execution mode allows the inference server to start without loading any models from the model repository. Model loading and unloading is then controlled by a new GRPC/HTTP model control API.
Added a new instance-group mode allows TensorFlow models that explicitly distribute inferencing across multiple GPUs to run in that manner in the inference server.
Improved input/output tensor reshape to allow variable-sized dimensions in tensors being reshaped.
Added a C++ wrapper around the custom backend C API to simplify the creation of custom backends. This wrapper is included in the custom backend SDK.
Improved the accuracy of the compute statistic reported for inference requests. Previously the compute statistic included some additional time beyond the actual compute time.
The performance client, perf_client, now reports more information for ensemble models, including statistics for all contained models and the entire ensemble.

Client Libraries and Examples

Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.5.0ubuntu1604.clients.tar.gz and v1.5.0ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files.

Custom Backend SDK

Ubuntu 16.04 and Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.5.0ubuntu1604.custombackend.tar.gz and v1.5.0ubuntu1804.custombackend.tar.gz files. See the documentation section 'Building a Custom Backend' for more information on using these files.

- Python
Published by dzier almost 7 years ago

server - Release 1.4.0, corresponding to NGC container 19.07

NVIDIA TensorRT Inference Server

The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.

What's New In 1.4.0

Added libtorch as a new backend. PyTorch models manually decorated or automatically traced to produce TorchScript can now be run directly by the inference server.
Build system converted from bazel to CMake. The new CMake-based build system is more transparent, portable and modular.
To simplify the creation of custom backends, a Custom Backend SDK and improved documentation is now available.
Improved AsyncRun API in C++ and Python client libraries.
perfclient can now use user-supplied input data (previously perfclient could only use random or zero input data).
perf_client now reports latency at multiple confidence percentiles (p50, p90, p95, p99) as well as a user-supplied percentile that is also used to stabilize latency results.
Improvements to automatic model configuration creation (--strict-model-config=false).
C++ and Python client libraries now allow additional HTTP headers to be specified when using the HTTP protocol.

Known Issues

Google Cloud Storage (GCS) support has been restored in this release.

Client Libraries and Examples

Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.4.0ubuntu1604.clients.tar.gz and v1.4.0ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files.

Custom Backend SDK

Ubuntu 16.04 and Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.4.0ubuntu1604.custombackend.tar.gz and v1.4.0ubuntu1804.custombackend.tar.gz files. See the documentation section 'Building a Custom Backend' for more information on using these files.

- Python
Published by deadeyegoodwin almost 7 years ago

server - Release 1.3.0, corresponding to NGC container 19.06

NVIDIA TensorRT Inference Server

The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.

What's New In 1.3.0

The ONNX Runtime (github.com/Microsoft/onnxruntime) is now integrated into inference server. ONNX models can now be used directly in a model repository.
HTTP health port may be specified independently of inference and status HTTP port with --http-health-port flag.
Fixed bug in perf_client that caused high CPU usage that could lower the measured inference/sec in some cases.

Known Issues

Google Cloud Storage (GCS) support is not available in the 19.06 release. Support for GCS is available on the master branch and will be re-enabled in the 19.07 release.

Client Libraries and Examples

Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.3.0ubuntu1604.clients.tar.gz and v1.3.0ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files.

- Python
Published by deadeyegoodwin about 7 years ago

server - Release 1.2.0, corresponding to NGC container 19.05

NVIDIA TensorRT Inference Server

The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.

What's New In 1.2.0

Ensembling is now available. An ensemble represents a pipeline of one or more models and the connection of input and output tensors between those models. A single inference request to an ensemble will trigger the execution of the entire pipeline.
Added Helm chart that deploys a single TensorRT Inference Server into a Kubernetes cluster.
The client Makefile now supports building for both Ubuntu 16.04 and Ubuntu 18.04. The Python wheel produced from the build is now compatible with both Python2 and Python3.
The perfclient application now has a --percentile flag that can be used to report latencies instead of reporting average latency (which remains the default). For example, using --percentile=99 causes perfclient to report the 99th percentile latency.
The perf_client application now has a -z option to use zero-valued input tensors instead of random values.
Improved error reporting of incorrect input/output tensor names for TensorRT models.
Added --allow-gpu-metrics option to enable/disable reporting of GPU metrics.

Client Libraries and Examples

Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.2.0ubuntu1604.clients.tar.gz and v1.2.0ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files.

- Python
Published by deadeyegoodwin about 7 years ago

server - Release 1.1.0, corresponding to NGC container 19.04

NVIDIA TensorRT Inference Server

The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.

What's New In 1.1.0

Client libraries and examples now build with a separate Makefile (a Dockerfile is also included for convenience).
Input or output tensors with variable-size dimensions (indicated by -1 in the model configuration) can now represent tensors where the variable dimension has value 0 (zero).
Zero-sized input and output tensors are now supported for batching models. This enables the inference server to support models that require inputs and outputs that have shape [ batch-size ].
TensorFlow custom operations (C++) can now be built into the inference server. An example and documentation are included in this release.

Client Libraries and Examples

An Ubuntu 16.04 build of the client libraries and examples are included in this release in the attached v1.1.0.clients.tar.gz. See the documentation section 'Building the Client Libraries and Examples' for more information on using this file.

- Python
Published by deadeyegoodwin about 7 years ago

server - Release 1.0.0, corresponding to NGC container 19.03

NVIDIA TensorRT Inference Server

The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.

What's New In 1.0.0

1.0.0 is the first GA, non-beta, release of TensorRT Inference Server. See the README for information on backwards-compatibility guarantees for this and future releases.
Added support for stateful models and backends that require multiple inference requests be routed to the same model instance/batch slot. The new sequence batcher provides scheduling and batching capabilities for this class of models.
Added GRPC streaming protocol support for inference requests.
The HTTP front-end is now asynchronous to enable lower-latency and higher-throughput handling of inference requests.
Enhanced perf_client to support stateful models and backends.

Client Libraries and Examples

An Ubuntu 16.04 build of the client libraries and examples are included in this release in the attached v1.0.0.clients.tar.gz. See the documentation section 'Building the Client Libraries and Examples' for more information on using this file.

- Python
Published by deadeyegoodwin over 7 years ago

server - Release 0.11.0 beta, corresponding to NGC container 19.02

NVIDIA TensorRT Inference Server

The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.

What's New In 0.11.0 Beta

Variable-size input and output tensor support. Models that support variable-size input tensors and produce variable-size output tensors are now supported in the model configuration by using a dimension size of -1 for those dimensions that can take on any size.
String datatype support. For TensorFlow models and custom backends, input and output tensors can contain strings.
Improved support for non-GPU systems. The inference server will run correctly on systems that do not contain GPUs and that do not have nvidia-docker or CUDA installed.

Client Libraries and Examples

An Ubuntu 16.04 build of the client libraries and examples are included in this release in the attached v0.11.0.clients.tar.gz. See the documentation section 'Building the Client Libraries and Examples' for more information on using this file.

- Python
Published by deadeyegoodwin over 7 years ago

server - Release 0.10.0 beta, corresponding to NGC container 19.01

NVIDIA TensorRT Inference Server

The NVIDIA TensorRT Inference Server (TRTIS) provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.

What's New In 0.10.0 Beta

Custom backend support. TRTIS allows individual models to be implemented with custom backends instead of by a deep-learning framework. With a custom backend a model can implement any logic desired, while still benefiting from the GPU support, concurrent execution, dynamic batching and other features provided by TRTIS.

- Python
Published by deadeyegoodwin over 7 years ago

server - Release 0.9.0 beta, corresponding to NGC container 18.12

NVIDIA TensorRT Inference Server 0.9.0 Beta

The NVIDIA TensorRT Inference Server (TRTIS) provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.

What's New in 0.9.0 Beta

TRTIS now monitors the model repository for any change and dynamically reloads the model when necessary, without requiring a server restart. It is now possible to add and remove model versions, add/remove entire models, modify the model configuration, and modify the model labels while the server is running.
Added a model priority parameter to the model configuration. Currently the model priority controls the CPU thread priority when executing the model and for TensorRT models also controls the CUDA stream priority.
Fixed a bug in GRPC API: changed the model version parameter from string to int. This is a non-backwards compatible change.
Added --strict-model-config=false option to allow some model configuration properties to be derived automatically. For some model types, this removes the need to specify the config.pbtxt file.
Improved performance from an asynchronous GRPC frontend.

- Python
Published by deadeyegoodwin over 7 years ago

server - Release 0.8.0 beta, corresponding to NGC container 18.11

- Python
Published by deadeyegoodwin over 7 years ago

Recent Releases of server

server - Release 2.60.0 corresponding to NGC container 25.08

Triton Inference Server

New Features and Improvements

Known Issues

Client Libraries and Examples

Windows Support

Jetson iGPU Support

Jetson AGX Systems Support

Triton TRT-LLM Container Support Matrix

server - Release 2.59.1 corresponding to NGC container 25.07

Triton Inference Server

New Features and Improvements

Known Issues

Client Libraries and Examples

Windows Support

Jetson iGPU Support

Triton TRT-LLM Container Support Matrix

server - Release 2.59.0 corresponding to NGC container 25.06

Triton Inference Server

New Features and Improvements

Known Issues

Client Libraries and Examples

Windows Support

Jetson iGPU Support

Triton TRT-LLM Container Support Matrix

server - Release 2.58.0 corresponding to NGC container 25.05

Triton Inference Server

New Features and Improvements

Known Issues

Client Libraries and Examples

Windows Support

Jetson iGPU Support

Triton TRT-LLM Container Support Matrix

server - Release 2.57.0 corresponding to NGC container 25.04

Triton Inference Server

New Features and Improvements

Known Issues

Client Libraries and Examples

Windows Support

Jetson iGPU Support

Triton TRT-LLM Container Support Matrix

server - Release 2.56.0 corresponding to NGC container 25.03

Triton Inference Server

New Features and Improvements

Known Issues

Client Libraries and Examples

Windows Support

Jetson iGPU Support

Triton TRT-LLM Container Support Matrix

server - Release 2.55.0 corresponding to NGC container 25.02

Triton Inference Server

New Features and Improvements

Known Issues

Client Libraries and Examples

Windows Support

Jetson iGPU Support

Triton TRT-LLM Container Support Matrix

server - Release 2.54.0 corresponding to NGC container 25.01

Triton Inference Server

New Features and Improvements

Known Issues

Client Libraries and Examples

Windows Support

Jetson iGPU Support

Triton TRT-LLM Container Support Matrix

server - Release 2.53.0 corresponding to NGC container 24.12

Triton Inference Server

New Features and Improvements

Known Issues

Client Libraries and Examples

Windows Support

Jetson iGPU Support

Triton TRT-LLM Container Support Matrix

server - Release 2.52.0 corresponding to NGC container 24.11

Triton Inference Server

New Features and Improvements

Known Issues

Client Libraries and Examples

Windows Support