aks-infiniband-install

Example prototype for installing Infiniband on an AKS cluster

https://github.com/converged-computing/aks-infiniband-install

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.8%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Example prototype for installing Infiniband on an AKS cluster

Basic Info
  • Host: GitHub
  • Owner: converged-computing
  • License: mit
  • Language: Shell
  • Default Branch: main
  • Size: 36.1 KB
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Created over 1 year ago · Last pushed 11 months ago
Metadata Files
Readme License Citation

README.md

AKS Infiniband Installer

DOI

We are trying to get Infiniband working on AKS, and this small series of steps will help. We are using the build here to install the drivers to the nodes, and then the Mellanox/k8s-rdma-shared-dev-plugin to provide a CNI to enable Infiniband on the pods. The directories are organized by OS and driver version, since it matters. If you need to install to Usernetes on a node already running the driver, jump down to install usernetes.

1. Build Image

In practice, we built only the latest (22.04) drivers and got that to work on older devices by way of customizing flags. The other directories (aside from ubuntu22.04) are provided for example, but we have not used them.

  • ubuntu22.04: will build for MLNX_OFED_LINUX-24.04-0.7.0.0-ubuntu22.04-x86_64 (Connect-X 4 and 5)

You'll need to have the driver that matches your node version. Nvidia has disabled allowing wget / curl of the newer files so you'll need to agree to their license agreement and download it from this page and put the iso in the respective folder you want to build from. Note that we follow the instructions here to install it with the daemonset. Then update in the Dockerfile:

  1. The base image to use (e.g., ubuntu:22.04)
  2. The COPY directive to copy the ISO into the directory
  3. The driver-installation.yaml or driver-installation-with-gpu.yaml that references it

Building the image:

bash docker build -t ghcr.io/converged-computing/aks-infiniband-install:ubuntu-22.04 ubuntu22.04

2. Cluster Setup

When you create your cluster, you need to do the following.

```bash

Enable Infiniband for your AKS cluster

az feature register --name AKSInfinibandSupport --namespace Microsoft.ContainerService

Check the status

az feature list -o table --query "[?contains(name, 'Microsoft.ContainerService/AKSInfinibandSupport')].{Name:name,State:properties.state}"

Register when ready

az provider register --namespace Microsoft.ContainerService ```

Some additional notes - you need an AKS nodepool with RDMA-capable skus.

3. Node Init

Note that if you shell into a node (install kubectl node-shell) if you install ibverbs-utils and do ibv_devices it will be empty. Let's try to install infiniband next, and we will use a container that is also built with ubuntu 22.04 drivers. I was originally looking at https://github.com/Mellanox/ib-kubernetes but opted for this approach instead. You can just do but then I switched to the approach we have here. Let's first install the drivers:

```bash

Regular without gpu (ubuntu 22.04 full)

kubectl apply -f ./driver-installation.yaml

GPU (without doing a host check)

kubectl apply -f ./driver-installation-with-gpu.yaml ```

When they are done, here is how to check that it was successful - this isn't perfect but it works. Basically we want to see that the ib0 device is up.

bash for pod in $(kubectl get pods -o json | jq -r .items[].metadata.name) do kubectl exec -it $pod -- nsenter -t 1 -m /usr/sbin/ip link | grep 'ib0:' done

That should equal the number of nodes. Apply the daemonset to make it available to pods:

bash kubectl apply -k ./daemonset/

Here is a quick test:

```bash

First node

kubectl node-shell aks-userpool-14173555-vmss000000 ibvrcpingpong

Second node

kubectl node-shell aks-userpool-14173555-vmss000001 ibvrcpingpong aks-userpool-14173555-vmss000000 ```

You can get a test environment in test. Note that the ucx perftest I have found useful. We will add examples with HPC applications (or a link to a repository with them) if requested.

Install Usernetes

For Usernetes you should be running on host machines that already have drivers. Given the setup, you should see /dev/infiniband already in the container. We don't need to install drivers, but we do (should) install the driver installer for Kubernetes to do this properly.

```bash kubectl apply -k ./daemonset-usernetes/

Check

kubectl logs -n kube-system rdma-shared-dp-ds-hghzj ```

License

HPCIC DevTools is distributed under the terms of the MIT license. All new contributions must be made under this license.

See LICENSE, COPYRIGHT, and NOTICE for details.

SPDX-License-Identifier: (MIT)

LLNL-CODE- 842614

Owner

  • Name: Converged Computing
  • Login: converged-computing
  • Kind: organization

The best of cloud and high performance computing: technology and community combined.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: Vanessa
  given-names: Sochat
  orcid: https://orcid.org/0000-0002-4387-3819
title: AKS Infiniband Install Daemonset
version: 0.1.0
doi: 10.5281/zenodo.15253171
date-released: 2025-04-20
url: "https://github.com/converged-computing/aks-infiniband-install"

GitHub Events

Total
  • Release event: 1
  • Member event: 1
  • Push event: 3
  • Create event: 1
Last Year
  • Release event: 1
  • Member event: 1
  • Push event: 3
  • Create event: 1

Dependencies

ubuntu20.04/Dockerfile docker
  • ubuntu 20.04 build
ubuntu22.04/Dockerfile docker
  • ubuntu 22.04 build