https://github.com/awslabs/aws-virtual-gpu-device-plugin

AWS virtual gpu device plugin provides capability to use smaller virtual gpus for your machine learning inference workloads

https://github.com/awslabs/aws-virtual-gpu-device-plugin

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.2%) to scientific vocabulary

Keywords

gpu kubernetes nvidia
Last synced: 5 months ago · JSON representation

Repository

AWS virtual gpu device plugin provides capability to use smaller virtual gpus for your machine learning inference workloads

Basic Info
Statistics
  • Stars: 204
  • Watchers: 49
  • Forks: 31
  • Open Issues: 15
  • Releases: 2
Archived
Topics
gpu kubernetes nvidia
Created almost 6 years ago · Last pushed about 2 years ago
Metadata Files
Readme Changelog Contributing License Code of conduct

README.md

Virtual GPU device plugin for Kubernetes

The virtual device plugin for Kubernetes is a Daemonset that allows you to automatically: - Expose arbitrary number of virtual GPUs on GPU nodes of your cluster. - Run ML serving containers backed by Accelerator with low latency and low cost in your Kubernetes cluster.

This repository contains AWS virtual GPU implementation of the Kubernetes device plugin.

Prerequisites

The list of prerequisites for running the virtual device plugin is described below: * NVIDIA drivers ~= 361.93 * nvidia-docker version > 2.0 (see how to install and it's prerequisites) * docker configured with nvidia as the default runtime. * Kubernetes version >= 1.10

Limitations

  • This solution is build on top of Volta Multi-Process Service(MPS). You can only use it on instances types with Tesla-V100 or newer. (Only Amazon EC2 P3 Instances and Amazon EC2 G4 Instances now)
  • Virtual GPU device plugin by default set GPU compute mode to EXCLUSIVE_PROCESS which means GPU is assigned to MPS process, individual process threads can submit work to GPU concurrently via MPS server. This GPU can not be used for other purpose.
  • Virtual GPU device plugin only on single physical GPU instance like P3.2xlarge if you request k8s.amazonaws.com/vgpu more than 1 in the workloads.
  • Virtual GPU device plugin can not work with Nvidia device plugin together. You can label nodes and use selector to install Virtual GPU device plugin.

High Level Design

device-plugin

Quick Start

Label GPU node groups

bash kubectl label node <your_k8s_node_name> k8s.amazonaws.com/accelerator=vgpu

Enabling virtual GPU Support in Kubernetes

Update node selector label in the manifest file to match with labels of your GPU node group, then apply it to Kubernetes.

shell $ kubectl create -f https://raw.githubusercontent.com/awslabs/aws-virtual-gpu-device-plugin/v0.1.1/manifests/device-plugin.yml

Running GPU Jobs

Virtual NVIDIA GPUs can now be consumed via container level resource requirements using the resource name k8s.amazonaws.com/vgpu:

yaml apiVersion: apps/v1 kind: Deployment metadata: name: resnet-deployment spec: replicas: 3 selector: matchLabels: app: resnet-server template: metadata: labels: app: resnet-server spec: # hostIPC is required for MPS communication hostIPC: true containers: - name: resnet-container image: seedjeffwan/tensorflow-serving-gpu:resnet args: # Make sure you set limit based on the vGPU account to avoid tf-serving process occupy all the gpu memory - --per_process_gpu_memory_fraction=0.2 env: - name: MODEL_NAME value: resnet ports: - containerPort: 8501 # Use virtual gpu resource here resources: limits: k8s.amazonaws.com/vgpu: 1 volumeMounts: - name: nvidia-mps mountPath: /tmp/nvidia-mps volumes: - name: nvidia-mps hostPath: path: /tmp/nvidia-mps

WARNING: if you don't request GPUs when using the device plugin all the GPUs on the machine will be exposed inside your container.

Check the full example here

Development

Please check Development for more details.

Credits

The project idea comes from @RenaudWasTaken comment in kubernetes/kubernetes#52757 and Alibaba’s solution from @cheyang GPU Sharing Scheduler Extender Now Supports Fine-Grained Kubernetes Clusters.

Reference

AWS:

Community:

License

This project is licensed under the Apache-2.0 License.

Owner

  • Name: Amazon Web Services - Labs
  • Login: awslabs
  • Kind: organization
  • Location: Seattle, WA

AWS Labs

GitHub Events

Total
  • Watch event: 4
  • Fork event: 1
Last Year
  • Watch event: 4
  • Fork event: 1

Issues and Pull Requests

Last synced: almost 2 years ago

All Time
  • Total issues: 20
  • Total pull requests: 14
  • Average time to close issues: 3 months
  • Average time to close pull requests: 3 months
  • Total issue authors: 19
  • Total pull request authors: 7
  • Average comments per issue: 1.15
  • Average comments per pull request: 0.43
  • Merged pull requests: 10
  • Bot issues: 0
  • Bot pull requests: 3
Past Year
  • Issues: 1
  • Pull requests: 2
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 2
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 1
Top Authors
Issue Authors
  • walkley (2)
  • FanniSun (1)
  • t-ibayashi-safie (1)
  • vikranthkeerthipati (1)
  • parth-chudasama (1)
  • jaggerwang (1)
  • Narsil (1)
  • amybachir (1)
  • Apokleos (1)
  • stephanrb3 (1)
  • nneram (1)
  • valafon (1)
  • cyyeh (1)
  • stevensu1977 (1)
  • josephlee518 (1)
Pull Request Authors
  • Jeffwan (5)
  • dependabot[bot] (3)
  • walkley (2)
  • parisnakitakejser (2)
  • Wei-1 (1)
  • hemandee (1)
  • linjungz (1)
Top Labels
Issue Labels
Pull Request Labels
dependencies (3)

Packages

  • Total packages: 1
  • Total downloads: unknown
  • Total docker downloads: 469,718
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 2
proxy.golang.org: github.com/awslabs/aws-virtual-gpu-device-plugin
  • Versions: 2
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Docker Downloads: 469,718
Rankings
Docker downloads count: 0.5%
Stargazers count: 4.0%
Forks count: 4.6%
Average: 5.3%
Dependent packages count: 8.2%
Dependent repos count: 9.3%
Last synced: 6 months ago

Dependencies

go.mod go
  • github.com/NVIDIA/gpu-monitoring-tools v0.0.0-20191011002627-7a750c7e4f8b
  • github.com/fsnotify/fsnotify v1.4.7
  • github.com/gogo/protobuf v1.3.0
  • github.com/golang/protobuf v1.3.2
  • golang.org/x/net v0.0.0-20190812203447-cdfb69ac37fc
  • google.golang.org/genproto v0.0.0-20190926190326-7ee9db18f195
  • google.golang.org/grpc v1.24.0
  • k8s.io/api v0.0.0
  • k8s.io/api=>k8s.io/api v0.0.0-20190819141258-3544db3b9e44
  • k8s.io/apiextensions-apiserver=>k8s.io/apiextensions-apiserver v0.0.0-20190819143637-0dbe462fe92d
  • k8s.io/apimachinery v0.0.0
  • k8s.io/apimachinery=>k8s.io/apimachinery v0.0.0-20190817020851-f2f3a405f61d
  • k8s.io/apiserver=>k8s.io/apiserver v0.0.0-20190819142446-92cc630367d0
  • k8s.io/cli-runtime=>k8s.io/cli-runtime v0.0.0-20190819144027-541433d7ce35
  • k8s.io/client-go v0.0.0
  • k8s.io/client-go=>k8s.io/client-go v0.0.0-20190819141724-e14f31a72a77
  • k8s.io/cloud-provider=>k8s.io/cloud-provider v0.0.0-20190819145148-d91c85d212d5
  • k8s.io/cluster-bootstrap=>k8s.io/cluster-bootstrap v0.0.0-20190819145008-029dd04813af
  • k8s.io/code-generator=>k8s.io/code-generator v0.0.0-20190612205613-18da4a14b22b
  • k8s.io/component-base=>k8s.io/component-base v0.0.0-20190819141909-f0f7c184477d
  • k8s.io/cri-api=>k8s.io/cri-api v0.0.0-20190817025403-3ae76f584e79
  • k8s.io/csi-translation-lib=>k8s.io/csi-translation-lib v0.0.0-20190819145328-4831a4ced492
  • k8s.io/kube-aggregator=>k8s.io/kube-aggregator v0.0.0-20190819142756-13daafd3604f
  • k8s.io/kube-controller-manager=>k8s.io/kube-controller-manager v0.0.0-20190819144832-f53437941eef
  • k8s.io/kube-proxy=>k8s.io/kube-proxy v0.0.0-20190819144346-2e47de1df0f0
  • k8s.io/kube-scheduler=>k8s.io/kube-scheduler v0.0.0-20190819144657-d1a724e0828e
  • k8s.io/kubectl=>k8s.io/kubectl v0.0.0-20190602132728-7075c07e78bf
  • k8s.io/kubelet=>k8s.io/kubelet v0.0.0-20190819144524-827174bad5e8
  • k8s.io/kubernetes v1.16.0
  • k8s.io/legacy-cloud-providers=>k8s.io/legacy-cloud-providers v0.0.0-20190819145509-592c9a46fd00
  • k8s.io/metrics=>k8s.io/metrics v0.0.0-20190819143841-305e1cef1ab1
  • k8s.io/node-api=>k8s.io/node-api v0.0.0-20190819145652-b61681edbd0a
  • k8s.io/sample-apiserver=>k8s.io/sample-apiserver v0.0.0-20190819143045-c84c31c165c4
  • k8s.io/sample-cli-plugin=>k8s.io/sample-cli-plugin v0.0.0-20190819144209-f9ca4b649af0
  • k8s.io/sample-controller=>k8s.io/sample-controller v0.0.0-20190819143301-7c475f5e1313
  • k8s.io/utils=>k8s.io/utils v0.0.0-20190221042446-c2654d5206da
go.sum go
  • 552 dependencies
Dockerfile docker
  • amazonlinux latest build
  • golang 1.13 build
examples/Dockerfile docker
  • tensorflow/serving 1.15.0-gpu build