Share a GPU between pods on AWS EKS

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

k8s-device-plugin

11 2,353 9.5 Go

NVIDIA device plugin for Kubernetes

If you ever tried to use GPU-based instances with AWS ECS, or on EKS using the default Nvidia plugin, you would know that it's not possible to make a task/pod shared the same GPU on an instance. If you want to add more replicas to your service (for redundancy or load balancing), you would need one GPU for each replica.

aws-eks-share-gpu

1 8 0.0 HCL

How to share the same GPU between pods on AWS EKS

This project (available here) uses the k8s device plugin described by this AWS blog post to make GPU-based nodes publish the amount of GPU resource they have available. Instead of the amount of VRAM available or some abstract metric, this plugin advertises the amount of pods/processes that can be connected to the GPU. This is controlled by what is called by NVIDIA as Multi-Process Service (MPS).

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
aws-virtual-gpu-device-plugin

3 132 0.0 Jupyter Notebook

Discontinued AWS virtual gpu device plugin provides capability to use smaller virtual gpus for your machine learning inference workloads

This project (available here) uses the k8s device plugin described by this AWS blog post to make GPU-based nodes publish the amount of GPU resource they have available. Instead of the amount of VRAM available or some abstract metric, this plugin advertises the amount of pods/processes that can be connected to the GPU. This is controlled by what is called by NVIDIA as Multi-Process Service (MPS).

asdf-hashicorp

6 222 4.6 Shell

HashiCorp plugin for the asdf version manager
asdf-tflint

2 3 3.2 Shell

An asdf plugin for installing terraform-linters/tflint.
asdf-awscli

2 49 4.3 Shell
aws-ami-gpu-monitoring

1 3 3.2 HCL

This project contains the code necessary to build an AWS AMI with monitoring capabilities of GPU usage (among other metrics) using CloudWatch.

From that repo, the only thing changed is the base AMI, which in this case an AMI tailored for accelerated hardware on EKS was used. The list of compatible AMIs for EKS can be obtained in this link updated regularly by AWS. Also, the AMI from AWS comes with SSM agent in it, so no need to change anything regarding that.

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
containers-roadmap

80 5,137 2.0 Shell

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).

From that repo, the only thing changed is the base AMI, which in this case an AMI tailored for accelerated hardware on EKS was used. The list of compatible AMIs for EKS can be obtained in this link updated regularly by AWS. Also, the AMI from AWS comes with SSM agent in it, so no need to change anything regarding that.

k2tf

4 1,141 2.7 Go

Kubernetes YAML to Terraform HCL converter

Pro tip: If you want to convert k8s yaml files to .tf, you can use k2tf (repo) that is able to convert the resource types of the yaml top their appropriated counterparts of the k8s provider for terraform. To install it, just:

terraform-provider-kubernetes

6 1,540 9.0 Go

Terraform Kubernetes provider

After the resources be provisioned, you might want to run terraform apply -refresh-only to refresh your local state as the creation of some resource change the state of others within AWS. Also, state differences on metadata.resource_version of k8s resources almost always show up after an apply. This seems to be related to this issue.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project