Share a GPU between pods on AWS EKS

This page summarizes the projects mentioned and recommended in the original post on dev.to

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • k8s-device-plugin

    NVIDIA device plugin for Kubernetes

  • If you ever tried to use GPU-based instances with AWS ECS, or on EKS using the default Nvidia plugin, you would know that it's not possible to make a task/pod shared the same GPU on an instance. If you want to add more replicas to your service (for redundancy or load balancing), you would need one GPU for each replica.

  • aws-eks-share-gpu

    How to share the same GPU between pods on AWS EKS

  • This project (available here) uses the k8s device plugin described by this AWS blog post to make GPU-based nodes publish the amount of GPU resource they have available. Instead of the amount of VRAM available or some abstract metric, this plugin advertises the amount of pods/processes that can be connected to the GPU. This is controlled by what is called by NVIDIA as Multi-Process Service (MPS).

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • aws-virtual-gpu-device-plugin

    Discontinued AWS virtual gpu device plugin provides capability to use smaller virtual gpus for your machine learning inference workloads

  • This project (available here) uses the k8s device plugin described by this AWS blog post to make GPU-based nodes publish the amount of GPU resource they have available. Instead of the amount of VRAM available or some abstract metric, this plugin advertises the amount of pods/processes that can be connected to the GPU. This is controlled by what is called by NVIDIA as Multi-Process Service (MPS).

  • asdf-hashicorp

    HashiCorp plugin for the asdf version manager

  • asdf-tflint

    An asdf plugin for installing terraform-linters/tflint.

  • asdf-awscli

  • aws-ami-gpu-monitoring

    This project contains the code necessary to build an AWS AMI with monitoring capabilities of GPU usage (among other metrics) using CloudWatch.

  • From that repo, the only thing changed is the base AMI, which in this case an AMI tailored for accelerated hardware on EKS was used. The list of compatible AMIs for EKS can be obtained in this link updated regularly by AWS. Also, the AMI from AWS comes with SSM agent in it, so no need to change anything regarding that.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • containers-roadmap

    This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).

  • From that repo, the only thing changed is the base AMI, which in this case an AMI tailored for accelerated hardware on EKS was used. The list of compatible AMIs for EKS can be obtained in this link updated regularly by AWS. Also, the AMI from AWS comes with SSM agent in it, so no need to change anything regarding that.

  • k2tf

    Kubernetes YAML to Terraform HCL converter

  • Pro tip: If you want to convert k8s yaml files to .tf, you can use k2tf (repo) that is able to convert the resource types of the yaml top their appropriated counterparts of the k8s provider for terraform. To install it, just:

  • terraform-provider-kubernetes

    Terraform Kubernetes provider

  • After the resources be provisioned, you might want to run terraform apply -refresh-only to refresh your local state as the creation of some resource change the state of others within AWS. Also, state differences on metadata.resource_version of k8s resources almost always show up after an apply. This seems to be related to this issue.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts