Reduce cost by 3x in the cloud and improve GPU usage in shared clusters with AdaptDL for PyTorch

This page summarizes the projects mentioned and recommended in the original post on /r/u_Henry-GO

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • adaptdl

    Resource-adaptive cluster scheduler for deep learning training.

  • AdaptDL monitors training job performance in real-time, and elastically re-scales resources (GPUs, compute instances) while jobs are running. For each training job, AdaptDL automatically tunes the batch size, learning rate, and gradient accumulation. In the cloud (e.g. AWS), AdaptDL can auto-scale the number of provisioned Spot Instances. We’ve seen shared-cluster training jobs at Petuum and our partners complete 2–3x faster on average, with 3x cheaper cost in AWS using Spot Instances!

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • [Discussion] Open source scheduler and queuing system for model training/inferencing tasks?

    1 project | /r/MachineLearning | 16 Aug 2021
  • How we were able to achieve hyper-parameter tuning (HPT) for deep learning workflows at 1.5x faster in our clusters and 3x cheaper on AWS

    1 project | /r/learnmachinelearning | 23 Feb 2021
  • [D] Anyone deploy DL models with AWS Lambda? Trying to estimate costs

    2 projects | /r/MachineLearning | 5 Apr 2021
  • SB-1047 will stifle open-source AI and decrease safety

    2 projects | news.ycombinator.com | 29 Apr 2024
  • Getting Started with Gemma Models

    4 projects | dev.to | 15 Apr 2024