[D] managing compute for long running ML training jobs

This page summarizes the projects mentioned and recommended in the original post on /r/MachineLearning

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • determined

    Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.

  • These are some of the problems we are trying to solve with the Determined training platform. Determined can be run with or without k8s - the k8s version inherits some of the scheduling problems of k8s, but the non-k8s version uses a custom gang scheduler designed for large scale ML training. Determined offers a priority scheduler that allows smaller jobs to run while being able to schedule a large distributed job whenever you need, by setting a higher priority.

  • goofys

    a high-performance, POSIX-ish Amazon S3 file system written in Go

  • We don't specifically address the dataset issue, (we let you bring your data wherever it is). What is your scale (dataset size, number of GPUs, file size)? I second the FSx for Lustre recommendation, assuming you have pretty large scale (the smallest FSx cluster you can create is decently large). It can also be reasonable to load your data directly from cloud storage as you need it. Petastorm is a really good option to look into, but it is fairly heavyweight (you need to transform your data into into Parquet first using the Petastorm utility). You can also mount cloud storage buckets as FUSE filesystems which can be pretty convenient - Goofys is a good, high-performance option

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Migrating instance to AWS GovCloud

    1 project | /r/aws | 1 Nov 2022
  • How do you manage large training datasets?

    1 project | /r/computervision | 2 Jun 2022
  • Mount S3 Objects to Kubernetes Pods

    2 projects | dev.to | 31 Jan 2022
  • How does HDFS relate to Spark - Noob Question Reality Check

    1 project | /r/apachespark | 16 Dec 2021
  • AWS Developer Forums: S3 Block Devices

    1 project | news.ycombinator.com | 30 Dec 2020