[D] managing compute for long running ML training jobs

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

determined

10 2,861 9.9 Go

Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.

These are some of the problems we are trying to solve with the Determined training platform. Determined can be run with or without k8s - the k8s version inherits some of the scheduling problems of k8s, but the non-k8s version uses a custom gang scheduler designed for large scale ML training. Determined offers a priority scheduler that allows smaller jobs to run while being able to schedule a large distributed job whenever you need, by setting a higher priority.

goofys

16 5,037 0.0 Go

a high-performance, POSIX-ish Amazon S3 file system written in Go

We don't specifically address the dataset issue, (we let you bring your data wherever it is). What is your scale (dataset size, number of GPUs, file size)? I second the FSx for Lustre recommendation, assuming you have pretty large scale (the smallest FSx cluster you can create is decently large). It can also be reasonable to load your data directly from cloud storage as you need it. Petastorm is a really good option to look into, but it is fairly heavyweight (you need to transform your data into into Parquet first using the Petastorm utility). You can also mount cloud storage buckets as FUSE filesystems which can be pretty convenient - Goofys is a good, high-performance option

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Migrating instance to AWS GovCloud

1 project | /r/aws | 1 Nov 2022
How do you manage large training datasets?

1 project | /r/computervision | 2 Jun 2022
Mount S3 Objects to Kubernetes Pods

2 projects | dev.to | 31 Jan 2022
How does HDFS relate to Spark - Noob Question Reality Check

1 project | /r/apachespark | 16 Dec 2021
AWS Developer Forums: S3 Block Devices

1 project | news.ycombinator.com | 30 Dec 2020

[D] managing compute for long running ML training jobs

This page summarizes the projects mentioned and recommended in the original post on /r/MachineLearning
s3-bucket Deep Learning S3 Machine Learning Posix
Post date: 21 Jun 2021

determined

goofys

InfluxDB

Related posts

Migrating instance to AWS GovCloud

How do you manage large training datasets?

Mount S3 Objects to Kubernetes Pods

How does HDFS relate to Spark - Noob Question Reality Check

AWS Developer Forums: S3 Block Devices

[D] managing compute for long running ML training jobs

This page summarizes the projects mentioned and recommended in the original post on /r/MachineLearning s3-bucket Deep Learning S3 Machine Learning Posix Post date: 21 Jun 2021

determined

goofys

InfluxDB

Related posts

Migrating instance to AWS GovCloud

How do you manage large training datasets?

Mount S3 Objects to Kubernetes Pods

How does HDFS relate to Spark - Noob Question Reality Check

AWS Developer Forums: S3 Block Devices

This page summarizes the projects mentioned and recommended in the original post on /r/MachineLearning
s3-bucket Deep Learning S3 Machine Learning Posix
Post date: 21 Jun 2021