-
determined
Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
These are some of the problems we are trying to solve with the Determined training platform. Determined can be run with or without k8s - the k8s version inherits some of the scheduling problems of k8s, but the non-k8s version uses a custom gang scheduler designed for large scale ML training. Determined offers a priority scheduler that allows smaller jobs to run while being able to schedule a large distributed job whenever you need, by setting a higher priority.
We don't specifically address the dataset issue, (we let you bring your data wherever it is). What is your scale (dataset size, number of GPUs, file size)? I second the FSx for Lustre recommendation, assuming you have pretty large scale (the smallest FSx cluster you can create is decently large). It can also be reasonable to load your data directly from cloud storage as you need it. Petastorm is a really good option to look into, but it is fairly heavyweight (you need to transform your data into into Parquet first using the Petastorm utility). You can also mount cloud storage buckets as FUSE filesystems which can be pretty convenient - Goofys is a good, high-performance option