Show HN: SpotML – Managed ML Training on Cheap AWS/GCP Spot Instances

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

nimbo

5 123 8.8 Python

Discontinued Run compute jobs on AWS as if you were running them locally.

Seems like Nimbo (https://nimbo.sh) has a Business Source License (https://github.com/nimbo-sh/nimbo/blob/master/LICENSE), so you might want to check with them regarding licensing terms for a startup that is using their code and/or docs in "production"?
Otherwise, this idea is interesting and probably generalizable to other applications. Maybe it's not crystal clear to me, but what are the advantages of your service over existing solutions such as Nimbo and Spotty? FWIW it might be worthwhile adding this to your website.
Good luck!

Ray

43 31,179 10.0 Python

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

Neat. Congratulations on the launch!
Apart from the fact that it could deploy to both GCP and AWS, what does it do differently than AWS Batch [0]?
When we had a similar problem, we ran jobs on spots with AWS Batch and it worked nicely enough.
Some suggestions (for a later date):
1. Add built-in support for Ray [1] (you'd essentially be then competing with Anyscale, which is a VC funded startup, just to contrast it with another comment on this thread) and dbt [2].
2. Support deploying coin miners (might be good to widen the product's reach; and stand it up against the likes of consensys).
3. Get in front of many cost optimisation consultants out there, like the Duckbill Group.
If I may, where are you building this product from? And how many are on the team?
Thanks.
[0] https://aws.amazon.com/batch/use-cases/
[1] https://ray.io/
[2] https://getdbt.com/

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
dbt-spark

7 364 8.6 Python

dbt-spark contains all of the code enabling dbt to work with Apache Spark and Databricks

Neat. Congratulations on the launch!
Apart from the fact that it could deploy to both GCP and AWS, what does it do differently than AWS Batch [0]?
When we had a similar problem, we ran jobs on spots with AWS Batch and it worked nicely enough.
Some suggestions (for a later date):
1. Add built-in support for Ray [1] (you'd essentially be then competing with Anyscale, which is a VC funded startup, just to contrast it with another comment on this thread) and dbt [2].
2. Support deploying coin miners (might be good to widen the product's reach; and stand it up against the likes of consensys).
3. Get in front of many cost optimisation consultants out there, like the Duckbill Group.
If I may, where are you building this product from? And how many are on the team?
Thanks.
[0] https://aws.amazon.com/batch/use-cases/
[1] https://ray.io/
[2] https://getdbt.com/

criu-image-streamer

1 84 0.0 Rust

Enables streaming of images to and from CRIU during checkpoint/restore with low overhead

Cool yeah that makes sense, makes total sense for ML where you just need to run over epochs, less clear for other workloads.
After looking around I thinking more about CRIU/docker suspend. The google stars aligned and I found this https://github.com/checkpoint-restore/criu-image-streamer + https://linuxplumbersconf.org/event/7/contributions/641/atta... which actually seems perfect. I wonder how fast it is
(or, hacking on a checkpoint idea, have a daemon periodically 'checkpoint' other programs so even if it's too slow over 60 seconds, revert to the last checkpoint. Even an rsync like application where only send the changes)

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Ray: Unified framework for scaling AI and Python applications

1 project | news.ycombinator.com | 3 May 2024
Fine-Tuning Llama-2: A Comprehensive Case Study for Tailoring Custom Models

1 project | news.ycombinator.com | 11 Aug 2023
Ray – an open source project for scaling AI workloads

1 project | news.ycombinator.com | 11 Aug 2023
Methods to keep agents inside grid world.

1 project | /r/reinforcementlearning | 30 Jun 2023
Is dynamic action masking possible in Rllib?

1 project | /r/reinforcementlearning | 23 Jan 2023

Show HN: SpotML – Managed ML Training on Cheap AWS/GCP Spot Instances

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Concurrency and Parallelism Ray Distributed Parallel Machine Learning
Post date: 3 Oct 2021

nimbo

Ray

InfluxDB

dbt-spark

criu-image-streamer

Related posts

Ray: Unified framework for scaling AI and Python applications

Fine-Tuning Llama-2: A Comprehensive Case Study for Tailoring Custom Models

Ray – an open source project for scaling AI workloads

Methods to keep agents inside grid world.

Is dynamic action masking possible in Rllib?

Show HN: SpotML – Managed ML Training on Cheap AWS/GCP Spot Instances

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Concurrency and Parallelism Ray Distributed Parallel Machine Learning Post date: 3 Oct 2021

nimbo

Ray

InfluxDB

dbt-spark

criu-image-streamer

Related posts

Ray: Unified framework for scaling AI and Python applications

Fine-Tuning Llama-2: A Comprehensive Case Study for Tailoring Custom Models

Ray – an open source project for scaling AI workloads

Methods to keep agents inside grid world.

Is dynamic action masking possible in Rllib?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Concurrency and Parallelism Ray Distributed Parallel Machine Learning
Post date: 3 Oct 2021