Our great sponsors
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
Neptune.ai, which promises to streamline your workflows and make collaboration a breeze.
That said I personally use Kubeflow hosted on a local baremetal kubernetes cluster (8 nodes, 4 gpus), but a lot of it is a bit of a bear to get installed correctly in a multi-machine environment (specifically this issue is still open and exposing the built-in dashboards outside of the cluster is a problem). Also because it's a Google product it's very clearly intended to run in the cloud with self-hosting being very much an afterthought
If you're not concerned about self-hosting, WandB is one of the more fully featured training monitoring tools (I've used it in the past without any issues but the lack of data and training privacy and lack of self-hosting possibilities makes it a hard no for anything that isn't scholastic). Polyaxon is an alternative but rewriting all your variable logging to conform to their requirements makes it very difficult to switch to it in the middle of a project so you have to commit to it from the get-go.
I have an old labmate who uses a similar setup with MLFlow and can endorse it.
Check out Aim: https://github.com/aimhubio/aim