-
You can check Exo out:
https://github.com/exo-explore/exo
It's a project designed to run a large model in a distributed manner. My need for GPU is to run my own machine learning research pet project (mostly evolutionary neuron network models for now), and it's a bit different from inferencing needs. Training is yet another different story.
But yeah, I agreed. I think machine learning should be distributed more in the future.
-
InfluxDB
InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
-
Ah, this is awesome! I currently run k3s on a decently spec-ed NixOS rig. I tried getting k3s to recognize my Nvidia GPU but was unsuccessful. I even used the small guide for getting GPU in k3s to work in nixpkgs[0], but without success.
For now I’m just using Docker’s Nvidia container runtime for containers that need GPU acceleration.
Will likely spend more time digging into your findings — hoping it results in me finding a solution to my setup!
[0] https://github.com/NixOS/nixpkgs/blob/master/pkgs/applicatio...
-
There's a bug in k8s-device-plugin that stops the plugin from even launching, as I mentioned in the article:
https://github.com/NVIDIA/k8s-device-plugin/issues/1182
And I opened a PR for fixing that here:
https://github.com/NVIDIA/k8s-device-plugin/pull/1183
I am unsure if this bug is only for the NixOS environment because its library paths and other quicks differ from those of major Linux distros.
Another major problem was that the "default_runtime_name" in the Containerd config didn't work as expected. I had to create a RuntimeClass and assign it to the pod to make it pick up the Nvidia runtime.
Other than that, I haven't tried K3S, the one I am running is a full-blown K8S cluster. I guess they should be similar.
While there's no guarantee, if you find any hints showing why your Nvidia plugin won't work here, I might be able to help, as I skip some minor issues I encountered in the articles. If it happens to be the ones I faced, I can share how I solved them.