nos
serve
nos | serve | |
---|---|---|
19 | 11 | |
572 | 3,961 | |
1.4% | 0.8% | |
5.1 | 9.5 | |
10 days ago | 5 days ago | |
Go | Java | |
Apache License 2.0 | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
nos
-
Plug and play modules to optimize the performances of your AI systems
Some of the available modules include:
Speedster: Automatically apply the best set of SOTA optimization techniques to achieve the maximum inference speed-up on your hardware. https://github.com/nebuly-ai/nebullvm/blob/main/apps/acceler...
Nos: Automatically maximize the utilization of GPU resources in a Kubernetes cluster through real-time dynamic partitioning and elastic quotas. https://github.com/nebuly-ai/nos
ChatLLaMA: Build faster and cheaper ChatGPT-like training process based on LLaMA architectures. https://github.com/nebuly-ai/nebullvm/tree/main/apps/acceler...
OpenAlphaTensor: Increase the computational performances of an AI model with custom-generated matrix multiplication algorithm fine-tuned for your specific hardware. https://github.com/nebuly-ai/nebullvm/tree/main/apps/acceler...
Forward-Forward: The Forward Forward algorithm is a method for training deep neural networks that replaces the backpropagation forward and backward passes with two forward passes. https://github.com/nebuly-ai/nebullvm/tree/main/apps/acceler...
-
Nos – Open-Source to Maximize GPU Utilization in Kubernetes
Hi HN! I’m Michele Zanotti and today I’m releasing nos, an open-source module to efficiently run GPU workloads on Kubernetes!
Nos is meant to increase GPU utilization and cut down infrastructure and operational costs providing 2 main features:
1. Dynamic GPU Partitioning: you can think of this as a cluster autoscaler for GPUs. Instead of scaling up the number of nodes and GPUs, it dynamically partitions them into smaller “GPU slices”. This ensures that each workload only uses the GPU resources it actually needs, resulting in spare GPU capacity that could be used for other workloads. To partition GPUs, nos leverages Nvidia's MPS and MIG [1,2], finally making them dynamic.
2. Elastic Resource Quota management: it allows to increase the number of Pods running on the cluster by allowing teams (namespaces) to borrow quotas of reserved resources from other teams as long as they are not using them.
https://github.com/nebuly-ai/nos
Let me know your thoughts on the project in the comments. And don't forget to leave a star on GitHub if you like the project :)
Nos addresses some key challenges of Kubernetes tied to the fact that Kubernetes was not designed to support GPU and AI / machine learning workloads. In Kubernetes, GPUs are managed with [3] Nvidia k8s Device Plugin, which has a few major downsides. First, it requires the allocation of an integer number of GPUs per workload, not allowing workloads to request only fractions of GPU. Second, when enabling GPU shared access either with time-slicing or MIG, the device plugin advertises to Kubernetes a fixed set of GPU resources that do not dynamically adapt to the requests of the Pods at each time.
This often leads to both underutilized GPUs and pending Pods, and/or the cluster admin having to spend a lot of time looking for workarounds to make the best use of GPUs.
For example, consider a company with a k8s cluster with 20 GPUs, where 3 of these GPUs have been reserved for the data science team using Resource Quota objects. In most cases, the workloads of data scientists (notebooks, scripts, etc.) require much less memory/compute resources than those of an entire GPU, yet Kubernetes will force each container to consume an entire GPU. Also, if the team once needs to run a heavy workload, it may want to use as many resources as possible. However, the Resource Quota over their namespace would constrain the team to use at most the 3 GPUs reserved for them, even if the company cluster may be full of unused GPUs!
Instead, with nos the data science team would use nos Dynamic GPU Partitioning to request GPU slices so that many workloads can share the same GPU. Also, Elastic Resource Quotas would allow the team to consume more than the 3 reserved GPUs, borrowing quotas from other teams that are not using them. To recap, the team would be able to launch more Pods and the company would likely need fewer nodes. All this with minimal effort required by the cluster admin, who only has to set up nos.
Let me know what you think of nos, feedback would be very helpful! :) And please leave a star on GitHub if you like this opensource https://github.com/nebuly-ai/nos
Here are some other links that may be useful
- Tutorial on how to use Dynamic GPU Partitioning with Nvidia MIG https://towardsdatascience.com/dynamic-mig-partitioning-in-k...
- Introducing Nos - Opensource to Maximize GPU Utilization in Kubernetes (more in the comments)
-
Opensource to maximize GPU utilization in Kubernetes
Let me know what you think of nos, feedback would be very helpful! :) And please leave a star on GitHub if you like this opensource https://github.com/nebuly-ai/nos
- New Opensource to Maximize GPU Utilization in Kubernetes
- Show HN: Nos – Open-Source to Maximize GPU Utilization in Kubernetes
- An open-source to train faster deep learning models
serve
-
Show Show HN: Llama2 Embeddings FastAPI Server
What's wrong with just using Torchserve[1]? We've been using it to serve embedding models in production.
[1] https://pytorch.org/serve/
-
How to leverage a local LLM for a client?
Looks like you are already up to speed loading LLaMa models which is great. Assuming this is Hugging Face PyTorch checkpoint, I think it should be possible to spin up a TorchServe instance which has in-built support for API access and HF Transformers. Since scale and latency aren’t a big concern for you, this should be good enough start.
- Is there a course that teaches you how to make an API with a trained model?
-
Pytorch eating memory on every api call
You could split the service in two, flask for the web part and a service to serve the model, I haven't used it but there is https://pytorch.org/serve/
-
Google Kubernetes Engine : Unable to access ports exposed on external IP
I'm attempting to set up inference for a torchserve container, and it's really tough to figure out what's not allowing me to connect to my network with the ports that I'm trying to expose. I'm using Google Kubernetes Engine and Helm via tweaking one of the tutorials at [torchserve](github.com/pytorch/serve). Specifically, it's the GKE tutorial [here](https://github.com/pytorch/serve/tree/master/kubernetes).
-
BetterTransformer: PyTorch-native free-lunch speedups for Transformer-based models
I did a Space to showcase a bit the speedups we can have in a end-to-end case with TorchServe to deploy the model on a cloud instance (AWS EC2 g4dn, using one T4 GPU): https://huggingface.co/spaces/fxmarty/bettertransformer-demo
-
[D] How to get the fastest PyTorch inference and what is the "best" model serving framework?
For 2), I am aware of a few options. Triton inference server is an obvious one as is the ‘transformer-deploy’ version from LDS. My only reservation here is that they require the model compilation or are architecture specific. I am aware of others like Bento, Ray serving and TorchServe. Ideally I would have something that allows any (PyTorch model) to be used without the extra compilation effort (or at least optionally) and has some convenience things like ease of use, easy to deploy, easy to host multiple models and can perform some dynamic batching. Anyway, I am really interested to hear people's experience here as I know there are now quite a few options! Any help is appreciated! Disclaimer - I have no affiliation or are connected in any way with the libraries or companies listed here. These are just the ones I know of. Thanks in advance.
-
how to integrate a deep learning model into a Django webapp!?
If you built the model using pytorch or tensorflow, I'd suggest using torchserve or TF serving to serve the model as its own "microservice," then query it from your django app. Among other things, it will make retraining and updating your model a lot easier.
- Choose JavaScript 🧠
-
Popular Machine Learning Deployment Tools
GitHub