serving
flake
Our great sponsors
serving | flake | |
---|---|---|
12 | 5 | |
6,071 | 593 | |
0.2% | 10.1% | |
9.8 | 4.4 | |
3 days ago | 3 days ago | |
C++ | Nix | |
Apache License 2.0 | GNU Affero General Public License v3.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
serving
-
Llama.cpp: Full CUDA GPU Acceleration
Yet another TEDIOUS BATTLE: Python vs. C++/C stack.
This project gained popularity due to the HIGH DEMAND for running large models with 1B+ parameters, like `llama`. Python dominates the interface and training ecosystem, but prior to llama.cpp, non-ML professionals showed little interest in a fast C++ interface library. While existing solutions like tensorflow-serving [1] in C++ were sufficiently fast with GPU support, llama.cpp took the initiative to optimize for CPU and trim unnecessary code, essentially code-golfing and sacrificing some algorithm correctness for improved performance, which isn't favored by "ML research".
NOTE: In my opinion, a true pioneer was DarkNet, which implemented the YOLO model series and significantly outperformed others [2]. Same trick basically like llama.cpp
[1] https://github.com/tensorflow/serving
-
[D] How do OpenAI and other companies manage to have real-time inference on model with billions of parameters over an API?
I mean, probably - it's written in C++ https://github.com/tensorflow/serving
-
Should I wait for the M2 Macbook Pro?
We’re looking into that solution at the moment, the issue I’m referring to is related to this https://github.com/tensorflow/serving/issues/1948 we’ll know if the plug-in approach works for our uses soon but haven’t started looking into implementing it yet
- TF Serving has been unavailable for 9 days so far due to outdated GPG key
- TF Serving has been unavailable for 8 days
-
Would you use maturin for ML model serving?
Which ML framework do you use? Tensorflow has https://github.com/tensorflow/serving. You could also use the Rust bindings to load a saved model and expose it using one of the Rust HTTP servers. It doesn't matter whether you trained your model in Python as long as you export its saved model.
-
Is LaMDA Sentient? – An Interview [pdf]
Most likely it's a model server running something like https://github.com/tensorflow/serving and if there isn't a lot of load, the resource could kill some of its tasks. I wouldn't imagine it's sitting around pondering deep thoughts.
-
Ask HN: How to deploy a TensorFlow model for access through an HTTP endpoint?
https://github.com/tensorflow/serving
https://thenewstack.io/tutorial-deploying-tensorflow-models-...
-
Popular Machine Learning Deployment Tools
GitHub
-
If data science uses a lot of computational power, then why is python the most used programming language?
You serve models via https://www.tensorflow.org/tfx/guide/serving which is written entirely in C++ (https://github.com/tensorflow/serving/tree/master/tensorflow_serving/model_servers), no Python on the serving path or in the shipped product.
flake
- Running AI Models on NixOS
- Nixified.Ai Release 2
-
Llama.cpp: Full CUDA GPU Acceleration
> Ideally, there's Nix (and poetry2nix) that could take care of everything, but only a few folks write Flakes for their projects.
Relevant to "AI, Python, setting up is hard ... nix", there's stuff like:
https://github.com/nixified-ai/flake
-
Can you substitute conda with Nix for Data Science and ML/AI?
However, I would reach out to the Nixified.ai folks about it, because I can see that the invoke.ai build script mentions pytorch and several other hard-to-install packages (albeit not detectron).
- A Nix flake for many AI projects
What are some alternatives?
server - The Triton Inference Server provides an optimized cloud and edge inferencing solution.
nonguix - Nonguix mirror – pull requests ignored, please use upstream for that
MNN - MNN is a blazing fast, lightweight deep learning framework, battle-tested by business-critical use cases in Alibaba
guix-nonfree - Unofficial collection of packages that are not going to be accepted in to guix
flashlight - A C++ standalone library for machine learning
lit-llama - Implementation of the LLaMA language model based on nanoGPT. Supports flash attention, Int8 and GPTQ 4bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. Apache 2.0-licensed.
XLA.jl - Julia on TPUs
llama_cpp.rb - llama_cpp provides Ruby bindings for llama.cpp
oneflow - OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient.
guix-nonfree
glow - Compiler for Neural Network hardware accelerators
TokenHawk - WebGPU LLM inference tuned by hand [Moved to: https://github.com/kayvr/token-hawk]