Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →
Serving Alternatives
Similar projects and alternatives to serving
-
server
The Triton Inference Server provides an optimized cloud and edge inferencing solution. (by triton-inference-server)
-
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
-
MNN
MNN is a blazing fast, lightweight deep learning framework, battle-tested by business-critical use cases in Alibaba
-
-
-
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
-
-
oneflow
OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient.
-
exllama
A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
-
-
-
-
mlc-llm
Enable everyone to develop, optimize and deploy AI models natively on everyone's devices.
-
lit-llama
Implementation of the LLaMA language model based on nanoGPT. Supports flash attention, Int8 and GPTQ 4bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. Apache 2.0-licensed.
-
maturin
Build and publish crates with pyo3, rust-cpython and cffi bindings as well as rust binaries as python packages
-
-
pinferencia
Python + Inference - Model Deployment library in Python. Simplest model inference server ever.
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
serving reviews and mentions
-
Llama.cpp: Full CUDA GPU Acceleration
Yet another TEDIOUS BATTLE: Python vs. C++/C stack.
This project gained popularity due to the HIGH DEMAND for running large models with 1B+ parameters, like `llama`. Python dominates the interface and training ecosystem, but prior to llama.cpp, non-ML professionals showed little interest in a fast C++ interface library. While existing solutions like tensorflow-serving [1] in C++ were sufficiently fast with GPU support, llama.cpp took the initiative to optimize for CPU and trim unnecessary code, essentially code-golfing and sacrificing some algorithm correctness for improved performance, which isn't favored by "ML research".
NOTE: In my opinion, a true pioneer was DarkNet, which implemented the YOLO model series and significantly outperformed others [2]. Same trick basically like llama.cpp
-
Would you use maturin for ML model serving?
Which ML framework do you use? Tensorflow has https://github.com/tensorflow/serving. You could also use the Rust bindings to load a saved model and expose it using one of the Rust HTTP servers. It doesn't matter whether you trained your model in Python as long as you export its saved model.
-
Popular Machine Learning Deployment Tools
GitHub
-
If data science uses a lot of computational power, then why is python the most used programming language?
You serve models via https://www.tensorflow.org/tfx/guide/serving which is written entirely in C++ (https://github.com/tensorflow/serving/tree/master/tensorflow_serving/model_servers), no Python on the serving path or in the shipped product.
-
Exposing Tensorflow Serving’s gRPC Endpoints on Amazon EKS
gRPC only connects to a host and port — but we can use whatever service route we want. Above I use the path we configured in our k8s ingress object: /service1, and overwrite the base configuration provided by tensorflow serving. When we call the tfserving_metadata function above, we specify /service1 as an argument.
-
A note from our sponsor - InfluxDB
www.influxdata.com | 28 Mar 2024
Stats
tensorflow/serving is an open source project licensed under Apache License 2.0 which is an OSI approved license.
The primary programming language of serving is C++.