deepsparse vs server

deepsparse

Sparsity-aware deep learning inference runtime for CPUs (by neuralmagic)

ML Machinelearning Pytorch Onnx deepsparse-engine Inference Computer Vision object-detection pruning quantization pretrained-models NLP auto-ml cpus sparsification cpu-inference-api

Source Code

neuralmagic.com

Suggest alternative

Edit details

server

The Triton Inference Server provides an optimized cloud and edge inferencing solution. (by triton-inference-server)

Inference GPU Machine Learning Deep Learning Cloud datacenter Edge

Source Code

docs.nvidia.com

Suggest alternative

Edit details

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

deepsparse		server
	Project
21	Mentions	24
2,878	Stars	7,356
1.5%	Growth	2.7%
9.5	Activity	9.5
about 5 hours ago	Latest Commit	3 days ago
Python	Language	Python
GNU General Public License v3.0 or later	License	BSD 3-clause "New" or "Revised" License

The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.

deepsparse

Posts with mentions or reviews of deepsparse. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2022-10-28.

Fast Llama 2 on CPUs with Sparse Fine-Tuning and DeepSparse
1 project | news.ycombinator.com | 23 Nov 2023

Interesting company. Yannic Kilcher interviewed Nir Shavit last year and they went into some depth: https://www.youtube.com/watch?v=0PAiQ1jTN5k DeepSparse is on GitHub: https://github.com/neuralmagic/deepsparse
The future of quantization techniques in deep learning.
1 project | /r/deeplearning | 14 Jun 2023

sparsity https://github.com/neuralmagic/deepsparse
[D] How to get the fastest PyTorch inference and what is the "best" model serving framework?
8 projects | /r/MachineLearning | 28 Oct 2022

For 1), what is the easiest way to speed up inference (assume only PyTorch and primarily GPU but also some CPU)? I have been using ONNX and Torchscript but there is a bit of a learning curve and sometimes it can be tricky to get the model to actually work. Is there anything else worth trying? I am enthused by things like TorchDynamo (although I have not tested it extensively) due to its apparent ease of use. I also saw the post yesterday about Kernl using (OpenAI) Triton kernels to speed up transformer models which also looks interesting. Are things like SageMaker Neo or NeuralMagic worth trying? My only reservation with some of these is they still seem to be pretty model/architecture specific. I am a little reluctant to put much time into these unless I know others have had some success first.
[D] Most efficient open source language model ?
1 project | /r/MachineLearning | 23 Oct 2022

You should look into deepsparse, they are working on delivering GPU level performance on consumer CPUs with some great results: https://github.com/neuralmagic/deepsparse. There is a great interview with the founder, Nir Shavit here: https://piped.kavin.rocks/watch?v=0PAiQ1jTN5k
[R] New sparsity research (oBERT) enabled 175X increase in CPU performance for MLPerf submission
2 projects | /r/MachineLearning | 10 Sep 2022

Utilizing the oBERT research we published at Neural Magic and some further iteration, we’ve enabled an increase in NLP performance of 175X while retaining 99% accuracy on the question-answering task in MLPerf. A combination of distillation, layer dropping, quantization, and unstructured pruning with oBERT enabled these large performance gains through the DeepSparse Engine. All of our contributions and research are open-sourced or free to use. Read through the oBERT paper on arxiv, try out the research in SparseML, and dive into the writeup to learn more about how we achieved these impressive results and utilize them for your own use cases!
An open-source library for optimizing deep learning inference. (1) You select the target optimization, (2) nebullvm searches for the best optimization techniques for your model-hardware configuration, and then (3) serves an optimized model that runs much faster in inference
10 projects | /r/learnmachinelearning | 26 Jul 2022

Open-source projects leveraged by nebullvm include OpenVINO, TensorRT, Intel Neural Compressor, SparseML and DeepSparse, Apache TVM, ONNX Runtime, TFlite and XLA. A huge thank you to the open-source community for developing and maintaining these amazing projects.
[R] BERT-Large: Prune Once for DistilBERT Inference Performance
2 projects | /r/MachineLearning | 16 Jul 2022

BERT-Large (345 million parameters) is now faster than the much smaller DistilBERT (66 million parameters) all while retaining the accuracy of the much larger BERT-Large model! We made this possible with Intel Labs by applying cutting-edge sparsification and quantization research from their Prune Once For All paper and utilizing it in the DeepSparse engine. It makes BERT-Large 12x smaller while delivering 8x latency speedup on commodity CPUs. We open-sourced the research in SparseML; run through the overview here and give it a try!
[R] How well do sparse ImageNet models transfer? Prune once and deploy anywhere for inference performance speedups! (arxiv link in comments)
2 projects | /r/MachineLearning | 26 Jun 2022

And benchmark/deploy with 8X better performance in DeepSparse!
Sparseserver.ui – test the performance of Sparse Transformers
1 project | news.ycombinator.com | 19 Apr 2022
[P] SparseServer.UI : A UI to test performance of Sparse Transformers
2 projects | /r/MachineLearning | 19 Apr 2022

Hi _Arsenie, this runs the deepsparse.server command for multiple models. and btw, we recently updated the READMEs for the Deepsparse Engine https://github.com/neuralmagic/deepsparse

server

Posts with mentions or reviews of server. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2024-01-08.

FLaNK Weekly 08 Jan 2024
41 projects | dev.to | 8 Jan 2024
Is there any open source app to load a model and expose API like OpenAI?
5 projects | /r/LocalLLaMA | 9 Dec 2023
"A matching Triton is not available"
1 project | /r/StableDiffusion | 15 Oct 2023
best way to serve llama V2 (llama.cpp VS triton VS HF text generation inference)
3 projects | /r/LocalLLaMA | 25 Sep 2023

I am wondering what is the best / most cost-efficient way to serve llama V2. - llama.cpp (is it production ready or just for playing around?) ? - Triton inference server ? - HF text generation inference ?
Triton Inference Server - Backend
2 projects | /r/learnmachinelearning | 13 Jun 2023
Single RTX 3080 or two RTX 3060s for deep learning inference?
1 project | /r/computervision | 12 Apr 2023

For inference of CNNs, memory should really not be an issue. If it is a software engineering problem, not a hardware issue. FP16 or Int8 for weights is fine and weight size won’t increase due to the high resolution. And during inference memory used for hidden layer tensors can be reused as soon as the last consumer layer has been processed. You likely using something that is designed for training for inference and that blows up the memory requirement, or if you are using TensorRT or something like that, you need to be careful to avoid that every tasks loads their own copy of the library code into the GPU. Maybe look at https://github.com/triton-inference-server/server
Machine Learning Inference Server in Rust?
4 projects | /r/rust | 21 Mar 2023

I am looking for something like [Triton Inference Server](https://github.com/triton-inference-server/server) or [TFX Serving](https://www.tensorflow.org/tfx/guide/serving), but in Rust. I came across [Orkon](https://github.com/vertexclique/orkhon) which seems to be dormant and a bunch of examples off of the [Awesome-Rust-MachineLearning](https://github.com/vaaaaanquish/Awesome-Rust-MachineLearning)
Multi-model serving options
3 projects | /r/mlops | 12 Feb 2023

You've already mentioned Seldon Core which is well worth looking at but if you're just after the raw multi-model serving aspect rather than a fully-fledged deployment framework you should maybe take a look at the individual inference servers: Triton Inference Server and MLServer both support multi-model serving for a wide variety of frameworks (and custom python models). MLServer might be a better option as it has an MLFlow runtime but only you will be able to decide that. There also might be other inference servers that do MMS that I'm not aware of.
I mean,.. we COULD just make our own lol
4 projects | /r/replika | 12 Feb 2023

[1] https://docs.nvidia.com/launchpad/ai/chatbot/latest/chatbot-triton-overview.html[2] https://github.com/triton-inference-server/server[3] https://neptune.ai/blog/deploying-ml-models-on-gpu-with-kyle-morris[4] https://thechief.io/c/editorial/comparison-cloud-gpu-providers/[5] https://geekflare.com/best-cloud-gpu-platforms/
Why TensorFlow for Python is dying a slow death
4 projects | news.ycombinator.com | 15 Jan 2023

"TensorFlow has the better deployment infrastructure"
Tensorflow Serving is nice in that it's so tightly integrated with Tensorflow. As usual that goes both ways. It's so tightly coupled to Tensorflow if the mlops side of the solution is using Tensorflow Serving you're going to get "trapped" in the Tensorflow ecosystem (essentially).
For pytorch models (and just about anything else) I've been really enjoying Nvidia Triton Server[0]. Of course it further entrenches Nvidia and CUDA in the space (although you can execute models CPU only) but for a deployment today and the foreseeable future you're almost certainly going to be using a CUDA stack anyway.
Triton Server is very impressive and I'm always surprised to see how relatively niche it is.
[0] - https://github.com/triton-inference-server/server

What are some alternatives?

When comparing deepsparse and server you can also consider the following projects:

NudeNet - Neural Nets for Nudity Detection and Censoring

DeepSpeed - DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

yolov5 - YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite

onnx-tensorrt - ONNX-TensorRT: TensorRT backend for ONNX

openvino - OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference

ROCm - AMD ROCm™ Software - GitHub Home [Moved to: https://github.com/ROCm/ROCm]

model-optimization - A toolkit to optimize ML models for deployment for Keras and TensorFlow, including quantization and pruning.

pinferencia - Python + Inference - Model Deployment library in Python. Simplest model inference server ever.

sparseml - Libraries for applying sparsification recipes to neural networks with a few lines of code, enabling faster and smaller models

Triton - Triton is a dynamic binary analysis library. Build your own program analysis tools, automate your reverse engineering, perform software verification or just emulate code.

tvm - Open deep learning compiler stack for cpu, gpu and specialized accelerators

Megatron-LM - Ongoing research training transformer models at scale

deepsparse vs NudeNet server vs DeepSpeed deepsparse vs yolov5 server vs onnx-tensorrt deepsparse vs openvino server vs ROCm deepsparse vs model-optimization server vs pinferencia deepsparse vs sparseml server vs Triton deepsparse vs tvm server vs Megatron-LM

Compare deepsparse vs server and see what are their differences.

deepsparse

server

deepsparse

server

What are some alternatives?