deepsparse
tvm
deepsparse | tvm | |
---|---|---|
21 | 15 | |
2,878 | 11,186 | |
1.5% | 1.3% | |
9.5 | 9.9 | |
about 6 hours ago | 4 days ago | |
Python | Python | |
GNU General Public License v3.0 or later | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
deepsparse
-
Fast Llama 2 on CPUs with Sparse Fine-Tuning and DeepSparse
Interesting company. Yannic Kilcher interviewed Nir Shavit last year and they went into some depth: https://www.youtube.com/watch?v=0PAiQ1jTN5k DeepSparse is on GitHub: https://github.com/neuralmagic/deepsparse
-
The future of quantization techniques in deep learning.
sparsity https://github.com/neuralmagic/deepsparse
-
[D] How to get the fastest PyTorch inference and what is the "best" model serving framework?
For 1), what is the easiest way to speed up inference (assume only PyTorch and primarily GPU but also some CPU)? I have been using ONNX and Torchscript but there is a bit of a learning curve and sometimes it can be tricky to get the model to actually work. Is there anything else worth trying? I am enthused by things like TorchDynamo (although I have not tested it extensively) due to its apparent ease of use. I also saw the post yesterday about Kernl using (OpenAI) Triton kernels to speed up transformer models which also looks interesting. Are things like SageMaker Neo or NeuralMagic worth trying? My only reservation with some of these is they still seem to be pretty model/architecture specific. I am a little reluctant to put much time into these unless I know others have had some success first.
-
[D] Most efficient open source language model ?
You should look into deepsparse, they are working on delivering GPU level performance on consumer CPUs with some great results: https://github.com/neuralmagic/deepsparse. There is a great interview with the founder, Nir Shavit here: https://piped.kavin.rocks/watch?v=0PAiQ1jTN5k
-
[R] New sparsity research (oBERT) enabled 175X increase in CPU performance for MLPerf submission
Utilizing the oBERT research we published at Neural Magic and some further iteration, we’ve enabled an increase in NLP performance of 175X while retaining 99% accuracy on the question-answering task in MLPerf. A combination of distillation, layer dropping, quantization, and unstructured pruning with oBERT enabled these large performance gains through the DeepSparse Engine. All of our contributions and research are open-sourced or free to use. Read through the oBERT paper on arxiv, try out the research in SparseML, and dive into the writeup to learn more about how we achieved these impressive results and utilize them for your own use cases!
-
An open-source library for optimizing deep learning inference. (1) You select the target optimization, (2) nebullvm searches for the best optimization techniques for your model-hardware configuration, and then (3) serves an optimized model that runs much faster in inference
Open-source projects leveraged by nebullvm include OpenVINO, TensorRT, Intel Neural Compressor, SparseML and DeepSparse, Apache TVM, ONNX Runtime, TFlite and XLA. A huge thank you to the open-source community for developing and maintaining these amazing projects.
-
[R] BERT-Large: Prune Once for DistilBERT Inference Performance
BERT-Large (345 million parameters) is now faster than the much smaller DistilBERT (66 million parameters) all while retaining the accuracy of the much larger BERT-Large model! We made this possible with Intel Labs by applying cutting-edge sparsification and quantization research from their Prune Once For All paper and utilizing it in the DeepSparse engine. It makes BERT-Large 12x smaller while delivering 8x latency speedup on commodity CPUs. We open-sourced the research in SparseML; run through the overview here and give it a try!
-
[R] How well do sparse ImageNet models transfer? Prune once and deploy anywhere for inference performance speedups! (arxiv link in comments)
And benchmark/deploy with 8X better performance in DeepSparse!
- Sparseserver.ui – test the performance of Sparse Transformers
-
[P] SparseServer.UI : A UI to test performance of Sparse Transformers
Hi _Arsenie, this runs the deepsparse.server command for multiple models. and btw, we recently updated the READMEs for the Deepsparse Engine https://github.com/neuralmagic/deepsparse
tvm
-
Making AMD GPUs competitive for LLM inference
Yes, this is coming! Myself and others at OctoML and in the TVM community are actively working on multi-gpu support in the compiler and runtime. Here are some of the merged and active PRs on the multi-GPU (multi-device) roadmap:
Support in TVM’s graph IR (Relax) - https://github.com/apache/tvm/pull/15447
-
VSL; Vlang's Scientific Library
Would it make sense to have a backend support for OpenXLA, Apache TVM, Jittor or other similar to get free GPU, TPU and other accelerators for free ?
- Apache TVM
-
MLC LLM - "MLC LLM is a universal solution that allows any language model to be deployed natively on a diverse set of hardware backends and native applications, plus a productive framework for everyone to further optimize model performance for their own use cases."
I have tried the iPhone app. It's fast. They're using Apache TVM which should allow better use of native accelerators on different devices. Like using metal on Apple and Vulcan or CUDA or whatever instead of just running the thing on the CPU like llama.cpp.
-
ONNX Runtime merges WebGPU back end
I was going to answer the same, I find the approach of machine learning compilers that directly compile models to host and device code better than having to bring a huge runtime. There are exciting projects in this area like TVM Unity, IREE [2], or torch.export [3]
[1] https://github.com/apache/tvm/tree/unity
[2] https://pytorch.org/get-started/pytorch-2.0/#inference-and-e...
[3] https://pytorch.org/get-started/pytorch-2.0/#inference-and-e...
-
Esp32 tensorflow lite
Apache TVM home page: https://tvm.apache.org/
-
Decompiling x86 Deep Neural Network Executables
It's pretty clear its referring to the output of Apache TVM and Meta's Glow
-
Run Stable Diffusion on Your M1 Mac’s GPU
As mentioned in sibling comments, Torch is indeed the glue in this implementation. Other glues are TVM[0] and ONNX[1]
These just cover the neural net though, and there is lots of surrounding code and pre-/post-processing that isn't covered by these systems.
For models on Replicate, we use Docker, packaged with Cog for this stuff.[2] Unfortunately Docker doesn't run natively on Mac, so if we want to use the Mac's GPU, we can't use Docker.
I wish there was a good container system for Mac. Even better if it were something that spanned both Mac and Linux. (Not as far-fetched as it seems... I used to work at Docker and spent a bit of time looking into this...)
[0] https://tvm.apache.org/
-
How to get started with machine learning.
Or use TVM, the idea is to compile your model into code that you can load at runtime. Similar to onnxruntime, it only does DNN inference; so you need domain-specific code.
-
An open-source library for optimizing deep learning inference. (1) You select the target optimization, (2) nebullvm searches for the best optimization techniques for your model-hardware configuration, and then (3) serves an optimized model that runs much faster in inference
Open-source projects leveraged by nebullvm include OpenVINO, TensorRT, Intel Neural Compressor, SparseML and DeepSparse, Apache TVM, ONNX Runtime, TFlite and XLA. A huge thank you to the open-source community for developing and maintaining these amazing projects.
What are some alternatives?
NudeNet - Neural Nets for Nudity Detection and Censoring
TensorRT - NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
yolov5 - YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
mlc-llm - Enable everyone to develop, optimize and deploy AI models natively on everyone's devices.
openvino - OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference
onnxruntime - ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
model-optimization - A toolkit to optimize ML models for deployment for Keras and TensorFlow, including quantization and pruning.
stable-diffusion - This version of CompVis/stable-diffusion features an interactive command-line script that combines text2img and img2img functionality in a "dream bot" style interface, a WebGUI, and multiple features and other enhancements. [Moved to: https://github.com/invoke-ai/InvokeAI]
sparseml - Libraries for applying sparsification recipes to neural networks with a few lines of code, enabling faster and smaller models
nebuly - The user analytics platform for LLMs
PINTO_model_zoo - A repository for storing models that have been inter-converted between various frameworks. Supported frameworks are TensorFlow, PyTorch, ONNX, OpenVINO, TFJS, TFTRT, TensorFlowLite (Float32/16/INT8), EdgeTPU, CoreML.
stable-diffusion