DeepSpeed
TensorRT
Our great sponsors
- Sonar - Write Clean Python Code. Always.
- InfluxDB - Access the most powerful time series database as a service
- ONLYOFFICE ONLYOFFICE Docs — document collaboration in your environment
- CodiumAI - TestGPT | Generating meaningful tests for busy devs
DeepSpeed | TensorRT | |
---|---|---|
41 | 17 | |
25,088 | 7,232 | |
61.0% | 5.3% | |
9.6 | 9.3 | |
2 days ago | 3 days ago | |
Python | C++ | |
Apache License 2.0 | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
DeepSpeed
-
Using --deepspeed requires lots of manual tweaking
Filed a discussion item on the deepspeed project: https://github.com/microsoft/DeepSpeed/discussions/3531
Solution: I don't know; this is where I am stuck. https://github.com/microsoft/DeepSpeed/issues/1037 suggests that I just need to 'apt install libaio-dev', but I've done that and it doesn't help.
-
Whether the ML computation engineering expertise will be valuable, is the question.
There could be some spectrum of this expertise. For instance, https://github.com/NVIDIA/FasterTransformer, https://github.com/microsoft/DeepSpeed
- FLiPN-FLaNK Stack Weekly for 17 April 2023
- DeepSpeed Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-Like Models
- DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-Like Models
-
12-Apr-2023 AI Summary
DeepSpeed Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales (https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-chat)
- Microsoft DeepSpeed
-
Apple: Transformer architecture optimized for Apple Silicon
I'm following this closely, together with other efforts like GPTQ Quantization and Microsoft's DeepSpeed, all of which are bringing down the hardware requirements of these advanced AI models.
-
Facebook LLAMA is being openly distributed via torrents
- https://github.com/microsoft/DeepSpeed
Anything that could bring this to a 10GB 3080 or 24GB 3090 without 60s/it per token?
TensorRT
- A1111 just added support for TensorRT for webui as an extension!
-
WIP - TensorRT accelerated stable diffusion img2img from mobile camera over webrtc + whisper speech to text. Interdimensional cable is here! Code: https://github.com/venetanji/videosd
It uses the nvidia demo code from: https://github.com/NVIDIA/TensorRT/tree/main/demo/Diffusion
-
[P] Get 2x Faster Transcriptions with OpenAI Whisper Large on Kernl
The traditional way to deploy a model is to export it to Onnx, then to TensorRT plan format. Each step requires its own tooling, its own mental model, and may raise some issues. The most annoying thing is that you need Microsoft or Nvidia support to get the best performances, and sometimes model support takes time. For instance, T5, a model released in 2019, is not yet correctly supported on TensorRT, in particular K/V cache is missing (soon it will be according to TensorRT maintainers, but I wrote the very same thing almost 1 year ago and then 4 months ago so… I don’t know).
-
Speeding up T5
I've tried to speed it up with TensorRT and followed this example: https://github.com/NVIDIA/TensorRT/blob/main/demo/HuggingFace/notebooks/t5.ipynb - it does give considerable speedup for batch-size=1 but it does not work with bigger batch sizes, which is useless as I can simply increase the batch-size of HuggingFace model.
-
An open-source library for optimizing deep learning inference. (1) You select the target optimization, (2) nebullvm searches for the best optimization techniques for your model-hardware configuration, and then (3) serves an optimized model that runs much faster in inference
Open-source projects leveraged by nebullvm include OpenVINO, TensorRT, Intel Neural Compressor, SparseML and DeepSparse, Apache TVM, ONNX Runtime, TFlite and XLA. A huge thank you to the open-source community for developing and maintaining these amazing projects.
-
I was looking for some great quantization open-source libraries that could actually be applied in production (both edge or cloud CPU/GPU). Do you know if I am missing any good libraries?
Nvidia Quantization | Quantization with TensorRT
-
Can you run a quantized model om GPU?
You might want to try Nvidia's quantization toolkit for pytorch: https://github.com/NVIDIA/TensorRT/tree/main/tools/pytorch-quantization
-
[P] What we learned by making T5-large 2X faster than Pytorch (and any autoregressive transformer)
Nvidia TensorRT demo from Nvidia heavily optimizes computation graph (through aggressive kernel fusions), making T5 inference very fast (they report 10X speedup on small-T5). The trick is that it doesn't use any cache, so it's very fast on short sequence and small models, as it avoids many memory bounded operations by redoing full computation again and again... but as several users have already found (1, 2, 3, 4, ...), this approach doesn't scale when the computation intensity increases, i.e., when base or large models are used instead of a small one, when generation is done on a moderately long sequence of few hundreds of tokens or if beam search is used instead of a greedy search. The graph above show the same behavior with Onnx Runtime;
- I made TensorRT example. I hope this will help beginners. And I also have a question about TensorRT best practice.
-
[P] [D] I made TensorRT example. I hope this will help beginners. And I also have a question about TensorRT best practice.
I used trtexec from Nvidia NGC image. It makes converting TensorRT model very easy and simple.
What are some alternatives?
ColossalAI - Making large AI models cheaper, faster and more accessible
fairscale - PyTorch extensions for high performance and large scale training.
onnx-tensorrt - ONNX-TensorRT: TensorRT backend for ONNX
Megatron-LM - Ongoing research training transformer models at scale
fairseq - Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
FasterTransformer - Transformer related optimization, including BERT, GPT
mesh-transformer-jax - Model parallel transformers in JAX and Haiku
llama - Inference code for LLaMA models
gpt-neox - An implementation of model parallel autoregressive transformers on GPUs, based on the DeepSpeed library.
server - The Triton Inference Server provides an optimized cloud and edge inferencing solution.
openvino - OpenVINO™ Toolkit repository
tensorrtx - Implementation of popular deep learning networks with TensorRT network definition API