stable-fast
gpt-fast
stable-fast | gpt-fast | |
---|---|---|
11 | 8 | |
973 | 5,179 | |
- | 4.0% | |
9.4 | 8.3 | |
11 days ago | 4 days ago | |
Python | Python | |
MIT License | BSD 3-clause "New" or "Revised" License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
stable-fast
-
Has anyone managed to get TensorRT working in ComfyUI on Windows?
Download (https://github.com/chengzeyi/stable-fast/releases) and install stable-fast binary, compiled according to your system: pip install stable_fast-0.0.13.post3+torch210cu118-cp310-cp310-win_amd64.whl
- Optimum-NVIDIA - 28x faster inference in just 1 line of code !?
- stable-fast for SD inference: Faster than AITemplate, On par with TensorRT
- [N] stable-fast for SD inference: Faster than AITemplate, On par with TensorRT
- Stable-fast for SD inference: Faster than AITemplate, On par with TensorRT
-
SDXL Turbo: A Real-Time Text-to-Image Generation Model
SDXL and ControlNet are already optimized, if thats what you mean: https://github.com/chengzeyi/stable-fast
(Note the links to various SD compilers).
But the whole field is moving so fast that people aren't even adopting the compilers at large.
-
Getting sub 100ms refresh rate on LCMs
> already compiling
Hmm, well if you mean torch.compile, y'all should still check out stable-fast, which is claiming ~16ms/iter on a 4090:
https://github.com/chengzeyi/stable-fast#rtx-4090-512x512-ba...
-
Generate images fast with SD 1.5 while typing on Gradio
Now combine this with an optimized SD implementation, like:
https://github.com/chengzeyi/stable-fast
Or AITemplate, and you are at 15FPS on a larger consumer GPU. 10 with a controlnet you can use for some motion consistency.
-
S-LoRA: Serving Concurrent LoRA Adapters
Since I am sending you down the rabbit hole anyway, you should check out sfast:
https://github.com/chengzeyi/stable-fast
It's, the most promising "fast" and flexible stable diffusion implementation akin to this paper or vLLM that I know of. It doesn't have as many caveats as other implementations, like AITemplate (which is basically Turing+ and linux only) or torch.compile (basically no support for changing inputs/loras).
-
🚀Announcing stable-fast v0.0.5: Speed Optimization for SDXL, Dynamic CUDA Graph
About 2 weeks ago, I released the stable-fast project, which is a lightweight inference performance optimization framework for HuggingFace Diffusers. It provides best performance while keeping the compilation dynamic and flexible, and supports ControlNet and LoRA seamlessly.
gpt-fast
-
[D] GPT-Fast performance on larger batch sizes
I'm toying around with gpt-fast (https://github.com/pytorch-labs/gpt-fast) and was wondering if anyone has run experiments @ BS>1?
- Optimum-NVIDIA - 28x faster inference in just 1 line of code !?
- GPT-Fast: Simple and efficient GPT inference in <1000 LOC of Python
-
GPT-Fast: A fast and hackable implementation of transformer inference in <1000 lines of native PyTorch with support for quantization, speculative decoding, TP, Nvidia/AMD support, and more!
And check out the code here: https://github.com/pytorch-labs/gpt-fast
-
80% faster, 50% less memory, 0% loss of accuracy Llama finetuning
How does this compare to PyTorch labs optimizations for Sam and llama2 ?
https://github.com/pytorch-labs/segment-anything-fast
https://github.com/pytorch-labs/gpt-fast
- Fast and hackable PyTorch native transformer inference
-
Accelerating Generative AI with PyTorch II: GPT, Fast
I'm wondering if gpt-fast has a version that can be run from Windows Command Prompt or Powershell?
https://github.com/pytorch-labs/gpt-fast/issues/45
What are some alternatives?
Fooocus - Focus on prompting and generating
unsloth - Finetune Llama 3, Mistral & Gemma LLMs 2-5x faster with 80% less memory
TensorRT-LLM - TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
hyperlearn - 2-2000x faster ML algos, 50% less memory usage, works on all hardware - new and old.
optimum-nvidia
segment-anything-fast - A batched offline inference oriented version of segment-anything