accelerate
llamazoo
accelerate | llamazoo | |
---|---|---|
18 | 2 | |
6,996 | 57 | |
2.9% | - | |
9.7 | 10.0 | |
1 day ago | 5 months ago | |
Python | C | |
Apache License 2.0 | GNU General Public License v3.0 or later |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
accelerate
-
Can we discuss MLOps, Deployment, Optimizations, and Speed?
accelerate is a best-in-class lib for deploying models, especially across multi-gpu and multi-node.
-
Code Llama - The Hugging Face Edition
In the coming days, we'll work on sharing scripts to train models, optimizations for on-device inference, even nicer demos (and for more powerful models), and more. Feel free to like our GitHub repos (transformers, peft, accelerate). Enjoy!
-
What are the current fastest multi-gpu inference frameworks?
So I rent a cloud server today to try out some of the recent LLMs like falcon and vicuna. I started with huggingface's generate API using accelerate. It got about 2 instances/s with 8 A100 40GB GPUs which I think is a bit slow. I was using batch size = 1 since I do not know how to do multi-batch inference using the .generate API. I did torch.compile + bf16 already. Do we have an even faster multi-gpu inference framework? I have 8 GPUs so I was thinking about MUCH faster speed like ~10 or 20 instances per second (or is it possible at all? I am pretty new to this field).
-
Looking at lefnire's suggestion of splitting huggingface batches per gradient_accumulation_steps
Looking through https://github.com/huggingface/accelerate/tree/main/src/accelerate/utils/ I think it might be feasible, but will require some modifications to:
-
Have to abandon my (almost) finished LLaMA-API-Inference server. If anybody finds it useful and wants to continue, the repo is yours. :)
As /u/RabbitHole32 already mentioned, the speed increase stems from a patch which modifies, how a certain, large tensor is distributed between the GPU's. The patch was created by /u/emvw7yf. Here you can find the respective GitHub issue: https://github.com/huggingface/accelerate/issues/1394
-
Help please! SD installation broken
::pip install git+https://github.com/huggingface/accelerate
-
Batch Controlnet
pip install controlnet_aux pip install diffusers transformers git+https://github.com/huggingface/accelerate.git
-
[D] Large Language Models feasible to run on 32GB RAM / 8 GB VRAM / 24GB VRAM
Try to use both GPUs with this one: https://github.com/huggingface/accelerate https://huggingface.co/docs/accelerate/usage_guides/big_modeling https://huggingface.co/blog/accelerate-large-models Maybe it will help (the last link is clearer IMHO).
-
Fine Tuning Stable Diffusion with Dreambooth from Within My Python Code
I read through this page on accelerate, but it's not clear to me how the arguments such as instance_prompt gets passed in.
-
What does ACCELERATE do in AUTOMATIC1111?
To activate it you have to uncomment webui-user.sh line 44 and adding set ACCELERATE="True" to webui-user.bat. It seems to use huggingface/accelerate (Microsoft DeepSpeed, ZeRO paper) ACCELERATE
llamazoo
-
Gotzmann LLM Score
Parameters for all tests (done with llama.cpp thru LLaMAZoo):
-
Have to abandon my (almost) finished LLaMA-API-Inference server. If anybody finds it useful and wants to continue, the repo is yours. :)
Looks like exactly same idea I'm doing right now with LLaMAZoo: https://github.com/gotzmann/llamazoo
What are some alternatives?
DeepSpeed - DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
llama.py - Python bindings to llama.cpp
bitsandbytes - Accessible large language models via k-bit quantization for PyTorch.
TALIS - Simple and fast server for GPTQ-quantized LLaMA inference
FlexGen - Running large language models like OPT-175B/GPT-3 on a single GPU. Focusing on high-throughput generation. [Moved to: https://github.com/FMInference/FlexGen]
llama-go - Port of Facebook's LLaMA (Large Language Model Meta AI) in Golang with embedded C/C++
horovod - Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
llama.go - llama.go is like llama.cpp in pure Golang!
ChatGLM-6B - ChatGLM-6B: An Open Bilingual Dialogue Language Model | 开源双语对话语言模型
InternGPT - InternGPT (iGPT) is an open source demo platform where you can easily showcase your AI models. Now it supports DragGAN, ChatGPT, ImageBind, multimodal chat like GPT-4, SAM, interactive image editing, etc. Try it at igpt.opengvlab.com (支持DragGAN、ChatGPT、ImageBind、SAM的在线Demo系统)
unsloth - Finetune Llama 3, Mistral & Gemma LLMs 2-5x faster with 80% less memory
InternChat - InternGPT / InternChat allows you to interact with ChatGPT by clicking, dragging and drawing using a pointing device. [Moved to: https://github.com/OpenGVLab/InternGPT]