DeepSpeed
text-generation-webui
Our great sponsors
- Revelo Payroll - Free Global Payroll designed for tech teams
- Sonar - Write Clean Python Code. Always.
- Onboard AI - Learn any GitHub repo in 59 seconds
- InfluxDB - Collect and Analyze Billions of Data Points in Real Time
DeepSpeed | text-generation-webui | |
---|---|---|
46 | 829 | |
28,628 | 24,345 | |
2.4% | - | |
8.8 | 9.9 | |
1 day ago | 3 days ago | |
Python | Python | |
Apache License 2.0 | GNU Affero General Public License v3.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
DeepSpeed
-
A comprehensive guide to running Llama 2 locally
While on the surface, a 192GB Mac Studio seems like a great deal (it's not much more than a 48GB A6000!), there are several reasons why this might not be a good idea:
* I assume most people have never used llama.cpp Metal w/ large models. It will drop to CPU speeds whenever the context window is full: https://github.com/ggerganov/llama.cpp/issues/1730#issuecomm... - while sure this might be fixed in the future, it's been an issue since Metal support was added, and is a significant problem if you are actually trying to actually use it for inferencing. With 192GB of memory, you could probably run larger models w/o quantization, but I've never seen anyone post benchmarks of their experiences. Note that at that point, the limited memory bandwidth will be a big factor.
* If you are planning on using Apple Silicon for ML/training, I'd also be wary. There are multi-year long open bugs in PyTorch[1], and most major LLM libs like deepspeed, bitsandbytes, etc don't have Apple Silicon support[2][3].
You can see similar patterns w/ Stable Diffusion support [4][5] - support lagging by months, lots of problems and poor performance with inference, much less fine tuning. You can apply this to basically any ML application you want (srt, tts, video, etc)
Macs are fine to poke around with, but if you actually plan to do more than run a small LLM and say "neat", especially for a business, recommending a Mac for anyone getting started w/ ML workloads is a bad take. (In general, for anyone getting started, unless you're just burning budget, renting cloud GPU is going to be the best cost/perf, although on-prem/local obviously has other advantages.)
[1] https://github.com/pytorch/pytorch/issues?q=is%3Aissue+is%3A...
[2] https://github.com/microsoft/DeepSpeed/issues/1580
[3] https://github.com/TimDettmers/bitsandbytes/issues/485
[4] https://github.com/AUTOMATIC1111/stable-diffusion-webui/disc...
[5] https://forums.macrumors.com/threads/ai-generated-art-stable...
-
Microsoft Research proposes new framework, LongMem, allowing for unlimited context length along with reduced GPU memory usage and faster inference speed. Code will be open-sourced
And https://github.com/microsoft/deepspeed
-
April 2023
DeepSpeed Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales (https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-chat)
-
Using --deepspeed requires lots of manual tweaking
Filed a discussion item on the deepspeed project: https://github.com/microsoft/DeepSpeed/discussions/3531
Solution: I don't know; this is where I am stuck. https://github.com/microsoft/DeepSpeed/issues/1037 suggests that I just need to 'apt install libaio-dev', but I've done that and it doesn't help.
-
Whether the ML computation engineering expertise will be valuable, is the question.
There could be some spectrum of this expertise. For instance, https://github.com/NVIDIA/FasterTransformer, https://github.com/microsoft/DeepSpeed
- FLiPN-FLaNK Stack Weekly for 17 April 2023
- DeepSpeed Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-Like Models
- DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-Like Models
-
12-Apr-2023 AI Summary
DeepSpeed Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales (https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-chat)
text-generation-webui
-
Need helping setting up an LXC container for LLM use via API on a local server
It looks like oobabooga can also do API access, but I'm hesitant to use something geared more for roleplaying, as many of the options don't really apply to my use case
-
Best software web-/GUI?
Right now I really only know about Ooba and koboldcpp for running and using models, I feel like they are really well when you want to tinker with the models but if you want to actually use them for example as a replacement to ChatGPT they fall behind
-
DevToys–A Swiss army knife for developers
I like this one: https://github.com/oobabooga/text-generation-webui
LLM is the swiss army knife for everyone.
-
Show HN: Beating GPT-4 on HumanEval with a fine-tuned CodeLlama-34B
Not hard with text-generation-ui! https://github.com/oobabooga/text-generation-webui
I personally use llama.cpp as the driver since I run CPU-only but another may be better suited for GPU usage. But then it's as simple as downloading the model and placing it in the directory.
- Meta: Code Llama, an AI Tool for Coding
-
GPT-3.5 Turbo fine-tuning and API updates
It depends on your needs. For instance, do you want to host an API or do you want to have a front end like chatGPT? Chances are, text-generation-webui [1] should get you pretty close to hosting it yourself. You simply clone the repo, download the model from huggingface using the included helper (download-model.py) and fire up the server with server.py. You can connect to it by SSH port tunneling on port 7860 (there's other way like Ngrok but SSH tunneling is the easiest and secure).
As for hosting, I found that runpod [2] has been the cheapest (not affiliated, just a user). All the other services tend to add up more than them when you include bandwidth and storage. There's some tutorials online [3] but a lot of them use the quantized version. You should be able to fit the original 70B with "load_in_8bit" on one A100 80GB.
-
Show HN: LlamaGPT – Self-hosted, offline, private AI chatbot, powered by Llama 2
I like this for turn by turn conversations: https://huggingface.co/NousResearch/Nous-Hermes-Llama2-13b
this for zero shot instructions: https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-...
easiest way would be https://github.com/oobabooga/text-generation-webui
a little more complex way I do is I have a stack with llama.cpp server, a openai adapter, and bettergpt as frontend using the openai adapter as the custom endpoint. bettergpt ux beats oogaboga by a long way (and chatgpt on certain aspects)
-
What's the best (and cheap) way to try out all the new LLMs on cloud services.
Here's the two major GUIs people use here to run LLMs: https://github.com/oobabooga/text-generation-webui/ https://github.com/LostRuins/koboldcpp/
-
LLama2 with GeForce 1080 8Gb
I'm using Debian Linux with TGW, I also have a GTX 1080 8 GB, I am able to offload all 35 layers to the GPU when loading the q4 (4bit) version of this model Luna-AI-Llama2-Uncensored-GGML using llama.cpp. Getting 25 to 30 tokens a second. I will note that I am only using it for a discord bot, and it does what I need. Just wanted to post that it can be done.
-
Why host your own LLM?
my recommendation for your pdf project is run https://github.com/oobabooga/text-generation-webui.
What are some alternatives?
ColossalAI - Making large AI models cheaper, faster and more accessible
KoboldAI
TavernAI - Atmospheric adventure chat for AI language models (KoboldAI, NovelAI, Pygmalion, OpenAI chatgpt, gpt-4)
llama.cpp - Port of Facebook's LLaMA model in C/C++
fairscale - PyTorch extensions for high performance and large scale training.
Megatron-LM - Ongoing research training transformer models at scale
gpt4all - gpt4all: open-source LLM chatbots that you can run anywhere
TensorRT - NVIDIA® TensorRT™, an SDK for high-performance deep learning inference, includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for inference applications.
langchain - ⚡ Building applications with LLMs through composability ⚡ [Moved to: https://github.com/langchain-ai/langchain]
KoboldAI-Client
alpaca-lora - Instruct-tune LLaMA on consumer hardware
llama - Inference code for LLaMA models