petals
minimal-llama | petals | |
---|---|---|
4 | 98 | |
456 | 8,684 | |
- | 1.5% | |
8.5 | 8.3 | |
7 months ago | 5 days ago | |
Python | Python | |
- | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
minimal-llama
- Show HN: Finetune LLaMA-7B on commodity GPUs using your own text
-
Visual ChatGPT
I can't edit my comment now, but it's 30B that needs 18GB of VRAM.
LLaMA-13B, GPT-3 175B level, only needs 10GB of VRAM with the GPTQ 4bit quantization.
>do you think there's anything left to trim? like weight pruning, or LoRA, or I dunno, some kind of Huffman coding scheme that lets you mix 4-bit, 2-bit and 1-bit quantizations?
Absolutely. The GPTQ paper claims negligible output quality loss with 3-bit quantization. The GPTQ-for-LLaMA repo supports 3-bit quantization and inference. So this extra 25% savings is already possible.
As of right GPTQ-for-LLaMA is using a VRAM hungry attention method. Flash attention will reduce the requirements for 7B to 4GB and possibly fit 30B with a 2048 context window into 16GB, all before stacking 3-bit.
Pruning is a possibility but I'm not aware of anyone working on it yet.
LoRa has already been implemented. See https://github.com/zphang/minimal-llama#peft-fine-tuning-wit...
petals
-
Mistral Large
So how long until we can do an open source Mistral Large?
We could make a start on Petals or some other open source distributed training network cluster possibly?
[0] https://petals.dev/
-
Distributed Inference and Fine-Tuning of Large Language Models over the Internet
Can check out their project at https://github.com/bigscience-workshop/petals
- Make no mistake—AI is owned by Big Tech
- Would you donate computation and storage to help build an open source LLM?
-
Run 70B LLM Inference on a Single 4GB GPU with This New Technique
There is already an implementation along the same line using the torrent architecture.
https://petals.dev/
-
Run LLMs in bittorrent style
Check it out at Petals.dev. Chatbot
- Is distributed computing dying, or just fading into the background?
-
Ask HN: Are there any projects currently exploring distributed AI training?
https://github.com/bigscience-workshop/petals
-
Mistral 7B,The complete Guide of the Best 7B model
https://github.com/bigscience-workshop/petals
Inference only: https://lite.koboldai.net/
- Run LLMs at home, BitTorrent‑style
What are some alternatives?
FlexGen - Running large language models on a single GPU for throughput-oriented scenarios.
text-generation-webui - A Gradio web UI for Large Language Models. Supports transformers, GPTQ, AWQ, EXL2, llama.cpp (GGUF), Llama models.
visual-chatgpt - Official repo for the paper: Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models [Moved to: https://github.com/microsoft/TaskMatrix]
llama - Inference code for Llama models
whisper.cpp - Port of OpenAI's Whisper model in C/C++
alpaca-lora - Instruct-tune LLaMA on consumer hardware
simple-llm-finetuner - Simple UI for LLM Model Finetuning
GLM-130B - GLM-130B: An Open Bilingual Pre-Trained Model (ICLR 2023)
Auto-GPT - An experimental open-source attempt to make GPT-4 fully autonomous. [Moved to: https://github.com/Significant-Gravitas/Auto-GPT]
GPTQ-for-LLaMa - 4 bits quantization of LLaMA using GPTQ
Open-Assistant - OpenAssistant is a chat-based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically to do so.