minimal-llama VS GPTQ-for-LLaMa

Compare minimal-llama vs GPTQ-for-LLaMa and see what are their differences.

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
minimal-llama GPTQ-for-LLaMa
4 75
456 2,916
- -
8.5 8.6
7 months ago 9 months ago
Python Python
- Apache License 2.0
The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.

minimal-llama

Posts with mentions or reviews of minimal-llama. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-03-21.
  • Show HN: Finetune LLaMA-7B on commodity GPUs using your own text
    16 projects | news.ycombinator.com | 21 Mar 2023
  • Visual ChatGPT
    8 projects | news.ycombinator.com | 9 Mar 2023
    I can't edit my comment now, but it's 30B that needs 18GB of VRAM.

    LLaMA-13B, GPT-3 175B level, only needs 10GB of VRAM with the GPTQ 4bit quantization.

    >do you think there's anything left to trim? like weight pruning, or LoRA, or I dunno, some kind of Huffman coding scheme that lets you mix 4-bit, 2-bit and 1-bit quantizations?

    Absolutely. The GPTQ paper claims negligible output quality loss with 3-bit quantization. The GPTQ-for-LLaMA repo supports 3-bit quantization and inference. So this extra 25% savings is already possible.

    As of right GPTQ-for-LLaMA is using a VRAM hungry attention method. Flash attention will reduce the requirements for 7B to 4GB and possibly fit 30B with a 2048 context window into 16GB, all before stacking 3-bit.

    Pruning is a possibility but I'm not aware of anyone working on it yet.

    LoRa has already been implemented. See https://github.com/zphang/minimal-llama#peft-fine-tuning-wit...

GPTQ-for-LLaMa

Posts with mentions or reviews of GPTQ-for-LLaMa. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-07-10.
  • [P] Early in 2023 I put in a lot of work on a new machine learning project. Now I'm not sure what to do with it.
    1 project | /r/MachineLearning | 3 Dec 2023
    First I want to make it clear this is not a self promotion post. I hope many machine learning people come at me with questions or comments about this project. A little background about myself. I did work on the 4 bits quantization of LLaMA using GPTQ. (https://github.com/qwopqwop200/GPTQ-for-LLaMa). I've been studying AI in-depth for many years now.
  • GPT-4 Details Leaked
    3 projects | news.ycombinator.com | 10 Jul 2023
    Deploying the 60B version is a challenge though and you might need to apply 4-bit quantization with something like https://github.com/PanQiWei/AutoGPTQ or https://github.com/qwopqwop200/GPTQ-for-LLaMa . Then you can improve the inference speed by using https://github.com/turboderp/exllama .

    If you prefer to use an "instruct" model à la ChatGPT (i.e. that does not need few-shot learning to output good results) you can use something like this: https://huggingface.co/TheBloke/Wizard-Vicuna-30B-Uncensored...

  • Rambling
    1 project | /r/PygmalionAI | 30 Jun 2023
    I use gptq-for-llama - from this https://github.com/qwopqwop200/GPTQ-for-LLaMa and Pygmalion 7B.
  • Now that ExLlama is out with reduced VRAM usage, are there any GPTQ models bigger than 7b which can fit onto an 8GB card?
    2 projects | /r/LocalLLaMA | 29 Jun 2023
    exllama is an optimized implementation of GPTQ-for-LLaMa, allowing you to run 4-bit quantized language models with GPU at great speeds.
  • GGML – AI at the Edge
    11 projects | news.ycombinator.com | 6 Jun 2023
    With a single NVIDIA 3090 and the fastest inference branch of GPTQ-for-LLAMA https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/fastest-i..., I get a healthy 10-15 tokens per second on the 30B models. IMO GGML is great (And I totally use it) but it's still not as fast as running the models on GPU for now.
  • New quantization method AWQ outperforms GPTQ in 4-bit and 3-bit with 1.45x speedup and works with multimodal LLMs
    4 projects | /r/LocalLLaMA | 2 Jun 2023
    And exactly what Triton version are they comparing against? I just tried the latest version of this, and on my 4090/12900K I get 77 tokens per second for Llama 7B-128g. My own GPTQ CUDA implementation gets 151 tokens/second on the same model, same hardware. That makes it 96% faster, whereas AWQ is only 79% faster. For 30B-128g I'm currently only getting a 110% speedup over Triton compared to their 178%, but it still seems a little disingenuous to compare against their own CUDA implementation only, when they're trying to present the quantization method as being faster for inference.
  • Introducing Basaran: self-hosted open-source alternative to the OpenAI text completion API
    9 projects | /r/LocalLLaMA | 1 Jun 2023
    Thanks for the explanation. I think some repos, like text generation webui used gptq for llama (I don't know if it's this repo or another one), anyway most repo that I saw use external things (like gptq for llama)
  • How to use AMD GPU?
    4 projects | /r/LocalLLaMA | 1 Jun 2023
    cd ../.. git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git -b triton cd GPTQ-for-LLaMa pip install -r requirements.txt mkdir -p ../text-generation-webui/repositories ln -s ../../GPTQ-for-LLaMa ../text-generation-webui/repositories/GPTQ-for-LLaMa
  • Help needed with installing quant_cuda for the WebUI
    2 projects | /r/LocalLLaMA | 31 May 2023
    cd repositories git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa pip install -r requirements.txt
  • The installed version of bitsandbytes was compiled without GPU support
    2 projects | /r/Oobabooga | 29 May 2023
    # To use the GPTQ models I need to Install GPTQ-for-LLaMa and the monkey patch mkdir repositories cd repositories git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git -b triton cd GPTQ-for-LLaMa pip install ninja pip install -r requirements.txt cd cd text-generation-webui # download random model python download-model.py xxx/yyy # try to start the gui python server.py # It returns this warning but it runs bin /home/gm/miniconda3/envs/chat/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so /home/gm/miniconda3/envs/chat/lib/python3.10/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. warn("The installed version of bitsandbytes was compiled without GPU support. " /home/gm/miniconda3/envs/chat/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32

What are some alternatives?

When comparing minimal-llama and GPTQ-for-LLaMa you can also consider the following projects:

FlexGen - Running large language models on a single GPU for throughput-oriented scenarios.

llama.cpp - LLM inference in C/C++

visual-chatgpt - Official repo for the paper: Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models [Moved to: https://github.com/microsoft/TaskMatrix]

bitsandbytes - Accessible large language models via k-bit quantization for PyTorch.

whisper.cpp - Port of OpenAI's Whisper model in C/C++

text-generation-webui - A Gradio web UI for Large Language Models. Supports transformers, GPTQ, AWQ, EXL2, llama.cpp (GGUF), Llama models.

simple-llm-finetuner - Simple UI for LLM Model Finetuning

qlora - QLoRA: Efficient Finetuning of Quantized LLMs

alpaca-lora - Instruct-tune LLaMA on consumer hardware

private-gpt - Interact with your documents using the power of GPT, 100% privately, no data leaks

stable-diffusion-webui-docker - Easy Docker setup for Stable Diffusion with user-friendly UI