gpt-llama.cpp VS GPTQ-for-LLaMa

Compare gpt-llama.cpp vs GPTQ-for-LLaMa and see what are their differences.

gpt-llama.cpp

A llama.cpp drop-in replacement for OpenAI's GPT endpoints, allowing GPT-powered apps to run off local llama.cpp models instead of OpenAI. (by keldenl)

GPTQ-for-LLaMa

4 bits quantization of LLaMA using GPTQ (by qwopqwop200)
SurveyJS - Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App
With SurveyJS form UI libraries, you can build and style forms in a fully-integrated drag & drop form builder, render them in your JS app, and store form submission data in any backend, inc. PHP, ASP.NET Core, and Node.js.
surveyjs.io
featured
InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
gpt-llama.cpp GPTQ-for-LLaMa
12 75
587 2,924
- -
8.2 8.6
11 months ago 9 months ago
JavaScript Python
MIT License Apache License 2.0
The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.

gpt-llama.cpp

Posts with mentions or reviews of gpt-llama.cpp. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-06-22.

GPTQ-for-LLaMa

Posts with mentions or reviews of GPTQ-for-LLaMa. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-07-10.
  • [P] Early in 2023 I put in a lot of work on a new machine learning project. Now I'm not sure what to do with it.
    1 project | /r/MachineLearning | 3 Dec 2023
    First I want to make it clear this is not a self promotion post. I hope many machine learning people come at me with questions or comments about this project. A little background about myself. I did work on the 4 bits quantization of LLaMA using GPTQ. (https://github.com/qwopqwop200/GPTQ-for-LLaMa). I've been studying AI in-depth for many years now.
  • GPT-4 Details Leaked
    3 projects | news.ycombinator.com | 10 Jul 2023
    Deploying the 60B version is a challenge though and you might need to apply 4-bit quantization with something like https://github.com/PanQiWei/AutoGPTQ or https://github.com/qwopqwop200/GPTQ-for-LLaMa . Then you can improve the inference speed by using https://github.com/turboderp/exllama .

    If you prefer to use an "instruct" model à la ChatGPT (i.e. that does not need few-shot learning to output good results) you can use something like this: https://huggingface.co/TheBloke/Wizard-Vicuna-30B-Uncensored...

  • Rambling
    1 project | /r/PygmalionAI | 30 Jun 2023
    I use gptq-for-llama - from this https://github.com/qwopqwop200/GPTQ-for-LLaMa and Pygmalion 7B.
  • Now that ExLlama is out with reduced VRAM usage, are there any GPTQ models bigger than 7b which can fit onto an 8GB card?
    2 projects | /r/LocalLLaMA | 29 Jun 2023
    exllama is an optimized implementation of GPTQ-for-LLaMa, allowing you to run 4-bit quantized language models with GPU at great speeds.
  • GGML – AI at the Edge
    11 projects | news.ycombinator.com | 6 Jun 2023
    With a single NVIDIA 3090 and the fastest inference branch of GPTQ-for-LLAMA https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/fastest-i..., I get a healthy 10-15 tokens per second on the 30B models. IMO GGML is great (And I totally use it) but it's still not as fast as running the models on GPU for now.
  • New quantization method AWQ outperforms GPTQ in 4-bit and 3-bit with 1.45x speedup and works with multimodal LLMs
    4 projects | /r/LocalLLaMA | 2 Jun 2023
    And exactly what Triton version are they comparing against? I just tried the latest version of this, and on my 4090/12900K I get 77 tokens per second for Llama 7B-128g. My own GPTQ CUDA implementation gets 151 tokens/second on the same model, same hardware. That makes it 96% faster, whereas AWQ is only 79% faster. For 30B-128g I'm currently only getting a 110% speedup over Triton compared to their 178%, but it still seems a little disingenuous to compare against their own CUDA implementation only, when they're trying to present the quantization method as being faster for inference.
  • Introducing Basaran: self-hosted open-source alternative to the OpenAI text completion API
    9 projects | /r/LocalLLaMA | 1 Jun 2023
    Thanks for the explanation. I think some repos, like text generation webui used gptq for llama (I don't know if it's this repo or another one), anyway most repo that I saw use external things (like gptq for llama)
  • How to use AMD GPU?
    4 projects | /r/LocalLLaMA | 1 Jun 2023
    cd ../.. git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git -b triton cd GPTQ-for-LLaMa pip install -r requirements.txt mkdir -p ../text-generation-webui/repositories ln -s ../../GPTQ-for-LLaMa ../text-generation-webui/repositories/GPTQ-for-LLaMa
  • Help needed with installing quant_cuda for the WebUI
    2 projects | /r/LocalLLaMA | 31 May 2023
    cd repositories git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa pip install -r requirements.txt
  • The installed version of bitsandbytes was compiled without GPU support
    2 projects | /r/Oobabooga | 29 May 2023
    # To use the GPTQ models I need to Install GPTQ-for-LLaMa and the monkey patch mkdir repositories cd repositories git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git -b triton cd GPTQ-for-LLaMa pip install ninja pip install -r requirements.txt cd cd text-generation-webui # download random model python download-model.py xxx/yyy # try to start the gui python server.py # It returns this warning but it runs bin /home/gm/miniconda3/envs/chat/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so /home/gm/miniconda3/envs/chat/lib/python3.10/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. warn("The installed version of bitsandbytes was compiled without GPU support. " /home/gm/miniconda3/envs/chat/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32

What are some alternatives?

When comparing gpt-llama.cpp and GPTQ-for-LLaMa you can also consider the following projects:

llama_index - LlamaIndex is a data framework for your LLM applications

llama.cpp - LLM inference in C/C++

Auto-LLM-Local - Created my own python script similar to AutoGPT where you supply a local llm model like alpaca13b (The main one I use), and the script can access the supplied tools to achieve your objective. Code fully works as far as I can tell. Takes me 5 minutes per chain on my slow laptop.

bitsandbytes - Accessible large language models via k-bit quantization for PyTorch.

long_term_memory - A gradio web UI for running Large Language Models like GPT-J 6B, OPT, GALACTICA, LLaMA, and Pygmalion.

text-generation-webui - A Gradio web UI for Large Language Models. Supports transformers, GPTQ, AWQ, EXL2, llama.cpp (GGUF), Llama models.

langchain - ⚡ Building applications with LLMs through composability ⚡ [Moved to: https://github.com/langchain-ai/langchain]

qlora - QLoRA: Efficient Finetuning of Quantized LLMs

semantic-kernel - Integrate cutting-edge LLM technology quickly and easily into your apps

private-gpt - Interact with your documents using the power of GPT, 100% privately, no data leaks

langchain - 🦜🔗 Build context-aware reasoning applications

stable-diffusion-webui-docker - Easy Docker setup for Stable Diffusion with user-friendly UI