AutoGPTQ
gpt-llama.cpp
AutoGPTQ | gpt-llama.cpp | |
---|---|---|
19 | 12 | |
3,806 | 587 | |
5.0% | - | |
9.3 | 8.2 | |
4 days ago | 11 months ago | |
Python | JavaScript | |
MIT License | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
AutoGPTQ
- Setting up LLAMA2 70B Chat locally
- Experience of setting up LLAMA 2 70B Chat locally
-
GPT-4 Details Leaked
Deploying the 60B version is a challenge though and you might need to apply 4-bit quantization with something like https://github.com/PanQiWei/AutoGPTQ or https://github.com/qwopqwop200/GPTQ-for-LLaMa . Then you can improve the inference speed by using https://github.com/turboderp/exllama .
If you prefer to use an "instruct" model à la ChatGPT (i.e. that does not need few-shot learning to output good results) you can use something like this: https://huggingface.co/TheBloke/Wizard-Vicuna-30B-Uncensored...
-
Loader Types
AutoGPTQ: an attempt at standardizing GPTQ-for-LLaMa and turning it into a library that is easier to install and use, and that supports more models. https://github.com/PanQiWei/AutoGPTQ
- WizardLM-33B-V1.0-Uncensored
-
Any help converting an interesting .bin model to 4 bit 128g GPTQ? Bloke?
Just use the script: https://github.com/PanQiWei/AutoGPTQ/blob/main/examples/quantization/quant_with_alpaca.py
-
LLM.int8(): 8-Bit Matrix Multiplication for Transformers at Scale
In the wild, people tend to use GTPQ quantization for pure GPU inference: https://github.com/PanQiWei/AutoGPTQ
And ggml's quant for CPU inference with some offload, which just got updated to a more GPTQ-like method days ago: https://github.com/ggerganov/llama.cpp/pull/1684
Some other runtimes like Apache TVM also have their own quant implementations: https://github.com/mlc-ai/mlc-llm
For training, 4-bit bitsandbytes is SOTA, as far as I know.
TBH I'm not sure why this November paper is being linked. Few are running 8 bit models when they could fit a better 3-5 bit model in the same memory pool.
-
Introducing Basaran: self-hosted open-source alternative to the OpenAI text completion API
Instead of integrating GPTQ-for-Lllama, use AutoGPTQ instead.
- AutoGPTQ - An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm
gpt-llama.cpp
-
Attempt to run Llama on a remote server with chatbot-ui
hi! I really like the solution https://github.com/keldenl/gpt-llama.cpp which helps to deploy https://github.com/mckaywrigley/chatbot-ui on the local model. I am running this together with Wizard7b or 13b locally and it works fine, but when I tried to upload to a remote server I met an error.
-
Introducing Basaran: self-hosted open-source alternative to the OpenAI text completion API
sounds like you’re asking for exactly this? https://github.com/keldenl/gpt-llama.cpp
- LLaMA and AutoAPI?
-
New big update to GPTNicheFinder: better trends analysis and scoring system, cleaned up UI and verbose in the terminal for people who want to see what is going on and to verify the results
I salut you good sir. This is an amazing idea. I don't have time but it will be interesting idea to use this wrapper https://github.com/keldenl/gpt-llama.cpp which simulates GPT endpoint for local lama, so basically we can have amazing tool for completely free use. If somebody test it please let me know underneath my comment!
-
I build an AI powered writing tools, an AI co-author
I would gladly buy your product to run with a local model, like Vicuna ggml , also see https://github.com/keldenl/gpt-llama.cpp/
-
Serge... Just works
possible through fastllama in python or gpt-llama.cpp an API wrapper around llama.cpp
-
Embeddings?
https://github.com/keldenl/gpt-llama.cpp supports embeddings, and it even takes in openai type requests and returns openai compatible responses!
-
I built a completely Local AutoGPT with the help of GPT-llama running Vicuna-13B
https://github.com/keldenl/gpt-llama.cpp
- I build a completely Local and portable AutoGPT with the help of gpt-llama, running on Vicuna-13b
-
Adding Long-Term Memory to Custom LLMs: Let's Tame Vicuna Together!
There's a (kind of) working Auto-GPT solution that uses Vicuna https://github.com/keldenl/gpt-llama.cpp/blob/master/docs/Auto-GPT-setup-guide.md
What are some alternatives?
exllama - A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
llama_index - LlamaIndex is a data framework for your LLM applications
llama.cpp - LLM inference in C/C++
Auto-LLM-Local - Created my own python script similar to AutoGPT where you supply a local llm model like alpaca13b (The main one I use), and the script can access the supplied tools to achieve your objective. Code fully works as far as I can tell. Takes me 5 minutes per chain on my slow laptop.
text-generation-webui - A Gradio web UI for Large Language Models. Supports transformers, GPTQ, AWQ, EXL2, llama.cpp (GGUF), Llama models.
long_term_memory - A gradio web UI for running Large Language Models like GPT-J 6B, OPT, GALACTICA, LLaMA, and Pygmalion.
basaran - Basaran is an open-source alternative to the OpenAI text completion API. It provides a compatible streaming API for your Hugging Face Transformers-based text generation models.
langchain - ⚡ Building applications with LLMs through composability ⚡ [Moved to: https://github.com/langchain-ai/langchain]
GPTQ-for-LLaMa - 4 bits quantization of LLaMA using GPTQ
semantic-kernel - Integrate cutting-edge LLM technology quickly and easily into your apps
self-refine - LLMs can generate feedback on their work, use it to improve the output, and repeat this process iteratively.
langchain - 🦜🔗 Build context-aware reasoning applications