AutoGPTQ
self-refine
AutoGPTQ | self-refine | |
---|---|---|
19 | 8 | |
3,806 | 488 | |
5.0% | - | |
9.3 | 7.9 | |
4 days ago | 4 months ago | |
Python | Python | |
MIT License | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
AutoGPTQ
- Setting up LLAMA2 70B Chat locally
- Experience of setting up LLAMA 2 70B Chat locally
-
GPT-4 Details Leaked
Deploying the 60B version is a challenge though and you might need to apply 4-bit quantization with something like https://github.com/PanQiWei/AutoGPTQ or https://github.com/qwopqwop200/GPTQ-for-LLaMa . Then you can improve the inference speed by using https://github.com/turboderp/exllama .
If you prefer to use an "instruct" model à la ChatGPT (i.e. that does not need few-shot learning to output good results) you can use something like this: https://huggingface.co/TheBloke/Wizard-Vicuna-30B-Uncensored...
-
Loader Types
AutoGPTQ: an attempt at standardizing GPTQ-for-LLaMa and turning it into a library that is easier to install and use, and that supports more models. https://github.com/PanQiWei/AutoGPTQ
- WizardLM-33B-V1.0-Uncensored
-
Any help converting an interesting .bin model to 4 bit 128g GPTQ? Bloke?
Just use the script: https://github.com/PanQiWei/AutoGPTQ/blob/main/examples/quantization/quant_with_alpaca.py
-
LLM.int8(): 8-Bit Matrix Multiplication for Transformers at Scale
In the wild, people tend to use GTPQ quantization for pure GPU inference: https://github.com/PanQiWei/AutoGPTQ
And ggml's quant for CPU inference with some offload, which just got updated to a more GPTQ-like method days ago: https://github.com/ggerganov/llama.cpp/pull/1684
Some other runtimes like Apache TVM also have their own quant implementations: https://github.com/mlc-ai/mlc-llm
For training, 4-bit bitsandbytes is SOTA, as far as I know.
TBH I'm not sure why this November paper is being linked. Few are running 8 bit models when they could fit a better 3-5 bit model in the same memory pool.
-
Introducing Basaran: self-hosted open-source alternative to the OpenAI text completion API
Instead of integrating GPTQ-for-Lllama, use AutoGPTQ instead.
- AutoGPTQ - An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm
self-refine
- Self-Refine: Iterative Refinement with Self-Feedback
-
ChemCrow: Augmenting large-language models with chemistry tools
>the systems operation are well understood
That's like saying human behavior is well understood because we know how neurons communicate signals. It's too low level to be useful, hence psychology.
>They don't have the ability to reason or reflect.
Yes they do
https://selfrefine.info/
https://arxiv.org/abs/2303.11366
-
Generative Agents: Interactive Simulacra of Human Behavior
Literally you've been told it doesn't need that much handholding.
https://selfrefine.info/
-
Large Language Models Are Human-Level Prompt Engineers
For those curious about self-refining systems: https://selfrefine.info/ (our recent work).
-
This AI Paper Introduces SELF-REFINE: A Framework For Improving Initial Outputs From LLMs Through Iterative Feedback And Refinement
Project: https://selfrefine.info/
-
Self-Refine: Iterative Refinement with Self-Feedback - a novel approach that allows LLMs to iteratively refine outputs and incorporate feedback along multiple dimensions to improve performance on diverse tasks
Found relevant code at https://github.com/madaan/self-refine + all code implementations here
What are some alternatives?
exllama - A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
prompt-lib - A set of utilities for running few-shot prompting experiments on large-language models
llama.cpp - LLM inference in C/C++
temporal-graph-gen - Pre-trained models for our work on Temporal Graph Generation
text-generation-webui - A Gradio web UI for Large Language Models. Supports transformers, GPTQ, AWQ, EXL2, llama.cpp (GGUF), Llama models.
basaran - Basaran is an open-source alternative to the OpenAI text completion API. It provides a compatible streaming API for your Hugging Face Transformers-based text generation models.
GPTQ-for-LLaMa - 4 bits quantization of LLaMA using GPTQ
ray-llm - RayLLM - LLMs on Ray
transformers - 🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
catai - UI for 🦙model . Run AI assistant locally ✨
gptq-cuda-api
course - The Hugging Face course on Transformers