QLoRA: Efficient Finetuning of Quantized LLMs

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

qlora

80 9,344 7.4 Jupyter Notebook

QLoRA: Efficient Finetuning of Quantized LLMs
bitsandbytes

61 5,344 9.4 Python

Accessible large language models via k-bit quantization for PyTorch.

Tim Dettmers is such a star. He's probably done more to make low-resource LLMs usable than anyone else.
First bitsandbytes[1] and now this.
[1] https://github.com/TimDettmers/bitsandbytes

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
guidance

89 12,248 9.5 Jupyter Notebook

Discontinued A guidance language for controlling large language models. [Moved to: https://github.com/guidance-ai/guidance] (by microsoft)

This is off-topic, but are there any communities or congregations (that aren't reddit) based around locally hosted LLMs? I'm asking because while I see a bunch of projects for exposing GGML/LLaMA to OpenAI compatible interfaces, some UIs, etc, I can't really find a good community or resources for the concept in general.
I'm working on a front-end for LLMs in general, having re-implemented a working version of OpenAI's code interpreter "plugin" already within the UI, and support for the wealth of third-party OpenAI plugins (I've been testing with the first diagram plugin I found, it works well.) I'm planning to open source it once my breaking changes slow down.
This field moves very fast, I'm looking for feedback (and essentially testers/testing data) on what people want, and looking for prompts/chat logs/guidance templates (https://github.com/microsoft/guidance) for tasks they expect to "just work" with natural language.
Instead of being limited by the monetization for ChatGPT Plus (and limited number of messages every four hours) for extensibility within a chat interface, I want to open it and free it, with a Bring-Your-Own-LLM setup.

GPTQ-for-LLaMa

75 2,904 8.6 Python

4 bits quantization of LLaMA using GPTQ

Hold on. I need someone to explain something to me.
The colab notebook shows an example of loading the vanilla, unquantized model "decapoda-research/llama-7b-hf", using the flag "load_in_4bit" to load it as 4bits.
When... when did this become possible? My understanding, from playing with these models daily for the past few months, is that quantization of LLaMA-based models is done via this: https://github.com/qwopqwop200/GPTQ-for-LLaMa
And performing the quantization step is memory and time expensive. Which is why some kind people with large resources are performing the quantization, and then uploading those quantized models, such as this one: https://huggingface.co/TheBloke/wizard-vicuna-13B-GPTQ
But now I'm seeing that, as of recently, the transformers library is capable of loading models in 4bits simply by passing this flag?
Is this a free lunch? Is GPTQ-for-LLaMA no longer needed anymore? Or is this still not as good, in terms of inference quality, as the GPTQ-quantized models?

gpt4all

139 63,641 9.8 C++

gpt4all: run open-source LLMs anywhere

You might want to try some of the discord channels connected to some of the repos. i.e. GPT4All https://github.com/nomic-ai/gpt4all scroll down for the discord link.

petals

98 8,631 8.5 Python

🌸 Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading

https://github.com/bigscience-workshop/petals

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project