QLoRA: Efficient Finetuning of Quantized LLMs

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • qlora

    QLoRA: Efficient Finetuning of Quantized LLMs

  • bitsandbytes

    Accessible large language models via k-bit quantization for PyTorch.

  • Tim Dettmers is such a star. He's probably done more to make low-resource LLMs usable than anyone else.

    First bitsandbytes[1] and now this.

    [1] https://github.com/TimDettmers/bitsandbytes

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • guidance

    Discontinued A guidance language for controlling large language models. [Moved to: https://github.com/guidance-ai/guidance] (by microsoft)

  • This is off-topic, but are there any communities or congregations (that aren't reddit) based around locally hosted LLMs? I'm asking because while I see a bunch of projects for exposing GGML/LLaMA to OpenAI compatible interfaces, some UIs, etc, I can't really find a good community or resources for the concept in general.

    I'm working on a front-end for LLMs in general, having re-implemented a working version of OpenAI's code interpreter "plugin" already within the UI, and support for the wealth of third-party OpenAI plugins (I've been testing with the first diagram plugin I found, it works well.) I'm planning to open source it once my breaking changes slow down.

    This field moves very fast, I'm looking for feedback (and essentially testers/testing data) on what people want, and looking for prompts/chat logs/guidance templates (https://github.com/microsoft/guidance) for tasks they expect to "just work" with natural language.

    Instead of being limited by the monetization for ChatGPT Plus (and limited number of messages every four hours) for extensibility within a chat interface, I want to open it and free it, with a Bring-Your-Own-LLM setup.

  • GPTQ-for-LLaMa

    4 bits quantization of LLaMA using GPTQ

  • Hold on. I need someone to explain something to me.

    The colab notebook shows an example of loading the vanilla, unquantized model "decapoda-research/llama-7b-hf", using the flag "load_in_4bit" to load it as 4bits.

    When... when did this become possible? My understanding, from playing with these models daily for the past few months, is that quantization of LLaMA-based models is done via this: https://github.com/qwopqwop200/GPTQ-for-LLaMa

    And performing the quantization step is memory and time expensive. Which is why some kind people with large resources are performing the quantization, and then uploading those quantized models, such as this one: https://huggingface.co/TheBloke/wizard-vicuna-13B-GPTQ

    But now I'm seeing that, as of recently, the transformers library is capable of loading models in 4bits simply by passing this flag?

    Is this a free lunch? Is GPTQ-for-LLaMA no longer needed anymore? Or is this still not as good, in terms of inference quality, as the GPTQ-quantized models?

  • gpt4all

    gpt4all: run open-source LLMs anywhere

  • You might want to try some of the discord channels connected to some of the repos. i.e. GPT4All https://github.com/nomic-ai/gpt4all scroll down for the discord link.

  • petals

    🌸 Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts