Our great sponsors
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
guidance
Discontinued A guidance language for controlling large language models. [Moved to: https://github.com/guidance-ai/guidance] (by microsoft)
-
petals
🌸 Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading
Tim Dettmers is such a star. He's probably done more to make low-resource LLMs usable than anyone else.
First bitsandbytes[1] and now this.
This is off-topic, but are there any communities or congregations (that aren't reddit) based around locally hosted LLMs? I'm asking because while I see a bunch of projects for exposing GGML/LLaMA to OpenAI compatible interfaces, some UIs, etc, I can't really find a good community or resources for the concept in general.
I'm working on a front-end for LLMs in general, having re-implemented a working version of OpenAI's code interpreter "plugin" already within the UI, and support for the wealth of third-party OpenAI plugins (I've been testing with the first diagram plugin I found, it works well.) I'm planning to open source it once my breaking changes slow down.
This field moves very fast, I'm looking for feedback (and essentially testers/testing data) on what people want, and looking for prompts/chat logs/guidance templates (https://github.com/microsoft/guidance) for tasks they expect to "just work" with natural language.
Instead of being limited by the monetization for ChatGPT Plus (and limited number of messages every four hours) for extensibility within a chat interface, I want to open it and free it, with a Bring-Your-Own-LLM setup.
Hold on. I need someone to explain something to me.
The colab notebook shows an example of loading the vanilla, unquantized model "decapoda-research/llama-7b-hf", using the flag "load_in_4bit" to load it as 4bits.
When... when did this become possible? My understanding, from playing with these models daily for the past few months, is that quantization of LLaMA-based models is done via this: https://github.com/qwopqwop200/GPTQ-for-LLaMa
And performing the quantization step is memory and time expensive. Which is why some kind people with large resources are performing the quantization, and then uploading those quantized models, such as this one: https://huggingface.co/TheBloke/wizard-vicuna-13B-GPTQ
But now I'm seeing that, as of recently, the transformers library is capable of loading models in 4bits simply by passing this flag?
Is this a free lunch? Is GPTQ-for-LLaMA no longer needed anymore? Or is this still not as good, in terms of inference quality, as the GPTQ-quantized models?
You might want to try some of the discord channels connected to some of the repos. i.e. GPT4All https://github.com/nomic-ai/gpt4all scroll down for the discord link.