Our great sponsors
-
llm_finetuning
Convenient wrapper for fine-tuning and inference of Large Language Models (LLMs) with several quantization techniques (GTPQ, bitsandbytes)
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
exllama
A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
I also create a short summary at https://github.com/taprosoft/llm_finetuning/blob/main/benchmark/README.md to compare the performance difference between popular quantization techniques. GPTQ seems to hold a good advantage in term of speed in compare to 4-bit quantization from bitsandbytes.
Follow up the popular work of u/tloen alpaca-lora, I wrapped the setup of alpaca_lora_4bit to add support for GPTQ training in form of installable pip packages. You can perform training and inference with multiple quantizations method to compare the results.
Follow up the popular work of u/tloen alpaca-lora, I wrapped the setup of alpaca_lora_4bit to add support for GPTQ training in form of installable pip packages. You can perform training and inference with multiple quantizations method to compare the results.
For inference step, this repo can help you to use ExLlama to perform inference on an evaluation dataset for the best throughput.