-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
text-generation-webui
A Gradio web UI for Large Language Models. Supports transformers, GPTQ, AWQ, EXL2, llama.cpp (GGUF), Llama models.
Some results here: https://github.com/ggerganov/llama.cpp/discussions/406
tl;dr quantizing the 13B model gives up about 30% of the improvement you get from moving from 7B to 13B - so quantized 13B is still much better than unquantized 7B. Similar results for the larger models.
I wonder where such difference between llama.cpp and [1] repo comes from. F16 difference in perplexity is .3 on 7B model, which is not insignificant. ggml quirks are definitely need to be fixed.
[1] https://github.com/qwopqwop200/GPTQ-for-LLaMa
Define "comprehensive?"
There are some benchmarks here: https://www.reddit.com/r/LocalLLaMA/comments/1248183/i_am_cu... and here: https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-i...
Check out the original paper on quantization, which has some benchmarks: https://arxiv.org/pdf/2210.17323.pdf and this paper, which also has benchmarks and explains how they determined that 4-bit quantization is optimal compared to 3-bit: https://arxiv.org/pdf/2212.09720.pdf
I also think the discussion of that second paper here is interesting, though it doesn't have its own benchmarks: https://github.com/oobabooga/text-generation-webui/issues/17...
I'd assume that 33B model should fit with this(only repo that I know of that implements SparseGPT and GPTQ for LLaMa) https://github.com/lachlansneff/sparsellama
I'm curious if someone will have to port these enhancements elsewhere, ie: https://github.com/rustformers/llama-rs