70B Llama 2 at 35tokens/second on 4090

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

exllamav2

17 2,978 9.8 Python

A fast inference library for running LLMs locally on modern consumer-class GPUs

Can anyone provide any additional details on the EXL2[0]/GPTQ[1] quantisation, which seems to be the main reason for a speedup in this model?
I had a quick look at the paper which is _reasonably_ clear, but if anyone else has any other sources that are easy to understand, or a quick explanation to give more insight into it, I'd appreciate it.
[0] https://github.com/turboderp/exllamav2#exl2-quantization
[1] https://arxiv.org/abs/2210.17323

llama.cpp

778 57,984 10.0 C++

LLM inference in C/C++

If its better than the equivalent 30B model, then its still a huge achievement.
Llama.cpp's Q2_K quant is 2.5625 bpw with perplexity just barely better than the next step down: https://github.com/ggerganov/llama.cpp/pull/1684
But subjectively, the Q2 quant "feels" worse than its high wikitext perplexity would suggest.
That's apples to oranges, and this quantization is totally different that Q2_K, but I just hope the quality hit in practice isn't so bad.

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
gptq

8 1,725 4.4 Python

Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".

Can anyone provide any additional details on the EXL2[0]/GPTQ[1] quantisation, which seems to be the main reason for a speedup in this model?
I had a quick look at the paper which is _reasonably_ clear, but if anyone else has any other sources that are easy to understand, or a quick explanation to give more insight into it, I'd appreciate it.
[0] https://github.com/turboderp/exllamav2#exl2-quantization
[1] https://arxiv.org/abs/2210.17323

OmniQuant

4 572 7.7 Python

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.

I think OmniQuant is notable because it shifts the bend of the curve to 3-bit. While < 3-bit still ramps up, it's notable in that it's usable and doesn't go asymptotic: https://github.com/OpenGVLab/OmniQuant/blob/main/imgs/weight...
What EXL2 seems to bring to the table is that you can target an arbitrary quantize bit-weight (eg, if you're a bit short on VRAM, you don't need to go from 4->3 or 3->2, but can specify say 3.75bwp). You have some control w/ other schemes by setting group size, or with k-quants, but EXL2 is definitely allows you to be finer grained. I haven't gotten a chance to sit down with EXL2 yet, but if no one else does it, it's on my todo-list to be able to do 1:1 perplexity and standard benchmark evals on all the various new quantization methods, just as a matter of curiosity.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Show HN: EmuBert – the first open encoder model for Australian law

1 project | news.ycombinator.com | 14 May 2024
GPT-4o

9 projects | news.ycombinator.com | 13 May 2024
Claude is now available in Europe

2 projects | news.ycombinator.com | 14 May 2024
Kebbie – Open-source mobile keyboard testing framework

1 project | news.ycombinator.com | 14 May 2024
GPT-4o takes #1 and #2 on the Aider LLM leaderboards

1 project | news.ycombinator.com | 14 May 2024

70B Llama 2 at 35tokens/second on 4090

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Post date: 12 Sep 2023

exllamav2

llama.cpp

InfluxDB

gptq

OmniQuant

Related posts

Show HN: EmuBert – the first open encoder model for Australian law

GPT-4o

Claude is now available in Europe

Kebbie – Open-source mobile keyboard testing framework

GPT-4o takes #1 and #2 on the Aider LLM leaderboards