70B Llama 2 at 35tokens/second on 4090

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • exllamav2

    A fast inference library for running LLMs locally on modern consumer-class GPUs

  • Can anyone provide any additional details on the EXL2[0]/GPTQ[1] quantisation, which seems to be the main reason for a speedup in this model?

    I had a quick look at the paper which is _reasonably_ clear, but if anyone else has any other sources that are easy to understand, or a quick explanation to give more insight into it, I'd appreciate it.

    [0] https://github.com/turboderp/exllamav2#exl2-quantization

    [1] https://arxiv.org/abs/2210.17323

  • llama.cpp

    LLM inference in C/C++

  • If its better than the equivalent 30B model, then its still a huge achievement.

    Llama.cpp's Q2_K quant is 2.5625 bpw with perplexity just barely better than the next step down: https://github.com/ggerganov/llama.cpp/pull/1684

    But subjectively, the Q2 quant "feels" worse than its high wikitext perplexity would suggest.

    That's apples to oranges, and this quantization is totally different that Q2_K, but I just hope the quality hit in practice isn't so bad.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • gptq

    Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".

  • Can anyone provide any additional details on the EXL2[0]/GPTQ[1] quantisation, which seems to be the main reason for a speedup in this model?

    I had a quick look at the paper which is _reasonably_ clear, but if anyone else has any other sources that are easy to understand, or a quick explanation to give more insight into it, I'd appreciate it.

    [0] https://github.com/turboderp/exllamav2#exl2-quantization

    [1] https://arxiv.org/abs/2210.17323

  • OmniQuant

    [ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.

  • I think OmniQuant is notable because it shifts the bend of the curve to 3-bit. While < 3-bit still ramps up, it's notable in that it's usable and doesn't go asymptotic: https://github.com/OpenGVLab/OmniQuant/blob/main/imgs/weight...

    What EXL2 seems to bring to the table is that you can target an arbitrary quantize bit-weight (eg, if you're a bit short on VRAM, you don't need to go from 4->3 or 3->2, but can specify say 3.75bwp). You have some control w/ other schemes by setting group size, or with k-quants, but EXL2 is definitely allows you to be finer grained. I haven't gotten a chance to sit down with EXL2 yet, but if no one else does it, it's on my todo-list to be able to do 1:1 perplexity and standard benchmark evals on all the various new quantization methods, just as a matter of curiosity.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts