Llama Is Expensive

Scout Monitoring - Free Django app performance insights with Scout Monitoring

Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.

www.scoutapm.com

featured

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

exllama

64 2,631 9.0 Python

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.

> We serve Llama on 2 80-GB A100 GPUs, as that is the minumum required to fit Llama in memory (with 16-bit precision)
Well there is your problem.
LLaMA quantized to 4 bits fits in 40GB. And it gets similar throughput split between dual consumer GPUs, which likely means better throughput on a single 40GB A100 (or a cheaper 48GB Pro GPU)
https://github.com/turboderp/exllama#dual-gpu-results
Also, I'm not sure which model was tested, but Llama 70B chat should have better performance than the base model if the prompting syntax is right. That was only reverse engineered from the Meta demo implementation recently.

Scout Monitoring

www.scoutapm.com featured

Free Django app performance insights with Scout Monitoring. Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Show HN: 5x faster depth map generation using tensorrt inside comfyui

1 project | news.ycombinator.com | 1 Jun 2024
Meltdown

1 project | news.ycombinator.com | 1 Jun 2024
Napster Sparked a File-Sharing Revolution 25 Years Ago

1 project | news.ycombinator.com | 1 Jun 2024
HuggingFace hacked – Space secrets leak disclosure

1 project | news.ycombinator.com | 1 Jun 2024
Shellgpt: Chat with LLM in your terminal, be it shell generator, story teller

1 project | news.ycombinator.com | 1 Jun 2024

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Post date: 20 Jul 2023

exllama

Scout Monitoring

Related posts

Show HN: 5x faster depth map generation using tensorrt inside comfyui

Meltdown

Napster Sparked a File-Sharing Revolution 25 Years Ago

HuggingFace hacked – Space secrets leak disclosure

Shellgpt: Chat with LLM in your terminal, be it shell generator, story teller