-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
And exactly what Triton version are they comparing against? I just tried the latest version of this, and on my 4090/12900K I get 77 tokens per second for Llama 7B-128g. My own GPTQ CUDA implementation gets 151 tokens/second on the same model, same hardware. That makes it 96% faster, whereas AWQ is only 79% faster. For 30B-128g I'm currently only getting a 110% speedup over Triton compared to their 178%, but it still seems a little disingenuous to compare against their own CUDA implementation only, when they're trying to present the quantization method as being faster for inference.
GitHub: https://github.com/mit-han-lab/llm-awq
Summary of the study by Claude-100k if anyone is interested:
Related posts
-
Is there any game that allow us to interact with it by python?
-
A Coder Considers the Waning Days of the Craft
-
Open/Local LLM support for MineDojo/Voyager
-
Voyager – Minecraft Embodied Agent with Large Language Models
-
[D] - Are there any AI benchmarks that involve successful longterm problem solving when running as autonomous agents (like in autogpt)? How do we compare the effectiveness of models as agents?