llama.cpp
exllamav2
| llama.cpp | exllamav2 | |
|---|---|---|
| 1,032 | 19 | |
| 115,929 | 4,550 | |
| 7.4% | 1.1% | |
| 10.0 | 6.7 | |
| 3 days ago | 3 months ago | |
| C++ | Python | |
| MIT License | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
llama.cpp
-
How to Setup a Local Coding Agent on macOS
> The benchmark prompt was:
> Write a compact Python function that parses a unified diff and returns the changed file paths. Then explain two edge cases.
> Each benchmark generated about 128 tokens.
Generating 128 tokens is probably not enough for good benchmark results. MTP speedup depends on how often the predicted tokens are accepted. In my experience, the very early output has a higher acceptance rate, so short testing can give false positive speedups.
Also llama.cpp includes a tool specifically for benchmarking:
https://github.com/ggml-org/llama.cpp/blob/master/tools/llam...
-
Doubling Qwen3.6-27B on One RTX 3090: ollama llama.cpp + MTP, Lever by Lever (35.7 80.2 tok/s)
In my build, MTP came from mainline llama.cpp, not ik_llama. ik_llama got me to ~47 (engine + quant), but I couldn't get MTP running there ā my build rejected the -mtp flags and ignored the model's nextn tensors. Mainline llama.cpp added MTP fairly recently (PR #22673, merged 2026-05-16), and that's where it worked for me. (There may well be an ik_llama path I missed ā this is just what got it going on my box.)
- New `llama.cpp` Updates, AI Agents for Any LLM, and Quantized Vector Index for Local Inference
- Gemma 4 QAT models: Optimizing compression for mobile and laptop efficiency
-
Introducing LlamaStash: a zero-overhead, terminal-native llama.cpp launcher
That script grew up. Today I'm releasing LlamaStash, the first public release of a fast, cross-platform, terminal-native launcher for llama.cpp with zero overhead.
-
How fast is LlamaStash? Overhead, throughput, and a fair comparison with Ollama and LM Studio
LlamaStash spawns the unmodified upstream llama-server. So three different questions follow from that, and there is a benchmark suite for each.
-
A 10 year old Xeon is all you need (for 26B-A4B MTP Drafters without GPU)
llama.cpp includes a benchmarking tool called llama-bench https://github.com/ggml-org/llama.cpp/blob/master/tools/llam...
ik_llama includes llama-sweep-bench https://github.com/ikawrakow/ik_llama.cpp/blob/main/examples...
When comparing hardware, the output of these tools is very helpful to let others put it into context. The post says the output is "reading speed" but knowing the prefill and token generation speeds would be a lot more helpful.
-
Racket v9.2 is now available
lol the same way we implement all of the reduced precision fp8, fp4 types today: by storing them in the corresponding uint:
https://github.com/ggml-org/llama.cpp/discussions/15095
- Run Gemma-4 E2B-it with llama.cpp on Raspberry Pi4
-
Gemma 4 dense by default: why your local agent doesn't want the MoE
# Build llama.cpp with Metal backend git clone https://github.com/ggml-org/llama.cpp cd llama.cpp && cmake -B build -DGGML_METAL=ON && cmake --build build -j # Community-quantized GGUFs (Google ships safetensors; unsloth ships GGUF) huggingface-cli download unsloth/gemma-4-31B-it-GGUF \ gemma-4-31B-it-Q4_K_M.gguf --local-dir . huggingface-cli download unsloth/gemma-4-26B-A4B-it-GGUF \ gemma-4-26B-A4B-it-Q4_K_M.gguf --local-dir . # Benchmark: 200 generations of 512 tokens, log per-call timing ./build/bin/llama-bench -m gemma-4-31B-it-Q4_K_M.gguf -n 512 -r 200 -o json > dense.json ./build/bin/llama-bench -m gemma-4-26B-A4B-it-Q4_K_M.gguf -n 512 -r 200 -o json > moe.json
exllamav2
- Codestral Mamba
-
Why I made TabbyAPI
Today, Iām going to focus on my most popular project, TabbyAPI. TabbyAPI is a python based FastAPI server that allows users to interact with Large Language Models (or LLMs) using the ExllamaV2 library and adheres to the OpenAI API specification.
- Running Llama3 Locally
-
Mixture-of-Depths: Dynamically allocating compute in transformers
There are already some implementations out there which attempt to accomplish this!
Here's an example: https://github.com/silphendio/sliced_llama
A gist pertaining to said example: https://gist.github.com/silphendio/535cd9c1821aa1290aa10d587...
Here's a discussion about integrating this capability with ExLlama: https://github.com/turboderp/exllamav2/pull/275
And same as above but for llama.cpp: https://github.com/ggerganov/llama.cpp/issues/4718#issuecomm...
-
What do you use to run your models?
Sorry, I'm somewhat familiar with this term (I've seen it as a model loader in Oobabooga), but still not following the correlation here. Are you saying I should instead be using this project in lieu of llama.cpp? Or are you saying that there is, perhaps, an exllamav2 "extension" or similar within llama.cpp that I can use?
-
I just started having problems with the colab again. I get errors and it just stops. Help?
EDIT: I reported the bug to the exllamav2 Github. It's actually already fixed, just not on any current built release.
-
Yi-34B-200K works on a single 3090 with 47K context/4bpw
install exllamav2 from git with pip install git+https://github.com/turboderp/exllamav2.git. Make sure you have flash attention 2 as well.
-
Tested: ExllamaV2's max context on 24gb with 70B low-bpw & speculative sampling performance
Recent releases for exllamav2 brings working fp8 cache support, which I've been very excited to test. This feature doubles the maximum context length you can run with your model, without any visible downsides.
-
Show HN: Phind Model beats GPT-4 at coding, with GPT-3.5 speed and 16k context
Without batching, I was actually thinking that's kind of modest.
ExllamaV2 will get 48 tokens/s on a 4090, which is much slower/cheaper than an H100:
https://github.com/turboderp/exllamav2#performance
I didn't test codellama, but the 3090 TI figures are in the ballpark of my generation speed on a 3090.
-
Guide for Llama2 70b model merging and exllama2 quantization
First, you need the convert.py script from turboderp's Exllama2 repo. You can read all about the convert.py arguments here.
What are some alternatives?
koboldcpp - Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
ollama - Get up and running with Kimi-K2.6, GLM-5.1, MiniMax, DeepSeek, gpt-oss, Qwen, Gemma and other models.
unsloth - Unsloth Studio is a web UI for training and running open models like Gemma 4, Qwen3.6, DeepSeek, gpt-oss locally.
tabbyAPI - The official API server for Exllama. OAI compatible, lightweight, and fast.
mlc-llm - Universal LLM Deployment Engine with ML Compilation