llama.cpp
exllama
| llama.cpp | exllama | |
|---|---|---|
| 1,032 | 66 | |
| 115,929 | 2,914 | |
| 7.4% | 0.0% | |
| 10.0 | 9.0 | |
| 3 days ago | over 2 years ago | |
| C++ | Python | |
| MIT License | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
llama.cpp
-
How to Setup a Local Coding Agent on macOS
> The benchmark prompt was:
> Write a compact Python function that parses a unified diff and returns the changed file paths. Then explain two edge cases.
> Each benchmark generated about 128 tokens.
Generating 128 tokens is probably not enough for good benchmark results. MTP speedup depends on how often the predicted tokens are accepted. In my experience, the very early output has a higher acceptance rate, so short testing can give false positive speedups.
Also llama.cpp includes a tool specifically for benchmarking:
https://github.com/ggml-org/llama.cpp/blob/master/tools/llam...
-
Doubling Qwen3.6-27B on One RTX 3090: ollama llama.cpp + MTP, Lever by Lever (35.7 80.2 tok/s)
In my build, MTP came from mainline llama.cpp, not ik_llama. ik_llama got me to ~47 (engine + quant), but I couldn't get MTP running there — my build rejected the -mtp flags and ignored the model's nextn tensors. Mainline llama.cpp added MTP fairly recently (PR #22673, merged 2026-05-16), and that's where it worked for me. (There may well be an ik_llama path I missed — this is just what got it going on my box.)
- New `llama.cpp` Updates, AI Agents for Any LLM, and Quantized Vector Index for Local Inference
- Gemma 4 QAT models: Optimizing compression for mobile and laptop efficiency
-
Introducing LlamaStash: a zero-overhead, terminal-native llama.cpp launcher
That script grew up. Today I'm releasing LlamaStash, the first public release of a fast, cross-platform, terminal-native launcher for llama.cpp with zero overhead.
-
How fast is LlamaStash? Overhead, throughput, and a fair comparison with Ollama and LM Studio
LlamaStash spawns the unmodified upstream llama-server. So three different questions follow from that, and there is a benchmark suite for each.
-
A 10 year old Xeon is all you need (for 26B-A4B MTP Drafters without GPU)
llama.cpp includes a benchmarking tool called llama-bench https://github.com/ggml-org/llama.cpp/blob/master/tools/llam...
ik_llama includes llama-sweep-bench https://github.com/ikawrakow/ik_llama.cpp/blob/main/examples...
When comparing hardware, the output of these tools is very helpful to let others put it into context. The post says the output is "reading speed" but knowing the prefill and token generation speeds would be a lot more helpful.
-
Racket v9.2 is now available
lol the same way we implement all of the reduced precision fp8, fp4 types today: by storing them in the corresponding uint:
https://github.com/ggml-org/llama.cpp/discussions/15095
- Run Gemma-4 E2B-it with llama.cpp on Raspberry Pi4
-
Gemma 4 dense by default: why your local agent doesn't want the MoE
# Build llama.cpp with Metal backend git clone https://github.com/ggml-org/llama.cpp cd llama.cpp && cmake -B build -DGGML_METAL=ON && cmake --build build -j # Community-quantized GGUFs (Google ships safetensors; unsloth ships GGUF) huggingface-cli download unsloth/gemma-4-31B-it-GGUF \ gemma-4-31B-it-Q4_K_M.gguf --local-dir . huggingface-cli download unsloth/gemma-4-26B-A4B-it-GGUF \ gemma-4-26B-A4B-it-Q4_K_M.gguf --local-dir . # Benchmark: 200 generations of 512 tokens, log per-call timing ./build/bin/llama-bench -m gemma-4-31B-it-Q4_K_M.gguf -n 512 -r 200 -o json > dense.json ./build/bin/llama-bench -m gemma-4-26B-A4B-it-Q4_K_M.gguf -n 512 -r 200 -o json > moe.json
exllama
-
Qwen2.5-VL-32B: Smarter and Lighter
Is that a problem? According to this, the GPUs don’t communicate that much once the weights are loaded: https://github.com/turboderp/exllama/discussions/16#discussi...
Intra GPU memory bandwidth is very important, but I‘ve seen lots of people use just a x4 lane and they didn’t complain much.
- Is LMDeploy the Ultimate Solution? Why It Outshines VLLM, TRT-LLM, TGI, and MLC
-
Any way to optimally use GPU for faster llama calls?
not using exllama seems like the tremendous waste
- ExLlama: Memory efficient way to run Llama
- Ask HN: Cheapest hardware to run Llama 2 70B
-
Llama Is Expensive
> We serve Llama on 2 80-GB A100 GPUs, as that is the minumum required to fit Llama in memory (with 16-bit precision)
Well there is your problem.
LLaMA quantized to 4 bits fits in 40GB. And it gets similar throughput split between dual consumer GPUs, which likely means better throughput on a single 40GB A100 (or a cheaper 48GB Pro GPU)
https://github.com/turboderp/exllama#dual-gpu-results
Also, I'm not sure which model was tested, but Llama 70B chat should have better performance than the base model if the prompting syntax is right. That was only reverse engineered from the Meta demo implementation recently.
-
Accessing Llama 2 from the command-line with the LLM-replicate plugin
For those getting started, the easiest one click installer I've used is Nomic.ai's gpt4all: https://gpt4all.io/
This runs with a simple GUI on Windows/Mac/Linux, leverages a fork of llama.cpp on the backend and supports GPU acceleration, and LLaMA, Falcon, MPT, and GPT-J models. It also has API/CLI bindings.
I just saw a slick new tool https://ollama.ai/ that will let you install a llama2-7b with a single `ollama run llama2` command that has a very simple 1-click installer for Apple Silicon Mac (but need to build from source for anything else atm). It looks like it only supports llamas OOTB but it also seems to use llama.cpp (via Go adapter) on the backend - it seemed to be CPU-only on my MBA, but I didn't poke too much and it's brand new, so we'll see.
For anyone on HN, they should probably be looking at https://github.com/ggerganov/llama.cpp and https://github.com/ggerganov/ggml directly. If you have a high-end Nvidia consumer card (3090/4090) I'd highly recommend looking into https://github.com/turboderp/exllama
For those generally confused, the r/LocalLLaMA wiki is a good place to start: https://www.reddit.com/r/LocalLLaMA/wiki/guide/
I've also been porting my own notes into a single location that tracks models, evals, and has guides focused on local models: https://llm-tracker.info/
-
GPT-4 Details Leaked
Deploying the 60B version is a challenge though and you might need to apply 4-bit quantization with something like https://github.com/PanQiWei/AutoGPTQ or https://github.com/qwopqwop200/GPTQ-for-LLaMa . Then you can improve the inference speed by using https://github.com/turboderp/exllama .
If you prefer to use an "instruct" model à la ChatGPT (i.e. that does not need few-shot learning to output good results) you can use something like this: https://huggingface.co/TheBloke/Wizard-Vicuna-30B-Uncensored...
-
Multi-GPU questions
Exllama for example uses buffers on each card that reduce the amount of VRAM available for model and context, see here. https://github.com/turboderp/exllama/issues/121
-
A simple repo for fine-tuning LLMs with both GPTQ and bitsandbytes quantization. Also supports ExLlama for inference for the best speed.
For inference step, this repo can help you to use ExLlama to perform inference on an evaluation dataset for the best throughput.
What are some alternatives?
koboldcpp - Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
textgen - Open-source desktop app for local LLMs. Text, vision, tool-calling, OpenAI/Anthropic-compatible API. 100% private.
unsloth - Unsloth Studio is a web UI for training and running open models like Gemma 4, Qwen3.6, DeepSeek, gpt-oss locally.
mlc-llm - Universal LLM Deployment Engine with ML Compilation
ollama - Get up and running with Kimi-K2.6, GLM-5.1, MiniMax, DeepSeek, gpt-oss, Qwen, Gemma and other models.