-
llama.cpp
Discontinued LLM inference in C/C++ [Moved to: https://github.com/ggml-org/llama.cpp] (by ggerganov)
Compilation is cleaner on dense. llama.cpp, MLX, and vLLM all support both, but the dense path has had more attention. Fewer corner cases in expert routing, GQA, and KV layout interactions. If you've ever had a custom kernel mis-handle expert dispatch, you know.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
Compilation is cleaner on dense. llama.cpp, MLX, and vLLM all support both, but the dense path has had more attention. Fewer corner cases in expert routing, GQA, and KV layout interactions. If you've ever had a custom kernel mis-handle expert dispatch, you know.
-
Compilation is cleaner on dense. llama.cpp, MLX, and vLLM all support both, but the dense path has had more attention. Fewer corner cases in expert routing, GQA, and KV layout interactions. If you've ever had a custom kernel mis-handle expert dispatch, you know.
-
# Build llama.cpp with Metal backend git clone https://github.com/ggml-org/llama.cpp cd llama.cpp && cmake -B build -DGGML_METAL=ON && cmake --build build -j # Community-quantized GGUFs (Google ships safetensors; unsloth ships GGUF) huggingface-cli download unsloth/gemma-4-31B-it-GGUF \ gemma-4-31B-it-Q4_K_M.gguf --local-dir . huggingface-cli download unsloth/gemma-4-26B-A4B-it-GGUF \ gemma-4-26B-A4B-it-Q4_K_M.gguf --local-dir . # Benchmark: 200 generations of 512 tokens, log per-call timing ./build/bin/llama-bench -m gemma-4-31B-it-Q4_K_M.gguf -n 512 -r 200 -o json > dense.json ./build/bin/llama-bench -m gemma-4-26B-A4B-it-Q4_K_M.gguf -n 512 -r 200 -o json > moe.json
-
The 26B-A4B suffix on the MoE is Google's naming for "26B total params, ~4B activated per token" — the very expectation-vs-distribution gap discussed above. llama-bench gives per-call timings; pipe through jq to extract the per-call latencies and compute your own quantiles. For a richer view, MLX-LM gives cleaner kernel timing on Apple Silicon and lets you verify the dispatch overhead directly.