Gemma 4 dense by default: why your local agent doesn't want the MoE

This page summarizes the projects mentioned and recommended in the original post on dev.to

SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  1. llama.cpp

    Discontinued LLM inference in C/C++ [Moved to: https://github.com/ggml-org/llama.cpp] (by ggerganov)

    Compilation is cleaner on dense. llama.cpp, MLX, and vLLM all support both, but the dense path has had more attention. Fewer corner cases in expert routing, GQA, and KV layout interactions. If you've ever had a custom kernel mis-handle expert dispatch, you know.

  2. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  3. mlx

    MLX: An array framework for Apple silicon

    Compilation is cleaner on dense. llama.cpp, MLX, and vLLM all support both, but the dense path has had more attention. Fewer corner cases in expert routing, GQA, and KV layout interactions. If you've ever had a custom kernel mis-handle expert dispatch, you know.

  4. vllm

    A high-throughput and memory-efficient inference and serving engine for LLMs

    Compilation is cleaner on dense. llama.cpp, MLX, and vLLM all support both, but the dense path has had more attention. Fewer corner cases in expert routing, GQA, and KV layout interactions. If you've ever had a custom kernel mis-handle expert dispatch, you know.

  5. llama.cpp

    LLM inference in C/C++

    # Build llama.cpp with Metal backend git clone https://github.com/ggml-org/llama.cpp cd llama.cpp && cmake -B build -DGGML_METAL=ON && cmake --build build -j # Community-quantized GGUFs (Google ships safetensors; unsloth ships GGUF) huggingface-cli download unsloth/gemma-4-31B-it-GGUF \ gemma-4-31B-it-Q4_K_M.gguf --local-dir . huggingface-cli download unsloth/gemma-4-26B-A4B-it-GGUF \ gemma-4-26B-A4B-it-Q4_K_M.gguf --local-dir . # Benchmark: 200 generations of 512 tokens, log per-call timing ./build/bin/llama-bench -m gemma-4-31B-it-Q4_K_M.gguf -n 512 -r 200 -o json > dense.json ./build/bin/llama-bench -m gemma-4-26B-A4B-it-Q4_K_M.gguf -n 512 -r 200 -o json > moe.json

  6. mlx-examples

    Examples in the MLX framework

    The 26B-A4B suffix on the MoE is Google's naming for "26B total params, ~4B activated per token" — the very expectation-vs-distribution gap discussed above. llama-bench gives per-call timings; pipe through jq to extract the per-call latencies and compute your own quantiles. For a richer view, MLX-LM gives cleaner kernel timing on Apple Silicon and lets you verify the dispatch overhead directly.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • The AI stack every developer will depend on in 2026

    7 projects | dev.to | 19 May 2026
  • Accelerating Gemma 4: faster inference with multi-token prediction drafters

    7 projects | news.ycombinator.com | 5 May 2026
  • How to Run LLMs Locally When Cloud AI Gets Too Invasive

    2 projects | dev.to | 17 Apr 2026
  • Stop Using Ollama

    11 projects | news.ycombinator.com | 15 Apr 2026
  • Catching a vLLM Latency Spike with eBPF and an Open-Weight LLM

    5 projects | dev.to | 24 Mar 2026

Did you know that C++ is
the 7th most popular programming language
based on number of references?