llama.cpp
whisper.cpp
| llama.cpp | whisper.cpp | |
|---|---|---|
| 1,032 | 204 | |
| 115,929 | 50,618 | |
| 7.4% | 2.9% | |
| 10.0 | 9.9 | |
| 3 days ago | 4 days ago | |
| C++ | C++ | |
| MIT License | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
llama.cpp
-
How to Setup a Local Coding Agent on macOS
> The benchmark prompt was:
> Write a compact Python function that parses a unified diff and returns the changed file paths. Then explain two edge cases.
> Each benchmark generated about 128 tokens.
Generating 128 tokens is probably not enough for good benchmark results. MTP speedup depends on how often the predicted tokens are accepted. In my experience, the very early output has a higher acceptance rate, so short testing can give false positive speedups.
Also llama.cpp includes a tool specifically for benchmarking:
https://github.com/ggml-org/llama.cpp/blob/master/tools/llam...
-
Doubling Qwen3.6-27B on One RTX 3090: ollama llama.cpp + MTP, Lever by Lever (35.7 80.2 tok/s)
In my build, MTP came from mainline llama.cpp, not ik_llama. ik_llama got me to ~47 (engine + quant), but I couldn't get MTP running there — my build rejected the -mtp flags and ignored the model's nextn tensors. Mainline llama.cpp added MTP fairly recently (PR #22673, merged 2026-05-16), and that's where it worked for me. (There may well be an ik_llama path I missed — this is just what got it going on my box.)
- New `llama.cpp` Updates, AI Agents for Any LLM, and Quantized Vector Index for Local Inference
- Gemma 4 QAT models: Optimizing compression for mobile and laptop efficiency
-
Introducing LlamaStash: a zero-overhead, terminal-native llama.cpp launcher
That script grew up. Today I'm releasing LlamaStash, the first public release of a fast, cross-platform, terminal-native launcher for llama.cpp with zero overhead.
-
How fast is LlamaStash? Overhead, throughput, and a fair comparison with Ollama and LM Studio
LlamaStash spawns the unmodified upstream llama-server. So three different questions follow from that, and there is a benchmark suite for each.
-
A 10 year old Xeon is all you need (for 26B-A4B MTP Drafters without GPU)
llama.cpp includes a benchmarking tool called llama-bench https://github.com/ggml-org/llama.cpp/blob/master/tools/llam...
ik_llama includes llama-sweep-bench https://github.com/ikawrakow/ik_llama.cpp/blob/main/examples...
When comparing hardware, the output of these tools is very helpful to let others put it into context. The post says the output is "reading speed" but knowing the prefill and token generation speeds would be a lot more helpful.
-
Racket v9.2 is now available
lol the same way we implement all of the reduced precision fp8, fp4 types today: by storing them in the corresponding uint:
https://github.com/ggml-org/llama.cpp/discussions/15095
- Run Gemma-4 E2B-it with llama.cpp on Raspberry Pi4
-
Gemma 4 dense by default: why your local agent doesn't want the MoE
# Build llama.cpp with Metal backend git clone https://github.com/ggml-org/llama.cpp cd llama.cpp && cmake -B build -DGGML_METAL=ON && cmake --build build -j # Community-quantized GGUFs (Google ships safetensors; unsloth ships GGUF) huggingface-cli download unsloth/gemma-4-31B-it-GGUF \ gemma-4-31B-it-Q4_K_M.gguf --local-dir . huggingface-cli download unsloth/gemma-4-26B-A4B-it-GGUF \ gemma-4-26B-A4B-it-Q4_K_M.gguf --local-dir . # Benchmark: 200 generations of 512 tokens, log per-call timing ./build/bin/llama-bench -m gemma-4-31B-it-Q4_K_M.gguf -n 512 -r 200 -o json > dense.json ./build/bin/llama-bench -m gemma-4-26B-A4B-it-Q4_K_M.gguf -n 512 -r 200 -o json > moe.json
whisper.cpp
-
I added voice messages to my self-hosted AI agent, for free
whisper.cpp is Georgi Gerganov's port of OpenAI's Whisper model to plain C++. It runs on CPU, no GPU needed, and the smallest model (ggml-tiny.bin) is only about 75MB.
-
Claude Code + Telegram: How to Supercharge Your AI Assistant with Voice, Threading & More
That's it. The plugin handles format conversion, chunking, and transcription automatically. For fully offline transcription, install whisper.cpp instead.
-
I Built a Voice Assistant That Runs Entirely in Your Browser
The first issue is that running Whisper eats up some serious computational resources, so it's not ideal to have it running constantly. In the Whisper WASM port, there is an example of a wake word detection implementation command.wasm, which partially addresses this with VAD, only running transcription once voice activity is detected.
-
Show HN: OWhisper – Ollama for realtime speech-to-text
Thank you for taking the time to build something and share it. However what is the advantage of using this over whisper.cpp stream that can also do real time conversion?
https://github.com/ggml-org/whisper.cpp/tree/master/examples...
-
Kitten TTS: 25MB CPU-Only, Open-Source Voice Model
Whisper and the many variants. Here's a good implementation.
https://github.com/ggml-org/whisper.cpp
-
Ask HN: What API or software are people using for transcription?
Whisper large v3 from openai, but we host it ourselves on Modal.com. It's easy, fast, no rate limits, and cheap as well.
If you want to run it locally, I'd still go with whisper, then I'd look at something like whisper.cpp https://github.com/ggml-org/whisper.cpp. Runs quite well.
- Whispercpp – Local, Fast, and Private Audio Transcription for Ruby
-
Build Your Own Siri. Locally. On-Device. No Cloud
not the gp but found this https://github.com/ggml-org/whisper.cpp/blob/master/models/c...
-
Run LLMs on Apple Neural Engine (ANE)
Actually that's a really good question, I hadn't considered that the comparison here is just CPU vs using Metal (CPU+GPU).
To answer the question though - I think this would be used for cases where you are building an app that wants to utilize a small AI model while at the same time having the GPU free to do graphics related things, which I'm guessing is why Apple stuck these into their hardware in the first place.
Here is an interesting comparison between the two from a whisper.cpp thread - ignoring startup times - the CPU+ANE seems about on par with CPU+GPU: https://github.com/ggml-org/whisper.cpp/pull/566#issuecommen...
-
Building a personal, private AI computer on a budget
A great thread with the type of info your looking for lives here: https://github.com/ggerganov/whisper.cpp/issues/89
But you can likely find similar threads for the llama.cpp benchmark here: https://github.com/ggerganov/llama.cpp/tree/master/examples/...
These are good examples because the llama.cpp and whisper.cpp benchmarks take full advantage of the Apple hardware but also take full advantage of non-Apple hardware with GPU support, AVX support etc.
It’s been true for a while now that the memory bandwidth of modern Apple systems in tandem with the neural cores and gpu has made them very competitive Nvidia for local inference and even training.
What are some alternatives?
koboldcpp - Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
bark - 🔊 Text-Prompted Generative Audio Model
unsloth - Unsloth Studio is a web UI for training and running open models like Gemma 4, Qwen3.6, DeepSeek, gpt-oss locally.
faster-whisper - Faster Whisper transcription with CTranslate2
mlc-llm - Universal LLM Deployment Engine with ML Compilation
whisperX - WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)