llama.cpp
onnxruntime
| llama.cpp | onnxruntime | |
|---|---|---|
| 1,032 | 78 | |
| 115,929 | 20,798 | |
| 7.4% | 1.9% | |
| 10.0 | 10.0 | |
| 3 days ago | 2 days ago | |
| C++ | C++ | |
| MIT License | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
llama.cpp
-
How to Setup a Local Coding Agent on macOS
> The benchmark prompt was:
> Write a compact Python function that parses a unified diff and returns the changed file paths. Then explain two edge cases.
> Each benchmark generated about 128 tokens.
Generating 128 tokens is probably not enough for good benchmark results. MTP speedup depends on how often the predicted tokens are accepted. In my experience, the very early output has a higher acceptance rate, so short testing can give false positive speedups.
Also llama.cpp includes a tool specifically for benchmarking:
https://github.com/ggml-org/llama.cpp/blob/master/tools/llam...
-
Doubling Qwen3.6-27B on One RTX 3090: ollama llama.cpp + MTP, Lever by Lever (35.7 80.2 tok/s)
In my build, MTP came from mainline llama.cpp, not ik_llama. ik_llama got me to ~47 (engine + quant), but I couldn't get MTP running there — my build rejected the -mtp flags and ignored the model's nextn tensors. Mainline llama.cpp added MTP fairly recently (PR #22673, merged 2026-05-16), and that's where it worked for me. (There may well be an ik_llama path I missed — this is just what got it going on my box.)
- New `llama.cpp` Updates, AI Agents for Any LLM, and Quantized Vector Index for Local Inference
- Gemma 4 QAT models: Optimizing compression for mobile and laptop efficiency
-
Introducing LlamaStash: a zero-overhead, terminal-native llama.cpp launcher
That script grew up. Today I'm releasing LlamaStash, the first public release of a fast, cross-platform, terminal-native launcher for llama.cpp with zero overhead.
-
How fast is LlamaStash? Overhead, throughput, and a fair comparison with Ollama and LM Studio
LlamaStash spawns the unmodified upstream llama-server. So three different questions follow from that, and there is a benchmark suite for each.
-
A 10 year old Xeon is all you need (for 26B-A4B MTP Drafters without GPU)
llama.cpp includes a benchmarking tool called llama-bench https://github.com/ggml-org/llama.cpp/blob/master/tools/llam...
ik_llama includes llama-sweep-bench https://github.com/ikawrakow/ik_llama.cpp/blob/main/examples...
When comparing hardware, the output of these tools is very helpful to let others put it into context. The post says the output is "reading speed" but knowing the prefill and token generation speeds would be a lot more helpful.
-
Racket v9.2 is now available
lol the same way we implement all of the reduced precision fp8, fp4 types today: by storing them in the corresponding uint:
https://github.com/ggml-org/llama.cpp/discussions/15095
- Run Gemma-4 E2B-it with llama.cpp on Raspberry Pi4
-
Gemma 4 dense by default: why your local agent doesn't want the MoE
# Build llama.cpp with Metal backend git clone https://github.com/ggml-org/llama.cpp cd llama.cpp && cmake -B build -DGGML_METAL=ON && cmake --build build -j # Community-quantized GGUFs (Google ships safetensors; unsloth ships GGUF) huggingface-cli download unsloth/gemma-4-31B-it-GGUF \ gemma-4-31B-it-Q4_K_M.gguf --local-dir . huggingface-cli download unsloth/gemma-4-26B-A4B-it-GGUF \ gemma-4-26B-A4B-it-Q4_K_M.gguf --local-dir . # Benchmark: 200 generations of 512 tokens, log per-call timing ./build/bin/llama-bench -m gemma-4-31B-it-Q4_K_M.gguf -n 512 -r 200 -o json > dense.json ./build/bin/llama-bench -m gemma-4-26B-A4B-it-Q4_K_M.gguf -n 512 -r 200 -o json > moe.json
onnxruntime
-
I Built a Neural Network Engine in C# That Runs in Your Browser - No ONNX Runtime, No JavaScript Bridge, No Native Binaries
Half of the answer is technical curiosity. The other half is that the current ML-in-the-browser landscape is dominated by ONNX Runtime Web, which has a fundamental WebGPU device-sharing bug that's been ignored for six months (microsoft/onnxruntime#26107). That bug is a wall for anyone trying to ship more than one model in a single browser session. It pushed me from "I wonder if I could do this" to "I'm doing this."
-
They Don't Have the Money (And Neither Do You): The Coming Era of Small Models
The techniques are well-understood. Google's Vertex AI, Amazon SageMaker, Microsoft's ONNX Runtime, and OpenAI's fine-tuning API all support distillation workflows now. It's become fundamental technique, not experimental.
-
I Stopped Paying for Subtitle Services After Running Whisper in a Browser Tab
Then I thought: OpenAI released Whisper as open source in 2022. It's been ported to ONNX, to Core ML, to WebAssembly. If someone can run Stable Diffusion in a browser tab, running a speech-to-text model shouldn't be harder.
-
I Ran a Neural Network in a Browser Tab to Split a Song into Stems
The Audio Stem Splitter on Kitmul loads an ONNX-exported version of the Demucs model and runs inference entirely via ONNX Runtime Web backed by WebAssembly. No server. No upload. The audio bytes never leave your machine.
-
How I started Inference4j - an open source ML project in java with the help of Claude
I stumbled upon ONNX runtime and was immediately hooked. A portable binary execution of a self-contained Neural Network Model.
-
Show HN: TTSLab – A voice AI agent and TTS lab running in the browser via WebGPU
3. Did the Voice Agent work for you? That's the most experimental part.
Built on top of ONNX Runtime Web (https://onnxruntime.ai) and Transformers.js — huge thanks to those communities for making in-browser ML inference possible.
-
15 years later, Microsoft morged my diagram
Seems like this is going to be the year of AI slop being released everywhere by Microsoft. Just wish they'd put as much effort into a post morten for this one as they're doing for a diagram on a blog post https://github.com/microsoft/onnxruntime/issues/27263#issuec...
-
GSoC 2026 Predictions: 30 NEW AI/ML/Security Organizations You Should Start Contributing to NOW!
Main: https://github.com/microsoft/onnxruntime ⭐ 15k+
-
ONNX Runtime and CoreML May Silently Convert Your Model to FP16
Thank you for reading.
Also generally I think CoreML isn't the best. The best solution for ORT would probably be to introduce a pure MPS provider (https://github.com/microsoft/onnxruntime/issues/21271), but given they've already bought into CoreML the effort may not be worth the reward for the core team. Which fair enough as it's a pretty mammoth task
-
Accelerating LLM Inference: How C++, ONNX, and llama.cpp Power Efficient AI
ONNX Runtime GitHub
What are some alternatives?
koboldcpp - Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
tvm - Open Machine Learning Compiler Framework
unsloth - Unsloth Studio is a web UI for training and running open models like Gemma 4, Qwen3.6, DeepSeek, gpt-oss locally.
onnx - Open standard for machine learning interoperability
mlc-llm - Universal LLM Deployment Engine with ML Compilation
onnx-tensorrt - ONNX-TensorRT: TensorRT backend for ONNX