SaaSHub helps you find the best software and product alternatives Learn more →
Ik_llama.cpp Alternatives
Similar projects and alternatives to ik_llama.cpp
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
ollama
Get up and running with Kimi-K2.6, GLM-5.1, MiniMax, DeepSeek, gpt-oss, Qwen, Gemma and other models.
-
-
-
-
-
sglang
SGLang is a high-performance serving framework for large language models and multimodal models.
-
-
-
-
-
-
-
-
-
clawdis
Discontinued Your own personal AI assistant. Any OS. Any Platform. [Moved to: https://github.com/clawdbot/clawdbot]
-
-
knn
POC for turning a Google corpus into an explorable knowledge graph using pure k-NN. (by sixthextinction)
ik_llama.cpp discussion
ik_llama.cpp reviews and mentions
-
A 10 year old Xeon is all you need (for 26B-A4B MTP Drafters without GPU)
llama.cpp includes a benchmarking tool called llama-bench https://github.com/ggml-org/llama.cpp/blob/master/tools/llam...
ik_llama includes llama-sweep-bench https://github.com/ikawrakow/ik_llama.cpp/blob/main/examples...
When comparing hardware, the output of these tools is very helpful to let others put it into context. The post says the output is "reading speed" but knowing the prefill and token generation speeds would be a lot more helpful.
- ik_llama.cpp – llama.cpp fork with better CPU performance
-
Turning Google into an Explorable Knowledge Graph Using Pure k-NN
Human learning is all about building connections in your head. Like last week, I read an ArXiv paper on quantization, which prompted me to do some Google-fu for a FP16 vs INT8 comparison on NVIDIA’s forums, and then make a site:github.com search for a Llama.cpp fork with optimized kernels to try it myself. This takes time. Google — or an LLM — can’t make these mental hops for you.
-
Something is afoot in the land of Qwen
I'm getting ~27 tok/s on my 5060 Ti 16Gb (just bought for ~€600) on an old 8th gen Core 7 (2017, with DDR4 memory); I'm using the iGPU for the graphics, so the card is free for inference - I'm using Qwen3.5-27B-UD-IQ3_XXS, and almost 72k context - I went with ik_llama.cpp [0] - Qwen3.5 is very promising, I'm definitely going to use it.
I also tried ZSE [1] but experienced some issues, I was hoping I could squeeze in 4bit or 5bit quants
[0] https://github.com/ikawrakow/ik_llama.cpp
[1] https://news.ycombinator.com/item?id=47160526
- A 30B Qwen Model Walks into a Raspberry Pi and Runs in Real Time
-
I Want Everything Local – Building My Offline AI Workspace
Take a look at ik_llama.cpp: https://github.com/ikawrakow/ik_llama.cpp
CPU performance is much better than mainline llama, as well as having more quantization types available
- GitHub deletes popular llama.cpp fork without explanation
- Basic Facts about GPUs
-
Ollama's new engine for multimodal models
There's also some interpersonal conflict in llama.cpp that's hampering other bug fixes https://github.com/ikawrakow/ik_llama.cpp/pull/400
- CPU beating GPU in token generation speed
-
A note from our sponsor - SaaSHub
www.saashub.com | 8 Jun 2026
Stats
ikawrakow/ik_llama.cpp is an open source project licensed under MIT License which is an OSI approved license.
The primary programming language of ik_llama.cpp is C++.