llama.cpp
textgen
| llama.cpp | textgen | |
|---|---|---|
| 1,032 | 887 | |
| 115,929 | 47,283 | |
| 7.4% | 0.8% | |
| 10.0 | 9.8 | |
| 3 days ago | 12 days ago | |
| C++ | Python | |
| MIT License | GNU Affero General Public License v3.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
llama.cpp
-
How to Setup a Local Coding Agent on macOS
> The benchmark prompt was:
> Write a compact Python function that parses a unified diff and returns the changed file paths. Then explain two edge cases.
> Each benchmark generated about 128 tokens.
Generating 128 tokens is probably not enough for good benchmark results. MTP speedup depends on how often the predicted tokens are accepted. In my experience, the very early output has a higher acceptance rate, so short testing can give false positive speedups.
Also llama.cpp includes a tool specifically for benchmarking:
https://github.com/ggml-org/llama.cpp/blob/master/tools/llam...
-
Doubling Qwen3.6-27B on One RTX 3090: ollama llama.cpp + MTP, Lever by Lever (35.7 80.2 tok/s)
In my build, MTP came from mainline llama.cpp, not ik_llama. ik_llama got me to ~47 (engine + quant), but I couldn't get MTP running there — my build rejected the -mtp flags and ignored the model's nextn tensors. Mainline llama.cpp added MTP fairly recently (PR #22673, merged 2026-05-16), and that's where it worked for me. (There may well be an ik_llama path I missed — this is just what got it going on my box.)
- New `llama.cpp` Updates, AI Agents for Any LLM, and Quantized Vector Index for Local Inference
- Gemma 4 QAT models: Optimizing compression for mobile and laptop efficiency
-
Introducing LlamaStash: a zero-overhead, terminal-native llama.cpp launcher
That script grew up. Today I'm releasing LlamaStash, the first public release of a fast, cross-platform, terminal-native launcher for llama.cpp with zero overhead.
-
How fast is LlamaStash? Overhead, throughput, and a fair comparison with Ollama and LM Studio
LlamaStash spawns the unmodified upstream llama-server. So three different questions follow from that, and there is a benchmark suite for each.
-
A 10 year old Xeon is all you need (for 26B-A4B MTP Drafters without GPU)
llama.cpp includes a benchmarking tool called llama-bench https://github.com/ggml-org/llama.cpp/blob/master/tools/llam...
ik_llama includes llama-sweep-bench https://github.com/ikawrakow/ik_llama.cpp/blob/main/examples...
When comparing hardware, the output of these tools is very helpful to let others put it into context. The post says the output is "reading speed" but knowing the prefill and token generation speeds would be a lot more helpful.
-
Racket v9.2 is now available
lol the same way we implement all of the reduced precision fp8, fp4 types today: by storing them in the corresponding uint:
https://github.com/ggml-org/llama.cpp/discussions/15095
- Run Gemma-4 E2B-it with llama.cpp on Raspberry Pi4
-
Gemma 4 dense by default: why your local agent doesn't want the MoE
# Build llama.cpp with Metal backend git clone https://github.com/ggml-org/llama.cpp cd llama.cpp && cmake -B build -DGGML_METAL=ON && cmake --build build -j # Community-quantized GGUFs (Google ships safetensors; unsloth ships GGUF) huggingface-cli download unsloth/gemma-4-31B-it-GGUF \ gemma-4-31B-it-Q4_K_M.gguf --local-dir . huggingface-cli download unsloth/gemma-4-26B-A4B-it-GGUF \ gemma-4-26B-A4B-it-Q4_K_M.gguf --local-dir . # Benchmark: 200 generations of 512 tokens, log per-call timing ./build/bin/llama-bench -m gemma-4-31B-it-Q4_K_M.gguf -n 512 -r 200 -o json > dense.json ./build/bin/llama-bench -m gemma-4-26B-A4B-it-Q4_K_M.gguf -n 512 -r 200 -o json > moe.json
textgen
-
Is there an IDE that can use the local open-source model?
I've been looking for this too. It seems to me as though all the ide's are trying to sell the llms as a service or trying to lock you in by downloading llms through their ide. I have been downloading llm's from huggingface as gguf files and would like to use those downloads (and running them through https://github.com/oobabooga/text-generation-webui). It is possible to run those llms as a local api using something like llama-cpp-python and would prefer to use something like that method. Zed (https://zed.dev/), which is now available on windows might be able to do it, but i'd rather use something (foss) that doesn't have a pricing model (the development focus will always be upon those who pay). tbh i'm getting a bit sick of changing ide's, as their support changes, and really would prefer not to use (microsoft) visual studio code which seems to be cornering the market. Starting to think i'm going to try to learn emacs, with https://github.com/karthink/gptel looking as if it would meet my needs.
-
When you're asking AI chatbots for answers, they're data-mining you
There are also things like Oobabooga's text-generation-webui[0] which can present a similar interface to ChatGPT for local models.
I've had great success in running Qwen3-8B-GGUF[1] on my RTX 2070 SUPER (8GB VRAM) using Oobabooga (everyone just calls it via the author's name, it's much catchier) so this is definitely doable on consumer hardware. Specifically, I run the Q4_K_M model as Oobabooga loads all of its layers into the GPU by default, making it nice and snappy. (Testing has shown that I can actually load up to the Q6_K model before some layers have to be loaded into the CPU, but I have to manually specify that all those layers should be loaded into the GPU, as opposed to leaving it auto-determined.)
It does obviously hallucinate more often than ChatGPT does, so care should be taken. That said, it's really nice to have something local.
There's a subreddit for running text gen models locally that people might be interested in: https:// www.reddit.com/r/LocalLLaMA
[0] https://github.com/oobabooga/text-generation-webui
[1] https://huggingface.co/Qwen/Qwen3-8B-GGUF
-
How to Install NVIDIA AceReason-Nemotron-14B Locally?
git clone https://github.com/oobabooga/text-generation-webui cd text-generation-webui
-
1,156 Questions Censored by DeepSeek
total time = 392339.02 ms / 2221 tokens
And my exact command was:
llama-server --model DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf --temp 0.6 -c 9000 --min-p 0.1 --top-k 0 --top-p 1 --timeout 3600 --slot-save-path ~/llama_kv_path --port 8117 -ctk q8_0
(IIRC slot save path argument does absolutely nothing unless and is superfluous, but I have been pasting a similar command around and been too lazy to remove it). -ctk q8_0 reduces memory use a bit for context.
I think my 256gb is right at the limit of spilling a bit into swap, so I'm pushing the limits :)
To explain to anyone not aware of llama-server: it exposes (a somewhat) OpenAI-compatible API and then you can use it with any software that speaks that. llama-server itself also has a UI, but I haven't used it.
I had some SSH tunnels set up to use the server interface with https://github.com/oobabooga/text-generation-webui where I hacked an "OpenAI" client to it (that UI doesn't have it natively). The only reason I use the oobabooga UI is out of habit so I don't recommend this setup to others.
-
DeepSeek-R1 with Dynamic 1.58-bit Quantization
Can't this kind of repetition be dealt with at the decoder level, like for any models? (see DRY decoder for instance: https://github.com/oobabooga/text-generation-webui/pull/5677)
-
I Run LLMs Locally
Still nothing better than oobabooga (https://github.com/oobabooga/text-generation-webui) in terms of maximalism/"Pro"/"Prosumer" LLM UI/UX ALA Blender, Photoshop, Final Cut Pro, etc.
Embarrassing and any VCs reading this can contact me to talk about how to fix that. lm-studio is today the closest competition (but not close enough) and Adobe or Microsoft could do it if they fired their current folks which prevent this from happening.
If you're not using Oobabooga, you're likely not playing with the settings on models, and if you're not playing with your models settings, you're hardly even scratching the surface on its total capabilities.
-
Yi-Coder: A Small but Mighty LLM for Code
I understand your situation. It sounds super simple to me now but I remember having to spend at least a week trying to get the concepts and figuring out what prerequisite knowledge I would need between a continium of just using chatgpt and learning relevant vector math etc. It is much closer to the chatgpt side fortunately. I don't like ollama per se (because i can't reuse its models due to it compressing them in its own format) but it's still a very good place to start. Any interface that lets you download models as gguf from huggingface will do just fine. Don't be turned off by the roleplaying/waifu sounding frontend names. They are all fine. This is what I mostly prefer: https://github.com/oobabooga/text-generation-webui
- XTC: An LLM sampler that boosts creativity, breaks writing clichés
-
Codestral Mamba
Why do people recommend this instead of the much better oobabooga text-gen-webui?
https://github.com/oobabooga/text-generation-webui
It's like you hate settings, features, and access to many backends!
-
Why I made TabbyAPI
The issue is running the model. Exl2 is part of the ExllamaV2 library, but to run a model, a user needs an API server. The only option out there was using text-generation-webui (TGW), a program that bundled every loader out there into a Gradio webui. Gradio is a common “building-block” UI framework for python development and is often used for AI applications. This setup was good for a while, until it wasn’t.
What are some alternatives?
koboldcpp - Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
unsloth - Unsloth Studio is a web UI for training and running open models like Gemma 4, Qwen3.6, DeepSeek, gpt-oss locally.
SillyTavern - LLM Frontend for Power Users.
mlc-llm - Universal LLM Deployment Engine with ML Compilation
ollama - Get up and running with Kimi-K2.6, GLM-5.1, MiniMax, DeepSeek, gpt-oss, Qwen, Gemma and other models.