| llama.cpp | outlines | |
|---|---|---|
| 1,032 | 50 | |
| 115,929 | 13,948 | |
| 7.4% | 1.4% | |
| 10.0 | 9.3 | |
| 3 days ago | 27 days ago | |
| C++ | Python | |
| MIT License | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
llama.cpp
-
How to Setup a Local Coding Agent on macOS
> The benchmark prompt was:
> Write a compact Python function that parses a unified diff and returns the changed file paths. Then explain two edge cases.
> Each benchmark generated about 128 tokens.
Generating 128 tokens is probably not enough for good benchmark results. MTP speedup depends on how often the predicted tokens are accepted. In my experience, the very early output has a higher acceptance rate, so short testing can give false positive speedups.
Also llama.cpp includes a tool specifically for benchmarking:
https://github.com/ggml-org/llama.cpp/blob/master/tools/llam...
-
Doubling Qwen3.6-27B on One RTX 3090: ollama llama.cpp + MTP, Lever by Lever (35.7 80.2 tok/s)
In my build, MTP came from mainline llama.cpp, not ik_llama. ik_llama got me to ~47 (engine + quant), but I couldn't get MTP running there — my build rejected the -mtp flags and ignored the model's nextn tensors. Mainline llama.cpp added MTP fairly recently (PR #22673, merged 2026-05-16), and that's where it worked for me. (There may well be an ik_llama path I missed — this is just what got it going on my box.)
- New `llama.cpp` Updates, AI Agents for Any LLM, and Quantized Vector Index for Local Inference
- Gemma 4 QAT models: Optimizing compression for mobile and laptop efficiency
-
Introducing LlamaStash: a zero-overhead, terminal-native llama.cpp launcher
That script grew up. Today I'm releasing LlamaStash, the first public release of a fast, cross-platform, terminal-native launcher for llama.cpp with zero overhead.
-
How fast is LlamaStash? Overhead, throughput, and a fair comparison with Ollama and LM Studio
LlamaStash spawns the unmodified upstream llama-server. So three different questions follow from that, and there is a benchmark suite for each.
-
A 10 year old Xeon is all you need (for 26B-A4B MTP Drafters without GPU)
llama.cpp includes a benchmarking tool called llama-bench https://github.com/ggml-org/llama.cpp/blob/master/tools/llam...
ik_llama includes llama-sweep-bench https://github.com/ikawrakow/ik_llama.cpp/blob/main/examples...
When comparing hardware, the output of these tools is very helpful to let others put it into context. The post says the output is "reading speed" but knowing the prefill and token generation speeds would be a lot more helpful.
-
Racket v9.2 is now available
lol the same way we implement all of the reduced precision fp8, fp4 types today: by storing them in the corresponding uint:
https://github.com/ggml-org/llama.cpp/discussions/15095
- Run Gemma-4 E2B-it with llama.cpp on Raspberry Pi4
-
Gemma 4 dense by default: why your local agent doesn't want the MoE
# Build llama.cpp with Metal backend git clone https://github.com/ggml-org/llama.cpp cd llama.cpp && cmake -B build -DGGML_METAL=ON && cmake --build build -j # Community-quantized GGUFs (Google ships safetensors; unsloth ships GGUF) huggingface-cli download unsloth/gemma-4-31B-it-GGUF \ gemma-4-31B-it-Q4_K_M.gguf --local-dir . huggingface-cli download unsloth/gemma-4-26B-A4B-it-GGUF \ gemma-4-26B-A4B-it-Q4_K_M.gguf --local-dir . # Benchmark: 200 generations of 512 tokens, log per-call timing ./build/bin/llama-bench -m gemma-4-31B-it-Q4_K_M.gguf -n 512 -r 200 -o json > dense.json ./build/bin/llama-bench -m gemma-4-26B-A4B-it-Q4_K_M.gguf -n 512 -r 200 -o json > moe.json
outlines
-
The Chomsky Objection the AI Industry Has Been Quietly Working Around
The Python library Outlines, maintained by .txt, is the most widely-used implementation. The minimal usage is short:
-
A 70ms Local NLI Judge Hits 0.596 Pearson r With Groq Llama 3.3 70B on DSPy Reward Scoring
Same minimal-first methodology will be applied to outlines, marvin, and llama_index — one paired comparison, no holes, real numbers. A PR at stanfordnlp/dspy referencing this work is open; find it linked from the semantix-ai repo.
-
Top 5 Structured Output Libraries for LLMs in 2026
TL;DR: For most projects, start with Instructor (Python) or Zod + zodResponseFormat (TypeScript) -- they're the fastest path to reliable structured output from any cloud LLM. If you run local models and need guaranteed schema compliance with zero retries, go with Outlines. For Python agent workflows with tool calling, PydanticAI is the best bet.
- LLM Structured Outputs Handbook
-
PyCon 2025 Agentic App with Use of Pydantic AI
`outlines` (https://github.com/dottxt-ai/outlines) is very good and supported by vLLM as a backend structured output provider (https://docs.vllm.ai/en/v0.8.2/features/structured_outputs.h...) for both local and remote LLMs. vLLM is probably the best open source tooling for the inference side right now.
-
GPT-5 for Developers
I assume they're doing "Structured Generation" or "Guided generation", which has been possible for a while if you control the LLM itself e.g. running an OSS model, e.g. [0][1]. It's cool to see a major API provider offer it, though.
The basic idea is: at each auto-regressive step (each token generation), instead of letting the model generate a probability distribution over "all tokens in the entire vocab it's ever seen" (the default), only allow the model to generate a probability distribution over "this specific set of tokens I provide". And that set can change from one sampling set to the next, according to a given grammar. E.g. if you're using a JSON grammar, and you've just generated a `{`, you can provide the model a choice of only which tokens are valid JSON immediately after a `{`, etc.
[0] https://github.com/dottxt-ai/outlines
-
Taming LLMs: How to Get Structured Output Every Time (Even for Big Responses)
To get started, install Outlines (pip install outlines) and experiment with the examples above. If you’re working on a project that needs structured LLM output, give Outlines a try and check out the official docs for more details. Want to dive deeper into FSM logic or specific integrations? Let me know in the comments!
- Outlines – Structured Outputs from LLMs
-
Albert Einstein's Theory of Relativity in Words of Four Letters or Less
Another beam could be "This way is solv-". With "ing" as the obvious next token.
It will select the best beam for the output.
[1]:https://github.com/dottxt-ai/outlines
-
Universal Personal Assistant with LLMs
LLM answer quality directly relates to its given prompts, and therefore, effective prompt engineering is necessary. The landscape of prompt managing platforms and libraries increased manifold. Some tools now actively incorporate specific tweaks of the most recent commercial models, enabling the formulation of prompts that are injected with model-specific formulations. Example libraries are dspy, LMQL, Outlines, and Prompttools,
What are some alternatives?
koboldcpp - Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
guidance - A guidance language for controlling large language models.
unsloth - Unsloth Studio is a web UI for training and running open models like Gemma 4, Qwen3.6, DeepSeek, gpt-oss locally.
langroid - Harness LLMs with Multi-Agent Programming
mlc-llm - Universal LLM Deployment Engine with ML Compilation
jsonformer - A Bulletproof Way to Generate Structured JSON from Language Models