llama.cpp
unsloth
| llama.cpp | unsloth | |
|---|---|---|
| 1,032 | 34 | |
| 115,929 | 66,397 | |
| 7.4% | 5.3% | |
| 10.0 | 10.0 | |
| 3 days ago | about 11 hours ago | |
| C++ | Python | |
| MIT License | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
llama.cpp
-
How to Setup a Local Coding Agent on macOS
> The benchmark prompt was:
> Write a compact Python function that parses a unified diff and returns the changed file paths. Then explain two edge cases.
> Each benchmark generated about 128 tokens.
Generating 128 tokens is probably not enough for good benchmark results. MTP speedup depends on how often the predicted tokens are accepted. In my experience, the very early output has a higher acceptance rate, so short testing can give false positive speedups.
Also llama.cpp includes a tool specifically for benchmarking:
https://github.com/ggml-org/llama.cpp/blob/master/tools/llam...
-
Doubling Qwen3.6-27B on One RTX 3090: ollama llama.cpp + MTP, Lever by Lever (35.7 80.2 tok/s)
In my build, MTP came from mainline llama.cpp, not ik_llama. ik_llama got me to ~47 (engine + quant), but I couldn't get MTP running there — my build rejected the -mtp flags and ignored the model's nextn tensors. Mainline llama.cpp added MTP fairly recently (PR #22673, merged 2026-05-16), and that's where it worked for me. (There may well be an ik_llama path I missed — this is just what got it going on my box.)
- New `llama.cpp` Updates, AI Agents for Any LLM, and Quantized Vector Index for Local Inference
- Gemma 4 QAT models: Optimizing compression for mobile and laptop efficiency
-
Introducing LlamaStash: a zero-overhead, terminal-native llama.cpp launcher
That script grew up. Today I'm releasing LlamaStash, the first public release of a fast, cross-platform, terminal-native launcher for llama.cpp with zero overhead.
-
How fast is LlamaStash? Overhead, throughput, and a fair comparison with Ollama and LM Studio
LlamaStash spawns the unmodified upstream llama-server. So three different questions follow from that, and there is a benchmark suite for each.
-
A 10 year old Xeon is all you need (for 26B-A4B MTP Drafters without GPU)
llama.cpp includes a benchmarking tool called llama-bench https://github.com/ggml-org/llama.cpp/blob/master/tools/llam...
ik_llama includes llama-sweep-bench https://github.com/ikawrakow/ik_llama.cpp/blob/main/examples...
When comparing hardware, the output of these tools is very helpful to let others put it into context. The post says the output is "reading speed" but knowing the prefill and token generation speeds would be a lot more helpful.
-
Racket v9.2 is now available
lol the same way we implement all of the reduced precision fp8, fp4 types today: by storing them in the corresponding uint:
https://github.com/ggml-org/llama.cpp/discussions/15095
- Run Gemma-4 E2B-it with llama.cpp on Raspberry Pi4
-
Gemma 4 dense by default: why your local agent doesn't want the MoE
# Build llama.cpp with Metal backend git clone https://github.com/ggml-org/llama.cpp cd llama.cpp && cmake -B build -DGGML_METAL=ON && cmake --build build -j # Community-quantized GGUFs (Google ships safetensors; unsloth ships GGUF) huggingface-cli download unsloth/gemma-4-31B-it-GGUF \ gemma-4-31B-it-Q4_K_M.gguf --local-dir . huggingface-cli download unsloth/gemma-4-26B-A4B-it-GGUF \ gemma-4-26B-A4B-it-Q4_K_M.gguf --local-dir . # Benchmark: 200 generations of 512 tokens, log per-call timing ./build/bin/llama-bench -m gemma-4-31B-it-Q4_K_M.gguf -n 512 -r 200 -o json > dense.json ./build/bin/llama-bench -m gemma-4-26B-A4B-it-Q4_K_M.gguf -n 512 -r 200 -o json > moe.json
unsloth
-
I Trained an LLM on 75K of My Own Messages So It Would Stop Writing Like a Chatbot
Training: unsloth + trl (SFTTrainer). Unsloth handles the 4-bit quantization and gradient checkpointing; trl handles the training loop.
-
How to Fine-Tune Llama 3.1 8B for Under $5 Using QLoRA in 2026 – A Practical Guide
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" pip install --no-deps xformers trl peft accelerate bitsandbytes
- Show HN: Unsloth Studio
-
An Experiment in Voice: What Happens When AI Learns to Write Like You
Here's the thing that made this experiment actually feasible: Unsloth.
-
I Built an Automated SLM Fine-Tuning Engine with Python and Unsloth 🚀
For the actual training, I rely on Unsloth. It makes fine-tuning 2x faster and uses 70% less memory.
- Unsloth – Train LLMs 2x faster with 70% less VRAM
-
AI Infrastructure on Consumer Hardware
Unsloth - Efficient fine-tuning
- Why ML Needs a New Programming Language
-
Qwen3-235B-A22B-Thinking-2507
Oh for finetuning - we do have some code for MoE finetuning for Qwen at https://github.com/unslothai/unsloth, but we haven't yet announced it yet!
-
When Fine-Tuning Makes Sense: A Developer's Guide
Lot's of tools for each of those separately (RAG and fine-tuning). We're working on combining them but it's not ready yet.
You don't need a big GPU cluster. Fine-tuning is quite accessible via both APIs and local tools. Some suggestions:
- getkiln.ai (biased, my tool): let's you try all of the below, and compare/eval the resulting models
- API based tuning for closed models: OpenAI, Google Gemini
- API based tuning for open models: Together.ai, Fireworks.ai
- Local tuning for open models: https://unsloth.ai (can be run on Google Collab instances if you don't have local Nvidia GPUs).
What are some alternatives?
koboldcpp - Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
axolotl - Go ahead and axolotl questions
mlc-llm - Universal LLM Deployment Engine with ML Compilation
DeepSpeed - DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
onnxruntime - ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
accelerate - 🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support