llama.cpp
llama
| llama.cpp | llama | |
|---|---|---|
| 1,032 | 190 | |
| 115,929 | 59,363 | |
| 7.4% | 0.0% | |
| 10.0 | 4.7 | |
| 3 days ago | over 1 year ago | |
| C++ | Python | |
| MIT License | GNU General Public License v3.0 or later |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
llama.cpp
-
How to Setup a Local Coding Agent on macOS
> The benchmark prompt was:
> Write a compact Python function that parses a unified diff and returns the changed file paths. Then explain two edge cases.
> Each benchmark generated about 128 tokens.
Generating 128 tokens is probably not enough for good benchmark results. MTP speedup depends on how often the predicted tokens are accepted. In my experience, the very early output has a higher acceptance rate, so short testing can give false positive speedups.
Also llama.cpp includes a tool specifically for benchmarking:
https://github.com/ggml-org/llama.cpp/blob/master/tools/llam...
-
Doubling Qwen3.6-27B on One RTX 3090: ollama llama.cpp + MTP, Lever by Lever (35.7 80.2 tok/s)
In my build, MTP came from mainline llama.cpp, not ik_llama. ik_llama got me to ~47 (engine + quant), but I couldn't get MTP running there — my build rejected the -mtp flags and ignored the model's nextn tensors. Mainline llama.cpp added MTP fairly recently (PR #22673, merged 2026-05-16), and that's where it worked for me. (There may well be an ik_llama path I missed — this is just what got it going on my box.)
- New `llama.cpp` Updates, AI Agents for Any LLM, and Quantized Vector Index for Local Inference
- Gemma 4 QAT models: Optimizing compression for mobile and laptop efficiency
-
Introducing LlamaStash: a zero-overhead, terminal-native llama.cpp launcher
That script grew up. Today I'm releasing LlamaStash, the first public release of a fast, cross-platform, terminal-native launcher for llama.cpp with zero overhead.
-
How fast is LlamaStash? Overhead, throughput, and a fair comparison with Ollama and LM Studio
LlamaStash spawns the unmodified upstream llama-server. So three different questions follow from that, and there is a benchmark suite for each.
-
A 10 year old Xeon is all you need (for 26B-A4B MTP Drafters without GPU)
llama.cpp includes a benchmarking tool called llama-bench https://github.com/ggml-org/llama.cpp/blob/master/tools/llam...
ik_llama includes llama-sweep-bench https://github.com/ikawrakow/ik_llama.cpp/blob/main/examples...
When comparing hardware, the output of these tools is very helpful to let others put it into context. The post says the output is "reading speed" but knowing the prefill and token generation speeds would be a lot more helpful.
-
Racket v9.2 is now available
lol the same way we implement all of the reduced precision fp8, fp4 types today: by storing them in the corresponding uint:
https://github.com/ggml-org/llama.cpp/discussions/15095
- Run Gemma-4 E2B-it with llama.cpp on Raspberry Pi4
-
Gemma 4 dense by default: why your local agent doesn't want the MoE
# Build llama.cpp with Metal backend git clone https://github.com/ggml-org/llama.cpp cd llama.cpp && cmake -B build -DGGML_METAL=ON && cmake --build build -j # Community-quantized GGUFs (Google ships safetensors; unsloth ships GGUF) huggingface-cli download unsloth/gemma-4-31B-it-GGUF \ gemma-4-31B-it-Q4_K_M.gguf --local-dir . huggingface-cli download unsloth/gemma-4-26B-A4B-it-GGUF \ gemma-4-26B-A4B-it-Q4_K_M.gguf --local-dir . # Benchmark: 200 generations of 512 tokens, log per-call timing ./build/bin/llama-bench -m gemma-4-31B-it-Q4_K_M.gguf -n 512 -r 200 -o json > dense.json ./build/bin/llama-bench -m gemma-4-26B-A4B-it-Q4_K_M.gguf -n 512 -r 200 -o json > moe.json
llama
-
Meta's Llama 3.1 can recall 42 percent of the first Harry Potter book
What are your thoughts on the origin of the LLaMA leak? It's interesting that the training data was torrented, and so was the leak. Perhaps we will never know? For the OSINT folks, not a lot to go on, or maybe a lot, depending?
https://en.wikipedia.org/wiki/Llama_(language_model)#Leak
https://archived.moe/g/thread/91848262#p91850335
https://github.com/meta-llama/llama/pull/73/files
-
🚀 25+ Open Source AI APIs, Models & Tools (with GitHub Repo Links)
Llama 2-Chat
-
Getting Forked by Microsoft
preceding calendar month, you must request a license from Meta...
ref: https://github.com/meta-llama/llama/blob/main/LICENSE
But again, not open source...
-
You Wouldn't Download an AI
IANAL But, this is not true it would be a piece of the software. If there is a copyright on the app itself it would extend to the model. Even models have licenses for example LLAMA is release under this license [1]
[1] https://github.com/meta-llama/llama/blob/main/LICENSE
-
LM Studio 0.3.0
Hello Hacker News, Yagil here- founder and original creator of LM Studio (now built by a team of 6!). I had the initial idea to build LM Studio after seeing the OG LLaMa weights ‘leak’ (https://github.com/meta-llama/llama/pull/73/files) and then later trying to run some TheBloke quants during the heady early days of ggerganov/llama.cpp. In my notes LM Studio was first “Napster for LLMs” which evolved later to “GarageBand for LLMs”.
What LM Studio is today is a an IDE / explorer for local LLMs, with a focus on format universality (e.g. GGUF) and data portability (you can go to file explorer and edit everything). The main aim is to give you an accessible way to work with LLMs and make them useful for your purposes.
Folks point out that the product is not open source. However I think we facilitate distribution and usage of openly available AI and empower many people to partake in it, while protecting (in my mind) the business viability of the company. LM Studio is free for personal experimentation and we ask businesses to get in touch to buy a business license.
At the end of the day LM Studio is intended to be an easy yet powerful tool for doing things with AI without giving up personal sovereignty over your data. Our computers are super capable machines, and everything that can happen locally w/o the internet, should. The app has no telemetry whatsoever (you’re welcome to monitor network connections yourself) and it can operate offline after you download or sideload some models.
0.3.0 is a huge release for us. We added (naïve) RAG, internationalization, UI themes, and set up foundations for major releases to come.
- Open Source AI Is the Path Forward
-
Mark Zuckerberg: Llama 3, $10B Models, Caesar Augustus, Bioweapons [video]
derivative works thereof).”
https://github.com/meta-llama/llama/blob/b8348da38fde8644ef0...
Also even if you did use Llama for something, they could unilaterally pull the rug on you when you got 700 million years, AND anyone who thinks Meta broke their copyright loses their license. (Checking if you are still getting screwed is against the rules)
Therefore, Zuckerberg is accountable for explicitly anticompetitive conduct, I assumed an MMA fighter would appreciate the value of competition, go figure.
-
Hello OLMo: A Open LLM
One thing I wanted to add and call attention to is the importance of licensing in open models. This is often overlooked when we blindly accept the vague branding of models as “open”, but I am noticing that many open weight models are actually using encumbered proprietary licenses rather than standard open source licenses that are OSI approved (https://opensource.org/licenses). As an example, Databricks’s DBRX model has a proprietary license that forces adherence to their highly restrictive Acceptable Use Policy by referencing a live website hosting their AUP (https://github.com/databricks/dbrx/blob/main/LICENSE), which means as they change their AUP, you may be further restricted in the future. Meta’s Llama is similar (https://github.com/meta-llama/llama/blob/main/LICENSE ). I’m not sure who can depend on these models given this flaw.
-
Reaching LLaMA2 Performance with 0.1M Dollars
It looks like Llama 2 7B took 184,320 A100-80GB GPU-hours to train[1]. This one says it used a 96×H100 GPU cluster for 2 weeks, for 32,256 hours. That's 17.5% of the number of hours, but H100s are faster than A100s [2] and FP16/bfloat16 performance is ~3x better.
If they had tried to replicate Llama 2 identically with their hardware setup, it'd cost a little bit less than twice their MoE model.
[1] https://github.com/meta-llama/llama/blob/main/MODEL_CARD.md#...
-
DBRX: A New Open LLM
Ironically, the LLaMA license text [1] this is lifted verbatim from is itself copyrighted [2] and doesn't grant you the permission to copy it or make changes like s/meta/dbrx/g lol.
[1] https://github.com/meta-llama/llama/blob/main/LICENSE#L65
What are some alternatives?
koboldcpp - Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
transformers - 🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
unsloth - Unsloth Studio is a web UI for training and running open models like Gemma 4, Qwen3.6, DeepSeek, gpt-oss locally.
ollama - Get up and running with Kimi-K2.6, GLM-5.1, MiniMax, DeepSeek, gpt-oss, Qwen, Gemma and other models.
mlc-llm - Universal LLM Deployment Engine with ML Compilation
llmx - An API for Chat Fine-Tuned Large Language Models (llm)