SaaSHub helps you find the best software and product alternatives Learn more →
Llama.cpp Alternatives
Similar projects and alternatives to llama.cpp
-
text-generation-webui
A Gradio web UI for Large Language Models with support for multiple inference backends.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
-
-
-
-
-
-
-
-
-
FastChat
An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
-
guidance
Discontinued A guidance language for controlling large language models. [Moved to: https://github.com/guidance-ai/guidance] (by microsoft)
-
LocalAI
:robot: The free, Open Source alternative to OpenAI, Claude and others. Self-hosted and local-first. Drop-in replacement for OpenAI, running on consumer-grade hardware. No GPU required. Runs gguf, transformers, diffusers and many more models architectures. Features: Generate Text, Audio, Video, Images, Voice Cloning, Distributed, P2P inference
-
-
-
exllama
A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
-
-
-
-
llm
Discontinued [Unmaintained, see README] An ecosystem of Rust libraries for working with large language models
llama.cpp discussion
llama.cpp reviews and mentions
-
2024 In Review
As local runners, I used llama.cpp and https://github.com/ml-explore/mlx – the second one uses Mac resources better (checked through macmon), but new models come out a bit slower on it. Some people use ollama, but I didn't understand why, considering that it has an incompatible client with openai-client API and slower performance.
-
Phi-4: Microsoft's Newest Small Language Model Specializing in Complex Reasoning
Compare performance on various Macs here as it gets updated:
https://github.com/ggerganov/llama.cpp/discussions/4167
OMM, Llama 3.3 70B runs at ~7 text generation tokens per second on Macbook Pro Max 128GB, while generating GPT-4 feeling text with more in depth responses and fewer bullets. Llama 3.3 70B also doesn't fight the system prompt, it leans in.
Consider e.g. LM Studio (0.3.5 or newer) for a Metal (MLX) centered UI, include MLX in your search term when downloading models.
Also, do not scrimp on the storage. At 60GB - 100GB per model, it takes a day of experimentation to use 2.5TB of storage in your model cache. And remember to exclude that path from your TimeMachine backups.
- Llama.cpp Now Supports Qwen2-VL (Vision Language Model)
-
Structured Outputs with Ollama
If anyone needs a more powerful constrain outputs, llama.cpp support gbnf:
https://github.com/ggerganov/llama.cpp/blob/master/grammars/...
-
Bringing K/V context quantisation to Ollama
Which has been part of llama.cpp since Dev 7, 2023 (2).
So... mmmm... while this is great, somehow I'm left feeling kind of vaguely put-off by the comms around what is really 'we finally support some config flag from llama.cpp that's been there for really quite a long time'.
Am I missed some context here?
[1] - https://github.com/ollama/ollama/commit/1bdab9fdb19f8a8c73ed...
[2] - since https://github.com/ggerganov/llama.cpp/commit/bcc0eb4591bec5...
-
Llama.cpp guide – Running LLMs locally on any hardware, from scratch
re Temperature config option: I've found it useful for trying to generate something akin to a sampling-based confidence score for chat completions (e.g., set the temperature a bit high, run the model a few times and calculate the distribution of responses). Otherwise haven't figured out a good way to get confidence scores in llama.cpp (Been tracking this git request to get log_probs https://github.com/ggerganov/llama.cpp/issues/6423)
-
Writing an AnythingLLM Custom Agent Skill to Trigger Make.com Webhooks
Recently I've been experimenting with running a local Llama.cpp Server and looking for 3rd party applications to connect to it. It seems like there are have been a lot of popular solutions to running models downloaded from Huggingface locally, but many of them seem to want to import the model themselves using the Llama.cpp or Ollama libraries instead of connecting to an external provider. I'm more interested in using this technology as a server so that I start it once and allow clients to connect to it without needing to shutdown and go through the long, resource intensive, start-up process of each application. One of the applications that I've found great for chatting with in my setup is AnythingLLM.
- 6 Easy Ways to Run LLM Locally + Alpha
-
OpenCoder: Open-Source LLM for Coding
I am happily using qwen2.5-coder-7b-instruct-q3_k_m.gguf with a context size of 32768 on an RTX 3060 Mobile with 6GB VRAM using llama.cpp [2]. With 16GB VRAM, you could use qwen2.5-7b-instruct-q8_0.gguf which is basically indistinguishable from the fp16 variant.
[1] https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF
[2] https://github.com/ggerganov/llama.cpp
-
Quantized Llama models with increased speed and a reduced memory footprint
I’ve only toyed with them a bit, and had a similar experience - but did find I got better output by forcing them to adhere to a fixed grammar: https://github.com/ggerganov/llama.cpp/tree/master/grammars
-
A note from our sponsor - SaaSHub
www.saashub.com | 20 Jan 2025
Stats
ggerganov/llama.cpp is an open source project licensed under MIT License which is an OSI approved license.
The primary programming language of llama.cpp is C++.