llama.cpp

LLM inference in C/C++ (by ggerganov)

Llama.cpp Alternatives

Similar projects and alternatives to llama.cpp

  1. text-generation-webui

    A Gradio web UI for Large Language Models with support for multiple inference backends.

  2. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  3. ollama

    Get up and running with Llama 3.3, Phi 4, Gemma 2, and other large language models.

  4. whisper.cpp

    Port of OpenAI's Whisper model in C/C++

  5. llama

    187 llama.cpp VS llama

    Inference code for Llama models

  6. koboldcpp

    Run GGUF models easily with a KoboldAI UI. One File. Zero Install.

  7. gpt4all

    GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.

  8. stanford_alpaca

    Code and documentation to train Stanford's Alpaca models, and generate the data.

  9. alpaca-lora

    107 llama.cpp VS alpaca-lora

    Instruct-tune LLaMA on consumer hardware

  10. mlc-llm

    90 llama.cpp VS mlc-llm

    Universal LLM Deployment Engine with ML Compilation

  11. alpaca.cpp

    Discontinued Locally run an Instruction-Tuned Chat-Style LLM

  12. FastChat

    86 llama.cpp VS FastChat

    An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.

  13. guidance

    90 llama.cpp VS guidance

    Discontinued A guidance language for controlling large language models. [Moved to: https://github.com/guidance-ai/guidance] (by microsoft)

  14. LocalAI

    :robot: The free, Open Source alternative to OpenAI, Claude and others. Self-hosted and local-first. Drop-in replacement for OpenAI, running on consumer-grade hardware. No GPU required. Runs gguf, transformers, diffusers and many more models architectures. Features: Generate Text, Audio, Video, Images, Voice Cloning, Distributed, P2P inference

  15. GPTQ-for-LLaMa

    4 bits quantization of LLaMA using GPTQ

  16. ggml

    73 llama.cpp VS ggml

    Tensor library for machine learning

  17. exllama

    65 llama.cpp VS exllama

    A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.

  18. llamafile

    Distribute and run LLMs with a single file.

  19. llama-cpp-python

    Python bindings for llama.cpp

  20. outlines

    41 llama.cpp VS outlines

    Structured Text Generation

  21. llm

    41 llama.cpp VS llm

    Discontinued [Unmaintained, see README] An ecosystem of Rust libraries for working with large language models

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a better llama.cpp alternative or higher similarity.

llama.cpp discussion

Log in or Post with

llama.cpp reviews and mentions

Posts with mentions or reviews of llama.cpp. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2025-01-01.
  • 2024 In Review
    16 projects | dev.to | 1 Jan 2025
    As local runners, I used llama.cpp and https://github.com/ml-explore/mlx – the second one uses Mac resources better (checked through macmon), but new models come out a bit slower on it. Some people use ollama, but I didn't understand why, considering that it has an incompatible client with openai-client API and slower performance.
  • Phi-4: Microsoft's Newest Small Language Model Specializing in Complex Reasoning
    3 projects | news.ycombinator.com | 15 Dec 2024
    Compare performance on various Macs here as it gets updated:

    https://github.com/ggerganov/llama.cpp/discussions/4167

    OMM, Llama 3.3 70B runs at ~7 text generation tokens per second on Macbook Pro Max 128GB, while generating GPT-4 feeling text with more in depth responses and fewer bullets. Llama 3.3 70B also doesn't fight the system prompt, it leans in.

    Consider e.g. LM Studio (0.3.5 or newer) for a Metal (MLX) centered UI, include MLX in your search term when downloading models.

    Also, do not scrimp on the storage. At 60GB - 100GB per model, it takes a day of experimentation to use 2.5TB of storage in your model cache. And remember to exclude that path from your TimeMachine backups.

  • Llama.cpp Now Supports Qwen2-VL (Vision Language Model)
    1 project | news.ycombinator.com | 14 Dec 2024
  • Structured Outputs with Ollama
    7 projects | news.ycombinator.com | 6 Dec 2024
    If anyone needs a more powerful constrain outputs, llama.cpp support gbnf:

    https://github.com/ggerganov/llama.cpp/blob/master/grammars/...

  • Bringing K/V context quantisation to Ollama
    9 projects | news.ycombinator.com | 4 Dec 2024
    Which has been part of llama.cpp since Dev 7, 2023 (2).

    So... mmmm... while this is great, somehow I'm left feeling kind of vaguely put-off by the comms around what is really 'we finally support some config flag from llama.cpp that's been there for really quite a long time'.

    Am I missed some context here?

    [1] - https://github.com/ollama/ollama/commit/1bdab9fdb19f8a8c73ed...

    [2] - since https://github.com/ggerganov/llama.cpp/commit/bcc0eb4591bec5...

  • Llama.cpp guide – Running LLMs locally on any hardware, from scratch
    5 projects | news.ycombinator.com | 29 Nov 2024
    re Temperature config option: I've found it useful for trying to generate something akin to a sampling-based confidence score for chat completions (e.g., set the temperature a bit high, run the model a few times and calculate the distribution of responses). Otherwise haven't figured out a good way to get confidence scores in llama.cpp (Been tracking this git request to get log_probs https://github.com/ggerganov/llama.cpp/issues/6423)
  • Writing an AnythingLLM Custom Agent Skill to Trigger Make.com Webhooks
    5 projects | dev.to | 27 Nov 2024
    Recently I've been experimenting with running a local Llama.cpp Server and looking for 3rd party applications to connect to it. It seems like there are have been a lot of popular solutions to running models downloaded from Huggingface locally, but many of them seem to want to import the model themselves using the Llama.cpp or Ollama libraries instead of connecting to an external provider. I'm more interested in using this technology as a server so that I start it once and allow clients to connect to it without needing to shutdown and go through the long, resource intensive, start-up process of each application. One of the applications that I've found great for chatting with in my setup is AnythingLLM.
  • 6 Easy Ways to Run LLM Locally + Alpha
    8 projects | dev.to | 11 Nov 2024
  • OpenCoder: Open-Source LLM for Coding
    3 projects | news.ycombinator.com | 9 Nov 2024
    I am happily using qwen2.5-coder-7b-instruct-q3_k_m.gguf with a context size of 32768 on an RTX 3060 Mobile with 6GB VRAM using llama.cpp [2]. With 16GB VRAM, you could use qwen2.5-7b-instruct-q8_0.gguf which is basically indistinguishable from the fp16 variant.

    [1] https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF

    [2] https://github.com/ggerganov/llama.cpp

  • Quantized Llama models with increased speed and a reduced memory footprint
    7 projects | news.ycombinator.com | 24 Oct 2024
    I’ve only toyed with them a bit, and had a similar experience - but did find I got better output by forcing them to adhere to a fixed grammar: https://github.com/ggerganov/llama.cpp/tree/master/grammars
  • A note from our sponsor - SaaSHub
    www.saashub.com | 20 Jan 2025
    SaaSHub helps you find the best software and product alternatives Learn more →

Stats

Basic llama.cpp repo stats
842
70,872
10.0
2 days ago

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com

Did you know that C++ is
the 7th most popular programming language
based on number of references?