Llama.cpp: Full CUDA GPU Acceleration

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • llama.cpp

    LLM inference in C/C++

  • > In comparison I could just type git clone https://github.com/ggerganov/llama.cpp and make . And it worked.

    You're comparing a single, well managed project that had put effort into user onboarding against all projects of a different language and proclaiming that an entire language/ecosystem is crap.

    The only real take away is that many projects, independent of language, put way too little effort towards onboarding users.

  • exllama

    A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.

  • Not a 30 series, but on my 4090 I'm getting 32.0 tokens/s on a 13b q4_0 model (uses about 10GiB of VRAM) w/ full context (`-ngl 99 -n 2048 --ignore-eos` to force all layers on GPU memory and to do a full 2048 context). As a point of reference, currently exllama [1] runs a 4-bit GPTQ of the same 13b model at 83.5 tokens/s.

    [1] https://github.com/turboderp/exllama

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • lit-llama

    Implementation of the LLaMA language model based on nanoGPT. Supports flash attention, Int8 and GPTQ 4bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. Apache 2.0-licensed.

  • mlc-llm

    Enable everyone to develop, optimize and deploy AI models natively on everyone's devices.

  • With any luck projects like MLC will help close the gap

    https://github.com/mlc-ai/mlc-llm

  • TokenHawk

    Discontinued WebGPU LLM inference tuned by hand [Moved to: https://github.com/kayvr/token-hawk]

  • whisper.cpp

    Port of OpenAI's Whisper model in C/C++

  • flake

    A Nix flake for many AI projects

  • > Ideally, there's Nix (and poetry2nix) that could take care of everything, but only a few folks write Flakes for their projects.

    Relevant to "AI, Python, setting up is hard ... nix", there's stuff like:

    https://github.com/nixified-ai/flake

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • llama.cpp

    Port of Facebook's LLaMA model in C/C++ (by SlyEcho)

  • llama.cpp can be run with a speedup for AMD GPUs when compiled with `LLAMA_CLBLAST=1` and there is also a HIPified fork [1] being worked on by a community contributor. The other week I was poking on how hard it would be to get an AMD card running w/ acceleration on Linux and was pleasantly surprised, it wasn't too bad: https://mostlyobvious.org/?link=/Reference%2FSoftware%2FGene...

    That being said, it's important to note that ROCm is Linux only. Not only that, but ROCm's GPU support has actually been decreasing over the past few years. The current list: https://rocm.docs.amd.com/en/latest/release/gpu_os_support.h... Previously (2022): https://docs.amd.com/bundle/Hardware_and_Software_Reference_...

    The ELI5 is that a few years back, AMD split their graphics (RDNA) and compute (CDNA) architectures, which Nvidia does too, but notably (what Nvidia definitely doesn't do, and a key to their success IMO) AMD also decided they would simply not support any CUDA-parity compute features on Windows or their non "compute" cards. In practice, this means that community/open-source developers will never have, tinker, port, or develop on AMD hardware, while on Nvidia you can start with a GTX/RTX card on your laptop, and use the same code up to an H100 or DGX.

    llama.cpp is a super-high profile project, has almost 200 contributiors now, but AFAIK, no contributors from AMD. If AMD doesn't have the manpower, IMO they should simply be sending nsa free hardware to top open source project/library developers (and on the software side, their #1 priority should be making sure every single current GPU they sell is at least "enabled" if not "supported" in ROCm, on Linux and Windows).

    [1] https://github.com/SlyEcho/llama.cpp/tree/hipblas

  • llama_cpp.rb

    llama_cpp provides Ruby bindings for llama.cpp

  • Python sits on the C-glue segment of programming languages (where Perl, PHP, Ruby and Node are also notable members). Being a glue language means having APIs to a lot of external toolchains written in not only C/C++ but many other compiled languages, APIs and system resources. Conda, virtualenv, etc. are godsend modules for making it all work, or even better, to freeze things once they all work, without resourcing to Docker, VMs or shell scripts. It's meant for application and DevOps people who need to slap together, ie, ML, Numpy, Elasticsearch, AWS APIs and REST endpoints and Get $hit Done.

    It's annoying to see them "glueys" compared to the binary compiled segment where the heavy lifting is done. Python and others exist to latch on and assimilate. Resistance is futile:

    https://pypi.org/project/pyllamacpp/

    https://www.npmjs.com/package/llama-node

    https://packagist.org/packages/kambo/llama-cpp-php

    https://github.com/yoshoku/llama_cpp.rb

  • serving

    A flexible, high-performance serving system for machine learning models

  • Yet another TEDIOUS BATTLE: Python vs. C++/C stack.

    This project gained popularity due to the HIGH DEMAND for running large models with 1B+ parameters, like `llama`. Python dominates the interface and training ecosystem, but prior to llama.cpp, non-ML professionals showed little interest in a fast C++ interface library. While existing solutions like tensorflow-serving [1] in C++ were sufficiently fast with GPU support, llama.cpp took the initiative to optimize for CPU and trim unnecessary code, essentially code-golfing and sacrificing some algorithm correctness for improved performance, which isn't favored by "ML research".

    NOTE: In my opinion, a true pioneer was DarkNet, which implemented the YOLO model series and significantly outperformed others [2]. Same trick basically like llama.cpp

    [1] https://github.com/tensorflow/serving

  • darknet

    Convolutional Neural Networks

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts