LLaMA Now Goes Faster on CPUs

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • llama.cpp

    LLM inference in C/C++

  • Yes, this is really a phenomenal effort! And what open source is about: Bringing improvements to so many use cases. So that Intel and AMD chip uses can start to perform while taking advantage of their high-performance capabilities, making even old parts competitive.

    There are two PRs raised to merge to llama.cpp:

    https://github.com/ggerganov/llama.cpp/pull/6414

    https://github.com/ggerganov/llama.cpp/pull/6412

    Hopefully these can be accepted, without drama! as there are many downstream dependencies on llama.cpp can will also benefit.

    Though of course everyone should also look directly at releases from llamafile https://github.com/mozilla-Ocho/llamafile.

  • llamafile

    Distribute and run LLMs with a single file.

  • While I did not succeed in making the matmul code from https://github.com/Mozilla-Ocho/llamafile/blob/main/llamafil... work in isolation, I compared eigen, openblas, and mkl: https://gist.github.com/Dobiasd/e664c681c4a7933ef5d2df7caa87...

    In this (very primitive!) benchmark, MKL was a bit better than eigen (~10%) on my machine (i5-6600).

    Since the article https://justine.lol/matmul/ compared the new kernels with MLK, we can (by transitivity) compare the new kernels with Eigen this way, at least very roughly for this one use-case.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • whisper.cpp

    Port of OpenAI's Whisper model in C/C++

  • ollama

    Get up and running with Llama 3, Mistral, Gemma, and other large language models.

  • disaster

    Disassemble C/C++ code under cursor in Emacs

  • This is probably what they are referring to https://github.com/jart/disaster

  • OpenBLAS

    OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.

  • The Fortran implementation is just a reference implementation. The goal of reference BLAS [0] is to provide relatively simple and easy to understand implementations which demonstrate the interface and are intended to give correct results to test against. Perhaps an exceptional Fortran compiler which doesn't yet exist could generate code which rivals hand (or automatically) tuned optimized BLAS libraries like OpenBLAS [1], MKL [2], ATLAS [3], and those based on BLIS [4], but in practice this is not observed.

    Justine observed that the threading model for LLaMA makes it impractical to integrate one of these optimized BLAS libraries, so she wrote her own hand-tuned implementations following the same principles they use.

    [0] https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprogra...

    [1] https://github.com/OpenMathLib/OpenBLAS

    [2] https://www.intel.com/content/www/us/en/developer/tools/onea...

    [3] https://en.wikipedia.org/wiki/Automatically_Tuned_Linear_Alg...

    [4]https://en.wikipedia.org/wiki/BLIS_(software)

  • alpaca.cpp

    Discontinued Locally run an Instruction-Tuned Chat-Style LLM

  • Where's the 30B-in-6GB claim? ^FGB in your GH link finds [0] which is neither by jart nor by ggerganov but by another user who promptly gets told to look at [1] where Justine denies that claim.

      [0] https://github.com/antimatter15/alpaca.cpp/issues/182

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • gemma.cpp

    lightweight, standalone C++ inference engine for Google's Gemma models.

  • For C++, also check out our https://github.com/google/gemma.cpp/blob/main/gemma.cc, which has direct calls to MatVec.

  • BigDL

    Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max). A PyTorch LLM library that seamlessly integrates with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, etc.

  • Any performance benchmark against intel's 'IPEX-LLM'[0] or others?

    [0] - https://github.com/intel-analytics/ipex-llm

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts