LLaMA Now Goes Faster on CPUs

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

llama.cpp

769 56,891 10.0 C++

LLM inference in C/C++

Yes, this is really a phenomenal effort! And what open source is about: Bringing improvements to so many use cases. So that Intel and AMD chip uses can start to perform while taking advantage of their high-performance capabilities, making even old parts competitive.
There are two PRs raised to merge to llama.cpp:
https://github.com/ggerganov/llama.cpp/pull/6414
https://github.com/ggerganov/llama.cpp/pull/6412
Hopefully these can be accepted, without drama! as there are many downstream dependencies on llama.cpp can will also benefit.
Though of course everyone should also look directly at releases from llamafile https://github.com/mozilla-Ocho/llamafile.

llamafile

34 13,765 9.6 C++

Distribute and run LLMs with a single file.

While I did not succeed in making the matmul code from https://github.com/Mozilla-Ocho/llamafile/blob/main/llamafil... work in isolation, I compared eigen, openblas, and mkl: https://gist.github.com/Dobiasd/e664c681c4a7933ef5d2df7caa87...
In this (very primitive!) benchmark, MKL was a bit better than eigen (~10%) on my machine (i5-6600).
Since the article https://justine.lol/matmul/ compared the new kernels with MLK, we can (by transitivity) compare the new kernels with Eigen this way, at least very roughly for this one use-case.

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
whisper.cpp

187 31,174 9.8 C

Port of OpenAI's Whisper model in C/C++
ollama

196 58,943 9.9 Go

Get up and running with Llama 3, Mistral, Gemma, and other large language models.
disaster

2 278 2.9 Emacs Lisp

Disassemble C/C++ code under cursor in Emacs

This is probably what they are referring to https://github.com/jart/disaster

OpenBLAS

22 5,974 9.8 C

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.

The Fortran implementation is just a reference implementation. The goal of reference BLAS [0] is to provide relatively simple and easy to understand implementations which demonstrate the interface and are intended to give correct results to test against. Perhaps an exceptional Fortran compiler which doesn't yet exist could generate code which rivals hand (or automatically) tuned optimized BLAS libraries like OpenBLAS [1], MKL [2], ATLAS [3], and those based on BLIS [4], but in practice this is not observed.
Justine observed that the threading model for LLaMA makes it impractical to integrate one of these optimized BLAS libraries, so she wrote her own hand-tuned implementations following the same principles they use.
[0] https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprogra...
[1] https://github.com/OpenMathLib/OpenBLAS
[2] https://www.intel.com/content/www/us/en/developer/tools/onea...
[3] https://en.wikipedia.org/wiki/Automatically_Tuned_Linear_Alg...
[4]https://en.wikipedia.org/wiki/BLIS_(software)

alpaca.cpp

94 9,878 9.4 C

Discontinued Locally run an Instruction-Tuned Chat-Style LLM

Where's the 30B-in-6GB claim? ^FGB in your GH link finds [0] which is neither by jart nor by ggerganov but by another user who promptly gets told to look at [1] where Justine denies that claim.
  [0] https://github.com/antimatter15/alpaca.cpp/issues/182

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
gemma.cpp

8 5,476 9.0 C++

lightweight, standalone C++ inference engine for Google's Gemma models.

For C++, also check out our https://github.com/google/gemma.cpp/blob/main/gemma.cc, which has direct calls to MatVec.

BigDL

5 5,957 9.9 Python

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max). A PyTorch LLM library that seamlessly integrates with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, etc.

Any performance benchmark against intel's 'IPEX-LLM'[0] or others?
[0] - https://github.com/intel-analytics/ipex-llm

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project