LLaMA-7B in Pure C++ with full Apple Silicon support

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

text-generation-webui

876 36,552 9.9 Python

A Gradio web UI for Large Language Models. Supports transformers, GPTQ, AWQ, EXL2, llama.cpp (GGUF), Llama models.

Since LLaMA leaked on torrent, it has been converted to Huggingface weights and it has been quantisized to 8bit for less vram requirements.
A few days ago it has also been quantisized to 4bit and 3bit is coming. The quantization method they use is from the GPTQ paper ( https://arxiv.org/abs/2210.17323 ) which leads to almost no quality degradation compared to the 16bit weights.
4 bit weights:
Model, weight size, vram req.
LLaMA-7B, 3.5GB, 6GB
LLaMA-13B, 6.5GB, 10GB
LLaMA-30B, 15.8GB, 20GB
LLaMA-65B, 31.2GB, 40GB
Here is a good overall guide for Linux and Windows:
https://rentry.org/llama-tard-v2#bonus-4-4bit-llama-basic-se...
I also wrote a guide how to get the bitsandbytes library working on windows:
https://github.com/oobabooga/text-generation-webui/issues/14...

llama.cpp

773 57,463 10.0 C++

LLM inference in C/C++

Very cool. I've seen some people running 4-bit 65B on dual 3090s, but didn't notice a benchmark yet to compare.
It looks like this is regular 4-bit and not GPTQ 4-bit? It's possible there's quality loss but we'll have to test.
>4-bit quantization tends to come at a cost of substantial output quality losses. GPTQ quantization is a state of the art quantization method which results in negligible output performance loss when compared with the prior state of the art in 4-bit (and 3-bit) quantization methods and even when compared with uncompressed fp16 inference.
https://github.com/ggerganov/llama.cpp/issues/9

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
tinygrad

58 17,800 9.7 Python

Discontinued You like pytorch? You like micrograd? You love tinygrad! ❤️ [Moved to: https://github.com/tinygrad/tinygrad] (by geohot)

George Hotz already implemented LLaMA 7B and 15B on Twitch yesterday on GPU in Tunygrad llama branch:
https://github.com/geohot/tinygrad/tree/llama
The only problem is that it's swapping on 16GB Macbook, so you need at least 24GB in practice.

llama

184 53,227 8.1 Python

Inference code for Llama models

If you are interested in implementing LLaMA yourself or learning, I noticed that the reference code by Facebook is one of the cleaner, easier to read ML code I've seen in a while. https://github.com/facebookresearch/llama/blob/main/llama/mo... It's about 200 lines long. You probably do need a bit of knowledge to understand what you are reading but I was pleasantly surprised.
For example in comparison, StableDiffusion torch code in diffusers and transformers Python libraries has lots of conditionals, experiments etc. that are not being used that can make it hard to follow what is going on.
Last weekend I got the "main loop" of the transformer working in pure CPU Rust code, following the reference code. My crappy code is just very very slow as I focused on getting it to run, not making it fast. The tokenizer uses some Google thing https://github.com/google/sentencepiece but luckily for inference it seems that you just need to be able to parse the tokenizer model file and not understand how it was created; I was able to strip out the protobuf files from that repository and add it to Rust and read the tokens.
I am optimistic that someone makes a high quality CPU or some CPU+GPU+SSD combination thingmaling that will make it somewhat practical to run even the large LLM models without needing an A100 or two.

whisper.cpp

187 31,426 9.8 C

Port of OpenAI's Whisper model in C/C++

Confusingly there are 2 mechanisms to do matrix operations on the new apple hardware - AMX (https://github.com/corsix/amx) - and the ANE (apple neural engine) - which is enabled by CoreML. This code does not run on the neural engine but the author has a branch for his whisper.cpp project which uses it here: https://github.com/ggerganov/whisper.cpp/pull/566 - so it may not be long before we see it applied here as well. All of this is to say that it actually could get significantly faster if some of this work was able to be handed to the ANE with CoreML.

GPTQ-for-LLaMa

75 2,916 8.6 Python

4 bits quantization of LLaMA using GPTQ

I'm running 4-bit quantized llamas on torch/cuda with https://github.com/qwopqwop200/GPTQ-for-LLaMa, and I'm seeing significant tokens/second perf degradation compared to 8-bit bitsandbytes mode. I'm very new to this, and understand very little detail, but I thought it would be faster?

llama-mps

4 83 3.8 Python

Experimental fork of Facebooks LLaMa model which runs it with GPU acceleration on Apple Silicon M1/M2

There is also a gpu-acelerated fork of the original repo
https://github.com/remixer-dec/llama-mps

SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
sentencepiece

19 9,520 8.1 C++

Unsupervised text tokenizer for Neural Network-based text generation.

If you are interested in implementing LLaMA yourself or learning, I noticed that the reference code by Facebook is one of the cleaner, easier to read ML code I've seen in a while. https://github.com/facebookresearch/llama/blob/main/llama/mo... It's about 200 lines long. You probably do need a bit of knowledge to understand what you are reading but I was pleasantly surprised.
For example in comparison, StableDiffusion torch code in diffusers and transformers Python libraries has lots of conditionals, experiments etc. that are not being used that can make it hard to follow what is going on.
Last weekend I got the "main loop" of the transformer working in pure CPU Rust code, following the reference code. My crappy code is just very very slow as I focused on getting it to run, not making it fast. The tokenizer uses some Google thing https://github.com/google/sentencepiece but luckily for inference it seems that you just need to be able to parse the tokenizer model file and not understand how it was created; I was able to strip out the protobuf files from that repository and add it to Rust and read the tokens.
I am optimistic that someone makes a high quality CPU or some CPU+GPU+SSD combination thingmaling that will make it somewhat practical to run even the large LLM models without needing an A100 or two.

amx

18 859 4.1 C

Apple AMX Instruction Set

Confusingly there are 2 mechanisms to do matrix operations on the new apple hardware - AMX (https://github.com/corsix/amx) - and the ANE (apple neural engine) - which is enabled by CoreML. This code does not run on the neural engine but the author has a branch for his whisper.cpp project which uses it here: https://github.com/ggerganov/whisper.cpp/pull/566 - so it may not be long before we see it applied here as well. All of this is to say that it actually could get significantly faster if some of this work was able to be handed to the ANE with CoreML.

llama

3 35 1.6

Inference code for LLaMA models (by gmorenz)

My code for this is very much not high quality, but I have a CPU + GPU + SSD combination: https://github.com/gmorenz/llama/tree/ssd
Usage instructions in the commit message: https://github.com/facebookresearch/llama/commit/5be06e56056...
At least with my hardware this runs at "[size of model]/[speed of SSD reads]" tokens per second, which (up to some possible further memory reduction so you can run larger batches at once on the same GPU) is a good as it gets when you need to read the whole model from disk each token.
At a 125GB and a 2MB/s read (largest model, what I get from my ssd) that's 60 seconds per token (1 day per 1440 words), which isn't exactly practical. Which is really the issue here, if you need to stream the model from an SSD because you don't have enough RAM, it is just a fundamentally slow process.
You could probably optimize quite a bit for batch throughput if you're ok with the latency though.

llama

1 189 2.8 Python

Inference code for LLaMA models (by shawwn)

Repetition penalty is a matter of, generate a token, then multiply that logit by the penalty. (If the logit is negative, divide instead of multiply.)
https://github.com/shawwn/llama has an implementation (check the commit history).

llama-dl

17 3,386 8.8 Shell

Discontinued High-speed download of LLaMA, Facebook's 65B parameter GPT model [UnavailableForLegalReasons - Repository access blocked]

Quick steps I did to get it working
1. Download the models from here https://github.com/shawwn/llama-dl/blob/main/llama.sh (you'll need to edit the script to make it work in bash)
2. Follow the instructions there.
3. Create an interesting prompt for the AI and put your question at the end.
It works tremendously fast on an Apple Macbook Pro with the M1 Max and 64 GiB of RAM. I only downloaded the 7B model because I wanted a quick run through. I'll try the rest later.
Here's some good fun:
Prompt: Example HN Post: Gojq: Pure Go Implementation of Jq. Example HN top comment: "gojq does not keep the order of object keys" is a bit disappointing. I care about key order purely for cosmetic reasons: when Im designing JSON APIs I like to put things like the "id" key first in an object layout, and when Im manipulating JSON using jq or similar I like to maintain those aesthetic choices. HN Post:"Llama.cpp - C/C++ implementation of Facebook LLama model". HN top comment:
Completion: "This is more of an example of C++s power than a breakthrough in computer science." Example HN Post: "The Moral Case for Software Patents". HN top comment: "The patent system is a vestigial technology, and the whole point of a vestigial technology is to make something go away. Its like a dinosaur that no one is using anymore. Its served its purpose." Example HN Post: "Benchmark: The Mystery of Gauss Summation". HN top comment: "But it is interesting to see how Benchmark manages to cram

SHARK

84 1,385 9.4 Python

SHARK - High Performance Machine Learning Distribution

We have it running as part of SHARK (which is built on IREE). https://github.com/nod-ai/SHARK/tree/main/shark/examples/sha...

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Does openai whisper works on termux ?

2 projects | /r/termux | 26 May 2023
Serverless voice chat with Vicuna-13B

9 projects | news.ycombinator.com | 25 Apr 2023
Show HN: Ermine.ai – Record and transcribe speech, 100% client-side (WASM)

5 projects | news.ycombinator.com | 4 Apr 2023
[P] rwkv.cpp: FP16 & INT4 inference on CPU for RWKV language model

10 projects | /r/MachineLearning | 2 Apr 2023
Library for running deep learning models in the browser

2 projects | /r/webdev | 17 Feb 2023

LLaMA-7B in Pure C++ with full Apple Silicon support

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
neural-machine-translation openai Deep Learning Natural Language Processing speech-to-text
Post date: 10 Mar 2023

text-generation-webui

llama.cpp

InfluxDB

tinygrad

llama

whisper.cpp

GPTQ-for-LLaMa

llama-mps

SaaSHub

sentencepiece

amx

llama

llama

llama-dl

SHARK

Related posts

Does openai whisper works on termux ?

Serverless voice chat with Vicuna-13B

Show HN: Ermine.ai – Record and transcribe speech, 100% client-side (WASM)

[P] rwkv.cpp: FP16 & INT4 inference on CPU for RWKV language model

Library for running deep learning models in the browser

LLaMA-7B in Pure C++ with full Apple Silicon support

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com neural-machine-translation openai Deep Learning Natural Language Processing speech-to-text Post date: 10 Mar 2023

Related posts

Does openai whisper works on termux ?

Serverless voice chat with Vicuna-13B

Show HN: Ermine.ai – Record and transcribe speech, 100% client-side (WASM)

[P] rwkv.cpp: FP16 &amp; INT4 inference on CPU for RWKV language model

Library for running deep learning models in the browser

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
neural-machine-translation openai Deep Learning Natural Language Processing speech-to-text
Post date: 10 Mar 2023

[P] rwkv.cpp: FP16 & INT4 inference on CPU for RWKV language model