LLaMA-7B in Pure C++ with full Apple Silicon support

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • text-generation-webui

    A Gradio web UI for Large Language Models. Supports transformers, GPTQ, AWQ, EXL2, llama.cpp (GGUF), Llama models.

  • Since LLaMA leaked on torrent, it has been converted to Huggingface weights and it has been quantisized to 8bit for less vram requirements.

    A few days ago it has also been quantisized to 4bit and 3bit is coming. The quantization method they use is from the GPTQ paper ( https://arxiv.org/abs/2210.17323 ) which leads to almost no quality degradation compared to the 16bit weights.

    4 bit weights:

    Model, weight size, vram req.

    LLaMA-7B, 3.5GB, 6GB

    LLaMA-13B, 6.5GB, 10GB

    LLaMA-30B, 15.8GB, 20GB

    LLaMA-65B, 31.2GB, 40GB

    Here is a good overall guide for Linux and Windows:

    https://rentry.org/llama-tard-v2#bonus-4-4bit-llama-basic-se...

    I also wrote a guide how to get the bitsandbytes library working on windows:

    https://github.com/oobabooga/text-generation-webui/issues/14...

  • llama.cpp

    LLM inference in C/C++

  • Very cool. I've seen some people running 4-bit 65B on dual 3090s, but didn't notice a benchmark yet to compare.

    It looks like this is regular 4-bit and not GPTQ 4-bit? It's possible there's quality loss but we'll have to test.

    >4-bit quantization tends to come at a cost of substantial output quality losses. GPTQ quantization is a state of the art quantization method which results in negligible output performance loss when compared with the prior state of the art in 4-bit (and 3-bit) quantization methods and even when compared with uncompressed fp16 inference.

    https://github.com/ggerganov/llama.cpp/issues/9

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • tinygrad

    Discontinued You like pytorch? You like micrograd? You love tinygrad! ❤️ [Moved to: https://github.com/tinygrad/tinygrad] (by geohot)

  • George Hotz already implemented LLaMA 7B and 15B on Twitch yesterday on GPU in Tunygrad llama branch:

    https://github.com/geohot/tinygrad/tree/llama

    The only problem is that it's swapping on 16GB Macbook, so you need at least 24GB in practice.

  • llama

    Inference code for Llama models

  • If you are interested in implementing LLaMA yourself or learning, I noticed that the reference code by Facebook is one of the cleaner, easier to read ML code I've seen in a while. https://github.com/facebookresearch/llama/blob/main/llama/mo... It's about 200 lines long. You probably do need a bit of knowledge to understand what you are reading but I was pleasantly surprised.

    For example in comparison, StableDiffusion torch code in diffusers and transformers Python libraries has lots of conditionals, experiments etc. that are not being used that can make it hard to follow what is going on.

    Last weekend I got the "main loop" of the transformer working in pure CPU Rust code, following the reference code. My crappy code is just very very slow as I focused on getting it to run, not making it fast. The tokenizer uses some Google thing https://github.com/google/sentencepiece but luckily for inference it seems that you just need to be able to parse the tokenizer model file and not understand how it was created; I was able to strip out the protobuf files from that repository and add it to Rust and read the tokens.

    I am optimistic that someone makes a high quality CPU or some CPU+GPU+SSD combination thingmaling that will make it somewhat practical to run even the large LLM models without needing an A100 or two.

  • whisper.cpp

    Port of OpenAI's Whisper model in C/C++

  • Confusingly there are 2 mechanisms to do matrix operations on the new apple hardware - AMX (https://github.com/corsix/amx) - and the ANE (apple neural engine) - which is enabled by CoreML. This code does not run on the neural engine but the author has a branch for his whisper.cpp project which uses it here: https://github.com/ggerganov/whisper.cpp/pull/566 - so it may not be long before we see it applied here as well. All of this is to say that it actually could get significantly faster if some of this work was able to be handed to the ANE with CoreML.

  • GPTQ-for-LLaMa

    4 bits quantization of LLaMA using GPTQ

  • I'm running 4-bit quantized llamas on torch/cuda with https://github.com/qwopqwop200/GPTQ-for-LLaMa, and I'm seeing significant tokens/second perf degradation compared to 8-bit bitsandbytes mode. I'm very new to this, and understand very little detail, but I thought it would be faster?

  • llama-mps

    Experimental fork of Facebooks LLaMa model which runs it with GPU acceleration on Apple Silicon M1/M2

  • There is also a gpu-acelerated fork of the original repo

    https://github.com/remixer-dec/llama-mps

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • sentencepiece

    Unsupervised text tokenizer for Neural Network-based text generation.

  • If you are interested in implementing LLaMA yourself or learning, I noticed that the reference code by Facebook is one of the cleaner, easier to read ML code I've seen in a while. https://github.com/facebookresearch/llama/blob/main/llama/mo... It's about 200 lines long. You probably do need a bit of knowledge to understand what you are reading but I was pleasantly surprised.

    For example in comparison, StableDiffusion torch code in diffusers and transformers Python libraries has lots of conditionals, experiments etc. that are not being used that can make it hard to follow what is going on.

    Last weekend I got the "main loop" of the transformer working in pure CPU Rust code, following the reference code. My crappy code is just very very slow as I focused on getting it to run, not making it fast. The tokenizer uses some Google thing https://github.com/google/sentencepiece but luckily for inference it seems that you just need to be able to parse the tokenizer model file and not understand how it was created; I was able to strip out the protobuf files from that repository and add it to Rust and read the tokens.

    I am optimistic that someone makes a high quality CPU or some CPU+GPU+SSD combination thingmaling that will make it somewhat practical to run even the large LLM models without needing an A100 or two.

  • amx

    Apple AMX Instruction Set

  • Confusingly there are 2 mechanisms to do matrix operations on the new apple hardware - AMX (https://github.com/corsix/amx) - and the ANE (apple neural engine) - which is enabled by CoreML. This code does not run on the neural engine but the author has a branch for his whisper.cpp project which uses it here: https://github.com/ggerganov/whisper.cpp/pull/566 - so it may not be long before we see it applied here as well. All of this is to say that it actually could get significantly faster if some of this work was able to be handed to the ANE with CoreML.

  • llama

    Inference code for LLaMA models (by gmorenz)

  • My code for this is very much not high quality, but I have a CPU + GPU + SSD combination: https://github.com/gmorenz/llama/tree/ssd

    Usage instructions in the commit message: https://github.com/facebookresearch/llama/commit/5be06e56056...

    At least with my hardware this runs at "[size of model]/[speed of SSD reads]" tokens per second, which (up to some possible further memory reduction so you can run larger batches at once on the same GPU) is a good as it gets when you need to read the whole model from disk each token.

    At a 125GB and a 2MB/s read (largest model, what I get from my ssd) that's 60 seconds per token (1 day per 1440 words), which isn't exactly practical. Which is really the issue here, if you need to stream the model from an SSD because you don't have enough RAM, it is just a fundamentally slow process.

    You could probably optimize quite a bit for batch throughput if you're ok with the latency though.

  • llama

    Inference code for LLaMA models (by shawwn)

  • Repetition penalty is a matter of, generate a token, then multiply that logit by the penalty. (If the logit is negative, divide instead of multiply.)

    https://github.com/shawwn/llama has an implementation (check the commit history).

  • llama-dl

    Discontinued High-speed download of LLaMA, Facebook's 65B parameter GPT model [UnavailableForLegalReasons - Repository access blocked]

  • Quick steps I did to get it working

    1. Download the models from here https://github.com/shawwn/llama-dl/blob/main/llama.sh (you'll need to edit the script to make it work in bash)

    2. Follow the instructions there.

    3. Create an interesting prompt for the AI and put your question at the end.

    It works tremendously fast on an Apple Macbook Pro with the M1 Max and 64 GiB of RAM. I only downloaded the 7B model because I wanted a quick run through. I'll try the rest later.

    Here's some good fun:

    Prompt: Example HN Post: Gojq: Pure Go Implementation of Jq. Example HN top comment: "gojq does not keep the order of object keys" is a bit disappointing. I care about key order purely for cosmetic reasons: when Im designing JSON APIs I like to put things like the "id" key first in an object layout, and when Im manipulating JSON using jq or similar I like to maintain those aesthetic choices. HN Post:"Llama.cpp - C/C++ implementation of Facebook LLama model". HN top comment:

    Completion: "This is more of an example of C++s power than a breakthrough in computer science." Example HN Post: "The Moral Case for Software Patents". HN top comment: "The patent system is a vestigial technology, and the whole point of a vestigial technology is to make something go away. Its like a dinosaur that no one is using anymore. Its served its purpose." Example HN Post: "Benchmark: The Mystery of Gauss Summation". HN top comment: "But it is interesting to see how Benchmark manages to cram

  • SHARK

    SHARK - High Performance Machine Learning Distribution

  • We have it running as part of SHARK (which is built on IREE). https://github.com/nod-ai/SHARK/tree/main/shark/examples/sha...

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Does openai whisper works on termux ?

    2 projects | /r/termux | 26 May 2023
  • Serverless voice chat with Vicuna-13B

    9 projects | news.ycombinator.com | 25 Apr 2023
  • Show HN: Ermine.ai – Record and transcribe speech, 100% client-side (WASM)

    5 projects | news.ycombinator.com | 4 Apr 2023
  • [P] rwkv.cpp: FP16 & INT4 inference on CPU for RWKV language model

    10 projects | /r/MachineLearning | 2 Apr 2023
  • Library for running deep learning models in the browser

    2 projects | /r/webdev | 17 Feb 2023