-
text-generation-webui
A Gradio web UI for Large Language Models. Supports transformers, GPTQ, AWQ, EXL2, llama.cpp (GGUF), Llama models.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
tinygrad
Discontinued You like pytorch? You like micrograd? You love tinygrad! ❤️ [Moved to: https://github.com/tinygrad/tinygrad] (by geohot)
-
llama-mps
Experimental fork of Facebooks LLaMa model which runs it with GPU acceleration on Apple Silicon M1/M2
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
llama-dl
Discontinued High-speed download of LLaMA, Facebook's 65B parameter GPT model [UnavailableForLegalReasons - Repository access blocked]
Since LLaMA leaked on torrent, it has been converted to Huggingface weights and it has been quantisized to 8bit for less vram requirements.
A few days ago it has also been quantisized to 4bit and 3bit is coming. The quantization method they use is from the GPTQ paper ( https://arxiv.org/abs/2210.17323 ) which leads to almost no quality degradation compared to the 16bit weights.
4 bit weights:
Model, weight size, vram req.
LLaMA-7B, 3.5GB, 6GB
LLaMA-13B, 6.5GB, 10GB
LLaMA-30B, 15.8GB, 20GB
LLaMA-65B, 31.2GB, 40GB
Here is a good overall guide for Linux and Windows:
https://rentry.org/llama-tard-v2#bonus-4-4bit-llama-basic-se...
I also wrote a guide how to get the bitsandbytes library working on windows:
https://github.com/oobabooga/text-generation-webui/issues/14...
Very cool. I've seen some people running 4-bit 65B on dual 3090s, but didn't notice a benchmark yet to compare.
It looks like this is regular 4-bit and not GPTQ 4-bit? It's possible there's quality loss but we'll have to test.
>4-bit quantization tends to come at a cost of substantial output quality losses. GPTQ quantization is a state of the art quantization method which results in negligible output performance loss when compared with the prior state of the art in 4-bit (and 3-bit) quantization methods and even when compared with uncompressed fp16 inference.
https://github.com/ggerganov/llama.cpp/issues/9
George Hotz already implemented LLaMA 7B and 15B on Twitch yesterday on GPU in Tunygrad llama branch:
https://github.com/geohot/tinygrad/tree/llama
The only problem is that it's swapping on 16GB Macbook, so you need at least 24GB in practice.
If you are interested in implementing LLaMA yourself or learning, I noticed that the reference code by Facebook is one of the cleaner, easier to read ML code I've seen in a while. https://github.com/facebookresearch/llama/blob/main/llama/mo... It's about 200 lines long. You probably do need a bit of knowledge to understand what you are reading but I was pleasantly surprised.
For example in comparison, StableDiffusion torch code in diffusers and transformers Python libraries has lots of conditionals, experiments etc. that are not being used that can make it hard to follow what is going on.
Last weekend I got the "main loop" of the transformer working in pure CPU Rust code, following the reference code. My crappy code is just very very slow as I focused on getting it to run, not making it fast. The tokenizer uses some Google thing https://github.com/google/sentencepiece but luckily for inference it seems that you just need to be able to parse the tokenizer model file and not understand how it was created; I was able to strip out the protobuf files from that repository and add it to Rust and read the tokens.
I am optimistic that someone makes a high quality CPU or some CPU+GPU+SSD combination thingmaling that will make it somewhat practical to run even the large LLM models without needing an A100 or two.
Confusingly there are 2 mechanisms to do matrix operations on the new apple hardware - AMX (https://github.com/corsix/amx) - and the ANE (apple neural engine) - which is enabled by CoreML. This code does not run on the neural engine but the author has a branch for his whisper.cpp project which uses it here: https://github.com/ggerganov/whisper.cpp/pull/566 - so it may not be long before we see it applied here as well. All of this is to say that it actually could get significantly faster if some of this work was able to be handed to the ANE with CoreML.
I'm running 4-bit quantized llamas on torch/cuda with https://github.com/qwopqwop200/GPTQ-for-LLaMa, and I'm seeing significant tokens/second perf degradation compared to 8-bit bitsandbytes mode. I'm very new to this, and understand very little detail, but I thought it would be faster?
There is also a gpu-acelerated fork of the original repo
https://github.com/remixer-dec/llama-mps
If you are interested in implementing LLaMA yourself or learning, I noticed that the reference code by Facebook is one of the cleaner, easier to read ML code I've seen in a while. https://github.com/facebookresearch/llama/blob/main/llama/mo... It's about 200 lines long. You probably do need a bit of knowledge to understand what you are reading but I was pleasantly surprised.
For example in comparison, StableDiffusion torch code in diffusers and transformers Python libraries has lots of conditionals, experiments etc. that are not being used that can make it hard to follow what is going on.
Last weekend I got the "main loop" of the transformer working in pure CPU Rust code, following the reference code. My crappy code is just very very slow as I focused on getting it to run, not making it fast. The tokenizer uses some Google thing https://github.com/google/sentencepiece but luckily for inference it seems that you just need to be able to parse the tokenizer model file and not understand how it was created; I was able to strip out the protobuf files from that repository and add it to Rust and read the tokens.
I am optimistic that someone makes a high quality CPU or some CPU+GPU+SSD combination thingmaling that will make it somewhat practical to run even the large LLM models without needing an A100 or two.
Confusingly there are 2 mechanisms to do matrix operations on the new apple hardware - AMX (https://github.com/corsix/amx) - and the ANE (apple neural engine) - which is enabled by CoreML. This code does not run on the neural engine but the author has a branch for his whisper.cpp project which uses it here: https://github.com/ggerganov/whisper.cpp/pull/566 - so it may not be long before we see it applied here as well. All of this is to say that it actually could get significantly faster if some of this work was able to be handed to the ANE with CoreML.
My code for this is very much not high quality, but I have a CPU + GPU + SSD combination: https://github.com/gmorenz/llama/tree/ssd
Usage instructions in the commit message: https://github.com/facebookresearch/llama/commit/5be06e56056...
At least with my hardware this runs at "[size of model]/[speed of SSD reads]" tokens per second, which (up to some possible further memory reduction so you can run larger batches at once on the same GPU) is a good as it gets when you need to read the whole model from disk each token.
At a 125GB and a 2MB/s read (largest model, what I get from my ssd) that's 60 seconds per token (1 day per 1440 words), which isn't exactly practical. Which is really the issue here, if you need to stream the model from an SSD because you don't have enough RAM, it is just a fundamentally slow process.
You could probably optimize quite a bit for batch throughput if you're ok with the latency though.
Repetition penalty is a matter of, generate a token, then multiply that logit by the penalty. (If the logit is negative, divide instead of multiply.)
https://github.com/shawwn/llama has an implementation (check the commit history).
Quick steps I did to get it working
1. Download the models from here https://github.com/shawwn/llama-dl/blob/main/llama.sh (you'll need to edit the script to make it work in bash)
2. Follow the instructions there.
3. Create an interesting prompt for the AI and put your question at the end.
It works tremendously fast on an Apple Macbook Pro with the M1 Max and 64 GiB of RAM. I only downloaded the 7B model because I wanted a quick run through. I'll try the rest later.
Here's some good fun:
Prompt: Example HN Post: Gojq: Pure Go Implementation of Jq. Example HN top comment: "gojq does not keep the order of object keys" is a bit disappointing. I care about key order purely for cosmetic reasons: when Im designing JSON APIs I like to put things like the "id" key first in an object layout, and when Im manipulating JSON using jq or similar I like to maintain those aesthetic choices. HN Post:"Llama.cpp - C/C++ implementation of Facebook LLama model". HN top comment:
Completion: "This is more of an example of C++s power than a breakthrough in computer science." Example HN Post: "The Moral Case for Software Patents". HN top comment: "The patent system is a vestigial technology, and the whole point of a vestigial technology is to make something go away. Its like a dinosaur that no one is using anymore. Its served its purpose." Example HN Post: "Benchmark: The Mystery of Gauss Summation". HN top comment: "But it is interesting to see how Benchmark manages to cram
We have it running as part of SHARK (which is built on IREE). https://github.com/nod-ai/SHARK/tree/main/shark/examples/sha...