A comprehensive guide to running Llama 2 locally

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

llama.cpp

769 55,846 10.0 C++

LLM inference in C/C++

While on the surface, a 192GB Mac Studio seems like a great deal (it's not much more than a 48GB A6000!), there are several reasons why this might not be a good idea:
* I assume most people have never used llama.cpp Metal w/ large models. It will drop to CPU speeds whenever the context window is full: https://github.com/ggerganov/llama.cpp/issues/1730#issuecomm... - while sure this might be fixed in the future, it's been an issue since Metal support was added, and is a significant problem if you are actually trying to actually use it for inferencing. With 192GB of memory, you could probably run larger models w/o quantization, but I've never seen anyone post benchmarks of their experiences. Note that at that point, the limited memory bandwidth will be a big factor.
* If you are planning on using Apple Silicon for ML/training, I'd also be wary. There are multi-year long open bugs in PyTorch[1], and most major LLM libs like deepspeed, bitsandbytes, etc don't have Apple Silicon support[2][3].
You can see similar patterns w/ Stable Diffusion support [4][5] - support lagging by months, lots of problems and poor performance with inference, much less fine tuning. You can apply this to basically any ML application you want (srt, tts, video, etc)
Macs are fine to poke around with, but if you actually plan to do more than run a small LLM and say "neat", especially for a business, recommending a Mac for anyone getting started w/ ML workloads is a bad take. (In general, for anyone getting started, unless you're just burning budget, renting cloud GPU is going to be the best cost/perf, although on-prem/local obviously has other advantages.)
[1] https://github.com/pytorch/pytorch/issues?q=is%3Aissue+is%3A...
[2] https://github.com/microsoft/DeepSpeed/issues/1580
[3] https://github.com/TimDettmers/bitsandbytes/issues/485
[4] https://github.com/AUTOMATIC1111/stable-diffusion-webui/disc...
[5] https://forums.macrumors.com/threads/ai-generated-art-stable...

text-generation-webui

876 36,293 9.9 Python

A Gradio web UI for Large Language Models. Supports transformers, GPTQ, AWQ, EXL2, llama.cpp (GGUF), Llama models.

The bash script is downloading llama.cpp, a project which allows you to run LLaMA-based language models on your CPU. The bash script then downloads the 13 billion parameter GGML version of LLaMA 2. The GGML version is what will work with llama.cpp and uses CPU for inferencing. There are ways to run models using your GPU, but it depends on your setup whether it will be worth it.
I would highly recommend looking into the text-generation-webui project (https://github.com/oobabooga/text-generation-webui). It has a one-click installer and very comprehensive guides for getting models running locally and where to find models. The project also has an "api" command flag to let you use it like you might use a web-based service currently.

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
ollama

192 58,943 9.9 Go

Get up and running with Llama 3, Mistral, Gemma, and other large language models.

> adjust your paths as necessary. It has a tendency to talk to itself.
This is always a fun surprise. What I've seen help, especially with chat models, is to use a prompt template. Some tools (e.g. https://ollama.ai/ – disclosure: I'm a maintainer) use a default, model-specific prompt template when you run the model. This is easier for users since they can just input your chat messages and get answers. Every model is trained differently.
But with llama.cpp it you'd need wrap your prompt text with the right template. For llama 2, the facebook developers' generation code wraps the system prompt and user prompts in specific tags (<>{system prompt} and [INST]{user prompt}[/INST]) respectively): https://github.com/facebookresearch/llama/blob/main/llama/ge....
Worth noting that customizing prompt templates can be fun – you don't have to use them – I've had a model generate a conversation between a few characters that talk to each other for example – it's pretty entertaining!

llama

184 53,053 8.1 Python

Inference code for Llama models

> adjust your paths as necessary. It has a tendency to talk to itself.
This is always a fun surprise. What I've seen help, especially with chat models, is to use a prompt template. Some tools (e.g. https://ollama.ai/ – disclosure: I'm a maintainer) use a default, model-specific prompt template when you run the model. This is easier for users since they can just input your chat messages and get answers. Every model is trained differently.
But with llama.cpp it you'd need wrap your prompt text with the right template. For llama 2, the facebook developers' generation code wraps the system prompt and user prompts in specific tags (<>{system prompt} and [INST]{user prompt}[/INST]) respectively): https://github.com/facebookresearch/llama/blob/main/llama/ge....
Worth noting that customizing prompt templates can be fun – you don't have to use them – I've had a model generate a conversation between a few characters that talk to each other for example – it's pretty entertaining!

text-generation-inference

29 7,800 9.6 Python

Large Language Model Text Generation Inference

MLC LLM (iOS/Android)
Which is not really comprehensive... If you have a linux machine with GPUs, i'd just use hugging face inference (https://github.com/huggingface/text-generation-inference). And I am sure there are other things that could be covered.

llm

23 2,903 9.5 Python

Access large language models from the command-line (by simonw)

My LLM command line tool can do that - it logs everything to a SQLite database and has an option to continue a conversation: https://llm.datasette.io

blog

5 1,989 9.8 Jupyter Notebook

Public repo for HF blog posts (by huggingface)

If you just want to do inference/mess around with the model and have a 16GB GPU, then this[0] is enough to paste into a notebook. You need to have access to the HF models though.
0. https://github.com/huggingface/blog/blob/main/llama2.md#usin...

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
llama

2 80 6.2 Python

Inference code for LLaMA models on CPU and Mac M1/M2 GPU (by krychu)

Self-plug. Here’s a fork of the original llama 2 code adapted to run on the CPU or MPS (M1/M2 GPU) if available:
https://github.com/krychu/llama
It runs with the original weights, and gets you to ~4 tokens/sec on MacBook Pro M1 with the 7B model.

Pytorch

336 77,783 10.0 Python

Tensors and Dynamic neural networks in Python with strong GPU acceleration

While on the surface, a 192GB Mac Studio seems like a great deal (it's not much more than a 48GB A6000!), there are several reasons why this might not be a good idea:
* I assume most people have never used llama.cpp Metal w/ large models. It will drop to CPU speeds whenever the context window is full: https://github.com/ggerganov/llama.cpp/issues/1730#issuecomm... - while sure this might be fixed in the future, it's been an issue since Metal support was added, and is a significant problem if you are actually trying to actually use it for inferencing. With 192GB of memory, you could probably run larger models w/o quantization, but I've never seen anyone post benchmarks of their experiences. Note that at that point, the limited memory bandwidth will be a big factor.
* If you are planning on using Apple Silicon for ML/training, I'd also be wary. There are multi-year long open bugs in PyTorch[1], and most major LLM libs like deepspeed, bitsandbytes, etc don't have Apple Silicon support[2][3].
You can see similar patterns w/ Stable Diffusion support [4][5] - support lagging by months, lots of problems and poor performance with inference, much less fine tuning. You can apply this to basically any ML application you want (srt, tts, video, etc)
Macs are fine to poke around with, but if you actually plan to do more than run a small LLM and say "neat", especially for a business, recommending a Mac for anyone getting started w/ ML workloads is a bad take. (In general, for anyone getting started, unless you're just burning budget, renting cloud GPU is going to be the best cost/perf, although on-prem/local obviously has other advantages.)
[1] https://github.com/pytorch/pytorch/issues?q=is%3Aissue+is%3A...
[2] https://github.com/microsoft/DeepSpeed/issues/1580
[3] https://github.com/TimDettmers/bitsandbytes/issues/485
[4] https://github.com/AUTOMATIC1111/stable-diffusion-webui/disc...
[5] https://forums.macrumors.com/threads/ai-generated-art-stable...

DeepSpeed

51 32,550 9.8 Python

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

While on the surface, a 192GB Mac Studio seems like a great deal (it's not much more than a 48GB A6000!), there are several reasons why this might not be a good idea:
* I assume most people have never used llama.cpp Metal w/ large models. It will drop to CPU speeds whenever the context window is full: https://github.com/ggerganov/llama.cpp/issues/1730#issuecomm... - while sure this might be fixed in the future, it's been an issue since Metal support was added, and is a significant problem if you are actually trying to actually use it for inferencing. With 192GB of memory, you could probably run larger models w/o quantization, but I've never seen anyone post benchmarks of their experiences. Note that at that point, the limited memory bandwidth will be a big factor.
* If you are planning on using Apple Silicon for ML/training, I'd also be wary. There are multi-year long open bugs in PyTorch[1], and most major LLM libs like deepspeed, bitsandbytes, etc don't have Apple Silicon support[2][3].
You can see similar patterns w/ Stable Diffusion support [4][5] - support lagging by months, lots of problems and poor performance with inference, much less fine tuning. You can apply this to basically any ML application you want (srt, tts, video, etc)
Macs are fine to poke around with, but if you actually plan to do more than run a small LLM and say "neat", especially for a business, recommending a Mac for anyone getting started w/ ML workloads is a bad take. (In general, for anyone getting started, unless you're just burning budget, renting cloud GPU is going to be the best cost/perf, although on-prem/local obviously has other advantages.)
[1] https://github.com/pytorch/pytorch/issues?q=is%3Aissue+is%3A...
[2] https://github.com/microsoft/DeepSpeed/issues/1580
[3] https://github.com/TimDettmers/bitsandbytes/issues/485
[4] https://github.com/AUTOMATIC1111/stable-diffusion-webui/disc...
[5] https://forums.macrumors.com/threads/ai-generated-art-stable...

bitsandbytes

61 5,389 9.4 Python

Accessible large language models via k-bit quantization for PyTorch.

While on the surface, a 192GB Mac Studio seems like a great deal (it's not much more than a 48GB A6000!), there are several reasons why this might not be a good idea:
* I assume most people have never used llama.cpp Metal w/ large models. It will drop to CPU speeds whenever the context window is full: https://github.com/ggerganov/llama.cpp/issues/1730#issuecomm... - while sure this might be fixed in the future, it's been an issue since Metal support was added, and is a significant problem if you are actually trying to actually use it for inferencing. With 192GB of memory, you could probably run larger models w/o quantization, but I've never seen anyone post benchmarks of their experiences. Note that at that point, the limited memory bandwidth will be a big factor.
* If you are planning on using Apple Silicon for ML/training, I'd also be wary. There are multi-year long open bugs in PyTorch[1], and most major LLM libs like deepspeed, bitsandbytes, etc don't have Apple Silicon support[2][3].
You can see similar patterns w/ Stable Diffusion support [4][5] - support lagging by months, lots of problems and poor performance with inference, much less fine tuning. You can apply this to basically any ML application you want (srt, tts, video, etc)
Macs are fine to poke around with, but if you actually plan to do more than run a small LLM and say "neat", especially for a business, recommending a Mac for anyone getting started w/ ML workloads is a bad take. (In general, for anyone getting started, unless you're just burning budget, renting cloud GPU is going to be the best cost/perf, although on-prem/local obviously has other advantages.)
[1] https://github.com/pytorch/pytorch/issues?q=is%3Aissue+is%3A...
[2] https://github.com/microsoft/DeepSpeed/issues/1580
[3] https://github.com/TimDettmers/bitsandbytes/issues/485
[4] https://github.com/AUTOMATIC1111/stable-diffusion-webui/disc...
[5] https://forums.macrumors.com/threads/ai-generated-art-stable...

stable-diffusion-webui

2,808 129,299 9.9 Python

Stable Diffusion web UI

While on the surface, a 192GB Mac Studio seems like a great deal (it's not much more than a 48GB A6000!), there are several reasons why this might not be a good idea:
* I assume most people have never used llama.cpp Metal w/ large models. It will drop to CPU speeds whenever the context window is full: https://github.com/ggerganov/llama.cpp/issues/1730#issuecomm... - while sure this might be fixed in the future, it's been an issue since Metal support was added, and is a significant problem if you are actually trying to actually use it for inferencing. With 192GB of memory, you could probably run larger models w/o quantization, but I've never seen anyone post benchmarks of their experiences. Note that at that point, the limited memory bandwidth will be a big factor.
* If you are planning on using Apple Silicon for ML/training, I'd also be wary. There are multi-year long open bugs in PyTorch[1], and most major LLM libs like deepspeed, bitsandbytes, etc don't have Apple Silicon support[2][3].
You can see similar patterns w/ Stable Diffusion support [4][5] - support lagging by months, lots of problems and poor performance with inference, much less fine tuning. You can apply this to basically any ML application you want (srt, tts, video, etc)
Macs are fine to poke around with, but if you actually plan to do more than run a small LLM and say "neat", especially for a business, recommending a Mac for anyone getting started w/ ML workloads is a bad take. (In general, for anyone getting started, unless you're just burning budget, renting cloud GPU is going to be the best cost/perf, although on-prem/local obviously has other advantages.)
[1] https://github.com/pytorch/pytorch/issues?q=is%3Aissue+is%3A...
[2] https://github.com/microsoft/DeepSpeed/issues/1580
[3] https://github.com/TimDettmers/bitsandbytes/issues/485
[4] https://github.com/AUTOMATIC1111/stable-diffusion-webui/disc...
[5] https://forums.macrumors.com/threads/ai-generated-art-stable...

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Cleared AWS Machine Learning - Specialty exam.. Happy to help!!!
2 projects | /r/AWSCertifications | 4 Apr 2023
People tricking ChatGPT “like watching an Asimov novel come to life”
1 project | news.ycombinator.com | 2 Dec 2022
Good practices for neural network training: identify, save, and document best models
1 project | dev.to | 4 Jan 2022
Zero-3 Offload: Scale DL models to trillion parameters without code changes
6 projects | news.ycombinator.com | 13 Mar 2021
D I Refuse To Use Pytorch Because Its A Facebook
1 project | /r/MachineLearning | 29 Dec 2020

A comprehensive guide to running Llama 2 locally

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Deep Learning Python Pytorch GPU Machine Learning
Post date: 25 Jul 2023

llama.cpp

text-generation-webui

WorkOS

ollama

llama

text-generation-inference

llm

blog

InfluxDB

llama

Pytorch

DeepSpeed

bitsandbytes

stable-diffusion-webui

Related posts

A comprehensive guide to running Llama 2 locally

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Deep Learning Python Pytorch GPU Machine Learning Post date: 25 Jul 2023

Related posts

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Deep Learning Python Pytorch GPU Machine Learning
Post date: 25 Jul 2023