A comprehensive guide to running Llama 2 locally

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • llama.cpp

    LLM inference in C/C++

  • While on the surface, a 192GB Mac Studio seems like a great deal (it's not much more than a 48GB A6000!), there are several reasons why this might not be a good idea:

    * I assume most people have never used llama.cpp Metal w/ large models. It will drop to CPU speeds whenever the context window is full: https://github.com/ggerganov/llama.cpp/issues/1730#issuecomm... - while sure this might be fixed in the future, it's been an issue since Metal support was added, and is a significant problem if you are actually trying to actually use it for inferencing. With 192GB of memory, you could probably run larger models w/o quantization, but I've never seen anyone post benchmarks of their experiences. Note that at that point, the limited memory bandwidth will be a big factor.

    * If you are planning on using Apple Silicon for ML/training, I'd also be wary. There are multi-year long open bugs in PyTorch[1], and most major LLM libs like deepspeed, bitsandbytes, etc don't have Apple Silicon support[2][3].

    You can see similar patterns w/ Stable Diffusion support [4][5] - support lagging by months, lots of problems and poor performance with inference, much less fine tuning. You can apply this to basically any ML application you want (srt, tts, video, etc)

    Macs are fine to poke around with, but if you actually plan to do more than run a small LLM and say "neat", especially for a business, recommending a Mac for anyone getting started w/ ML workloads is a bad take. (In general, for anyone getting started, unless you're just burning budget, renting cloud GPU is going to be the best cost/perf, although on-prem/local obviously has other advantages.)

    [1] https://github.com/pytorch/pytorch/issues?q=is%3Aissue+is%3A...

    [2] https://github.com/microsoft/DeepSpeed/issues/1580

    [3] https://github.com/TimDettmers/bitsandbytes/issues/485

    [4] https://github.com/AUTOMATIC1111/stable-diffusion-webui/disc...

    [5] https://forums.macrumors.com/threads/ai-generated-art-stable...

  • text-generation-webui

    A Gradio web UI for Large Language Models. Supports transformers, GPTQ, AWQ, EXL2, llama.cpp (GGUF), Llama models.

  • The bash script is downloading llama.cpp, a project which allows you to run LLaMA-based language models on your CPU. The bash script then downloads the 13 billion parameter GGML version of LLaMA 2. The GGML version is what will work with llama.cpp and uses CPU for inferencing. There are ways to run models using your GPU, but it depends on your setup whether it will be worth it.

    I would highly recommend looking into the text-generation-webui project (https://github.com/oobabooga/text-generation-webui). It has a one-click installer and very comprehensive guides for getting models running locally and where to find models. The project also has an "api" command flag to let you use it like you might use a web-based service currently.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • ollama

    Get up and running with Llama 3, Mistral, Gemma, and other large language models.

  • > adjust your paths as necessary. It has a tendency to talk to itself.

    This is always a fun surprise. What I've seen help, especially with chat models, is to use a prompt template. Some tools (e.g. https://ollama.ai/ – disclosure: I'm a maintainer) use a default, model-specific prompt template when you run the model. This is easier for users since they can just input your chat messages and get answers. Every model is trained differently.

    But with llama.cpp it you'd need wrap your prompt text with the right template. For llama 2, the facebook developers' generation code wraps the system prompt and user prompts in specific tags (<>{system prompt} and [INST]{user prompt}[/INST]) respectively): https://github.com/facebookresearch/llama/blob/main/llama/ge....

    Worth noting that customizing prompt templates can be fun – you don't have to use them – I've had a model generate a conversation between a few characters that talk to each other for example – it's pretty entertaining!

  • llama

    Inference code for Llama models

  • > adjust your paths as necessary. It has a tendency to talk to itself.

    This is always a fun surprise. What I've seen help, especially with chat models, is to use a prompt template. Some tools (e.g. https://ollama.ai/ – disclosure: I'm a maintainer) use a default, model-specific prompt template when you run the model. This is easier for users since they can just input your chat messages and get answers. Every model is trained differently.

    But with llama.cpp it you'd need wrap your prompt text with the right template. For llama 2, the facebook developers' generation code wraps the system prompt and user prompts in specific tags (<>{system prompt} and [INST]{user prompt}[/INST]) respectively): https://github.com/facebookresearch/llama/blob/main/llama/ge....

    Worth noting that customizing prompt templates can be fun – you don't have to use them – I've had a model generate a conversation between a few characters that talk to each other for example – it's pretty entertaining!

  • text-generation-inference

    Large Language Model Text Generation Inference

  • MLC LLM (iOS/Android)

    Which is not really comprehensive... If you have a linux machine with GPUs, i'd just use hugging face inference (https://github.com/huggingface/text-generation-inference). And I am sure there are other things that could be covered.

  • llm

    Access large language models from the command-line (by simonw)

  • My LLM command line tool can do that - it logs everything to a SQLite database and has an option to continue a conversation: https://llm.datasette.io

  • blog

    Public repo for HF blog posts (by huggingface)

  • If you just want to do inference/mess around with the model and have a 16GB GPU, then this[0] is enough to paste into a notebook. You need to have access to the HF models though.

    0. https://github.com/huggingface/blog/blob/main/llama2.md#usin...

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • llama

    Inference code for LLaMA models on CPU and Mac M1/M2 GPU (by krychu)

  • Self-plug. Here’s a fork of the original llama 2 code adapted to run on the CPU or MPS (M1/M2 GPU) if available:

    https://github.com/krychu/llama

    It runs with the original weights, and gets you to ~4 tokens/sec on MacBook Pro M1 with the 7B model.

  • Pytorch

    Tensors and Dynamic neural networks in Python with strong GPU acceleration

  • While on the surface, a 192GB Mac Studio seems like a great deal (it's not much more than a 48GB A6000!), there are several reasons why this might not be a good idea:

    * I assume most people have never used llama.cpp Metal w/ large models. It will drop to CPU speeds whenever the context window is full: https://github.com/ggerganov/llama.cpp/issues/1730#issuecomm... - while sure this might be fixed in the future, it's been an issue since Metal support was added, and is a significant problem if you are actually trying to actually use it for inferencing. With 192GB of memory, you could probably run larger models w/o quantization, but I've never seen anyone post benchmarks of their experiences. Note that at that point, the limited memory bandwidth will be a big factor.

    * If you are planning on using Apple Silicon for ML/training, I'd also be wary. There are multi-year long open bugs in PyTorch[1], and most major LLM libs like deepspeed, bitsandbytes, etc don't have Apple Silicon support[2][3].

    You can see similar patterns w/ Stable Diffusion support [4][5] - support lagging by months, lots of problems and poor performance with inference, much less fine tuning. You can apply this to basically any ML application you want (srt, tts, video, etc)

    Macs are fine to poke around with, but if you actually plan to do more than run a small LLM and say "neat", especially for a business, recommending a Mac for anyone getting started w/ ML workloads is a bad take. (In general, for anyone getting started, unless you're just burning budget, renting cloud GPU is going to be the best cost/perf, although on-prem/local obviously has other advantages.)

    [1] https://github.com/pytorch/pytorch/issues?q=is%3Aissue+is%3A...

    [2] https://github.com/microsoft/DeepSpeed/issues/1580

    [3] https://github.com/TimDettmers/bitsandbytes/issues/485

    [4] https://github.com/AUTOMATIC1111/stable-diffusion-webui/disc...

    [5] https://forums.macrumors.com/threads/ai-generated-art-stable...

  • DeepSpeed

    DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

  • While on the surface, a 192GB Mac Studio seems like a great deal (it's not much more than a 48GB A6000!), there are several reasons why this might not be a good idea:

    * I assume most people have never used llama.cpp Metal w/ large models. It will drop to CPU speeds whenever the context window is full: https://github.com/ggerganov/llama.cpp/issues/1730#issuecomm... - while sure this might be fixed in the future, it's been an issue since Metal support was added, and is a significant problem if you are actually trying to actually use it for inferencing. With 192GB of memory, you could probably run larger models w/o quantization, but I've never seen anyone post benchmarks of their experiences. Note that at that point, the limited memory bandwidth will be a big factor.

    * If you are planning on using Apple Silicon for ML/training, I'd also be wary. There are multi-year long open bugs in PyTorch[1], and most major LLM libs like deepspeed, bitsandbytes, etc don't have Apple Silicon support[2][3].

    You can see similar patterns w/ Stable Diffusion support [4][5] - support lagging by months, lots of problems and poor performance with inference, much less fine tuning. You can apply this to basically any ML application you want (srt, tts, video, etc)

    Macs are fine to poke around with, but if you actually plan to do more than run a small LLM and say "neat", especially for a business, recommending a Mac for anyone getting started w/ ML workloads is a bad take. (In general, for anyone getting started, unless you're just burning budget, renting cloud GPU is going to be the best cost/perf, although on-prem/local obviously has other advantages.)

    [1] https://github.com/pytorch/pytorch/issues?q=is%3Aissue+is%3A...

    [2] https://github.com/microsoft/DeepSpeed/issues/1580

    [3] https://github.com/TimDettmers/bitsandbytes/issues/485

    [4] https://github.com/AUTOMATIC1111/stable-diffusion-webui/disc...

    [5] https://forums.macrumors.com/threads/ai-generated-art-stable...

  • bitsandbytes

    Accessible large language models via k-bit quantization for PyTorch.

  • While on the surface, a 192GB Mac Studio seems like a great deal (it's not much more than a 48GB A6000!), there are several reasons why this might not be a good idea:

    * I assume most people have never used llama.cpp Metal w/ large models. It will drop to CPU speeds whenever the context window is full: https://github.com/ggerganov/llama.cpp/issues/1730#issuecomm... - while sure this might be fixed in the future, it's been an issue since Metal support was added, and is a significant problem if you are actually trying to actually use it for inferencing. With 192GB of memory, you could probably run larger models w/o quantization, but I've never seen anyone post benchmarks of their experiences. Note that at that point, the limited memory bandwidth will be a big factor.

    * If you are planning on using Apple Silicon for ML/training, I'd also be wary. There are multi-year long open bugs in PyTorch[1], and most major LLM libs like deepspeed, bitsandbytes, etc don't have Apple Silicon support[2][3].

    You can see similar patterns w/ Stable Diffusion support [4][5] - support lagging by months, lots of problems and poor performance with inference, much less fine tuning. You can apply this to basically any ML application you want (srt, tts, video, etc)

    Macs are fine to poke around with, but if you actually plan to do more than run a small LLM and say "neat", especially for a business, recommending a Mac for anyone getting started w/ ML workloads is a bad take. (In general, for anyone getting started, unless you're just burning budget, renting cloud GPU is going to be the best cost/perf, although on-prem/local obviously has other advantages.)

    [1] https://github.com/pytorch/pytorch/issues?q=is%3Aissue+is%3A...

    [2] https://github.com/microsoft/DeepSpeed/issues/1580

    [3] https://github.com/TimDettmers/bitsandbytes/issues/485

    [4] https://github.com/AUTOMATIC1111/stable-diffusion-webui/disc...

    [5] https://forums.macrumors.com/threads/ai-generated-art-stable...

  • stable-diffusion-webui

    Stable Diffusion web UI

  • While on the surface, a 192GB Mac Studio seems like a great deal (it's not much more than a 48GB A6000!), there are several reasons why this might not be a good idea:

    * I assume most people have never used llama.cpp Metal w/ large models. It will drop to CPU speeds whenever the context window is full: https://github.com/ggerganov/llama.cpp/issues/1730#issuecomm... - while sure this might be fixed in the future, it's been an issue since Metal support was added, and is a significant problem if you are actually trying to actually use it for inferencing. With 192GB of memory, you could probably run larger models w/o quantization, but I've never seen anyone post benchmarks of their experiences. Note that at that point, the limited memory bandwidth will be a big factor.

    * If you are planning on using Apple Silicon for ML/training, I'd also be wary. There are multi-year long open bugs in PyTorch[1], and most major LLM libs like deepspeed, bitsandbytes, etc don't have Apple Silicon support[2][3].

    You can see similar patterns w/ Stable Diffusion support [4][5] - support lagging by months, lots of problems and poor performance with inference, much less fine tuning. You can apply this to basically any ML application you want (srt, tts, video, etc)

    Macs are fine to poke around with, but if you actually plan to do more than run a small LLM and say "neat", especially for a business, recommending a Mac for anyone getting started w/ ML workloads is a bad take. (In general, for anyone getting started, unless you're just burning budget, renting cloud GPU is going to be the best cost/perf, although on-prem/local obviously has other advantages.)

    [1] https://github.com/pytorch/pytorch/issues?q=is%3Aissue+is%3A...

    [2] https://github.com/microsoft/DeepSpeed/issues/1580

    [3] https://github.com/TimDettmers/bitsandbytes/issues/485

    [4] https://github.com/AUTOMATIC1111/stable-diffusion-webui/disc...

    [5] https://forums.macrumors.com/threads/ai-generated-art-stable...

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts