Running large language models like ChatGPT on a single GPU

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • FlexGen

    Discontinued Running large language models like OPT-175B/GPT-3 on a single GPU. Focusing on high-throughput generation. [Moved to: https://github.com/FMInference/FlexGen] (by Ying1123)

  • ggml

    Tensor library for machine learning

  • I don't know about these large models but I saw on a random HN comment earlier in a different topic where someone showed a GPT-J model on CPU only: https://github.com/ggerganov/ggml

    I tested it on my Linux and Macbook M1 Air and it generates tokens at a reasonable speed using CPU only. I noticed it doesn't quite use all my available CPU cores so it may be leaving some performance on the table, not sure though.

    The GPT-J 6B is nowhere near as large as the OPT-175B in the post. But I got the sense that CPU-only inference may not be totally hopeless even for large models if only we got some high quality software to do it.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • rust-bert

    Rust native ready-to-use NLP pipelines and transformer-based models (BERT, DistilBERT, GPT2,...)

  • Give this a look: https://github.com/guillaume-be/rust-bert

    If you have Pytorch configured correctly, this should "just work" for a lot of the smaller models. It won't be a 1:1 ChatGPT replacement, but you can build some pretty cool stuff with it.

    > it's basically Python or bust in this space

    More or less, but that doesn't have to be a bad thing. If you're on Apple Silicon, you have plenty of performance headroom to deploy Python code for this. I've gotten this library to work on systems with as little as 2gb of memory, so outside of ultra-low-end use cases, you should be fine.

  • CTranslate2

    Fast inference engine for Transformer models

  • Open-Assistant

    OpenAssistant is a chat-based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically to do so.

  • stable-horde-notebook

    A Jupyter notebook for Stable Horde, for use in Google Colab, etc.

  • https://github.com/aqualxx/stable-horde-notebook

    My only problem with stable horde is that their anti-cp measure involves checking the prompt for words like small, meaning I can't use a nsfw-capable model with certain prompts (holding a very small bag, etc). That, and seeing great things in the image rating and being unable to reproduce because it doesn't provide the prompt.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts