Llama-CPU: Fork of Facebooks LLaMa model to run on CPU

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • llama

    Inference code for Llama models

  • Reading the patch: https://github.com/facebookresearch/llama/compare/main...mar...

    Looks like this is just tweaking some defaults and commenting out some code that enables cuda. It also switches to something called gloo, which I'm not familiar with. Seems like an alternate backend.

  • llama-cpu

    Fork of Facebooks LLaMa model to run on CPU

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • llama-mps

    Experimental fork of Facebooks LLaMa model which runs it with GPU acceleration on Apple Silicon M1/M2

  • llama

    Inference code for LLaMA models (by gmorenz)

  • I don't know about this fork specifically, but in general yes absolutely.

    Even without enough ram, you can stream model weights from disk and run at [size of model/disk read speed] seconds per token.

    I'm doing that on a small GPU with this code, but it should be easy to get this working with the CPU as compute instead (and at least with my disk/CPU, I'm not even sure that it would run even slower, I think disk read would probably still be the bottleneck)

    https://github.com/gmorenz/llama/tree/ssd

  • KoboldAI-Client

  • I have been using similar models like LLM for helping draft fictional stories. The community fine tuned models are geared towards SFW or NSFW story competition.

    https://github.com/KoboldAI/KoboldAI-Client To read more about current popular models.

    https://koboldai.net/ is a way to run some of these models in the "cloud". There's no account required and the prompts are run on other people's hardware, with priority weighting based on how much compute you have used or donated. There's an anonymous api key and there's no expectation that the output can't be logged.

    The models that run on hardware locally are very basic in the quality of output. Here's an example of a 6B output used to try to emulate chatgpt. https://mobile.twitter.com/Knaikk/status/1629711223863345154 The model was finetuned on story completion so it's not meaningfully comparable.

  • transformers

    🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

  • I tried mark's OMP_NUM_THREADS suggestion (https://news.ycombinator.com/item?id=35018559), did not see any an obvious change to make it parallel, and given the huggingface patch (https://github.com/huggingface/transformers/pull/21955) once it gets in should allow streaming from RAM to the GPU felt it was not worth the effort to keep working on the CPU version as even a ~30X speedup using all the cores will still take around a minute to run the 7B.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts