Facebook LLAMA is being openly distributed via torrents

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Access the most powerful time series database as a service
  • CodiumAI - TestGPT | Generating meaningful tests for busy devs
  • Sonar - Write Clean Python Code. Always.
  • ONLYOFFICE ONLYOFFICE Docs — document collaboration in your environment
  • llama

    Inference code for LLaMA models

    > so does this mean you got it working on one GPU with an NVLink to a 2nd, or is it really running on all 4 A40s?

    it's sharded across all 4 GPUs (as per the readme here: https://github.com/facebookresearch/llama). I'd wait a few weeks to a month for people to settle on a solution for running the model, people are just going to be throwing pytorch code at the wall and seeing what sticks right now.

  • petals

    🌸 Run 100B+ language models at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading

    True, and that's why there is a project that is using volunteered, distributed GPUs to run BLOOM/BLOOMZ: https://github.com/bigscience-workshop/petals, http://chat.petals.ml.

  • InfluxDB

    Access the most powerful time series database as a service. Ingest, store, & analyze all types of time series data in a fully-managed, purpose-built database. Keep data forever with low-cost storage and superior data compression.

  • FlexGen

    Running large language models on a single GPU for throughput-oriented scenarios.

    In principal you can run it on just about any hardware with enough storage space. It's just a question of how fast it will run. This readme has some benchmarks with a similar set of models (and the code has support for even swapping data out to disk if needed): https://github.com/FMInference/FlexGen

    And here are some benchmarks running OPT-175B purely on (a very beefy) CPU machine. Note that the biggest llama model is only 65.2B: https://github.com/FMInference/FlexGen/issues/24

  • fickling

    A Python pickling decompiler and static analyzer

    You're right! You should probably use Trail of Bits Fickling tool to investigate. https://github.com/trailofbits/fickling

  • DeepSpeed

    DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

    - https://github.com/microsoft/DeepSpeed

    Anything that could bring this to a 10GB 3080 or 24GB 3090 without 60s/it per token?

  • safer_unpickle

    A few months ago I made a small library to sanitize pytorch checkpoints, here it is: https://github.com/kir-gadjello/safer_unpickle

    The usage boils down to

  • llama-cpu

    Fork of Facebooks LLaMa model to run on CPU

    I was able to run 7B on a CPU, inferring several words per second: https://github.com/markasoftware/llama-cpu

  • CodiumAI

    TestGPT | Generating meaningful tests for busy devs. Get non-trivial tests (and trivial, too!) suggested right inside your IDE, so you can code smart, create more value, and stay confident when you push.

  • text-generation-webui

    A gradio web UI for running Large Language Models like LLaMA, llama.cpp, GPT-J, Pythia, OPT, and GALACTICA.

    You can do even better!. You can run the second smallest one (better than GPT-3 175B) on 24GB of vram, ie LLaMA-13B. https://github.com/oobabooga/text-generation-webui/issues/14...

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts