Facebook LLAMA is being openly distributed via torrents

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

llama

184 52,603 8.1 Python

Inference code for Llama models

> so does this mean you got it working on one GPU with an NVLink to a 2nd, or is it really running on all 4 A40s?
it's sharded across all 4 GPUs (as per the readme here: https://github.com/facebookresearch/llama). I'd wait a few weeks to a month for people to settle on a solution for running the model, people are just going to be throwing pytorch code at the wall and seeing what sticks right now.

petals

98 8,661 8.5 Python

🌸 Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading

True, and that's why there is a project that is using volunteered, distributed GPUs to run BLOOM/BLOOMZ: https://github.com/bigscience-workshop/petals, http://chat.petals.ml.

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
FlexGen

39 8,999 3.0 Python

Running large language models on a single GPU for throughput-oriented scenarios.

In principal you can run it on just about any hardware with enough storage space. It's just a question of how fast it will run. This readme has some benchmarks with a similar set of models (and the code has support for even swapping data out to disk if needed): https://github.com/FMInference/FlexGen
And here are some benchmarks running OPT-175B purely on (a very beefy) CPU machine. Note that the biggest llama model is only 65.2B: https://github.com/FMInference/FlexGen/issues/24

fickling

7 322 8.6 Python

A Python pickling decompiler and static analyzer

You're right! You should probably use Trail of Bits Fickling tool to investigate. https://github.com/trailofbits/fickling

DeepSpeed

51 32,550 9.8 Python

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

- https://github.com/microsoft/DeepSpeed
Anything that could bring this to a 10GB 3080 or 24GB 3090 without 60s/it per token?

safer_unpickle

1 5 10.0 Python

A few months ago I made a small library to sanitize pytorch checkpoints, here it is: https://github.com/kir-gadjello/safer_unpickle
The usage boils down to

llama-cpu

9 775 3.1 Python

Fork of Facebooks LLaMa model to run on CPU

I was able to run 7B on a CPU, inferring several words per second: https://github.com/markasoftware/llama-cpu

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
text-generation-webui

876 35,862 9.9 Python

A Gradio web UI for Large Language Models. Supports transformers, GPTQ, AWQ, EXL2, llama.cpp (GGUF), Llama models.

You can do even better!. You can run the second smallest one (better than GPT-3 175B) on 24GB of vram, ie LLaMA-13B. https://github.com/oobabooga/text-generation-webui/issues/14...

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project