Our great sponsors
- InfluxDB - Access the most powerful time series database as a service
- CodiumAI - TestGPT | Generating meaningful tests for busy devs
- Sonar - Write Clean Python Code. Always.
- ONLYOFFICE ONLYOFFICE Docs — document collaboration in your environment
-
> so does this mean you got it working on one GPU with an NVLink to a 2nd, or is it really running on all 4 A40s?
it's sharded across all 4 GPUs (as per the readme here: https://github.com/facebookresearch/llama). I'd wait a few weeks to a month for people to settle on a solution for running the model, people are just going to be throwing pytorch code at the wall and seeing what sticks right now.
-
petals
🌸 Run 100B+ language models at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading
True, and that's why there is a project that is using volunteered, distributed GPUs to run BLOOM/BLOOMZ: https://github.com/bigscience-workshop/petals, http://chat.petals.ml.
-
InfluxDB
Access the most powerful time series database as a service. Ingest, store, & analyze all types of time series data in a fully-managed, purpose-built database. Keep data forever with low-cost storage and superior data compression.
-
In principal you can run it on just about any hardware with enough storage space. It's just a question of how fast it will run. This readme has some benchmarks with a similar set of models (and the code has support for even swapping data out to disk if needed): https://github.com/FMInference/FlexGen
And here are some benchmarks running OPT-175B purely on (a very beefy) CPU machine. Note that the biggest llama model is only 65.2B: https://github.com/FMInference/FlexGen/issues/24
-
You're right! You should probably use Trail of Bits Fickling tool to investigate. https://github.com/trailofbits/fickling
-
DeepSpeed
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
- https://github.com/microsoft/DeepSpeed
Anything that could bring this to a 10GB 3080 or 24GB 3090 without 60s/it per token?
-
A few months ago I made a small library to sanitize pytorch checkpoints, here it is: https://github.com/kir-gadjello/safer_unpickle
The usage boils down to
-
I was able to run 7B on a CPU, inferring several words per second: https://github.com/markasoftware/llama-cpu
-
CodiumAI
TestGPT | Generating meaningful tests for busy devs. Get non-trivial tests (and trivial, too!) suggested right inside your IDE, so you can code smart, create more value, and stay confident when you push.
-
text-generation-webui
A gradio web UI for running Large Language Models like LLaMA, llama.cpp, GPT-J, Pythia, OPT, and GALACTICA.
You can do even better!. You can run the second smallest one (better than GPT-3 175B) on 24GB of vram, ie LLaMA-13B. https://github.com/oobabooga/text-generation-webui/issues/14...
Related posts
- Using --deepspeed requires lots of manual tweaking
- DeepSpeed Hybrid Engine for reinforcement learning with human feedback (RLHF)
- I'm Stephen Gou, Manager of ML / Founding Engineer at Cohere. Our team specializes in developing large language models. Previously at Uber ATG on perception models for self-driving cars. AMA!
- Microsoft AI Open-Sources DeepSpeed Chat: An End-To-End RLHF Pipeline To Train ChatGPT-like Models
- DeepSpeed Chat: Easy, fast and affordable RLHF training of ChatGPT-like models