Our great sponsors
- InfluxDB - Access the most powerful time series database as a service
- CodiumAI - TestGPT | Generating meaningful tests for busy devs
- Sonar - Write Clean Python Code. Always.
- ONLYOFFICE ONLYOFFICE Docs — document collaboration in your environment
Inference code for LLaMA models
> so does this mean you got it working on one GPU with an NVLink to a 2nd, or is it really running on all 4 A40s?
it's sharded across all 4 GPUs (as per the readme here: https://github.com/facebookresearch/llama). I'd wait a few weeks to a month for people to settle on a solution for running the model, people are just going to be throwing pytorch code at the wall and seeing what sticks right now.
🌸 Run 100B+ language models at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading
True, and that's why there is a project that is using volunteered, distributed GPUs to run BLOOM/BLOOMZ: https://github.com/bigscience-workshop/petals, http://chat.petals.ml.
Access the most powerful time series database as a service. Ingest, store, & analyze all types of time series data in a fully-managed, purpose-built database. Keep data forever with low-cost storage and superior data compression.
Running large language models on a single GPU for throughput-oriented scenarios.
In principal you can run it on just about any hardware with enough storage space. It's just a question of how fast it will run. This readme has some benchmarks with a similar set of models (and the code has support for even swapping data out to disk if needed): https://github.com/FMInference/FlexGen
And here are some benchmarks running OPT-175B purely on (a very beefy) CPU machine. Note that the biggest llama model is only 65.2B: https://github.com/FMInference/FlexGen/issues/24
A Python pickling decompiler and static analyzer
You're right! You should probably use Trail of Bits Fickling tool to investigate. https://github.com/trailofbits/fickling
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Anything that could bring this to a 10GB 3080 or 24GB 3090 without 60s/it per token?
A few months ago I made a small library to sanitize pytorch checkpoints, here it is: https://github.com/kir-gadjello/safer_unpickle
The usage boils down to
Fork of Facebooks LLaMa model to run on CPU
I was able to run 7B on a CPU, inferring several words per second: https://github.com/markasoftware/llama-cpu
TestGPT | Generating meaningful tests for busy devs. Get non-trivial tests (and trivial, too!) suggested right inside your IDE, so you can code smart, create more value, and stay confident when you push.
A gradio web UI for running Large Language Models like LLaMA, llama.cpp, GPT-J, Pythia, OPT, and GALACTICA.
You can do even better!. You can run the second smallest one (better than GPT-3 175B) on 24GB of vram, ie LLaMA-13B. https://github.com/oobabooga/text-generation-webui/issues/14...
Using --deepspeed requires lots of manual tweaking
3 projects | reddit.com/r/Oobabooga | 11 May 2023
DeepSpeed Hybrid Engine for reinforcement learning with human feedback (RLHF)
1 project | reddit.com/r/u_waynerad | 26 Apr 2023
I'm Stephen Gou, Manager of ML / Founding Engineer at Cohere. Our team specializes in developing large language models. Previously at Uber ATG on perception models for self-driving cars. AMA!
1 project | reddit.com/r/IAmA | 19 Apr 2023
Microsoft AI Open-Sources DeepSpeed Chat: An End-To-End RLHF Pipeline To Train ChatGPT-like Models
1 project | reddit.com/r/machinelearningnews | 13 Apr 2023
DeepSpeed Chat: Easy, fast and affordable RLHF training of ChatGPT-like models
1 project | reddit.com/r/patient_hackernews | 12 Apr 2023