llama
ChatGLM-6B
Our great sponsors
llama | ChatGLM-6B | |
---|---|---|
3 | 17 | |
35 | 39,231 | |
- | 3.1% | |
1.6 | 8.4 | |
about 1 year ago | 2 months ago | |
Python | ||
GNU General Public License v3.0 only | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
llama
-
Alpaca- An Instruct Tuned Llama 7B. Responses on par with txt-DaVinci-3. Demo up
> All the magic of "7B LLaMA running on a potato" seems to involve lowering precision down to f16 and then further quantizing to int4.
LLaMa weights are f16s to start out with, no lowering necessary to get to there.
You can stream weights from RAM to the GPU pretty efficiently. If you have >= 32GB ram and >=2GB vram my code here should work for you: https://github.com/gmorenz/llama/tree/gpu_offload
There's probably a cleaner version of it somewhere else. Really you should only need >= 16 GB ram, but the (meta provided) code to load the initial weights is completely unnecessarily making two copies of the weights in RAM simultaneously.
-
LLaMA-7B in Pure C++ with full Apple Silicon support
My code for this is very much not high quality, but I have a CPU + GPU + SSD combination: https://github.com/gmorenz/llama/tree/ssd
Usage instructions in the commit message: https://github.com/facebookresearch/llama/commit/5be06e56056...
At least with my hardware this runs at "[size of model]/[speed of SSD reads]" tokens per second, which (up to some possible further memory reduction so you can run larger batches at once on the same GPU) is a good as it gets when you need to read the whole model from disk each token.
At a 125GB and a 2MB/s read (largest model, what I get from my ssd) that's 60 seconds per token (1 day per 1440 words), which isn't exactly practical. Which is really the issue here, if you need to stream the model from an SSD because you don't have enough RAM, it is just a fundamentally slow process.
You could probably optimize quite a bit for batch throughput if you're ok with the latency though.
-
Llama-CPU: Fork of Facebooks LLaMa model to run on CPU
I don't know about this fork specifically, but in general yes absolutely.
Even without enough ram, you can stream model weights from disk and run at [size of model/disk read speed] seconds per token.
I'm doing that on a small GPU with this code, but it should be easy to get this working with the CPU as compute instead (and at least with my disk/CPU, I'm not even sure that it would run even slower, I think disk read would probably still be the bottleneck)
https://github.com/gmorenz/llama/tree/ssd
ChatGLM-6B
-
What are the current fastest multi-gpu inference frameworks?
ChatGLM seems to be pretty popular but I've never used this before.
-
A CEO is spending more than $2,000 a month on ChatGPT Plus accounts for all of his employees, and he says it's saving 'hours' of time
There are also locally hosted options that approach the effectiveness of ChatGPT. This GLM for example was specifically trained to be able to be processed on a single consumer grade GPU
- Open Source Chinese LLMs
- ChatGLM-6B: run locally on consumer graphics card (6GB of GPU memory required)
- Ask HN: Open source LLM for commercial use?
-
Coding LLaMa Modell?
A link to for y'all. Definitely gonna try to mess around with this!
- 关于GPT,AI和未来的一些社会经济问题,向诸位请教
- FLiPN-FLaNK Stack Weekly for 20 March 2023
- ChatGLM-6B - an open source 6.2 billion parameter English/Chinese bilingual LLM trained on 1T tokens, supplemented by supervised fine-tuning, feedback bootstrap, and Reinforcement Learning from Human Feedback. Runs on consumer grade GPUs
- ChatGLM: Open bilingual language model based on General Language Model framework
What are some alternatives?
llama.cpp - LLM inference in C/C++
llama-mps - Experimental fork of Facebooks LLaMa model which runs it with GPU acceleration on Apple Silicon M1/M2
alpaca.cpp - Locally run an Instruction-Tuned Chat-Style LLM
stanford_alpaca - Code and documentation to train Stanford's Alpaca models, and generate the data.
tinygrad - You like pytorch? You like micrograd? You love tinygrad! ❤️ [Moved to: https://github.com/tinygrad/tinygrad]
Open-Assistant - OpenAssistant is a chat-based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically to do so.
llama - Inference code for Llama models
datagen - Generate authentic looking mock data based on a SQL, JSON or Avro schema and produce to Kafka in JSON or Avro format.
KoboldAI-Client
basaran - Basaran is an open-source alternative to the OpenAI text completion API. It provides a compatible streaming API for your Hugging Face Transformers-based text generation models.