-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
open_llama
OpenLLaMA, a permissively licensed open source reproduction of Meta AI’s LLaMA 7B trained on the RedPajama dataset
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Koboldcpp [1], which builds on llamacpp and adds a gui, is a great way to run these models. Most people aren't running these models at full weight, ggml quantization is recommended for cpu+gpu or gptq if you have the gpu vram.
GGML 13b models at 4bit (Q4_0) take somewhere around 9gb of ram and q5_K_M take about 11gb. Gpu offloading support has also been added, I've been using 22 layers on my laptop rtx 2070 max q 8gb vram. I get around ~2-3 tokens per second with 13b models. In my experience, running 13b models is worth the extra time it takes to generate a response compared to 7b models. GPTQ is faster but I can't fit a quantized 13b model so I don't use it.
TheBloke [2] has been quantizing models and uploading them to HF and will probably upload a quantized version of this online soon. His discord server also has good guides to help you get going, linked in the model card of most of his models.
https://github.com/LostRuins/koboldcpp
https://huggingface.co/TheBloke
There are many UIs for running locally, but the easiest is koboldcpp:
https://github.com/LostRuins/koboldcpp
Its descended from the roleplaying community, but works fine (and performantly) for questioning and such.
You will need to download the model from HF quantize it yourself: https://github.com/ggerganov/llama.cpp#prepare-data--run
There is the Language Model Evaluation Harness project which evaluates LLMs on over 200 tasks. HuggingFace has a leaderboard tracking performance on a subset of these tasks.
https://github.com/EleutherAI/lm-evaluation-harness
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...
For some discussion on how to have the LLaMa tokenizer (properly) handle repeating spaces, please see this discussion: https://github.com/openlm-research/open_llama/issues/40
https://www.runpod.io/console/templates
This is the readme for the one I mentioned: https://github.com/TheBlokeAI/dockerLLM/blob/main/README_Run...
> can I use Colab/Huggingface GPUs?
You use these templates on the runpod platform itself. Theres no free tier equivalent like you have with Colab/HF, but currently you can rent an RTX 4090 for $0.69/hr so its pretty affordable.