Our great sponsors
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
It might be worth trying out 2/3 bit quantization on llama.cpp. Currently sitting in an unmerged pr, but it works. I doubt you’ll be getting 5+ tokens/second though. link
I'm on GPTQ for LLaMA folder under repositories says it's pointed at https://github.com/oobabooga/GPTQ-for-LLaMa.git. But I've run through the instructions and also applied the monkey patch to train and apply 4 bit lora which may come into play. No idea.
NOTE:
The number of mentions on this list indicates mentions on common posts plus user suggested alternatives.
Hence, a higher number means a more popular project.
Related posts
- Show HN: Free GitHub Copilot CLI with your own model or API
- Einsum in 40 Lines of Python
- Show HN: Cognita – open-source RAG framework for modular applications
- Show HN: Data Bonsai: a Python package to clean your data with LLMs
- Ask HN: Seeking On-Premises Website Examples for Uptime Comparison Experiment