-
exllama
A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
I made a fork of alpaca_lora_4bit that contains the whole project plus some notes. There really aren't any changes from the main repo besides a small hack to read plaintext training data and to modify the configured sequence length beyond the default 2048, and then this horribly messy attention patch which awkwardly bodges a pre-allocated K/V cache scheme into the HF Llama implementation.
The README.md has some details about what I did and how it went, but it ends on a list of next steps that I've yet to get to because I want to work some more on this other project first. The reason being that the Transformers library is just too limiting to work with. It's very poorly suited for these kinds of experiments. You end up patching functionality in and out, instantiating models in weird and hacky ways only to overwrite their weights afterwards, shuffling layers around, wondering where all your VRAM went, etc. I hope to be able to use this new project as a better platform for experimenting with LoRAs, among other things, and then I'll get back to the long-range adapter. I still haven't concluded that it can't work, just that it takes more than ten hours of training on an A100, and I pay for that by the hour so I want to make it count. ;)
There's https://github.com/saharNooby/rwkv.cpp which seems to work, and might be compatible with text-generation-webui.