Our great sponsors
-
exllama
A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
-
text-generation-webui
A Gradio web UI for Large Language Models. Supports transformers, GPTQ, AWQ, EXL2, llama.cpp (GGUF), Llama models.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
The next best thing or if you really need two different models would be to run one on the GPU and the other on CPU. For example run one model with exllama on the GPU and the other with llama.cpp in CPU mode.
Pre-Promt injection. Basically you add context to the input before you pass it to the LLM. There is a character_bias plugin for text-generation-webui that does this.
And finally the folks from the KoboldAi do some interesting stuff with Pseudocode and Soft-Prompts that might also be relevant.
Character Card specification v2 https://github.com/malfoyslastname/character-card-spec-v2