Running Multiple AI Models Sequentially for a Conversation on a Single GPU

This page summarizes the projects mentioned and recommended in the original post on /r/LocalLLaMA

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • exllama

    A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.

  • The next best thing or if you really need two different models would be to run one on the GPU and the other on CPU. For example run one model with exllama on the GPU and the other with llama.cpp in CPU mode.

  • text-generation-webui

    A Gradio web UI for Large Language Models. Supports transformers, GPTQ, AWQ, EXL2, llama.cpp (GGUF), Llama models.

  • Pre-Promt injection. Basically you add context to the input before you pass it to the LLM. There is a character_bias plugin for text-generation-webui that does this.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • KoboldAI-Client

  • And finally the folks from the KoboldAi do some interesting stuff with Pseudocode and Soft-Prompts that might also be relevant.

  • character-card-spec-v2

    An updated specification for AI character cards.

  • Character Card specification v2 https://github.com/malfoyslastname/character-card-spec-v2

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts