Most efficient way to set up API serving of custom LLMs?

This page summarizes the projects mentioned and recommended in the original post on /r/LocalLLaMA

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • llama-cpp-python

    Python bindings for llama.cpp

    I have a discord bot set up to interface with OpenAI's API already that a small discord server uses. I'm looking to give my bot access to custom models like Vicuna or any of the LLaMA variants that came out(up to 30B, potentially even 65B). The most obvious solution would be setting something like https://github.com/abetlen/llama-cpp-python up on a cloud instance, and serving from FastAPI.

  • text-generation-webui

    A Gradio web UI for Large Language Models. Supports transformers, GPTQ, AWQ, EXL2, llama.cpp (GGUF), Llama models.

    You can find more about the extension here: https://github.com/oobabooga/text-generation-webui/tree/main/extensions/openai

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

  • gpt-discord-bot

    Example Discord bot written in Python that uses the completions API to have conversations with the `text-davinci-003` model, and the moderations API to filter the messages.

    And here's a Discord bot that currently works with it that you may be able to learn from: https://github.com/openai/gpt-discord-bot

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts