What are the current fastest multi-gpu inference frameworks?

This page summarizes the projects mentioned and recommended in the original post on /r/LocalLLaMA

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • accelerate

    🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support

  • So I rent a cloud server today to try out some of the recent LLMs like falcon and vicuna. I started with huggingface's generate API using accelerate. It got about 2 instances/s with 8 A100 40GB GPUs which I think is a bit slow. I was using batch size = 1 since I do not know how to do multi-batch inference using the .generate API. I did torch.compile + bf16 already. Do we have an even faster multi-gpu inference framework? I have 8 GPUs so I was thinking about MUCH faster speed like ~10 or 20 instances per second (or is it possible at all? I am pretty new to this field).

  • FastChat

    An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.

  • Vicuna has a FastChat, not sure how flexible it is to configure tho

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • ChatGLM-6B

    ChatGLM-6B: An Open Bilingual Dialogue Language Model | 开源双语对话语言模型

  • ChatGLM seems to be pretty popular but I've never used this before.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • A smooth and sharp image interpolation you probably haven't heard of

    2 projects | news.ycombinator.com | 2 May 2024
  • OpenAI Security Slack Bots

    1 project | news.ycombinator.com | 2 May 2024
  • Building a Trader Bot with Sentiment Analysis: A Step-by-Step Guide

    1 project | dev.to | 2 May 2024
  • How to Build in Public as a Tech Professional

    2 projects | dev.to | 2 May 2024
  • Agents of Change: Navigating the Rise of AI Agents in 2024

    8 projects | dev.to | 2 May 2024