VLLM: 24x faster LLM serving than HuggingFace Transformers

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  1. flash-attention

    Fast and memory-efficient exact attention

    I wonder how this compares to Flash Attention (https://github.com/HazyResearch/flash-attention), which is the other "memory aware" Attention project I'm aware of.

    I guess Flash Attention is more about utilizing memory GPU SRam correctly, where this is more about using the OS/CPU memory better?

  2. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  3. willow

    Open source, local, and self-hosted Amazon Echo/Google Home competitive Voice Assistant alternative

  4. willow-inference-server

    Open source, local, and self-hosted highly optimized language inference server supporting ASR/STT, TTS, and LLM across WebRTC, REST, and WS

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Show HN: Willow Inference Server: Optimized ASR/TTS/LLM for Willow/WebRTC/REST

    3 projects | news.ycombinator.com | 23 May 2023
  • Distil-Whisper: distilled version of Whisper that is 6 times faster, 49% smaller

    14 projects | news.ycombinator.com | 31 Oct 2023
  • Whisper.api: An open source, self-hosted speech-to-text with fast transcription

    5 projects | news.ycombinator.com | 22 Aug 2023
  • [D] What is the most efficient version of OpenAI Whisper?

    7 projects | /r/MachineLearning | 12 Jul 2023
  • Show HN: Project S.A.T.U.R.D.A.Y – open-source, self hosted, J.A.R.V.I.S

    7 projects | news.ycombinator.com | 2 Jul 2023