VLLM: 24x faster LLM serving than HuggingFace Transformers

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • flash-attention

    Fast and memory-efficient exact attention

  • I wonder how this compares to Flash Attention (https://github.com/HazyResearch/flash-attention), which is the other "memory aware" Attention project I'm aware of.

    I guess Flash Attention is more about utilizing memory GPU SRam correctly, where this is more about using the OS/CPU memory better?

  • willow

    Open source, local, and self-hosted Amazon Echo/Google Home competitive Voice Assistant alternative

  • We run into this constantly with Willow[0] and the Willow Inference Server[1]. There seems to be a large gap in understanding with many users. They seem to find it difficult to understand a fundamental reality: GPUs are so physically different and better suited to many/most ML tasks all the CPU tricks in the world cannot bring CPU even close to the performance of GPUs (while maintaining quality/functionality) for many tasks. I find this interesting because everyone seems to take it as obvious that integrated graphics vs discrete graphics for gaming aren't even close. Ditto for these tasks.

    With Willow Inference Server I'm constantly telling people: a six year old $100 Tesla P4/GTX 1070 walks all over even the best CPUs in the world for our primary task of speech to text/ASR - at dramatically lower cost and power usage. Seriously - a GTX 1070 is at least 5x faster than a Threadripper 5955WX. Our goal is to provide an open-source commercial voice assistant equivalent user experience and that is and will be fundamentally impossible for the foreseeable future on CPU.

    Slight tangent but there are users in the space who seem to be under the impression that they can use their Raspberry Pi for voice assistant/speech recognition. It's not even close to a fair fight but with the same implementation and settings a GTX 1070 is roughly 90x (nearly two orders of magnitude) faster[2] than a Raspberry Pi... Yes, all-in a machine with a GTX 1070 uses and order of magnitude more power (3w vs 30x) than a Raspberry Pi but the power cost in even countries with the most expensive power in the world results in a $2-$3/mo cost difference - which I feel, at least, is a reasonable trade-off considering the dramatic difference in usability (Raspberry Pi is essentially useless - waiting 10-30 seconds for a response makes pulling your phone out faster).

    [0] - https://github.com/toverainc/willow

    [1] - https://github.com/toverainc/willow-inference-server

    [2] - https://github.com/toverainc/willow-inference-server/tree/wi...

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • willow-inference-server

    Open source, local, and self-hosted highly optimized language inference server supporting ASR/STT, TTS, and LLM across WebRTC, REST, and WS

  • We run into this constantly with Willow[0] and the Willow Inference Server[1]. There seems to be a large gap in understanding with many users. They seem to find it difficult to understand a fundamental reality: GPUs are so physically different and better suited to many/most ML tasks all the CPU tricks in the world cannot bring CPU even close to the performance of GPUs (while maintaining quality/functionality) for many tasks. I find this interesting because everyone seems to take it as obvious that integrated graphics vs discrete graphics for gaming aren't even close. Ditto for these tasks.

    With Willow Inference Server I'm constantly telling people: a six year old $100 Tesla P4/GTX 1070 walks all over even the best CPUs in the world for our primary task of speech to text/ASR - at dramatically lower cost and power usage. Seriously - a GTX 1070 is at least 5x faster than a Threadripper 5955WX. Our goal is to provide an open-source commercial voice assistant equivalent user experience and that is and will be fundamentally impossible for the foreseeable future on CPU.

    Slight tangent but there are users in the space who seem to be under the impression that they can use their Raspberry Pi for voice assistant/speech recognition. It's not even close to a fair fight but with the same implementation and settings a GTX 1070 is roughly 90x (nearly two orders of magnitude) faster[2] than a Raspberry Pi... Yes, all-in a machine with a GTX 1070 uses and order of magnitude more power (3w vs 30x) than a Raspberry Pi but the power cost in even countries with the most expensive power in the world results in a $2-$3/mo cost difference - which I feel, at least, is a reasonable trade-off considering the dramatic difference in usability (Raspberry Pi is essentially useless - waiting 10-30 seconds for a response makes pulling your phone out faster).

    [0] - https://github.com/toverainc/willow

    [1] - https://github.com/toverainc/willow-inference-server

    [2] - https://github.com/toverainc/willow-inference-server/tree/wi...

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts