Show HN: Willow Inference Server: Optimized ASR/TTS/LLM for Willow/WebRTC/REST

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • willow-inference-server

    Open source, local, and self-hosted highly optimized language inference server supporting ASR/STT, TTS, and LLM across WebRTC, REST, and WS

  • Thanks!

    Yes, "realtime multiple" is audio/speech length divided by actual inference time.

    You got it! The demo video is showing the slowest response times because it is using the highest quality/accuracy settings available with Whisper (large-v2, beam 5). Willow devices use medium 1 by default for comparison and those responses are measured in the 500 milliseconds or less range (again depending on speech length) across a wide variety of new and old CUDA hardware. Some sample benchmarks here[0].

    Applications using WIS (including Willow) can provide model settings on a per-request basis to balance quality vs latency depending on the task.

    [0] - https://github.com/toverainc/willow-inference-server#benchma...

  • willow

    Open source, local, and self-hosted Amazon Echo/Google Home competitive Voice Assistant alternative

  • Hey HN!

    Willow Inference Server (WIS) is a focused and highly optimized language inference server implementation. Our goal is to "automagically" enable performant, cost-effective self-hosting of released state of the art/best of breed models to enable speech and language tasks:

    Primarily targeting CUDA (works on CPU too) with support for low-end (cheap) devices such as the Tesla P4, GTX 1060, and up. Don't worry - it screams on an RTX 4090 too! (See benchmarks on Github).

    Memory optimized - all three default Whisper (base, medium, large-v2) models loaded simultaneously with TTS support inside of 6GB VRAM. LLM support defaults to int4 quantization (conversion scripts included). ASR/STT + TTS + Vicuna 13B require roughly 18GB VRAM. Less for 7B, of course!

    ASR. Heavy emphasis - Whisper optimized for very high quality as-close-to-real-time-as-possible speech recognition via a variety of means (Willow, WebRTC, POST a file, integration with devices and client applications, etc). Results in hundreds of milliseconds or less for most intended speech tasks. See YouTube WebRTC demo[1].

    TTS. Primarily provided for assistant tasks (like Willow!) and visually impaired users.

    LLM. Optionally pass input through a provided/configured LLM for question answering, chatbot, and assistant tasks. Currently supports LLaMA deriviates with strong preference for Vicuna (I like 13B). Built in support for quantization to int4 to conserve GPU memory.

    Support for a variety of transports. REST, WebRTC, Web Sockets (primarily for LLM).

    Performance and memory optimized. Leverages CTranslate2 for Whisper support and AutoGPTQ for LLMs.

    Willow support. WIS powers the Tovera hosted best-effort example server Willow users enjoy.

    Support for WebRTC - stream audio in real-time from browsers or WebRTC applications to optimize quality and response time. Heavily optimized for long-running sessions using WebRTC audio track management. Leave your session open for days at a time and have self-hosted ASR transcription within hundreds of milliseconds while conserving network bandwidth and CPU!

    Support for custom TTS voices. With relatively small audio recordings WIS can create and manage custom TTS voices. See API documentation for more information.

    Much like the release of Willow[0] last week this is an early release but we had a great response from HN and are looking forward to hearing what everyone thinks!

    [0] - https://github.com/toverainc/willow

    [1] - https://www.youtube.com/watch?v=PxCO5eONqSQ

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • VLLM: 24x faster LLM serving than HuggingFace Transformers

    3 projects | news.ycombinator.com | 20 Jun 2023
  • Distil-Whisper: distilled version of Whisper that is 6 times faster, 49% smaller

    14 projects | news.ycombinator.com | 31 Oct 2023
  • Whisper.api: An open source, self-hosted speech-to-text with fast transcription

    5 projects | news.ycombinator.com | 22 Aug 2023
  • [D] What is the most efficient version of OpenAI Whisper?

    7 projects | /r/MachineLearning | 12 Jul 2023
  • Show HN: Project S.A.T.U.R.D.A.Y – open-source, self hosted, J.A.R.V.I.S

    7 projects | news.ycombinator.com | 2 Jul 2023