Show HN: Willow Inference Server: Optimized ASR/TTS/LLM for Willow/WebRTC/REST

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

willow-inference-server

7 324 8.3 Python

Open source, local, and self-hosted highly optimized language inference server supporting ASR/STT, TTS, and LLM across WebRTC, REST, and WS

Thanks!
Yes, "realtime multiple" is audio/speech length divided by actual inference time.
You got it! The demo video is showing the slowest response times because it is using the highest quality/accuracy settings available with Whisper (large-v2, beam 5). Willow devices use medium 1 by default for comparison and those responses are measured in the 500 milliseconds or less range (again depending on speech length) across a wide variety of new and old CUDA hardware. Some sample benchmarks here[0].
Applications using WIS (including Willow) can provide model settings on a per-request basis to balance quality vs latency depending on the task.
[0] - https://github.com/toverainc/willow-inference-server#benchma...

willow

37 2,376 9.6 C

Open source, local, and self-hosted Amazon Echo/Google Home competitive Voice Assistant alternative

Hey HN!
Willow Inference Server (WIS) is a focused and highly optimized language inference server implementation. Our goal is to "automagically" enable performant, cost-effective self-hosting of released state of the art/best of breed models to enable speech and language tasks:
Primarily targeting CUDA (works on CPU too) with support for low-end (cheap) devices such as the Tesla P4, GTX 1060, and up. Don't worry - it screams on an RTX 4090 too! (See benchmarks on Github).
Memory optimized - all three default Whisper (base, medium, large-v2) models loaded simultaneously with TTS support inside of 6GB VRAM. LLM support defaults to int4 quantization (conversion scripts included). ASR/STT + TTS + Vicuna 13B require roughly 18GB VRAM. Less for 7B, of course!
ASR. Heavy emphasis - Whisper optimized for very high quality as-close-to-real-time-as-possible speech recognition via a variety of means (Willow, WebRTC, POST a file, integration with devices and client applications, etc). Results in hundreds of milliseconds or less for most intended speech tasks. See YouTube WebRTC demo[1].
TTS. Primarily provided for assistant tasks (like Willow!) and visually impaired users.
LLM. Optionally pass input through a provided/configured LLM for question answering, chatbot, and assistant tasks. Currently supports LLaMA deriviates with strong preference for Vicuna (I like 13B). Built in support for quantization to int4 to conserve GPU memory.
Support for a variety of transports. REST, WebRTC, Web Sockets (primarily for LLM).
Performance and memory optimized. Leverages CTranslate2 for Whisper support and AutoGPTQ for LLMs.
Willow support. WIS powers the Tovera hosted best-effort example server Willow users enjoy.
Support for WebRTC - stream audio in real-time from browsers or WebRTC applications to optimize quality and response time. Heavily optimized for long-running sessions using WebRTC audio track management. Leave your session open for days at a time and have self-hosted ASR transcription within hundreds of milliseconds while conserving network bandwidth and CPU!
Support for custom TTS voices. With relatively small audio recordings WIS can create and manage custom TTS voices. See API documentation for more information.
Much like the release of Willow[0] last week this is an early release but we had a great response from HN and are looking forward to hearing what everyone thinks!
[0] - https://github.com/toverainc/willow
[1] - https://www.youtube.com/watch?v=PxCO5eONqSQ

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

VLLM: 24x faster LLM serving than HuggingFace Transformers

3 projects | news.ycombinator.com | 20 Jun 2023
Distil-Whisper: distilled version of Whisper that is 6 times faster, 49% smaller

14 projects | news.ycombinator.com | 31 Oct 2023
Whisper.api: An open source, self-hosted speech-to-text with fast transcription

5 projects | news.ycombinator.com | 22 Aug 2023
[D] What is the most efficient version of OpenAI Whisper?

7 projects | /r/MachineLearning | 12 Jul 2023
Show HN: Project S.A.T.U.R.D.A.Y – open-source, self hosted, J.A.R.V.I.S

7 projects | news.ycombinator.com | 2 Jul 2023

Show HN: Willow Inference Server: Optimized ASR/TTS/LLM for Willow/WebRTC/REST

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Deep Learning Privacy speech-recognition speech-to-text Whisper
Post date: 23 May 2023

willow-inference-server

willow

InfluxDB

Related posts

VLLM: 24x faster LLM serving than HuggingFace Transformers

Distil-Whisper: distilled version of Whisper that is 6 times faster, 49% smaller

Whisper.api: An open source, self-hosted speech-to-text with fast transcription

[D] What is the most efficient version of OpenAI Whisper?

Show HN: Project S.A.T.U.R.D.A.Y – open-source, self hosted, J.A.R.V.I.S

Show HN: Willow Inference Server: Optimized ASR/TTS/LLM for Willow/WebRTC/REST

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Deep Learning Privacy speech-recognition speech-to-text Whisper Post date: 23 May 2023

willow-inference-server

willow

InfluxDB

Related posts

VLLM: 24x faster LLM serving than HuggingFace Transformers

Distil-Whisper: distilled version of Whisper that is 6 times faster, 49% smaller

Whisper.api: An open source, self-hosted speech-to-text with fast transcription

[D] What is the most efficient version of OpenAI Whisper?

Show HN: Project S.A.T.U.R.D.A.Y – open-source, self hosted, J.A.R.V.I.S

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Deep Learning Privacy speech-recognition speech-to-text Whisper
Post date: 23 May 2023