willow vs willow-inference-server

willow

Open source, local, and self-hosted Amazon Echo/Google Home competitive Voice Assistant alternative (by toverainc)

Source Code

heywillow.io

Suggest alternative

Edit details

willow-inference-server

Open source, local, and self-hosted highly optimized language inference server supporting ASR/STT, TTS, and LLM across WebRTC, REST, and WS (by toverainc)

Cuda Deep Learning llama llm Privacy speech-recognition speech-to-text text-to-speech vicuna WebRTC Whisper willow

Source Code

Suggest alternative

Edit details

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

willow		willow-inference-server
	Project
37	Mentions	7
2,365	Stars	320
1.7%	Growth	6.9%
9.6	Activity	8.3
2 months ago	Latest Commit	14 days ago
C	Language	Python
Apache License 2.0	License	Apache License 2.0

The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.

willow

Posts with mentions or reviews of willow. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2024-04-23.

ESPHome
11 projects | news.ycombinator.com | 23 Apr 2024

Fair points but with all due respect completely misses the point and context. My comment was a reply to a new user interested in esphome on a post about esphome.
You're talking about CircuitPython, 35KB web replies, PSRAM, UF2 bootloader, etc. These are comparatively very advanced topics and you didn't mention esphome once.
The comfort and familiarity of Amazon for what is already a new, intimidating, and challenging subject is of immeasurable value for a novice. They can click those links, fill a cart, and have stuff show up tomorrow with all of the usual ease, friendliness, and reliability of Amazon. If they get frustrated or it doesn't work out they can shove it in the box and get a full refund Amazon-style.
You're suggesting wandering all over the internet, ordering stuff from China, multiple vendors, etc while describing a bunch of things that frankly just won't matter to them. I say this as someone who has been an esphome and home assistant user since day one. The approach I described has never failed or remotely bothered me and over the past ~decade I've seen it suggested to new users successfully time and time again.
In terms of PSRAM to my knowledge the only thing it is utilized for in the esphome ecosystem is higher resolution displays and more advanced voice assistant scenarios that almost always require -S3 anyway and are a very advanced, challenging use cases. I'm very familiar with displays, voice, the S3, and PSRAM but more on that in a second...
> live with one less LX7 core and no Bluetooth
I'm the founder of Willow[0] and when comparing Willow to esphome the most frequent request we get is supporting bluetooth functionality i.e. esphome bluetooth proxy[1]. This is an extremely popular use case in the esphome/home assistant community. Not having bluetooth while losing a core and paying more is a bigger issue than pin spacing.
It's also a pretty obscure board and while not a big deal to you and I if you look around at docs, guides, etc, etc you'll see the cheap-o boards from Amazon are by far the most popular and common (unsurprisingly). Another plus for a new user.
Speaking of Willow (and back to PSRAM again) even the voice assistant satellite functionality of Home Assistant doesn't fundamentally require it - the most popular device doesn't have it either[2].
Very valuable comment with a lot of interesting information, just doesn't apply to context.
[0] - https://heywillow.io/
[1] - https://esphome.io/components/bluetooth_proxy.html
[2] - https://www.home-assistant.io/voice_control/thirteen-usd-voi...
Should I Open Source my Company?
5 projects | news.ycombinator.com | 22 Jan 2024

> - People might criticize my messy/bad/unfinished code
As someone who has created and maintained open source projects (most recently Willow[0]) for two decades I get a kick out of this.
Of course when interacting with users and feedback I keep it polite but in my head I'm thinking "You like to talk. I actually DID this. Shut up or submit a PR".
Surprise surprise they almost never do.
[0] - https://heywillow.io/
Jarvis: A Voice Virtual Assistant in Python (OpenAI, ElevenLabs, Deepgram)
7 projects | news.ycombinator.com | 18 Dec 2023

Also check out Willow- https://heywillow.io
It doesn’t synthesize voice back (yet) but open source and runs all offline on ESP32-based hardware and works with HomeAssistant!
Any “Google Home” type solutions that work offline?
1 project | /r/smarthome | 5 Dec 2023

Look into https://heywillow.io/ - still early in the project but they are getting good results.
Open Source Smart Device
1 project | /r/TheAmpHour | 10 Nov 2023
Home Assistant 2023.11
11 projects | news.ycombinator.com | 2 Nov 2023

Very nice!
Would you be interesting in integrating with my project Willow[0]?
Willow supports Home Assistant, OpenHAB, and generic REST+MQTT endpoints today. With Home Assistant and OpenHAB we benefit from their specific API support for providing speech to text output and processing through things like the HA Assist Pipelines[1].
From our standpoint we handle wake word, VAD+AEC+BSS, STT, TTS, user feedback, etc. All we really do is send the speech transcript to the Willow command endpoint (like HA) and speak+display the execution result. Other than all of the wild speech stuff and our obsession with speed and accuracy Willow is really quite "dumb" - think of it as a voice terminal.
OpenHAB has something similar but it's significantly more limited.
[0] - https://heywillow.io
[1] - https://developers.home-assistant.io/docs/voice/pipelines/
Distil-Whisper: distilled version of Whisper that is 6 times faster, 49% smaller
14 projects | news.ycombinator.com | 31 Oct 2023

I'm the founder of Willow[0] (we use ctranslate2 as well) and I will be looking at this as soon tomorrow as these models are released. HF claims they're drop-in compatible but we won't know for sure until someone looks at it.
[0] - https://heywillow.io/
What's New in Python 3.12
5 projects | news.ycombinator.com | 18 Oct 2023

Shameless self-plug but with my project Willow[0] we have a management server implementation to deal with multiple devices, etc. We have a new feature called "Willow One Wake" that takes the incoming audio amplitude when wake word is detected and uses our Willow Application Server (python) to only activate wake on the device closest to the person speaking. Old and tired compared to the commercial stuff but a first in the open source space.
The asyncio improvements in Python 3.12 especially (plus perf generally) have been instrumental in enabling real world use of this. With Python 3.12 asyncio, uvloop, and FastAPI it works remarkably well[1]. As the demo video shows not only does it not delay responsiveness, it has granularity down to inches.
[0] - https://heywillow.io/
[1] - https://youtu.be/qlhSEeWJ4gs
Show HN: Willow: the fastest and most private open source voice assistant
1 project | news.ycombinator.com | 12 Oct 2023
A Raspberry Pi 5 is better than two Pi 4S
3 projects | news.ycombinator.com | 8 Oct 2023

For most people with self-hosting tasks amd64 is back as the way to go.
As you say, there are a ton of "minipcs" on the market that directly compete with the Raspberry Pi on cost and power usage. They're typically slightly larger but the expansion options (bring your own RAM/storage) plus real I/O (with real PCIe), disk, etc IMO significantly outweighs this. They're also typically more performant and while aarch64 platform support is increasing dramatically there are still the occasions where there's a project, docker container, etc that doesn't support it.
Taking it a step further, there are a TON of decommissioned/recycled corporate/enterprise SFF desktops on the market. They don't compete in terms of size (13" x 15" or so) but they can actually get close in power usage. Many of them have multiple SATA ports, real NVMe, multiple real half-height PCIe slots, significantly better USB and PCIe bandwidth, etc.
With my project Willow and Willow Inference Server[0] we're trying to drive this approach in the self-hosting community with an initial emphasis on Home Assistant. They're generally sick of Raspberry PI supply shortages, very limited performance, poor I/O, flaky SD cards, etc. The Raspberry Pi is still pretty popular for "my first Home Assistant" but generally once people get bitten by the self-hosting bug they end up looking more like homelab very quickly.
For Willow particularly we emphasize use of GPUs because a voice assistant can't be waiting > 10 seconds to do speech recognition and speech synthesis. There are approaches out there trying to kind of get something working using Whisper tiny but in our ample internal testing and community feedback we feel that Whisper small is the bare minimum for voice assistant tasks, with many users going all out and using Whisper large-v2 at beam size 5. With GPU it's still so fast it doesn't really matter.
The Raspberry Pi is especially poorly suited for this use case (and even amd64). We have some benchmarks here[1]. TLDR a ~seven year old Tesla P4 (single slot, slot power only, half-height, used for $70) does speech recognition 87x faster, with the multiple increasing for more complex models and longer speech segments. A 3.8 second voice command takes 586ms on the Tesla P4 and 51 seconds on the Raspberry Pi 4. Even with the Pi 5 being twice as fast that's still 25 seconds, which is completely unusable. Not fair to compare GPU to Raspberry Pi but consider the economics and practicality...
You can get an SFF desktop and Tesla P4 from eBay for $200 shipped to your door. It will idle (with GPU and models loaded) at ~30 watts. The CPU, RAM, disk (NVMe), I/O, etc will walk all over a Raspberry Pi anything. Add the GPU and obviously it's not even close - you end up with a machine that can easily do 10x-100x what a Raspberry Pi can do for 2x the cost and power usage. You can even throw a 2.5gb Ethernet card in another slot for $20.
Even factoring in power usage (10-15w vs 30, 2-3x) the cost difference comes down to nearly nothing and for many users this configuration is essentially future-proof to anything they may want to do for many years (my system with everything running maxes out around 50% of one core). Many also gradually grew their self-hosted situation over the years with people ending up with three or more Raspberry Pis for different tasks (PiHole, Home Assistant, Plex, etc). At this point the SFF configuration starts to pull far head in every way including power usage.
Users were initially very skeptical to GPU use, likely from taking their experience in the desktop market and assuming things like "300 watt power usage with a huge > $500 card". Now they love having a GPU around for Willow and miscellaneous other CUDA tasks like encoding/decoding/transcoding with Plex/Jellyfin, accelerated Frigate, and all kinds of other applications. Willow Inference Server (depending on configuration) uses somewhere between 1-4GB of VRAM so with an 8GB VRAM card that leaves for plenty of additional tasks. We even have users who started with the Tesla P4 and then got the LLM bug and figured out how to get an RTX 3090 working with their setup.
[0] - https://heywillow.io/
[1] - https://heywillow.io/components/willow-inference-server/#ben...

willow-inference-server

Posts with mentions or reviews of willow-inference-server. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2024-01-27.

Brave Leo now uses Mixtral 8x7B as default
7 projects | news.ycombinator.com | 27 Jan 2024

I think this perspective comes from a lack of historical experience and hands-on experience overall.
Nvidia more broadly has very impressive support for their GPUs. If you look at the support lifecycles for their Jetson hardware over time it's significantly worse. I encourage you to look at what support lifecycles have looked like, with the most "egregious" example being dropping of support for the Jetson Nano in from what I recall was within a couple of years.
Another consideration - Jetson is optimized for power efficiency/form-factor and on a per $ basis CUDA performance is terrible. The power efficiency and form-factor come at significant cost. See this discussion from one of my projects[0]. I evaluated the use of WIS on an Orin that I have and from what I can recall it was significantly slower than a GTX 1070 which is... Unimpressive.
In the end what do I care what people use, I'm offering the perspective and experience of someone who has actually used the Jetson line for many years and frequently struggled with all of these issues and more.
[0] - https://github.com/toverainc/willow-inference-server/discuss...
Whisper.api: An open source, self-hosted speech-to-text with fast transcription
5 projects | news.ycombinator.com | 22 Aug 2023

ctranslate2 is incredible, I don’t know why it doesn’t get more attention.
We use it for our Willow Inference Server which has an API that can be used directly like OP project and supports all Whisper models, TTS, etc:
https://github.com/toverainc/willow-inference-server
The benchmarks are pretty incredible (largely thanks to ctranslate2).
Show HN: Project S.A.T.U.R.D.A.Y – open-source, self hosted, J.A.R.V.I.S
7 projects | news.ycombinator.com | 2 Jul 2023

Nice! I'm the creator of Willow[0] (which has been mentioned here).
First of all, we love seeing efforts like this and we'd love to work together with other open source voice user interface projects! There's plenty of work to do in the space...
I have roughly two decades of experience with voice and one thing to keep in mind is how latency sensitive voice tasks are. Generally speaking when it comes to conversational audio people have very high expectations regarding interactivity. For example, in the VoIP world we know that conversation between people starts getting annoying at around 300ms of latency. Higher latencies for voice assistant tasks are more-or-less "tolerated" but latency still needs to be extremely low. Alexa/Echo (with all of its problems) is at least a decent benchmark for what people expect for interactivity and all things considered it does pretty well.
I know you're early (we are too!) but in your demo I counted roughly six seconds of latency between the initial hello and response (and nearly 10 for "tell me a joke"). In terms of conversational voice this feels like an eternity. Again, no shade at all (believe me I understand more than most) but just something I thought I'd add from my decades of experience with humans and voice. This is why we have such heavy emphasis on reducing latency as much as possible.
For an idea of just how much we emphasize this you can try our WebRTC demo[1] which can do end-to-end (from click stop record in browser to ASR response) in a few hundred milliseconds (with Whisper large-v2 and beam size 5 - medium/1 is a fraction of that) including internet latency (it's hosted in Chicago, FYI).
Running locally with WIS and Willow we see less than 500ms from end of speech (on device VAD) to command execution completion and TTS response with platforms like Home Assistant. Granted this is with GPU so you could call it cheating but a $100 six year old Nvidia Pascal series GPU runs circles around the fastest CPUs for these tasks (STT and TTS - see benchmarks here[2]). Again, kind of cheating but my RTX 3090 at home drops this down to around 200ms - roughly half of that time is Home Assistant. It's my (somewhat controversial) personal opinion that GPUs are more-or-less a requirement (today) for Alexa/Echo competitive responsiveness.
Speaking of latency, I've been noticing a trend with Willow users regarding LLMs - they are very neat, cool, and interesting (our inference server[3] supports LLamA based LLMs) but they really aren't the right tool for these kinds of tasks. They have very high memory requirements (relatively speaking), require a lot of compute, and are very slow (again, relatively speaking). They also don't natively support the kinds of API call/response you need for most voice tasks. There are efforts out there to support this with LLMs but frankly I find the overall approach very strange. It seems that LLMs have sucked a lot of oxygen out of the room and people have forgotten (or never heard of) "good old fashioned" NLU/NLP approaches.
Have you considered an NLU/NLP engine like Rasa[4]? This is the approach we will be taking to implement this kind of functionality in a flexible and assistant platform/integration agnostic way. By the time you stack up VAD, STT, understanding user intent (while allowing flexible grammar), calling an API, execution, and TTS response latency starts to add up very, very quickly.
As one example, for "tell me a joke" Alexa does this in a few hundred milliseconds and I guarantee they're not using an LLM for this task - you can have a couple of hundred jokes to randomly select from with pre-generated TTS responses cached (as one path). Again, this is the approach we are taking to "catch up" with Alexa for all kinds of things from jokes to creating calendar entries, etc. Of course you can still have a catch-all to hand off to LLM for "conversation" but I'm not sure users actually want this for voice.
I may be misunderstanding your goals but just a few things I thought I would mention.
[0] - https://github.com/toverainc/willow
[1] - https://wisng.tovera.io/rtc/
[2] - https://github.com/toverainc/willow-inference-server/tree/wi...
[3] - https://github.com/toverainc/willow-inference-server
[4] - https://rasa.com/
VLLM: 24x faster LLM serving than HuggingFace Transformers
3 projects | news.ycombinator.com | 20 Jun 2023

We run into this constantly with Willow[0] and the Willow Inference Server[1]. There seems to be a large gap in understanding with many users. They seem to find it difficult to understand a fundamental reality: GPUs are so physically different and better suited to many/most ML tasks all the CPU tricks in the world cannot bring CPU even close to the performance of GPUs (while maintaining quality/functionality) for many tasks. I find this interesting because everyone seems to take it as obvious that integrated graphics vs discrete graphics for gaming aren't even close. Ditto for these tasks.
With Willow Inference Server I'm constantly telling people: a six year old $100 Tesla P4/GTX 1070 walks all over even the best CPUs in the world for our primary task of speech to text/ASR - at dramatically lower cost and power usage. Seriously - a GTX 1070 is at least 5x faster than a Threadripper 5955WX. Our goal is to provide an open-source commercial voice assistant equivalent user experience and that is and will be fundamentally impossible for the foreseeable future on CPU.
Slight tangent but there are users in the space who seem to be under the impression that they can use their Raspberry Pi for voice assistant/speech recognition. It's not even close to a fair fight but with the same implementation and settings a GTX 1070 is roughly 90x (nearly two orders of magnitude) faster[2] than a Raspberry Pi... Yes, all-in a machine with a GTX 1070 uses and order of magnitude more power (3w vs 30x) than a Raspberry Pi but the power cost in even countries with the most expensive power in the world results in a $2-$3/mo cost difference - which I feel, at least, is a reasonable trade-off considering the dramatic difference in usability (Raspberry Pi is essentially useless - waiting 10-30 seconds for a response makes pulling your phone out faster).
[0] - https://github.com/toverainc/willow
[1] - https://github.com/toverainc/willow-inference-server
[2] - https://github.com/toverainc/willow-inference-server/tree/wi...
GGML – AI at the Edge
11 projects | news.ycombinator.com | 6 Jun 2023

Shameless plug, I'm the founder of Willow[0].
In short you can:
1) Run a local Willow Inference Server[1]. Supports CPU or CUDA, just about the fastest implementation of Whisper out there for "real time" speech.
2) Run local command detection on device. We pull your Home Assistant entites on setup and define basic grammar for them but any English commands (up to 400) that can be processed by Home Assistant are recognized directly on the $50 ESP BOX device and sent to Home Assistant (or openHAB, or a REST endpoint, etc) for processing.
Whether WIS or local our performance target is 500ms from end of speech to command executed.
[0] - https://github.com/toverainc/willow
[1] - https://github.com/toverainc/willow-inference-server
Show HN: Willow Inference Server: Optimized ASR/TTS/LLM for Willow/WebRTC/REST
3 projects | news.ycombinator.com | 23 May 2023

Thanks!
Yes, "realtime multiple" is audio/speech length divided by actual inference time.
You got it! The demo video is showing the slowest response times because it is using the highest quality/accuracy settings available with Whisper (large-v2, beam 5). Willow devices use medium 1 by default for comparison and those responses are measured in the 500 milliseconds or less range (again depending on speech length) across a wide variety of new and old CUDA hardware. Some sample benchmarks here[0].
Applications using WIS (including Willow) can provide model settings on a per-request basis to balance quality vs latency depending on the task.
[0] - https://github.com/toverainc/willow-inference-server#benchma...

What are some alternatives?

When comparing willow and willow-inference-server you can also consider the following projects:

piper - A fast, local neural text to speech system

whisper-realtime - Whisper runs in realtime on a laptop GPU (8GB)

esp-box - The ESP-BOX is a new generation AIoT development platform released by Espressif Systems.

whisper.api - This project provides an API with user level access support to transcribe speech to text using a finetuned and processed Whisper ASR model.

mycroft-core - Mycroft Core, the Mycroft Artificial Intelligence platform.

wscribe-editor - web based editor for subtitles and transcripts

rhasspy3 - An open source voice assistant toolkit for many human languages

MeZO - [NeurIPS 2023] MeZO: Fine-Tuning Language Models with Just Forward Passes. https://arxiv.org/abs/2305.17333

vllm - A high-throughput and memory-efficient inference and serving engine for LLMs

ggml - Tensor library for machine learning

CocoaLumberjack - A fast & simple, yet powerful & flexible logging framework for macOS, iOS, tvOS and watchOS

TTS - 🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production

willow vs piper willow-inference-server vs whisper-realtime willow vs esp-box willow-inference-server vs whisper.api willow vs mycroft-core willow-inference-server vs wscribe-editor willow vs rhasspy3 willow-inference-server vs MeZO willow vs vllm willow-inference-server vs ggml willow vs CocoaLumberjack willow-inference-server vs TTS

Compare willow vs willow-inference-server and see what are their differences.

willow

willow-inference-server

willow

willow-inference-server

What are some alternatives?