willow-inference-server

Open source, local, and self-hosted highly optimized language inference server supporting ASR/STT, TTS, and LLM across WebRTC, REST, and WS (by toverainc)

Willow-inference-server Alternatives

Similar projects and alternatives to willow-inference-server

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a better willow-inference-server alternative or higher similarity.

willow-inference-server reviews and mentions

Posts with mentions or reviews of willow-inference-server. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2024-01-27.
  • Brave Leo now uses Mixtral 8x7B as default
    7 projects | news.ycombinator.com | 27 Jan 2024
    I think this perspective comes from a lack of historical experience and hands-on experience overall.

    Nvidia more broadly has very impressive support for their GPUs. If you look at the support lifecycles for their Jetson hardware over time it's significantly worse. I encourage you to look at what support lifecycles have looked like, with the most "egregious" example being dropping of support for the Jetson Nano in from what I recall was within a couple of years.

    Another consideration - Jetson is optimized for power efficiency/form-factor and on a per $ basis CUDA performance is terrible. The power efficiency and form-factor come at significant cost. See this discussion from one of my projects[0]. I evaluated the use of WIS on an Orin that I have and from what I can recall it was significantly slower than a GTX 1070 which is... Unimpressive.

    In the end what do I care what people use, I'm offering the perspective and experience of someone who has actually used the Jetson line for many years and frequently struggled with all of these issues and more.

    [0] - https://github.com/toverainc/willow-inference-server/discuss...

  • Whisper.api: An open source, self-hosted speech-to-text with fast transcription
    5 projects | news.ycombinator.com | 22 Aug 2023
    ctranslate2 is incredible, I don’t know why it doesn’t get more attention.

    We use it for our Willow Inference Server which has an API that can be used directly like OP project and supports all Whisper models, TTS, etc:

    https://github.com/toverainc/willow-inference-server

    The benchmarks are pretty incredible (largely thanks to ctranslate2).

  • Show HN: Project S.A.T.U.R.D.A.Y – open-source, self hosted, J.A.R.V.I.S
    7 projects | news.ycombinator.com | 2 Jul 2023
    Nice! I'm the creator of Willow[0] (which has been mentioned here).

    First of all, we love seeing efforts like this and we'd love to work together with other open source voice user interface projects! There's plenty of work to do in the space...

    I have roughly two decades of experience with voice and one thing to keep in mind is how latency sensitive voice tasks are. Generally speaking when it comes to conversational audio people have very high expectations regarding interactivity. For example, in the VoIP world we know that conversation between people starts getting annoying at around 300ms of latency. Higher latencies for voice assistant tasks are more-or-less "tolerated" but latency still needs to be extremely low. Alexa/Echo (with all of its problems) is at least a decent benchmark for what people expect for interactivity and all things considered it does pretty well.

    I know you're early (we are too!) but in your demo I counted roughly six seconds of latency between the initial hello and response (and nearly 10 for "tell me a joke"). In terms of conversational voice this feels like an eternity. Again, no shade at all (believe me I understand more than most) but just something I thought I'd add from my decades of experience with humans and voice. This is why we have such heavy emphasis on reducing latency as much as possible.

    For an idea of just how much we emphasize this you can try our WebRTC demo[1] which can do end-to-end (from click stop record in browser to ASR response) in a few hundred milliseconds (with Whisper large-v2 and beam size 5 - medium/1 is a fraction of that) including internet latency (it's hosted in Chicago, FYI).

    Running locally with WIS and Willow we see less than 500ms from end of speech (on device VAD) to command execution completion and TTS response with platforms like Home Assistant. Granted this is with GPU so you could call it cheating but a $100 six year old Nvidia Pascal series GPU runs circles around the fastest CPUs for these tasks (STT and TTS - see benchmarks here[2]). Again, kind of cheating but my RTX 3090 at home drops this down to around 200ms - roughly half of that time is Home Assistant. It's my (somewhat controversial) personal opinion that GPUs are more-or-less a requirement (today) for Alexa/Echo competitive responsiveness.

    Speaking of latency, I've been noticing a trend with Willow users regarding LLMs - they are very neat, cool, and interesting (our inference server[3] supports LLamA based LLMs) but they really aren't the right tool for these kinds of tasks. They have very high memory requirements (relatively speaking), require a lot of compute, and are very slow (again, relatively speaking). They also don't natively support the kinds of API call/response you need for most voice tasks. There are efforts out there to support this with LLMs but frankly I find the overall approach very strange. It seems that LLMs have sucked a lot of oxygen out of the room and people have forgotten (or never heard of) "good old fashioned" NLU/NLP approaches.

    Have you considered an NLU/NLP engine like Rasa[4]? This is the approach we will be taking to implement this kind of functionality in a flexible and assistant platform/integration agnostic way. By the time you stack up VAD, STT, understanding user intent (while allowing flexible grammar), calling an API, execution, and TTS response latency starts to add up very, very quickly.

    As one example, for "tell me a joke" Alexa does this in a few hundred milliseconds and I guarantee they're not using an LLM for this task - you can have a couple of hundred jokes to randomly select from with pre-generated TTS responses cached (as one path). Again, this is the approach we are taking to "catch up" with Alexa for all kinds of things from jokes to creating calendar entries, etc. Of course you can still have a catch-all to hand off to LLM for "conversation" but I'm not sure users actually want this for voice.

    I may be misunderstanding your goals but just a few things I thought I would mention.

    [0] - https://github.com/toverainc/willow

    [1] - https://wisng.tovera.io/rtc/

    [2] - https://github.com/toverainc/willow-inference-server/tree/wi...

    [3] - https://github.com/toverainc/willow-inference-server

    [4] - https://rasa.com/

  • VLLM: 24x faster LLM serving than HuggingFace Transformers
    3 projects | news.ycombinator.com | 20 Jun 2023
    We run into this constantly with Willow[0] and the Willow Inference Server[1]. There seems to be a large gap in understanding with many users. They seem to find it difficult to understand a fundamental reality: GPUs are so physically different and better suited to many/most ML tasks all the CPU tricks in the world cannot bring CPU even close to the performance of GPUs (while maintaining quality/functionality) for many tasks. I find this interesting because everyone seems to take it as obvious that integrated graphics vs discrete graphics for gaming aren't even close. Ditto for these tasks.

    With Willow Inference Server I'm constantly telling people: a six year old $100 Tesla P4/GTX 1070 walks all over even the best CPUs in the world for our primary task of speech to text/ASR - at dramatically lower cost and power usage. Seriously - a GTX 1070 is at least 5x faster than a Threadripper 5955WX. Our goal is to provide an open-source commercial voice assistant equivalent user experience and that is and will be fundamentally impossible for the foreseeable future on CPU.

    Slight tangent but there are users in the space who seem to be under the impression that they can use their Raspberry Pi for voice assistant/speech recognition. It's not even close to a fair fight but with the same implementation and settings a GTX 1070 is roughly 90x (nearly two orders of magnitude) faster[2] than a Raspberry Pi... Yes, all-in a machine with a GTX 1070 uses and order of magnitude more power (3w vs 30x) than a Raspberry Pi but the power cost in even countries with the most expensive power in the world results in a $2-$3/mo cost difference - which I feel, at least, is a reasonable trade-off considering the dramatic difference in usability (Raspberry Pi is essentially useless - waiting 10-30 seconds for a response makes pulling your phone out faster).

    [0] - https://github.com/toverainc/willow

    [1] - https://github.com/toverainc/willow-inference-server

    [2] - https://github.com/toverainc/willow-inference-server/tree/wi...

  • GGML – AI at the Edge
    11 projects | news.ycombinator.com | 6 Jun 2023
    Shameless plug, I'm the founder of Willow[0].

    In short you can:

    1) Run a local Willow Inference Server[1]. Supports CPU or CUDA, just about the fastest implementation of Whisper out there for "real time" speech.

    2) Run local command detection on device. We pull your Home Assistant entites on setup and define basic grammar for them but any English commands (up to 400) that can be processed by Home Assistant are recognized directly on the $50 ESP BOX device and sent to Home Assistant (or openHAB, or a REST endpoint, etc) for processing.

    Whether WIS or local our performance target is 500ms from end of speech to command executed.

    [0] - https://github.com/toverainc/willow

    [1] - https://github.com/toverainc/willow-inference-server

  • Show HN: Willow Inference Server: Optimized ASR/TTS/LLM for Willow/WebRTC/REST
    3 projects | news.ycombinator.com | 23 May 2023
    Thanks!

    Yes, "realtime multiple" is audio/speech length divided by actual inference time.

    You got it! The demo video is showing the slowest response times because it is using the highest quality/accuracy settings available with Whisper (large-v2, beam 5). Willow devices use medium 1 by default for comparison and those responses are measured in the 500 milliseconds or less range (again depending on speech length) across a wide variety of new and old CUDA hardware. Some sample benchmarks here[0].

    Applications using WIS (including Willow) can provide model settings on a per-request basis to balance quality vs latency depending on the task.

    [0] - https://github.com/toverainc/willow-inference-server#benchma...

  • A note from our sponsor - SaaSHub
    www.saashub.com | 3 May 2024
    SaaSHub helps you find the best software and product alternatives Learn more →

Stats

Basic willow-inference-server repo stats
7
320
8.3
16 days ago

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com