Willow-inference-server Alternatives

Similar projects and alternatives to willow-inference-server

uBlock

2,992 43,126 9.9 JavaScript willow-inference-server VS uBlock

uBlock Origin - An efficient blocker for Chromium and Firefox. Fast and lean.
llama.cpp

773 56,891 10.0 C++ willow-inference-server VS llama.cpp

LLM inference in C/C++
InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
TTS

231 29,420 9.4 Python willow-inference-server VS TTS

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
whisper.cpp

187 31,174 9.8 C willow-inference-server VS whisper.cpp

Port of OpenAI's Whisper model in C/C++
mlc-llm

89 16,955 9.9 Python willow-inference-server VS mlc-llm

Enable everyone to develop, optimize and deploy AI models natively on everyone's devices.
GPTQ-for-LLaMa

75 2,916 8.6 Python willow-inference-server VS GPTQ-for-LLaMa

4 bits quantization of LLaMA using GPTQ
ggml

69 9,725 9.8 C willow-inference-server VS ggml

Tensor library for machine learning
SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
tinygrad

58 17,800 9.7 Python willow-inference-server VS tinygrad

Discontinued You like pytorch? You like micrograd? You love tinygrad! ❤️ [Moved to: https://github.com/tinygrad/tinygrad] (by geohot)
willow

37 2,365 9.6 C willow-inference-server VS willow

Open source, local, and self-hosted Amazon Echo/Google Home competitive Voice Assistant alternative
flash-attention

26 10,888 9.4 Python willow-inference-server VS flash-attention

Fast and memory-efficient exact attention
faster-whisper

23 8,899 8.1 Python willow-inference-server VS faster-whisper

Faster Whisper transcription with CTranslate2
Cgml

22 38 8.6 C++ willow-inference-server VS Cgml

GPU-targeted vendor-agnostic AI library for Windows, and Mistral model implementation.
uBOL-home

16 353 8.4 JavaScript willow-inference-server VS uBOL-home

uBO Lite home (MV3)
MeZO

9 979 6.5 Python willow-inference-server VS MeZO

[NeurIPS 2023] MeZO: Fine-Tuning Language Models with Just Forward Passes. https://arxiv.org/abs/2305.17333
whisper.api

4 840 8.0 Python willow-inference-server VS whisper.api

This project provides an API with user level access support to transcribe speech to text using a finetuned and processed Whisper ASR model.
whisper-realtime

1 0 2.7 Python willow-inference-server VS whisper-realtime

Whisper runs in realtime on a laptop GPU (8GB)
S.A.T.U.R.D.A.Y

8 648 8.1 Go willow-inference-server VS S.A.T.U.R.D.A.Y

A toolbox for working with WebRTC, Audio and AI
wscribe-editor

1 69 4.9 Svelte willow-inference-server VS wscribe-editor

web based editor for subtitles and transcripts
llmperf-leaderboard

6 381 5.6 willow-inference-server VS llmperf-leaderboard
whisper.spm

1 160 6.1 C willow-inference-server VS whisper.spm

whisper.cpp package for the Swift Package Manager
SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a better willow-inference-server alternative or higher similarity.

Suggest an alternative to willow-inference-server

willow-inference-server reviews and mentions

Posts with mentions or reviews of willow-inference-server. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2024-01-27.

Brave Leo now uses Mixtral 8x7B as default
7 projects | news.ycombinator.com | 27 Jan 2024

I think this perspective comes from a lack of historical experience and hands-on experience overall.
Nvidia more broadly has very impressive support for their GPUs. If you look at the support lifecycles for their Jetson hardware over time it's significantly worse. I encourage you to look at what support lifecycles have looked like, with the most "egregious" example being dropping of support for the Jetson Nano in from what I recall was within a couple of years.
Another consideration - Jetson is optimized for power efficiency/form-factor and on a per $ basis CUDA performance is terrible. The power efficiency and form-factor come at significant cost. See this discussion from one of my projects[0]. I evaluated the use of WIS on an Orin that I have and from what I can recall it was significantly slower than a GTX 1070 which is... Unimpressive.
In the end what do I care what people use, I'm offering the perspective and experience of someone who has actually used the Jetson line for many years and frequently struggled with all of these issues and more.
[0] - https://github.com/toverainc/willow-inference-server/discuss...
Whisper.api: An open source, self-hosted speech-to-text with fast transcription
5 projects | news.ycombinator.com | 22 Aug 2023

ctranslate2 is incredible, I don’t know why it doesn’t get more attention.
We use it for our Willow Inference Server which has an API that can be used directly like OP project and supports all Whisper models, TTS, etc:
https://github.com/toverainc/willow-inference-server
The benchmarks are pretty incredible (largely thanks to ctranslate2).
Show HN: Project S.A.T.U.R.D.A.Y – open-source, self hosted, J.A.R.V.I.S
7 projects | news.ycombinator.com | 2 Jul 2023

Nice! I'm the creator of Willow[0] (which has been mentioned here).
First of all, we love seeing efforts like this and we'd love to work together with other open source voice user interface projects! There's plenty of work to do in the space...
I have roughly two decades of experience with voice and one thing to keep in mind is how latency sensitive voice tasks are. Generally speaking when it comes to conversational audio people have very high expectations regarding interactivity. For example, in the VoIP world we know that conversation between people starts getting annoying at around 300ms of latency. Higher latencies for voice assistant tasks are more-or-less "tolerated" but latency still needs to be extremely low. Alexa/Echo (with all of its problems) is at least a decent benchmark for what people expect for interactivity and all things considered it does pretty well.
I know you're early (we are too!) but in your demo I counted roughly six seconds of latency between the initial hello and response (and nearly 10 for "tell me a joke"). In terms of conversational voice this feels like an eternity. Again, no shade at all (believe me I understand more than most) but just something I thought I'd add from my decades of experience with humans and voice. This is why we have such heavy emphasis on reducing latency as much as possible.
For an idea of just how much we emphasize this you can try our WebRTC demo[1] which can do end-to-end (from click stop record in browser to ASR response) in a few hundred milliseconds (with Whisper large-v2 and beam size 5 - medium/1 is a fraction of that) including internet latency (it's hosted in Chicago, FYI).
Running locally with WIS and Willow we see less than 500ms from end of speech (on device VAD) to command execution completion and TTS response with platforms like Home Assistant. Granted this is with GPU so you could call it cheating but a $100 six year old Nvidia Pascal series GPU runs circles around the fastest CPUs for these tasks (STT and TTS - see benchmarks here[2]). Again, kind of cheating but my RTX 3090 at home drops this down to around 200ms - roughly half of that time is Home Assistant. It's my (somewhat controversial) personal opinion that GPUs are more-or-less a requirement (today) for Alexa/Echo competitive responsiveness.
Speaking of latency, I've been noticing a trend with Willow users regarding LLMs - they are very neat, cool, and interesting (our inference server[3] supports LLamA based LLMs) but they really aren't the right tool for these kinds of tasks. They have very high memory requirements (relatively speaking), require a lot of compute, and are very slow (again, relatively speaking). They also don't natively support the kinds of API call/response you need for most voice tasks. There are efforts out there to support this with LLMs but frankly I find the overall approach very strange. It seems that LLMs have sucked a lot of oxygen out of the room and people have forgotten (or never heard of) "good old fashioned" NLU/NLP approaches.
Have you considered an NLU/NLP engine like Rasa[4]? This is the approach we will be taking to implement this kind of functionality in a flexible and assistant platform/integration agnostic way. By the time you stack up VAD, STT, understanding user intent (while allowing flexible grammar), calling an API, execution, and TTS response latency starts to add up very, very quickly.
As one example, for "tell me a joke" Alexa does this in a few hundred milliseconds and I guarantee they're not using an LLM for this task - you can have a couple of hundred jokes to randomly select from with pre-generated TTS responses cached (as one path). Again, this is the approach we are taking to "catch up" with Alexa for all kinds of things from jokes to creating calendar entries, etc. Of course you can still have a catch-all to hand off to LLM for "conversation" but I'm not sure users actually want this for voice.
I may be misunderstanding your goals but just a few things I thought I would mention.
[0] - https://github.com/toverainc/willow
[1] - https://wisng.tovera.io/rtc/
[2] - https://github.com/toverainc/willow-inference-server/tree/wi...
[3] - https://github.com/toverainc/willow-inference-server
[4] - https://rasa.com/
VLLM: 24x faster LLM serving than HuggingFace Transformers
3 projects | news.ycombinator.com | 20 Jun 2023

We run into this constantly with Willow[0] and the Willow Inference Server[1]. There seems to be a large gap in understanding with many users. They seem to find it difficult to understand a fundamental reality: GPUs are so physically different and better suited to many/most ML tasks all the CPU tricks in the world cannot bring CPU even close to the performance of GPUs (while maintaining quality/functionality) for many tasks. I find this interesting because everyone seems to take it as obvious that integrated graphics vs discrete graphics for gaming aren't even close. Ditto for these tasks.
With Willow Inference Server I'm constantly telling people: a six year old $100 Tesla P4/GTX 1070 walks all over even the best CPUs in the world for our primary task of speech to text/ASR - at dramatically lower cost and power usage. Seriously - a GTX 1070 is at least 5x faster than a Threadripper 5955WX. Our goal is to provide an open-source commercial voice assistant equivalent user experience and that is and will be fundamentally impossible for the foreseeable future on CPU.
Slight tangent but there are users in the space who seem to be under the impression that they can use their Raspberry Pi for voice assistant/speech recognition. It's not even close to a fair fight but with the same implementation and settings a GTX 1070 is roughly 90x (nearly two orders of magnitude) faster[2] than a Raspberry Pi... Yes, all-in a machine with a GTX 1070 uses and order of magnitude more power (3w vs 30x) than a Raspberry Pi but the power cost in even countries with the most expensive power in the world results in a $2-$3/mo cost difference - which I feel, at least, is a reasonable trade-off considering the dramatic difference in usability (Raspberry Pi is essentially useless - waiting 10-30 seconds for a response makes pulling your phone out faster).
[0] - https://github.com/toverainc/willow
[1] - https://github.com/toverainc/willow-inference-server
[2] - https://github.com/toverainc/willow-inference-server/tree/wi...
GGML – AI at the Edge
11 projects | news.ycombinator.com | 6 Jun 2023

Shameless plug, I'm the founder of Willow[0].
In short you can:
1) Run a local Willow Inference Server[1]. Supports CPU or CUDA, just about the fastest implementation of Whisper out there for "real time" speech.
2) Run local command detection on device. We pull your Home Assistant entites on setup and define basic grammar for them but any English commands (up to 400) that can be processed by Home Assistant are recognized directly on the $50 ESP BOX device and sent to Home Assistant (or openHAB, or a REST endpoint, etc) for processing.
Whether WIS or local our performance target is 500ms from end of speech to command executed.
[0] - https://github.com/toverainc/willow
[1] - https://github.com/toverainc/willow-inference-server
Show HN: Willow Inference Server: Optimized ASR/TTS/LLM for Willow/WebRTC/REST
3 projects | news.ycombinator.com | 23 May 2023

Thanks!
Yes, "realtime multiple" is audio/speech length divided by actual inference time.
You got it! The demo video is showing the slowest response times because it is using the highest quality/accuracy settings available with Whisper (large-v2, beam 5). Willow devices use medium 1 by default for comparison and those responses are measured in the 500 milliseconds or less range (again depending on speech length) across a wide variety of new and old CUDA hardware. Some sample benchmarks here[0].
Applications using WIS (including Willow) can provide model settings on a per-request basis to balance quality vs latency depending on the task.
[0] - https://github.com/toverainc/willow-inference-server#benchma...
A note from our sponsor - SaaSHub
www.saashub.com | 3 May 2024

SaaSHub helps you find the best software and product alternatives Learn more →

Stats

Basic willow-inference-server repo stats

Mentions

Stars

320

Activity

8.3

Last Commit

16 days ago

toverainc/willow-inference-server is an open source project licensed under Apache License 2.0 which is an OSI approved license.

The primary programming language of willow-inference-server is Python.

Popular Comparisons