Show HN: Project S.A.T.U.R.D.A.Y – open-source, self hosted, J.A.R.V.I.S

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

S.A.T.U.R.D.A.Y

8 650 8.1 Go

A toolbox for working with WebRTC, Audio and AI
whisper.cpp

187 31,426 9.8 C

Port of OpenAI's Whisper model in C/C++

Welcome to Project S.A.T.U.R.D.A.Y. This is a project that allows anyone to easily build their own self-hosted J.A.R.V.I.S-like voice assistant. In my mind vocal computing is the future of human-computer interaction and by open sourcing this code I hope to expedite us on that path.
I have had a blast working on this so far and I'm excited to continue to build with it. It uses whisper.cpp [1], Coqui TTS [2] and OpenAI [3] to do speech-to-text, text-to-text and text-to-speech inference all 100% locally (except for text-to-text). In the future I plan to swap out OpenAI for llama.cpp [4]. It is built on top of WebRTC as the media transmission layer which will allow this technology to be deployed anywhere as it does not rely on any native or 3rd party APIs.
The purpose of this project is to be a toolbox for vocal computing. It provides high-level abstractions for dealing with speech-to-text, text-to-text and text-to-speech tasks. The tools remain decoupled from underlying AI models allowing for quick and easy upgrades when new technology is realeased. The main demo for this project is a J.A.R.V.I.S-like assistant however this is meant to be used for a wide variety of use cases.
In the coming months I plan to continue to build (hopefully with some of you) on top of this project in order to refine the abstraction level and better understand the kinds of tools required. I hope to build a community of like-minded individuals who want to see J.A.R.V.I.S finally come to life! If you are interested in vocal computing come join the Discord server and build with us! Hope to see you there :)
[1] whisper.cpp: https://github.com/ggerganov/whisper.cpp

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
TTS

232 29,420 9.4 Python

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
llama.cpp

775 57,463 10.0 C++

LLM inference in C/C++
excalidraw

374 73,428 9.5 TypeScript

Virtual whiteboard for sketching hand-drawn like diagrams
willow

37 2,376 9.6 C

Open source, local, and self-hosted Amazon Echo/Google Home competitive Voice Assistant alternative

Nice! I'm the creator of Willow[0] (which has been mentioned here).
First of all, we love seeing efforts like this and we'd love to work together with other open source voice user interface projects! There's plenty of work to do in the space...
I have roughly two decades of experience with voice and one thing to keep in mind is how latency sensitive voice tasks are. Generally speaking when it comes to conversational audio people have very high expectations regarding interactivity. For example, in the VoIP world we know that conversation between people starts getting annoying at around 300ms of latency. Higher latencies for voice assistant tasks are more-or-less "tolerated" but latency still needs to be extremely low. Alexa/Echo (with all of its problems) is at least a decent benchmark for what people expect for interactivity and all things considered it does pretty well.
I know you're early (we are too!) but in your demo I counted roughly six seconds of latency between the initial hello and response (and nearly 10 for "tell me a joke"). In terms of conversational voice this feels like an eternity. Again, no shade at all (believe me I understand more than most) but just something I thought I'd add from my decades of experience with humans and voice. This is why we have such heavy emphasis on reducing latency as much as possible.
For an idea of just how much we emphasize this you can try our WebRTC demo[1] which can do end-to-end (from click stop record in browser to ASR response) in a few hundred milliseconds (with Whisper large-v2 and beam size 5 - medium/1 is a fraction of that) including internet latency (it's hosted in Chicago, FYI).
Running locally with WIS and Willow we see less than 500ms from end of speech (on device VAD) to command execution completion and TTS response with platforms like Home Assistant. Granted this is with GPU so you could call it cheating but a $100 six year old Nvidia Pascal series GPU runs circles around the fastest CPUs for these tasks (STT and TTS - see benchmarks here[2]). Again, kind of cheating but my RTX 3090 at home drops this down to around 200ms - roughly half of that time is Home Assistant. It's my (somewhat controversial) personal opinion that GPUs are more-or-less a requirement (today) for Alexa/Echo competitive responsiveness.
Speaking of latency, I've been noticing a trend with Willow users regarding LLMs - they are very neat, cool, and interesting (our inference server[3] supports LLamA based LLMs) but they really aren't the right tool for these kinds of tasks. They have very high memory requirements (relatively speaking), require a lot of compute, and are very slow (again, relatively speaking). They also don't natively support the kinds of API call/response you need for most voice tasks. There are efforts out there to support this with LLMs but frankly I find the overall approach very strange. It seems that LLMs have sucked a lot of oxygen out of the room and people have forgotten (or never heard of) "good old fashioned" NLU/NLP approaches.
Have you considered an NLU/NLP engine like Rasa[4]? This is the approach we will be taking to implement this kind of functionality in a flexible and assistant platform/integration agnostic way. By the time you stack up VAD, STT, understanding user intent (while allowing flexible grammar), calling an API, execution, and TTS response latency starts to add up very, very quickly.
As one example, for "tell me a joke" Alexa does this in a few hundred milliseconds and I guarantee they're not using an LLM for this task - you can have a couple of hundred jokes to randomly select from with pre-generated TTS responses cached (as one path). Again, this is the approach we are taking to "catch up" with Alexa for all kinds of things from jokes to creating calendar entries, etc. Of course you can still have a catch-all to hand off to LLM for "conversation" but I'm not sure users actually want this for voice.
I may be misunderstanding your goals but just a few things I thought I would mention.
[0] - https://github.com/toverainc/willow
[1] - https://wisng.tovera.io/rtc/
[2] - https://github.com/toverainc/willow-inference-server/tree/wi...
[3] - https://github.com/toverainc/willow-inference-server
[4] - https://rasa.com/

willow-inference-server

7 324 8.3 Python

Open source, local, and self-hosted highly optimized language inference server supporting ASR/STT, TTS, and LLM across WebRTC, REST, and WS

Nice! I'm the creator of Willow[0] (which has been mentioned here).
First of all, we love seeing efforts like this and we'd love to work together with other open source voice user interface projects! There's plenty of work to do in the space...
I have roughly two decades of experience with voice and one thing to keep in mind is how latency sensitive voice tasks are. Generally speaking when it comes to conversational audio people have very high expectations regarding interactivity. For example, in the VoIP world we know that conversation between people starts getting annoying at around 300ms of latency. Higher latencies for voice assistant tasks are more-or-less "tolerated" but latency still needs to be extremely low. Alexa/Echo (with all of its problems) is at least a decent benchmark for what people expect for interactivity and all things considered it does pretty well.
I know you're early (we are too!) but in your demo I counted roughly six seconds of latency between the initial hello and response (and nearly 10 for "tell me a joke"). In terms of conversational voice this feels like an eternity. Again, no shade at all (believe me I understand more than most) but just something I thought I'd add from my decades of experience with humans and voice. This is why we have such heavy emphasis on reducing latency as much as possible.
For an idea of just how much we emphasize this you can try our WebRTC demo[1] which can do end-to-end (from click stop record in browser to ASR response) in a few hundred milliseconds (with Whisper large-v2 and beam size 5 - medium/1 is a fraction of that) including internet latency (it's hosted in Chicago, FYI).
Running locally with WIS and Willow we see less than 500ms from end of speech (on device VAD) to command execution completion and TTS response with platforms like Home Assistant. Granted this is with GPU so you could call it cheating but a $100 six year old Nvidia Pascal series GPU runs circles around the fastest CPUs for these tasks (STT and TTS - see benchmarks here[2]). Again, kind of cheating but my RTX 3090 at home drops this down to around 200ms - roughly half of that time is Home Assistant. It's my (somewhat controversial) personal opinion that GPUs are more-or-less a requirement (today) for Alexa/Echo competitive responsiveness.
Speaking of latency, I've been noticing a trend with Willow users regarding LLMs - they are very neat, cool, and interesting (our inference server[3] supports LLamA based LLMs) but they really aren't the right tool for these kinds of tasks. They have very high memory requirements (relatively speaking), require a lot of compute, and are very slow (again, relatively speaking). They also don't natively support the kinds of API call/response you need for most voice tasks. There are efforts out there to support this with LLMs but frankly I find the overall approach very strange. It seems that LLMs have sucked a lot of oxygen out of the room and people have forgotten (or never heard of) "good old fashioned" NLU/NLP approaches.
Have you considered an NLU/NLP engine like Rasa[4]? This is the approach we will be taking to implement this kind of functionality in a flexible and assistant platform/integration agnostic way. By the time you stack up VAD, STT, understanding user intent (while allowing flexible grammar), calling an API, execution, and TTS response latency starts to add up very, very quickly.
As one example, for "tell me a joke" Alexa does this in a few hundred milliseconds and I guarantee they're not using an LLM for this task - you can have a couple of hundred jokes to randomly select from with pre-generated TTS responses cached (as one path). Again, this is the approach we are taking to "catch up" with Alexa for all kinds of things from jokes to creating calendar entries, etc. Of course you can still have a catch-all to hand off to LLM for "conversation" but I'm not sure users actually want this for voice.
I may be misunderstanding your goals but just a few things I thought I would mention.
[0] - https://github.com/toverainc/willow
[1] - https://wisng.tovera.io/rtc/
[2] - https://github.com/toverainc/willow-inference-server/tree/wi...
[3] - https://github.com/toverainc/willow-inference-server
[4] - https://rasa.com/

SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

VLLM: 24x faster LLM serving than HuggingFace Transformers

3 projects | news.ycombinator.com | 20 Jun 2023
Show HN: Willow Inference Server: Optimized ASR/TTS/LLM for Willow/WebRTC/REST

3 projects | news.ycombinator.com | 23 May 2023
Distil-Whisper: distilled version of Whisper that is 6 times faster, 49% smaller

14 projects | news.ycombinator.com | 31 Oct 2023
Whisper.api: An open source, self-hosted speech-to-text with fast transcription

5 projects | news.ycombinator.com | 22 Aug 2023
[D] What is the most efficient version of OpenAI Whisper?

7 projects | /r/MachineLearning | 12 Jul 2023

Show HN: Project S.A.T.U.R.D.A.Y – open-source, self hosted, J.A.R.V.I.S

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
speech-to-text Deep Learning Whisper text-to-speech Privacy
Post date: 2 Jul 2023

S.A.T.U.R.D.A.Y

whisper.cpp

InfluxDB

TTS

llama.cpp

excalidraw

willow

willow-inference-server

SaaSHub

Related posts

VLLM: 24x faster LLM serving than HuggingFace Transformers

Show HN: Willow Inference Server: Optimized ASR/TTS/LLM for Willow/WebRTC/REST

Distil-Whisper: distilled version of Whisper that is 6 times faster, 49% smaller

Whisper.api: An open source, self-hosted speech-to-text with fast transcription

[D] What is the most efficient version of OpenAI Whisper?

Show HN: Project S.A.T.U.R.D.A.Y – open-source, self hosted, J.A.R.V.I.S

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com speech-to-text Deep Learning Whisper text-to-speech Privacy Post date: 2 Jul 2023

Related posts

VLLM: 24x faster LLM serving than HuggingFace Transformers

Show HN: Willow Inference Server: Optimized ASR/TTS/LLM for Willow/WebRTC/REST

Distil-Whisper: distilled version of Whisper that is 6 times faster, 49% smaller

Whisper.api: An open source, self-hosted speech-to-text with fast transcription

[D] What is the most efficient version of OpenAI Whisper?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
speech-to-text Deep Learning Whisper text-to-speech Privacy
Post date: 2 Jul 2023