SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 speech-to-text Open-Source Projects
-
DeepSpeech
DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
SpeechRecognition
Speech recognition module for Python, supporting several engines and APIs, online and offline.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
vosk-api
Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
-
pyvideotrans
Translate the video from one language to another and add dubbing. 将视频从一种语言翻译为另一种语言,并添加配音
-
silero-models
Silero Models: pre-trained speech-to-text, text-to-speech and text-enhancement models made embarrassingly simple
-
willow
Open source, local, and self-hosted Amazon Echo/Google Home competitive Voice Assistant alternative
-
STT
🐸STT - The deep learning toolkit for Speech-to-Text. Training and deploying STT models has never been so easy.
-
whisper-timestamped
Multilingual Automatic Speech Recognition with word-level timestamps and confidence
-
awesome-whisper
🔊 Awesome list for Whisper — an open-source AI-powered speech recognition system developed by OpenAI
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Project mention: Show HN: I created automatic subtitling app to boost short videos | news.ycombinator.com | 2024-04-09whisper.cpp [1] has a karaoke example that uses ffmpeg's drawtext filter to display rudimentary karaoke-like captions. It also supports diarisation. Perhaps it could be a starting point to create a better script that does what you need.
--
1: https://github.com/ggerganov/whisper.cpp/blob/master/README....
It's indeed suspicious. You're sending your voice samples, your various services accounts, your location and more private data to some proprietary black box in some public cloud. Sorry, but this is a privacy nightmare. It should be open source and self-hosted like Mycroft (https://mycroft.ai) or Leon (https://getleon.ai) to be trustworthy.
Project mention: Amazon plans to charge for Alexa in June–unless internal conflict delays revamp | news.ycombinator.com | 2024-01-20Yeah, whisper is the closest thing we have, but even it requires more processing power than is present in most of these edge devices in order to feel smooth. I've started a voice interface project on a Raspberry Pi 4, and it takes about 3 seconds to produce a result. That's impressive, but not fast enough for Alexa.
From what I gather a Pi 5 can do it in 1.5 seconds, which is closer, so I suspect it's only a matter of time before we do have fully local STT running directly on speakers.
> Probably anathema to the space, but if the devices leaned into the ~five tasks people use them for (timers, weather, todo list?) could probably tighten up the AI models to be more accurate and/or resource efficient.
Yes, this is the approach taken by a lot of streaming STT systems, like Kaldi [0]. Rather than use a fully capable model, you train a specialized one that knows what kinds of things people are likely to say to it.
[0] http://kaldi-asr.org/
Project mention: Easy video transcription and subtitling with Whisper, FFmpeg, and Python | news.ycombinator.com | 2024-04-06It uses this, which does support diarization: https://github.com/m-bain/whisperX
For our real-time STT needs, we'll employ a fantastic library called faster-whisper.
Start and Stop Listening Example
Project mention: SpeechBrain 1.0: A free and open-source AI toolkit for all things speech | news.ycombinator.com | 2024-02-28
Project mention: Weird A.I. Yankovic, a cursed deep dive into the world of voice cloning | news.ycombinator.com | 2023-10-02I doubt it's currently actually "the best open source text to speech", but the answer I came up with when throwing a couple of hours at the problem some months ago was "Silero" [0, 1].
Following the "standalone" guide [2], it was pretty trivial to make the model render my sample text in about 100 English "voices" (many of which were similar to each other, and in varying quality). Sampling those, I got about 10 that were pretty "good". And maybe 6 that were the "best ones" (pretty natural, not annoying to listen to).
IIRC the license was free for noncommercial use only. I'm not sure exactly "how open source" they are, but it was simple to install the dependencies and write the basic Python to try it out; I had to write a for loop to try all the voices like I wanted. I ended using something else for the project for other reasons, but this could still be fairly good backup option for some use cases IMO.
[0] https://github.com/snakers4/silero-models#text-to-speech
Fair points but with all due respect completely misses the point and context. My comment was a reply to a new user interested in esphome on a post about esphome.
You're talking about CircuitPython, 35KB web replies, PSRAM, UF2 bootloader, etc. These are comparatively very advanced topics and you didn't mention esphome once.
The comfort and familiarity of Amazon for what is already a new, intimidating, and challenging subject is of immeasurable value for a novice. They can click those links, fill a cart, and have stuff show up tomorrow with all of the usual ease, friendliness, and reliability of Amazon. If they get frustrated or it doesn't work out they can shove it in the box and get a full refund Amazon-style.
You're suggesting wandering all over the internet, ordering stuff from China, multiple vendors, etc while describing a bunch of things that frankly just won't matter to them. I say this as someone who has been an esphome and home assistant user since day one. The approach I described has never failed or remotely bothered me and over the past ~decade I've seen it suggested to new users successfully time and time again.
In terms of PSRAM to my knowledge the only thing it is utilized for in the esphome ecosystem is higher resolution displays and more advanced voice assistant scenarios that almost always require -S3 anyway and are a very advanced, challenging use cases. I'm very familiar with displays, voice, the S3, and PSRAM but more on that in a second...
> live with one less LX7 core and no Bluetooth
I'm the founder of Willow[0] and when comparing Willow to esphome the most frequent request we get is supporting bluetooth functionality i.e. esphome bluetooth proxy[1]. This is an extremely popular use case in the esphome/home assistant community. Not having bluetooth while losing a core and paying more is a bigger issue than pin spacing.
It's also a pretty obscure board and while not a big deal to you and I if you look around at docs, guides, etc, etc you'll see the cheap-o boards from Amazon are by far the most popular and common (unsurprisingly). Another plus for a new user.
Speaking of Willow (and back to PSRAM again) even the voice assistant satellite functionality of Home Assistant doesn't fundamentally require it - the most popular device doesn't have it either[2].
Very valuable comment with a lot of interesting information, just doesn't apply to context.
[0] - https://heywillow.io/
[1] - https://esphome.io/components/bluetooth_proxy.html
[2] - https://www.home-assistant.io/voice_control/thirteen-usd-voi...
Project mention: Rest in Peas: The Unrecognized Death of Speech Recognition (2010) | news.ycombinator.com | 2023-05-04What has happened since then? I know Common Voice has come and gone https://en.wikipedia.org/wiki/Common_Voice https://github.com/coqui-ai/STT
And I've seen some neural approaches too
No idea where to look for comparisons though.
https://github.com/MahmoudAshraf97/whisper-diarization
This project has been alright for transcribing audio with speaker diarization. A big finicky. The OpenAI model is better than other paid products(Descript, Riverside) so I’m looking forward to trying MacWhisper.
Project mention: How I converted a podcast into a knowledge base using Orama search and OpenAI whisper and Astro | dev.to | 2023-05-23
Project mention: Show HN: AI Dub Tool I Made to Watch Foreign Language Videos with My 7-Year-Old | news.ycombinator.com | 2024-02-28Yes. But Whisper's word-level timings are actually quite inaccurate out of the box. There are some Python libraries that mitigate that. I tested several of them. whisper-timestamped seems to be the best one. [0]
[0] https://github.com/linto-ai/whisper-timestamped
speech-to-text related posts
- VOSK Offline Speech Recognition API
- Show HN: I created automatic subtitling app to boost short videos
- Easy video transcription and subtitling with Whisper, FFmpeg, and Python
- SOTA ASR Tooling: Long-Form Transcription
- Deploying whisperX on AWS SageMaker as Asynchronous Endpoint
- LLMs on your local Computer (Part 1)
- SpeechBrain 1.0: A free and open-source AI toolkit for all things speech
-
A note from our sponsor - SaaSHub
www.saashub.com | 25 Apr 2024
Index
What are some of the best open-source speech-to-text projects? This list will help you:
Project | Stars | |
---|---|---|
1 | whisper.cpp | 30,942 |
2 | DeepSpeech | 24,278 |
3 | Leon | 14,539 |
4 | Kaldi Speech Recognition Toolkit | 13,706 |
5 | whisperX | 8,965 |
6 | faster-whisper | 8,723 |
7 | SpeechRecognition | 8,040 |
8 | speechbrain | 7,869 |
9 | vosk-api | 7,025 |
10 | annyang | 6,547 |
11 | pyvideotrans | 5,556 |
12 | silero-models | 4,534 |
13 | lingvo | 2,780 |
14 | willow | 2,361 |
15 | STT | 2,131 |
16 | whisper-diarization | 1,985 |
17 | kalliope | 1,696 |
18 | soloud | 1,644 |
19 | whisper-asr-webservice | 1,617 |
20 | whisper-timestamped | 1,501 |
21 | Dragonfire | 1,372 |
22 | dc_tts | 1,150 |
23 | awesome-whisper | 989 |
Sponsored