SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 speech-recognition Open-Source Projects
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
DeepSpeech
DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
-
DeepLearningExamples
State-of-the-Art Deep Learning scripts organized by models - easy to train and deploy with reproducible accuracy and performance on enterprise-grade infrastructure.
-
deep-learning-drizzle
Drench yourself in Deep Learning, Reinforcement Learning, Machine Learning, Computer Vision, and NLP by learning from these exciting lectures!!
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
PaddleSpeech
Easy-to-use Speech Toolkit including Self-Supervised Learning model, SOTA/Streaming ASR with punctuation, Streaming TTS with text frontend, Speaker Verification System, End-to-End Speech Translation and Keyword Spotting. Won NAACL2022 Best Demo Award.
-
SpeechRecognition
Speech recognition module for Python, supporting several engines and APIs, online and offline.
-
vosk-api
Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
-
silero-models
Silero Models: pre-trained speech-to-text, text-to-speech and text-enhancement models made embarrassingly simple
-
FunASR
A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models. |语音识别工具包,包含丰富的性能优越的开源预训练模型,支持语音识别、语音端点检测、文本后处理等,具备服务部署能力。
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Project mention: Maxtext: A simple, performant and scalable Jax LLM | news.ycombinator.com | 2024-04-23Is t5x an encoder/decoder architecture?
Some more general options.
The Flax ecosystem
https://github.com/google/flax?tab=readme-ov-file
or dm-haiku
https://github.com/google-deepmind/dm-haiku
were some of the best developed communities in the Jax AI field
Perhaps the “trax” repo? https://github.com/google/trax
Some HF examples https://github.com/huggingface/transformers/tree/main/exampl...
Sadly it seems much of the work is proprietary these days, but one example could be Grok-1, if you customize the details. https://github.com/xai-org/grok-1/blob/main/run.py
Project mention: Show HN: I created automatic subtitling app to boost short videos | news.ycombinator.com | 2024-04-09whisper.cpp [1] has a karaoke example that uses ffmpeg's drawtext filter to display rudimentary karaoke-like captions. It also supports diarisation. Perhaps it could be a starting point to create a better script that does what you need.
--
1: https://github.com/ggerganov/whisper.cpp/blob/master/README....
It's indeed suspicious. You're sending your voice samples, your various services accounts, your location and more private data to some proprietary black box in some public cloud. Sorry, but this is a privacy nightmare. It should be open source and self-hosted like Mycroft (https://mycroft.ai) or Leon (https://getleon.ai) to be trustworthy.
Project mention: Amazon plans to charge for Alexa in June–unless internal conflict delays revamp | news.ycombinator.com | 2024-01-20Yeah, whisper is the closest thing we have, but even it requires more processing power than is present in most of these edge devices in order to feel smooth. I've started a voice interface project on a Raspberry Pi 4, and it takes about 3 seconds to produce a result. That's impressive, but not fast enough for Alexa.
From what I gather a Pi 5 can do it in 1.5 seconds, which is closer, so I suspect it's only a matter of time before we do have fully local STT running directly on speakers.
> Probably anathema to the space, but if the devices leaned into the ~five tasks people use them for (timers, weather, todo list?) could probably tighten up the AI models to be more accurate and/or resource efficient.
Yes, this is the approach taken by a lot of streaming STT systems, like Kaldi [0]. Rather than use a fully capable model, you train a specialized one that knows what kinds of things people are likely to say to it.
[0] http://kaldi-asr.org/
PaddlePaddle/PaddleSpeech
Project mention: Easy video transcription and subtitling with Whisper, FFmpeg, and Python | news.ycombinator.com | 2024-04-06It uses this, which does support diarization: https://github.com/m-bain/whisperX
For our real-time STT needs, we'll employ a fantastic library called faster-whisper.
Start and Stop Listening Example
Project mention: WhisperSpeech – An Open Source text-to-speech system built by inverting Whisper | news.ycombinator.com | 2024-01-17You might check out this list from espnet. They list the different corpuses they use to train their models sorted by language and task (ASR, TTS etc):
https://github.com/espnet/espnet/blob/master/egs2/README.md
Project mention: SpeechBrain 1.0: A free and open-source AI toolkit for all things speech | news.ycombinator.com | 2024-02-28
Project mention: Weird A.I. Yankovic, a cursed deep dive into the world of voice cloning | news.ycombinator.com | 2023-10-02I doubt it's currently actually "the best open source text to speech", but the answer I came up with when throwing a couple of hours at the problem some months ago was "Silero" [0, 1].
Following the "standalone" guide [2], it was pretty trivial to make the model render my sample text in about 100 English "voices" (many of which were similar to each other, and in varying quality). Sampling those, I got about 10 that were pretty "good". And maybe 6 that were the "best ones" (pretty natural, not annoying to listen to).
IIRC the license was free for noncommercial use only. I'm not sure exactly "how open source" they are, but it was simple to install the dependencies and write the basic Python to try it out; I had to write a for loop to try all the voices like I wanted. I ended using something else for the project for other reasons, but this could still be fairly good backup option for some use cases IMO.
[0] https://github.com/snakers4/silero-models#text-to-speech
Project mention: [D] What is the most efficient version of OpenAI Whisper? | /r/MachineLearning | 2023-07-12Whisper JAX: https://github.com/sanchit-gandhi/whisper-jax. Builds on the Hugging Face implementation. Written in JAX (instead of PyTorch), where you get a 10x or more speed-up if you run it on TPU v4 hardware (I've gotten up to 15x with large batch sizes for super long audio files). Overall, 70-100x faster than OpenAI if you run it on TPU v4
Project mention: [Discussion] Looking for an Open-Source Speech to Text model (english) that captures filler words, pauses and also records timestamps for each word. | /r/LocalLLaMA | 2023-07-07
wenet-e2e/wenet
Project mention: FunASR: Fundamental End-to-End Speech Recognition Toolkit | news.ycombinator.com | 2024-01-13
speech-recognition related posts
- VOSK Offline Speech Recognition API
- Easy video transcription and subtitling with Whisper, FFmpeg, and Python
- SOTA ASR Tooling: Long-Form Transcription
- Deploying whisperX on AWS SageMaker as Asynchronous Endpoint
- SpeechBrain 1.0: A free and open-source AI toolkit for all things speech
- Show HN: AI Dub Tool I Made to Watch Foreign Language Videos with My 7-Year-Old
- Speech-to-Text Benchmark
-
A note from our sponsor - SaaSHub
www.saashub.com | 26 Apr 2024
Index
What are some of the best open-source speech-recognition projects? This list will help you:
Project | Stars | |
---|---|---|
1 | transformers | 125,021 |
2 | whisper.cpp | 30,942 |
3 | DeepSpeech | 24,278 |
4 | Leon | 14,539 |
5 | Kaldi Speech Recognition Toolkit | 13,706 |
6 | DeepLearningExamples | 12,607 |
7 | deep-learning-drizzle | 11,749 |
8 | PaddleSpeech | 10,120 |
9 | whisperX | 8,965 |
10 | faster-whisper | 8,723 |
11 | SpeechRecognition | 8,040 |
12 | espnet | 7,872 |
13 | speechbrain | 7,869 |
14 | vosk-api | 7,025 |
15 | annyang | 6,547 |
16 | wav2letter | 6,331 |
17 | openvino | 5,911 |
18 | silero-models | 4,534 |
19 | whisper-jax | 4,077 |
20 | pocketsphinx | 3,736 |
21 | wenet | 3,691 |
22 | Porcupine | 3,424 |
23 | FunASR | 3,299 |
Sponsored