SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 Python Speech Projects
-
Project mention: My Journey to a reliable and enjoyable locally hosted voice assistant | news.ycombinator.com | 2026-03-16
actually the hardest part of a locally hosted voice assistant isn't the llm. it's making the tts tolerable to actually talk to every day.
the core issue is prosody: kokoro and piper are trained on read speech, but conversational responses have shorter breath groups and different stress patterns on function words. that's why numbers, addresses, and hedged phrases sound off even when everything else works.
the fix is training data composition. conversational and read speech have different prosody distributions and models don't generalize across them. for self-hosted, coqui xtts-v2 [1] is worth trying if you want more natural english output than kokoro.
btw i'm lily, cofounder of rime [2]. we're solving this for business voice agents at scale, not really the personal home assistant use case, but the underlying problem is the same.
[1] https://github.com/coqui-ai/TTS
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
-
VoxCPM
VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning
Project mention: Rust RAG, Tokenizer-Free TTS (VoxCPM2), & Project NOMAD: Local AI & Offline Deployments | dev.to | 2026-05-30Source: https://github.com/OpenBMB/VoxCPM
-
Look at what arrived between mid-2023 and mid-2025. Gandhi et al.'s Distil-Whisper (2023) distilled large-v2 into a 756M-param student that runs 6× faster with a 1% WER gap on out-of-distribution audio, using large-scale pseudo-labelling. Georgi Gerganov's whisper.cpp made CPU-only and mobile inference a default rather than a party trick; a base.en checkpoint transcribes real-time on an M1 without touching a GPU. Max Bain's WhisperX added forced-alignment and diarization on top, so word-level timestamps and speaker labels stopped being a premium-tier differentiator.
-
datasets
🤗 The largest hub of ready-to-use datasets for AI models with fast, easy-to-use and efficient data manipulation tools
Project mention: GSoC 2026 Predictions: 30 NEW AI/ML/Security Organizations You Should Start Contributing to NOW! | dev.to | 2026-02-06 -
-
Project mention: Show HN: Apple's Sharp Running in the Browser via ONNX Runtime Web | news.ycombinator.com | 2026-05-03
So I use a VAD onnx (Silero [1]) to automatically detect when someone is talking, and then it sends the audio into one of the voice recognition libraries.
I originally tried to get away with just Whisper Tiny in the chess game [2], but it performs worse on the kinds of short phrases (knight E4, c takes d5, etc) used to dictate chess notation. Even with hotword-based phrasing and corrections, I found its accuracy on brief inputs noticeably poorer. So I switched over to Sherpa [3] trained on gigaspeech. It’s significantly more accurate, but it also comes with a correspondingly larger memory footprint.
Ideally, I would have used just one engine, but I needed a fallback for iOS devices (especially older ones) which can easily OOM.
[1] - https://github.com/snakers4/silero-vad
[2] - https://shahkur.specr.net
[3] - https://github.com/k2-fsa/sherpa-onnx
-
-
-
-
-
-
Project mention: Show HN: Background noise removal in multimedia with a single command | news.ycombinator.com | 2025-10-06
-
Project mention: How GPU-Powered Coding Agents Can Assist in Development of GPU-Accelerated Software | dev.to | 2026-02-28
Imagine owning a massive Plex media library with hundreds of foreign-language films and TV shows. You want subtitles for everything, but manually sourcing them is a nightmare — mismatched timings, missing translations, incomplete coverage. Tools like Bazarr exist specifically to automate subtitle management for Plex and Sonarr/Radarr libraries, and they ship with built-in integration for whisper-asr-webservice — a self-hosted REST API that wraps OpenAI's Whisper speech recognition model. Point Bazarr at a whisper-asr-webservice endpoint, and it will automatically transcribe and generate subtitles for every piece of media in your library, in any language Whisper supports.
-
-
whisper-timestamped
Multilingual Automatic Speech Recognition with word-level timestamps and confidence
There is also: https://github.com/linto-ai/whisper-timestamped
It doesn't use an extra model (so it supports every language that works with Whisper out of the box and use less memory), it works by applying Dynamic Time Warping to cross-attention weights.
-
aeneas
aeneas is a Python/C library and a set of tools to automagically synchronize audio and text (aka forced alignment)
-
-
Project mention: IMS Toucan – Text-to-Speech for over 7000 Languages | news.ycombinator.com | 2026-01-02
-
openai-edge-tts
Free, high-quality text-to-speech API endpoint to replace OpenAI, Azure, or ElevenLabs
-
-
-
StreamSpeech
StreamSpeech is an “All in One” seamless model for offline and simultaneous speech recognition, speech translation and speech synthesis.
Python Speech discussion
Python Speech related posts
-
A beginner's guide to the Whisperx-A40-Large model by Victor-Upmeet on Replicate
-
AI Twin — Voice Cloning with Text-to-Speech
-
Making AI Models Faster, Cheaper, and Greener — Here’s How
-
2025 Voice AI Guide: How to Make Your Own Real-Time Voice Agent (Part-1)
-
Ask HN: What Speaker Diarization tools should I look into?
-
Training with Big Data on Any Cloud
-
Show HN: Mikey – No bot meeting notetaker for Windows
-
A note from our sponsor - SaaSHub
www.saashub.com | 14 Jun 2026
Index
What are some of the best open-source Speech projects in Python? This list will help you:
| # | Project | Stars |
|---|---|---|
| 1 | TTS | 45,294 |
| 2 | MockingBird | 36,904 |
| 3 | VoxCPM | 28,778 |
| 4 | whisperX | 22,445 |
| 5 | datasets | 21,620 |
| 6 | AudioGPT | 10,205 |
| 7 | silero-vad | 9,308 |
| 8 | modelscope | 8,963 |
| 9 | EmotiVoice | 8,479 |
| 10 | speech-to-speech | 4,873 |
| 11 | ultravox | 4,449 |
| 12 | metavoice-src | 4,194 |
| 13 | DeepFilterNet | 4,056 |
| 14 | whisper-asr-webservice | 3,281 |
| 15 | lingvo | 2,864 |
| 16 | whisper-timestamped | 2,818 |
| 17 | aeneas | 2,811 |
| 18 | gTTS | 2,620 |
| 19 | IMS-Toucan | 2,203 |
| 20 | openai-edge-tts | 1,928 |
| 21 | SALMONN | 1,449 |
| 22 | voicefixer | 1,327 |
| 23 | StreamSpeech | 1,270 |