Bad PDFs = bad UX. Slow load times, broken annotations, clunky UX frustrates users. Nutrient’s PDF SDKs gives seamless document experiences, fast rendering, annotations, real-time collaboration, 100+ features. Used by 10K+ devs, serving ~half a billion users worldwide. Explore the SDK for free. Learn more →
Top 23 Python speech-to-text Projects
-
Project mention: Play 3.0 mini – A lightweight, reliable, cost-efficient Multilingual TTS model | news.ycombinator.com | 2024-10-14
Hi, I don't know what's SOTA, but I got good results with these (open source, on-device) :
https://github.com/SYSTRAN/faster-whisper (speech-to-text)
-
CodeRabbit
CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
-
Yes it's still relevant but I prefer WhisperX for some tasks: https://github.com/m-bain/whisperX
-
pyvideotrans
Translate the video from one language to another and add dubbing. 将视频从一种语言翻译为另一种语言,同时支持语音识别转录、语音合成、字幕翻译。
Project mention: Pyvideotrans – AI Video Translation and Voiceover Tool | news.ycombinator.com | 2024-08-14 -
Simple Diarizer Simple Diarizer is a speaker diarization library that utilizes pretrained models from SpeechBrain . To get started with simple_diarizer, follow these steps:
-
SpeechRecognition
Speech recognition module for Python, supporting several engines and APIs, online and offline.
-
RealtimeSTT
A robust, efficient, low-latency speech-to-text library with advanced voice activity detection, wake word activation and instant transcription.
-
Project mention: Omni SenseVoice: High-Speed Speech Recognition with Words Timestamps | news.ycombinator.com | 2024-10-12
I mean they make a bold statement up top just to paddle back a little bit further down with: "[…] In terms of Chinese and Cantonese recognition, the SenseVoice-Small model has advantages."
It feels dishonest to me.
[0] https://github.com/FunAudioLLM/SenseVoice?tab=readme-ov-file...
-
Nutrient
Nutrient - The #1 PDF SDK Library. Bad PDFs = bad UX. Slow load times, broken annotations, clunky UX frustrates users. Nutrient’s PDF SDKs gives seamless document experiences, fast rendering, annotations, real-time collaboration, 100+ features. Used by 10K+ devs, serving ~half a billion users worldwide. Explore the SDK for free.
-
voice-pro
Comprehensive Gradio WebUI for audio processing, powered by Whisper engines (Whisper, Faster-Whisper, Whisper-Timestamped). Features Voice Changer(RVC), zero-shot Voice Cloning (E2, F5-TTS), YouTube downloading, vocal isolation(UVR5), Text-to-Speech (Edge-TTS), and multi-language translation. Perfect for content creators and developers.
Project mention: Voice-Pro: Ultimate AI Voice Conversion and Multilingual Translation Tool 🔊 | dev.to | 2025-02-10GitHub: https://github.com/abus-aikorea/voice-pro
-
-
LLaMA-Omni
LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.
Project mention: Hertz-dev, the first open-source base model for conversational audio | news.ycombinator.com | 2024-11-03- [LLaMA-Omni](https://github.com/ictnlp/LLaMA-Omni) is a speech-language model built on Llama-3.1-8B-Instruct and trained using just 4 GPUs, offering low-latency, high-quality speech interactions and simultaneous generation of text and speech responses
- [moshi](https://github.com/kyutai-labs/moshi) a speech-text foundation model that supports low-latency high-quality speech interactions and simultaneous generation of text responses, using Mimi, a SOTA streaming neural audio codec
- [Mini-Omni](https://github.com/gpt-omni/mini-omni) a multimodal LLM based on Qwen2 offering real-time end-to-end speech input and streaming audio output conversational capabilities
- [Aria](https://github.com/rhymes-ai/Aria) is a lightweight, multimodal native MoE model with 25B parameters and 3.9B activated parameters per token, offering state-of-the-art performance in multimodal, language, and coding tasks, with a long multimodal context window of 64K tokens and efficient encoding of visual input for fast inference and low fine-tuning cost
- [Ichigo](https://github.com/homebrewltd/ichigo) an open research project extending a text-based LLM to have native listening ability, using an early fusion technique, with improved multiturn capabilities and refusal to process inaudible queries
-
-
whisper-timestamped
Multilingual Automatic Speech Recognition with word-level timestamps and confidence
Project mention: Show HN: AI Dub Tool I Made to Watch Foreign Language Videos with My 7-Year-Old | news.ycombinator.com | 2024-02-28Yes. But Whisper's word-level timings are actually quite inaccurate out of the box. There are some Python libraries that mitigate that. I tested several of them. whisper-timestamped seems to be the best one. [0]
[0] https://github.com/linto-ai/whisper-timestamped
-
-
whisper-standalone-win
Whisper & Faster-Whisper standalone executables for those who don't want to bother with Python.
On Windows I use whisper-standalone-win: https://github.com/Purfview/whisper-standalone-win
It has a few customization features that are nice: https://github.com/Purfview/whisper-standalone-win/discussio...
Works miles better than plain faster-whisper, in my experience. Not sure if there's wildcard support but that's easily scripted.
-
-
-
-
StreamSpeech
StreamSpeech is an “All in One” seamless model for offline and simultaneous speech recognition, speech translation and speech synthesis.
Has anyone had any luck with an offline, free, open-source real-time speech-to-speech translation app on under-powered devices (i.e., older smart phones)?
* https://github.com/ictnlp/StreamSpeech
* https://github.com/k2-fsa/sherpa-onnx
* https://github.com/openai/whisper
I'm looking for a simple app that can listen for English, translate into Korean (and other languages), then perform speech synthesis on the translation. Basically, a Babelfish that doesn't stick in the ear. Although real-time would be great, a 3- to 5-second delay is manageable.
RTranslator is awkward (couldn't get it to perform speech-to-speech using a single phone). 3PO sprouts errors like dandelions and requires an online connection.
Any suggestions?
-
whisper-ctranslate2
Whisper command line client compatible with original OpenAI client based on CTranslate2.
-
-
-
whisper-playground
Build real time speech2text web apps using OpenAI's Whisper https://openai.com/blog/whisper/
-
june
Local voice chatbot for engaging conversations, powered by Ollama, Hugging Face Transformers, and Coqui TTS Toolkit (by mezbaul-h)
Project mention: Show HN: Local Voice Assistant Using Ollama, Transformers and Coqui TTS Toolkit | news.ycombinator.com | 2024-06-20 -
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Python speech-to-text discussion
Python speech-to-text related posts
-
Ask HN: Is Whisper Still Relevant?
-
Transcriber AI – Free, end-to-end machine based transcription with speaker id
-
Ask HN: Real-time speech-to-speech translation
-
RealTime STT
-
WhisperX: Precise ASR with Word-Level Timestamps and Diarization
-
WhisperX: Precise ASR with Word-Level Timestamps and Speaker Diarization
-
Whisper-WebUI
-
A note from our sponsor - Nutrient
nutrient.io | 16 Feb 2025
Index
What are some of the best open-source speech-to-text projects in Python? This list will help you:
# | Project | Stars |
---|---|---|
1 | faster-whisper | 14,037 |
2 | whisperX | 13,857 |
3 | pyvideotrans | 11,741 |
4 | speechbrain | 9,329 |
5 | SpeechRecognition | 8,587 |
6 | RealtimeSTT | 5,822 |
7 | SenseVoice | 4,439 |
8 | voice-pro | 3,173 |
9 | lingvo | 2,824 |
10 | LLaMA-Omni | 2,795 |
11 | whisper-asr-webservice | 2,297 |
12 | whisper-timestamped | 2,228 |
13 | kalliope | 1,718 |
14 | whisper-standalone-win | 1,601 |
15 | Dragonfire | 1,383 |
16 | dc_tts | 1,159 |
17 | quillman | 1,085 |
18 | StreamSpeech | 1,028 |
19 | whisper-ctranslate2 | 973 |
20 | AI-Waifu-Vtuber | 900 |
21 | nonoCAPTCHA | 896 |
22 | whisper-playground | 808 |
23 | june | 748 |