Top 23 Speech Open-Source Projects

MockingBird

9 33,904 5.8 Python

🚀AI拟声: 5秒内克隆您的声音并生成任意语音内容 Clone a voice in 5 seconds to generate arbitrary speech in real-time
TTS

232 29,631 9.4 Python

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production

Project mention: Ask HN: Open-source, local Text-to-Speech (TTS) generators | news.ycombinator.com | 2024-05-07

I just noticed that https://coqui.ai/ is "Shutting down".
I'm building a web app (React / Django) which takes a list of affirmations & goals (in Markdown files), puts them into a database (SQlite), and uses voice synthesis to create voice audio files of the phrases. These are combined with a relaxed backing track (ffmpeg), made into playlists of 10-20 phrases (randomly sampled, or according to a theme: "mind" "body" "soul") and then play automatically in the morning & evening (cron). This allows you to persistently hear & vocalize your own goals & good vibes over time.
I had been planning to use Coqui TTS as the local text-to-speech engine, but with this cancellation, I'd love to hear from the community what is a great open-source, local text-to-speech engine?
Generally, I learn both the highest quality commercially available technology (example: ElevenLabs), and also the best open-source equivalent. Would love to hear suggestions & perspectives on this. What voice synth tools are you investing your time into learning & building with?

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
datasets

15 18,480 9.5 Python

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Project mention: 🐍🐍 23 issues to grow yourself as an exceptional open-source Python expert 🧑‍💻 🥇 | dev.to | 2023-10-19

Kaldi Speech Recognition Toolkit

22 13,768 6.7 Shell

kaldi-asr/kaldi is the official location of the Kaldi project.

Project mention: Amazon plans to charge for Alexa in June–unless internal conflict delays revamp | news.ycombinator.com | 2024-01-20

Yeah, whisper is the closest thing we have, but even it requires more processing power than is present in most of these edge devices in order to feel smooth. I've started a voice interface project on a Raspberry Pi 4, and it takes about 3 seconds to produce a result. That's impressive, but not fast enough for Alexa.
From what I gather a Pi 5 can do it in 1.5 seconds, which is closer, so I suspect it's only a matter of time before we do have fully local STT running directly on speakers.
> Probably anathema to the space, but if the devices leaned into the ~five tasks people use them for (timers, weather, todo list?) could probably tighten up the AI models to be more accurate and/or resource efficient.
Yes, this is the approach taken by a lot of streaming STT systems, like Kaldi [0]. Rather than use a fully capable model, you train a specialized one that knows what kinds of things people are likely to say to it.
[0] http://kaldi-asr.org/

Grounded-Segment-Anything

11 13,615 8.0 Jupyter Notebook

Grounded-SAM: Marrying Grounding-DINO with Segment Anything & Stable Diffusion & Recognize Anything - Automatically Detect , Segment and Generate Anything

Project mention: Tooling for bulk image data set manipulation? | /r/computervision | 2023-06-27

AudioGPT

4 9,788 3.7 Python

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
whisperX

24 9,173 8.4 Python

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)

Project mention: Easy video transcription and subtitling with Whisper, FFmpeg, and Python | news.ycombinator.com | 2024-04-06

It uses this, which does support diarization: https://github.com/m-bain/whisperX

SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
TTS

62 8,845 0.0 Jupyter Notebook

:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts) (by mozilla)

Project mention: Coqui.ai Is Shutting Down | news.ycombinator.com | 2024-01-03

Coqui-ai was a commercial continuation of Mozilla TTS and STT (https://github.com/mozilla/TTS).
At the time (2018-ish), it was really impressive for on-device voice synthesis (with a quality approaching the Google and Azure cloud-based voice synthesis options) and open source, so a lot of people in the FOSS community were hoping it could be used for a privacy-respecting home assistant, Linux speech synthesis that doesn't suck, etc.
After Mozilla abandoned the project, Coqui continued development and had some really impressive one-shot voice cloning, but pivoted to marketing speech synthesis for game developers. They were probably having trouble monetizing it, and it doesn't surprise me that they shut down.
An equivalent project that's still in active development and doing really well is Piper TTS (https://github.com/rhasspy/piper).

annyang

2 6,547 0.0 JavaScript

:speech_balloon: Speech recognition for your site
EmotiVoice

5 6,330 8.9 Python

EmotiVoice 😊: a Multi-Voice and Prompt-Controlled TTS Engine

Project mention: FLaNK Stack Weekly 12 February 2024 | dev.to | 2024-02-12

modelscope

3 6,099 9.6 Python

ModelScope: bring the notion of Model-as-a-Service to life.

Project mention: FLaNK Stack Weekly for 20 June 2023 | dev.to | 2023-06-20

Model as a Service https://github.com/modelscope/modelscope

silero-models

32 4,584 4.7 Jupyter Notebook

Silero Models: pre-trained speech-to-text, text-to-speech and text-enhancement models made embarrassingly simple

Project mention: Weird A.I. Yankovic, a cursed deep dive into the world of voice cloning | news.ycombinator.com | 2023-10-02

I doubt it's currently actually "the best open source text to speech", but the answer I came up with when throwing a couple of hours at the problem some months ago was "Silero" [0, 1].
Following the "standalone" guide [2], it was pretty trivial to make the model render my sample text in about 100 English "voices" (many of which were similar to each other, and in varying quality). Sampling those, I got about 10 that were pretty "good". And maybe 6 that were the "best ones" (pretty natural, not annoying to listen to).
IIRC the license was free for noncommercial use only. I'm not sure exactly "how open source" they are, but it was simple to install the dependencies and write the basic Python to try it out; I had to write a for loop to try all the voices like I wanted. I ended using something else for the project for other reasons, but this could still be fairly good backup option for some use cases IMO.
  [0] https://github.com/snakers4/silero-models#text-to-speech

lingvo

1 2,777 8.5 Python

Lingvo
aeneas

4 2,379 0.0 Python

aeneas is a Python/C library and a set of tools to automagically synchronize audio and text (aka forced alignment)
gTTS

3 2,149 7.3 Python

Python library and CLI tool to interface with Google Translate's text-to-speech API

Project mention: Using Groq to Build a Real-Time Language Translation App | dev.to | 2024-04-05

For our real-time TTS needs, we'll employ the fantastic library called gTTS.

whisper-diarization

5 2,066 6.8 Jupyter Notebook

Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper

Project mention: MacWhisper: Transcribe audio files on your Mac | news.ycombinator.com | 2023-08-23

https://github.com/MahmoudAshraf97/whisper-diarization
This project has been alright for transcribing audio with speaker diarization. A big finicky. The OpenAI model is better than other paid products(Descript, Riverside) so I’m looking forward to trying MacWhisper.

Amazing-Python-Scripts

2 2,024 9.7 Jupyter Notebook

🚀 Curated collection of Amazing Python scripts from Basics to Advance with automation task scripts.

Project mention: Top 10 GitHub Repositories for Python and Java Developers | dev.to | 2024-05-03

7. Avinashkranjan/Amazing-Python-Scripts - A collection of innovative Python scripts for tasks like web scraping and automation can be found in this repository. It's a great source of inspiration for creating projects. https://github.com/avinashkranjan/Amazing-Python-Scripts

DeepFilterNet

10 1,952 8.9 Python

Noise supression using deep filtering

Project mention: Anyone know of a good TTS pipeline for raw speech data? | /r/AudioAI | 2023-10-03

You mean remove background noise and transcribe? Then you can use DeepFilterNet to remove noise, and Whisper to transcribe.

julius

1 1,778 2.7 C

Open-Source Large Vocabulary Continuous Speech Recognition Engine (by julius-speech)
soloud

4 1,655 0.0 C

Free, easy, portable audio engine for games
whisper-timestamped

2 1,547 8.1 Python

Multilingual Automatic Speech Recognition with word-level timestamps and confidence

Project mention: Show HN: AI Dub Tool I Made to Watch Foreign Language Videos with My 7-Year-Old | news.ycombinator.com | 2024-02-28

Yes. But Whisper's word-level timings are actually quite inaccurate out of the box. There are some Python libraries that mitigate that. I tested several of them. whisper-timestamped seems to be the best one. [0]
[0] https://github.com/linto-ai/whisper-timestamped

ai-audio-startups

1 1,454 6.7

Community list of startups working with AI in audio and music technology
praat

4 1,386 9.4 C++

Praat: Doing Phonetics By Computer

Project mention: Praat: Doing Phonetics by Computer | news.ycombinator.com | 2024-04-18

This brings back memories.
I worked my way through some of its source code many years ago during my post-graduate studies and it was very _strange_. I see it is now on GitHub [0].
They used C macros to implement object oriented programming, with symbols like `me` and `my` and `thee` scattered throughout the source code. It seems the code has been converted to C++ (IIRC it used to be in C), but I still see the `my` keyword in there.
They have their own BASIC-like scripting language. The weirdest property for me was that it allowed for whitespace in the identifiers. Just look at the example in [1]: The `Create simple Matrix` is actually a function in the scripting language that constructs a matrix object. The function name corresponds to a menu item and IIRC they used some more preprocessor magic to reuse the same code for the menus on the GUI and the functions in the scripting language.
I don't think you're supposed to write the scripts by hand. Rather it recorded your actions as you worked your way through the GUI and then you could export and modify those recordings as scripts.
They also implemented their own cross platform GUI toolkit rather than using one of the existing cross-platform GUI toolkits, so it works on Windows, Linux (or any X Windows I believe) and MacOS.
[0]: https://github.com/praat/praat

SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Speech related posts

Praat: Doing Phonetics by Computer

1 project | news.ycombinator.com | 18 Apr 2024
Easy video transcription and subtitling with Whisper, FFmpeg, and Python

1 project | news.ycombinator.com | 6 Apr 2024
Using Groq to Build a Real-Time Language Translation App

3 projects | dev.to | 5 Apr 2024
OpenAI deems its voice cloning tool too risky for general release

1 project | news.ycombinator.com | 31 Mar 2024
SOTA ASR Tooling: Long-Form Transcription

1 project | news.ycombinator.com | 31 Mar 2024
Deploying whisperX on AWS SageMaker as Asynchronous Endpoint

2 projects | dev.to | 31 Mar 2024
Show HN: AI Dub Tool I Made to Watch Foreign Language Videos with My 7-Year-Old

1 project | news.ycombinator.com | 28 Feb 2024
A note from our sponsor - InfluxDB
www.influxdata.com | 10 May 2024

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →

Index

What are some of the best open-source Speech projects? This list will help you:

	Project	Stars
1	MockingBird	33,904
2	TTS	29,631
3	datasets	18,480
4	Kaldi Speech Recognition Toolkit	13,768
5	Grounded-Segment-Anything	13,615
6	AudioGPT	9,788
7	whisperX	9,173
8	TTS	8,845
9	annyang	6,547
10	EmotiVoice	6,330
11	modelscope	6,099
12	silero-models	4,584
13	lingvo	2,777
14	aeneas	2,379
15	gTTS	2,149
16	whisper-diarization	2,066
17	Amazing-Python-Scripts	2,024
18	DeepFilterNet	1,952
19	julius	1,778
20	soloud	1,655
21	whisper-timestamped	1,547
22	ai-audio-startups	1,454
23	praat	1,386

Speech

Top 23 Speech Open-Source Projects

Speech related posts

Praat: Doing Phonetics by Computer

Easy video transcription and subtitling with Whisper, FFmpeg, and Python

Using Groq to Build a Real-Time Language Translation App

OpenAI deems its voice cloning tool too risky for general release

SOTA ASR Tooling: Long-Form Transcription

Deploying whisperX on AWS SageMaker as Asynchronous Endpoint

Show HN: AI Dub Tool I Made to Watch Foreign Language Videos with My 7-Year-Old

Index