WhisperSpeech

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

WhisperSpeech

5 3,329 9.2 Jupyter Notebook

An Open Source text-to-speech system built by inverting Whisper.
EmotiVoice

5 6,303 8.9 Python

EmotiVoice 😊: a Multi-Voice and Prompt-Controlled TTS Engine

Interested to see how it performs for Mandarin Chinese speech synthesis, especially with prosody and emotion. The highest quality open source model I've seen so far is EmotiVoice[0], which I've made a CLI wrapper around to generate audio for flashcards.[1] For EmotiVoice, you can apparently also clone your own voice with a GPU, but I have not tested this.[2]
[0] https://github.com/netease-youdao/EmotiVoice
[1] https://github.com/siraben/emotivoice-cli
[2] https://github.com/netease-youdao/EmotiVoice/wiki/Voice-Clon...

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
emotivoice-cli

1 5 6.2 JavaScript

CLI wrapper around Emotivoice TTS Synthesis

Interested to see how it performs for Mandarin Chinese speech synthesis, especially with prosody and emotion. The highest quality open source model I've seen so far is EmotiVoice[0], which I've made a CLI wrapper around to generate audio for flashcards.[1] For EmotiVoice, you can apparently also clone your own voice with a GPU, but I have not tested this.[2]
[0] https://github.com/netease-youdao/EmotiVoice
[1] https://github.com/siraben/emotivoice-cli
[2] https://github.com/netease-youdao/EmotiVoice/wiki/Voice-Clon...

captioner

1 4 10.0 TypeScript

Generate subtitles of videos in the browser (by Rodeoclash)

I'd be interested if you ever dig anything up for this. I hacked together a kind of crude tool to snapshot audio and translate / caption it on the fly:
https://captioner.richardson.co.nz/
I would very much like to improve on this but the live translation / captioning still has some more work to go in this space.
Source was here: https://github.com/Rodeoclash/captioner

piper

33 3,902 8.9 C++

A fast, local neural text to speech system (by rhasspy)

If you're not already aware, the primary developer of Mimic 3 (and its non-Mimic predecessor Larynx) continued TTS-related development with Larynx and the renamed project Piper: https://github.com/rhasspy/piper
Last year Piper development was supported by Nabu Casa for their "Year of Voice" project for Home Assistant and it sounds like Mike Hansen is going to continue on it with their support this year.

WhisperLive

4 1,180 9.4 Python

A nearly-live implementation of OpenAI's Whisper.

Check out WhisperLive: https://github.com/collabora/WhisperLive
If you're grappling with the slow march from cool tech demos to real-world language model apps, you might wanna check out WhisperLive. It's this rad open-source project that’s all about leveraging Whisper models for slick live transcription. Think real-time, on-the-fly translated captions for those global meetups. It's a neat example of practical, user-focused tech in action. Dive into the details on their GitHub page

whisper

343 60,303 6.4 Python

Robust Speech Recognition via Large-Scale Weak Supervision

There is a plot of language performance on their repo: https://github.com/openai/whisper
I am not aware of a multi-lingual leaderboard for speech recognition models.

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
LiteratureForEyesAndEars

1 0 4.7 Python

I have forced alignments, too.
E.g. for the True Story of Ah Q https://github.com/Yorwba/LiteratureForEyesAndEars/tree/mast... .align.json is my homegrown alignment format, .srt are standard subtitles, .txt is the text, but note that in some places I have [[original text||what it is pronounced as]] annotations to make the forced alignment work better. (E.g. the "." in LibriVox.org, pronounced as 點 "diǎn" in Mandarin.) Oh, and cmn-Hans is the same thing transliterated into Simplified Chinese.
The corresponding LibriVox URL is predictably https://librivox.org/the-true-story-of-ah-q-by-xun-lu/

espnet

15 7,872 10.0 Python

End-to-End Speech Processing Toolkit

You might check out this list from espnet. They list the different corpuses they use to train their models sorted by language and task (ASR, TTS etc):
https://github.com/espnet/espnet/blob/master/egs2/README.md

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project