WhisperSpeech – An Open Source text-to-speech system built by inverting Whisper

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • WhisperSpeech

    An Open Source text-to-speech system built by inverting Whisper.

  • EmotiVoice

    EmotiVoice 😊: a Multi-Voice and Prompt-Controlled TTS Engine

  • Interested to see how it performs for Mandarin Chinese speech synthesis, especially with prosody and emotion. The highest quality open source model I've seen so far is EmotiVoice[0], which I've made a CLI wrapper around to generate audio for flashcards.[1] For EmotiVoice, you can apparently also clone your own voice with a GPU, but I have not tested this.[2]

    [0] https://github.com/netease-youdao/EmotiVoice

    [1] https://github.com/siraben/emotivoice-cli

    [2] https://github.com/netease-youdao/EmotiVoice/wiki/Voice-Clon...

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • emotivoice-cli

    CLI wrapper around Emotivoice TTS Synthesis

  • Interested to see how it performs for Mandarin Chinese speech synthesis, especially with prosody and emotion. The highest quality open source model I've seen so far is EmotiVoice[0], which I've made a CLI wrapper around to generate audio for flashcards.[1] For EmotiVoice, you can apparently also clone your own voice with a GPU, but I have not tested this.[2]

    [0] https://github.com/netease-youdao/EmotiVoice

    [1] https://github.com/siraben/emotivoice-cli

    [2] https://github.com/netease-youdao/EmotiVoice/wiki/Voice-Clon...

  • captioner

    Generate subtitles of videos in the browser (by Rodeoclash)

  • I'd be interested if you ever dig anything up for this. I hacked together a kind of crude tool to snapshot audio and translate / caption it on the fly:

    https://captioner.richardson.co.nz/

    I would very much like to improve on this but the live translation / captioning still has some more work to go in this space.

    Source was here: https://github.com/Rodeoclash/captioner

  • piper

    A fast, local neural text to speech system (by rhasspy)

  • If you're not already aware, the primary developer of Mimic 3 (and its non-Mimic predecessor Larynx) continued TTS-related development with Larynx and the renamed project Piper: https://github.com/rhasspy/piper

    Last year Piper development was supported by Nabu Casa for their "Year of Voice" project for Home Assistant and it sounds like Mike Hansen is going to continue on it with their support this year.

  • WhisperLive

    A nearly-live implementation of OpenAI's Whisper.

  • Check out WhisperLive: https://github.com/collabora/WhisperLive

    If you're grappling with the slow march from cool tech demos to real-world language model apps, you might wanna check out WhisperLive. It's this rad open-source project that’s all about leveraging Whisper models for slick live transcription. Think real-time, on-the-fly translated captions for those global meetups. It's a neat example of practical, user-focused tech in action. Dive into the details on their GitHub page

  • whisper

    Robust Speech Recognition via Large-Scale Weak Supervision

  • There is a plot of language performance on their repo: https://github.com/openai/whisper

    I am not aware of a multi-lingual leaderboard for speech recognition models.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • I have forced alignments, too.

    E.g. for the True Story of Ah Q https://github.com/Yorwba/LiteratureForEyesAndEars/tree/mast... .align.json is my homegrown alignment format, .srt are standard subtitles, .txt is the text, but note that in some places I have [[original text||what it is pronounced as]] annotations to make the forced alignment work better. (E.g. the "." in LibriVox.org, pronounced as 點 "diǎn" in Mandarin.) Oh, and cmn-Hans is the same thing transliterated into Simplified Chinese.

    The corresponding LibriVox URL is predictably https://librivox.org/the-true-story-of-ah-q-by-xun-lu/

  • espnet

    End-to-End Speech Processing Toolkit

  • You might check out this list from espnet. They list the different corpuses they use to train their models sorted by language and task (ASR, TTS etc):

    https://github.com/espnet/espnet/blob/master/egs2/README.md

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts