NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • tortoise-tts

    A multi-voice TTS system trained with an emphasis on quality

  • > I wonder if a style-transfer style algorithm could be used to map the intent of a sentence to a simulated voice.

    There's definitely research/proprietary software that can enable a person speaking in desired manner to have their voice control the expression of the generated speech.

    Here's a related issue on a Open Source text to speech project which I only learned of today: https://github.com/neonbjb/tortoise-tts/issues/34#issue-1229...

    > I tend to view most of these things through the perspective of what would help mod-maker's for video games

    Yeah, I think there's some really cool potential for indie creatives to have access to (even lower quality) voice simulation--for use in everything from the initial writing process (I find it quite interesting how engaging it is to hear one's words if that's going to be the final form--and even synthesis artifacts can prompt an emotion or thought to develop); to placeholder audio; and, even final audio in some cases.

    > (and I suspect various open source voice sample sets would become pretty popular).

    That's definitely a powerful enabler for Free/Open Source speech systems. There's a list of current data sets for speech at the "Open Speech and Language Resources" site: https://openslr.org/resources.php

    Encouraging people to provide their voice for Public Domain/Open Source use does come with some ethical aspects that I think people need to be made aware of so they can make informed decisions about it.

    Given your interest in this topic you might be interested in this (rough) tool I finally released last week: https://rancidbacon.itch.io/dialogue-tool-for-larynx-text-to...

  • larynx

    Discontinued End to end text to speech system using gruut and onnx

  • I imagine that our concept of what a villain sounds like tends to be extremely personally biased but here's a couple of options [Advisory: Contains threatening language.]:

    * http://www.sndup.net/p33q

    * http://www.sndup.net/sppn

    I created these samples in a relatively short time using the Free/Open Source (which I think is an important factor for indies) text-to-speech project Larynx & an narrative editor I finally released the other weekend:

    * https://github.com/rhasspy/larynx/

    * https://rancidbacon.itch.io/dialogue-tool-for-larynx-text-to...

    Now, I would really like to link you directly to audio of the next two but considering it's currently in beta behind an (automated response) email address, I think that may not be appropriate, so, instead...

    * Visit & get access to the beta here: https://mycroft.ai/blog/mimic-3-preview/

    * Copy & paste this SSML into the form: https://pastebin.com/Bwd7LCbj

    It's definitely a noticeable step up again in quality.

    There's an alternate pair of voices if you move the "_" from one "name" attribute to the other in each "voice" element.

    I intentionally didn't edit the text to remove some of the artifacts both to give a realistic impression of the current state & because sometimes they add interesting texture. :)

    Note the beta voices are "low" quality.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • TTS

    πŸΈπŸ’¬ - a deep learning toolkit for Text-to-Speech, battle-tested in research and production

  • I agree so much, that I've started learning ML to make a decent opensource many-languages TTS working on smartphones.

    But really, the situation is pretty good, with a lot of code and dataset available as opensource. Notably, if you're not constrained to smartphones and the like, you can run on your computer quite a number of modern models, see for instance https://github.com/coqui-ai/TTS/ (which itself contains many different models).

    The work that needs to be done is """just""" to turn those models into something suitable for smartphones (which will most likely include re-training), and to plug them back into Android's TTS API.

  • opentts

    Open Text to Speech Server

  • If you've not already encountered them I'd definitely encourage you to check out these Free/Open Source projects too:

    * Larynx: https://github.com/rhasspy/larynx/

    * OpenTTS: https://github.com/synesthesiam/opentts

    * Likely Mimic3 in the near future: https://mycroft.ai/blog/mimic-3-preview/

    Larynx in particular has a focus on "faster than real-time" while OpenTTS is an attempt to package & provide common REST API to all Free/Open Source Text To Speech systems so the FLOSS ecosystem can build on previous work supported by short-lived business interests, rather than start from scratch every time.

    AIUI the developer of the first two projects now works for Mycroft AI & is involved in the development of Mimic3 which seems very promising given how much of an impact on quality his solo work has had in just the past couple of years or so.

  • Thorsten-Voice

    Thorsten-Voice: A free to use, offline working, high quality german TTS voice should be available for every project without any license struggling.

  • For german users, I can recommend to take a look at

    https://www.thorsten-voice.de/

    https://github.com/thorstenMueller/Thorsten-Voice

    where someone contributed a huge set of his voice samples and a tutorial / script collection to build a pretty decent TTS model LOCALLY.

    Quality-wise it is not that good, but its free and pretty easy to follow for a tech enthusiast.

  • TensorFlowTTS

    :stuck_out_tongue_closed_eyes: TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, French, Korean, Chinese, German and Easy to adapt for other languages)

  • I had a lot of success using [FastSpeech2 + MB MelGAN via TensorFlowTTS](https://github.com/TensorSpeech/TensorFlowTTS). There are demos for [iOS](https://github.com/TensorSpeech/TensorFlowTTS/tree/master/ex...) and [Android](https://github.com/TensorSpeech/TensorFlowTTS/tree/master/ex...) which will allow you to run pretty convincing, modern TTS models with only a few hundred milliseconds of processing latency.

  • vosk-api

    Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node

  • In case it's of interest, when I last explored this topic in terms of the Free/Open Source ecosystem I was very impressed with how well VOSK-API performed: https://github.com/alphacep/vosk-api

    Here's another project that builds on top of VOSK to provide a tighter integration with Linux: https://github.com/ideasman42/nerd-dictation

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • nerd-dictation

    Simple, hackable offline speech to text - using the VOSK-API.

  • In case it's of interest, when I last explored this topic in terms of the Free/Open Source ecosystem I was very impressed with how well VOSK-API performed: https://github.com/alphacep/vosk-api

    Here's another project that builds on top of VOSK to provide a tighter integration with Linux: https://github.com/ideasman42/nerd-dictation

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts