Silero V3: fast high-quality text-to-speech in 20 languages with 173 voices

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • silero-models

    Silero Models: pre-trained speech-to-text, text-to-speech and text-enhancement models made embarrassingly simple

  • Can you elaborate further? I am not familiar with the field, but their benchmarks here seem to show quality similar to Google: https://github.com/snakers4/silero-models/wiki/Quality-Bench...

    The only trick I can see being played is that Google was benchmarked on September 2020, so likely has already improved and they don't want to show that. Is CommonVoice a better standard to use when comparing these tools?

  • Voice-Cloning-App

    Discontinued A Python/Pytorch app for easily synthesising human voices

  • Nice to see this here - Silero is also the engine that powers the "dataset builder" for Voice-Cloning-App (https://github.com/BenAAndrew/Voice-Cloning-App), a GUI TTS system that modifies Tacotron2 slightly.

    Just sharing the links in case others are new to the space and keen to tinker on some solid open-source offerings.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • ttsprech

    Simple text2speech for the command line

  • Source is pretty much "embarrassingly simple":

       https://github.com/Grumbel/silero-test/blob/master/silero-test

  • razdel

    Rule-based token, sentence segmentation for Russian language

  • Also currently we abandoned batching, so GPUs are not really required at all.

    > the quality (as in: what I'm hearing, not a formally measured metric) is good but (YMMV) not as good as turtle.

    I believe the compute required during training and inference … may differ by 3 or 4 orders of magnitude (!).

    Also note, that some speakers and languages just sound better due to high quality of source material and the amount of work invested and polish.

    > it breaks with strange error messages if the text you feed it is too long

    Well, there should be a warning somewhere, but it works with text no longer than 512-1024 symbols.

    > there is mention of "a model for text repunctuation and recapitalization", which I wonder if it could be used to break a very long text (eg a book) into pieces that can be digested by the tts engine

    This model only restores some punctuation marks and capital letters.

    There are libraries like razdel for this - https://github.com/natasha/razdel

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts