OpenVoice: Versatile Instant Voice Cloning

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • OpenVoice

    Instant voice cloning by MyShell.

  • Thanks for the link to the repo. It's very useful.

    As for the checkpoint, I'm not allergic and I don't do security theater:

    https://github.com/myshell-ai/OpenVoice?tab=readme-ov-file#i... links to

    https://myshell-public-repo-hosting.s3.amazonaws.com/checkpo...

  • awesome-ml

    Curated list of useful LLM / Analytics / Datascience resources

  • This aera is barely new. Look at how old some of the projects are:

    https://github.com/underlines/awesome-ml/blob/master/audio-a...

    The thing that changes is the complexity to run it. I was training my wife's voice and my voice for fun and needed 15min of audio and trained on my 3080 for 40 minutes.

    Now it's 2 Minutes.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • tortoise-tts

    A multi-voice TTS system trained with an emphasis on quality

  • I use Tortoise TTS. It's slow, a little clunky, and sometimes the output gets downright weird. But it's the best quality-oriented TTS I've found that I can run locally.

    https://github.com/neonbjb/tortoise-tts

  • piper

    A fast, local neural text to speech system (by rhasspy)

  • There isn't an ElevenLabs app like that, but I think that's the most expedient method, by far.

    (details and warning: in-depth, opinionated take, written almost for my own benefit, I've done a lot of work near here recently but haven't had to organize my thoughts until now)

    Why? Local inference is hard. You need two things: the clips to voice model (which we have here, but bleeding edge), and text + voice -> speech model.

    Text to voice to speech, locally, has excellent prior art for me, in the form of a Raspberry Pi-based ONNX inference library called [Piper](https://github.com/rhasspy/piper). I should just be able to copy that, about an afternoon of work!

    Except...when these models are trained, they encode plaintext to model input using a library called eSpeak. eSpeak is basically f(plaintext) => ints representing phonemes. eSpeak is a C library and written in a style I haven't seen in a while and depends on other C libraries. So I end up needing to port like 20K lines of C to Dart...or I could use WASM, but over the last year, I lost the ability to be able to reason through how to get WASM running in Dart, both native and web.

    It's a really annoying technical problem: the speech models all use this eSpeak C library to turn plaintext => model input (tokenized phonemes).

    Re: ElevenLabs

    I had looked into the API months ago and vaguely remembered it was _very_ complete.

    I spent the last hour or two playing with it, and reconfirmed that. They have enough API surface that you could build an API that took voice recordings, created a voice, and then did POSTs / socket connection to get audio data from that voice at will.

    Only issue is pricing IMHO, $0.18 for 1000 characters. :/ But this is something I feel very comfortable saying wouldn't be _that_ much work to build and open source with a "bring your own API key" type thing. I had forgotten about Eleven Labs till your post, which made me realize there was an actually meaningful and quite moving use case for it.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts