common-voice
TTS

common-voice | TTS | |
---|---|---|
67 | 242 | |
3,380 | 41,267 | |
0.2% | 2.3% | |
10.0 | 8.1 | |
4 days ago | 11 months ago | |
TypeScript | Python | |
Mozilla Public License 2.0 | Mozilla Public License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
common-voice
-
A CC-By Open-Source TTS Model with Voice Cloning
Yeah there's no chance Mozilla would do anything like this:
https://commonvoice.mozilla.org/
-
OpenAI's Whisper is another case study in Colonisation
Mozillas Common Voice Project (https://commonvoice.mozilla.org/) is creating an open dataset for many minority languages to make it easier to support them in STT systems. If you speak one of these languages please consider donating a few minutes of your voice.
- Mozilla Launching a Public Voice Dataset
-
Common Voice
> it was not at all obvious to me there was some way of speeding up getting a language in the first place.
Yeah, that's the biggest failing of Common Voice in my opinion. Getting a new language up to speed could be much improved by simply adding a few links to documentation, but even the existing links are broken, which I reported in March 2022... https://github.com/common-voice/common-voice/issues/3637
> I have no interest in wasting time contributing to a UI translation I actively don't want to be subjected to
Translating the UI may still help you get other people to record, even if you don't want to use it yourself.
> I'll see if I can submit some sentences at least
If you want to go faster, there's also a project to extract sentences from Wikipedia etc. in small doses Mozilla's lawyers and Wikimedia's lawyers have agreed are fair use. I think you'd only need to define how Norwegian BokmƄl separates sentences. (E.g. after a period but not if it's a common abbreviation like "etc." in the preceding sentence.)
- Practice speaking and listening of your target language on Common Voice
-
Web Speech API is (still) broken on Linux circa 2023
There is a lot of TTS and SST development going on (https://github.com/mozilla/TTS; https://github.com/mozilla/DeepSpeech; https://github.com/common-voice/common-voice). That is the only way they work: Contributions from the wild.
- How do I get audio data from from native speakers for Anki?
-
Web Speech API is not available in the Quest browser
Since you're interested in STT and TTS, let me just plug in Mozilla's Common Voice, a way for everyone to contribute to an open source data set for STT. You can record yourself or verify other people's recordings!
-
Mozilla Common Voice - Korean Language is live - Help Build a Korean Corpus for Training AI/Navi/etc
[커먼볓ģ“ģ¤ ģ ģģ°ķø](mailto:[email protected]) || Common Voice || Korean Language Homepage || FAQs || Speaking Aloud and Reviewing Recordings || Sentence Collector || NVidia/NeMo
-
Ask HN: Open-source video transcribing software?
How can it be used for transcription?
In their website I only see an interface for either uploading audio or submitting transcriptions:
https://commonvoice.mozilla.org/es
The Github repo they mention (https://github.com/common-voice/common-voice) seems to be just that sample collection software. I do not see where I can download the software to transcribe audio.
TTS
-
Build Your Own Clone: Best Open-Source AI Tools
2. Coqui AI
-
Real-time Voice Chat at ~500ms Latency
That is probably the reason you can't find that much.
*https://coqui.ai/
-
Show HN: Voice-Pro ā AI Voice Cloning Magic: Transform Any Voice in 15 Seconds
It's really easy for a technical person to do as well.
I use Coqui TTS[0] as part of my home automation, I wrote a small python script that lets me upload a voice clip for it to clone after I got the idea from HeyWillow[1], and a small shim that lets me send the output to a Home Assistant media player instead of using their standard output device. I run the TTS container on a VM with a Tesla P4 (~Ā£100 to buy) and get about 1x-2x (roughly the same time it'd take to say it, to process) using the large model.
Just for a giggle, I uploaded a few 3s-5s second clip of myself speaking and cloned my voice, then executed a command to our living room media player to call my wife into the room; from another room, she was 100% convinced it was myself speaking words I'd never spoken.
I tried playing with a variety of sentences for a few hours and overall, it sounded almost exactly like me, to me, with the exception of some "attitude" and "intonation" I know I wouldn't use in my speech. I didn't notice much of an improvement using much longer clips; the short ones were "good enough".
Tangentially, it really bugs me that most phone providers in the UK insist you record a "personal greeting" now before they'll let you check your voice mail box, I just record silence, because the last thing I want/need is a voicemail greeting in my voice confirming to some randomer I didn't want calling me, who I am and that my number is active, even more so knowing how I can
[0] https://github.com/coqui-ai/TTS
-
Show HN: Offline audiobook from any format with one CLI command
For anyone who is interested, CoquiTTS (formerly, MozillaTTS) was great, but the project isn't maintained anymore (athough there's been some confusion about whether or not it's active. See: https://github.com/coqui-ai/TTS/issues/4022).
Looks like there's an effort to keep an actively maintained fork here, though: https://github.com/idiap/coqui-ai-TTS
-
Ask HN: What is the state of OSS voice cloning?
I am super impressed by the quality of voice cloning offered by Eleven Labs and Play.ai. I feel like I see impressive OSS demos on social frequently, but last weekend I took a few popular ones for a spin and quality wasn't even close to the proprietary models.
https://github.com/coqui-ai/tts
- AIM Weekly 17 June 2024
-
Coqui.ai TTS: A Deep Learning Toolkit for Text-to-Speech
The license is the MPL, which allows commercial use?
https://github.com/coqui-ai/TTS/blob/dev/LICENSE.txt
-
Show HN: Pi-C.A.R.D, a Raspberry Pi Voice Assistant
When I did a similar thing (but with less LLM) I liked https://github.com/coqui-ai/TTS but back then I needed to cut out the conversion step from tensor to a list of numbers to make it work really nicely.
-
Ask HN: Open-source, local Text-to-Speech (TTS) generators
I just noticed that https://coqui.ai/ is "Shutting down".
I'm building a web app (React / Django) which takes a list of affirmations & goals (in Markdown files), puts them into a database (SQlite), and uses voice synthesis to create voice audio files of the phrases. These are combined with a relaxed backing track (ffmpeg), made into playlists of 10-20 phrases (randomly sampled, or according to a theme: "mind" "body" "soul") and then play automatically in the morning & evening (cron). This allows you to persistently hear & vocalize your own goals & good vibes over time.
I had been planning to use Coqui TTS as the local text-to-speech engine, but with this cancellation, I'd love to hear from the community what is a great open-source, local text-to-speech engine?
Generally, I learn both the highest quality commercially available technology (example: ElevenLabs), and also the best open-source equivalent. Would love to hear suggestions & perspectives on this. What voice synth tools are you investing your time into learning & building with?
-
OpenAI deems its voice cloning tool too risky for general release
lol this marketing technique is getting very old. https://github.com/coqui-ai/TTS is already amazing and open source.
What are some alternatives?
vosk-server - WebSocket, gRPC and WebRTC speech recognition server based on Vosk and Kaldi libraries
piper - A fast, local neural text to speech system
forced-alignment-tools - A collection of links and notes on forced alignment tools
tortoise-tts - A multi-voice TTS system trained with an emphasis on quality
DeepSpeech - DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
silero-models - Silero Models: pre-trained speech-to-text, text-to-speech and text-enhancement models made embarrassingly simple
