Our great sponsors
-
vosk-api
Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
Yeah, I was really impressed with the project when I encountered it last year when trying out a bunch of FLOSS Speech-To-Text options.
It was significantly better than the other FLOSS options I looked at--both in terms of getting it going initially & the quality of the speech to text results.
I tested it with a lightly modified version of this example script: https://github.com/alphacep/vosk-api/blob/master/python/exam...
What I found particularly interesting was when you have the "partial" recognition output shown in real-time you get to see how--at the end of a sentence--it may change a word earlier in the sentence in the final recognition output based on (I guess) the additional context of the full sentence.
(I just did a quick test again (with the installs from my testing last year) using an internal laptop microphone & the test script recognized a significant chunk of my speech (using a headset definitely improves things though) whereas with the same environment a test with `mic_vad_streaming` (from `DeepSpeech-examples-r0.9` with `deepspeech-0.9.0-models.pbmm`) failed to recognize any words at all.)
Vosk-api isn't an SST engine itself, it is built using the Kaldi speech recognition toolkit (https://github.com/kaldi-asr/kaldi) and nicely implements and packages an API for Kaldi chain/LF-MMI models.
Yes!
The project is called Larynx, and it is amazing: https://github.com/rhasspy/larynx/
I waxed lyrical about it recently in this thread about private alternatives to Alexa: https://news.ycombinator.com/item?id=29562526
I can only vouch for the quality/variety in English but it does note support for 50 voices over 9 languages, including all the first group of languages you mentioned, and also Russian. (I've "played" with all those languages to test them but can't really vouch for how a native speaker/listener might find it. :D )
It is miles ahead of any of the other Free/Open Source TTS solutions I've tried, including the ones you mentioned.
(It's still synthesized speech but the output quality is so good and the project is still extremely early days.)
And there's a range of options in accent & gender--which are in general sorely lacking in other FLOSS TTS options. (In terms of licensing, some voices are licensed more freely than others but the majority are without significant restriction.)
I like Larynx so much that I've been working on an editor for it to assist in "auditioning" & recording speech in a narrative context, e.g. game/film pre-viz.
I just checked and apparently they are already aware: https://github.com/Chocobozzz/PeerTube/issues/3325#issuecomm... :)
(Tho I'll admit I have no idea what "bluffing" means in that context. :D )
Was just about to mention this repo to the OP but suspect I found it from your site in the first place: https://github.com/benob/recasepunc :D
Punctuation/capitalization will make a massive difference to practical use! Look forward to it.
Related posts
- [D] ASR/Automatic Speech Recognition toolkit that provides precise word-level timing data? (eg, where in the audio stream a word starts and ends?)
- VOSK Offline Speech Recognition API
- Apollo dev posts backend code to Git to disprove Reddit’s claims of scrapping and inefficiency
- Working Vosk model?
- What are the aplications of rust in machine learning ?