Home Assistant’s Year of the Voice – Chapter 2

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

ovos-core

13 101 9.1 Python

OpenVoiceOS Core, the FOSS Artificial Intelligence platform.

I used to work for Mycroft, so I'm hoping to eventually create an image that's compatible with Home Assistant pipelines.
For now, though, you may want to check out OVOS: https://openvoiceos.com/

larynx

18 788 0.0 Python

Discontinued End to end text to speech system using gruut and onnx

The most exciting thing about Home Assistant's "Year of the Voice", for me, is that it is apparently enabling/supporting @synesthesiam's continued phenomenal contributions to the FLOSS off-line voice synthesis space.
The quality, variety & diversity of voices that synesthesiam's "Larynx" TTS project (https://github.com/rhasspy/larynx/) made available, completely transformed the Free/Open Source Text To Speech landscape.
In addition "OpenTTS" (https://github.com/synesthesiam/opentts) provided a common API for interacting with multiple FLOSS TTS projects which showed great promise for actually enabling "standing on the shoulders of" rather than re-inventing the same basic functionality every time.
The new "Piper" TTS project mentioned in the article is the apparent successor to Larynx and, along with the accompanying LibriTTS/LibriVox-based voice models, brings to FLOSS TTS something it's never had before:
* Too many voices! :)
Seriously, the current LibriTTS voice model version has 900+ voices (of varying quality levels), how do you even navigate that many?![0]
And that's not even considering the even higher quality single speaker models based on other audio recording sources.
Offline TTS while immensely valuable for individuals, doesn't seem to be attractive domain for most commercial entities due to lack of lock-in/telemetry opportunities so I was concerned that we might end up missing out on further valuable contributions from synesthesiam's specialised skills & experience due to financial realities & the human need for food. :)
I'm glad we instead get to see what happens next.
[0] See my follow-up comment about this.

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
opentts

10 822 1.3 Python

Open Text to Speech Server

The most exciting thing about Home Assistant's "Year of the Voice", for me, is that it is apparently enabling/supporting @synesthesiam's continued phenomenal contributions to the FLOSS off-line voice synthesis space.
The quality, variety & diversity of voices that synesthesiam's "Larynx" TTS project (https://github.com/rhasspy/larynx/) made available, completely transformed the Free/Open Source Text To Speech landscape.
In addition "OpenTTS" (https://github.com/synesthesiam/opentts) provided a common API for interacting with multiple FLOSS TTS projects which showed great promise for actually enabling "standing on the shoulders of" rather than re-inventing the same basic functionality every time.
The new "Piper" TTS project mentioned in the article is the apparent successor to Larynx and, along with the accompanying LibriTTS/LibriVox-based voice models, brings to FLOSS TTS something it's never had before:
* Too many voices! :)
Seriously, the current LibriTTS voice model version has 900+ voices (of varying quality levels), how do you even navigate that many?![0]
And that's not even considering the even higher quality single speaker models based on other audio recording sources.
Offline TTS while immensely valuable for individuals, doesn't seem to be attractive domain for most commercial entities due to lack of lock-in/telemetry opportunities so I was concerned that we might end up missing out on further valuable contributions from synesthesiam's specialised skills & experience due to financial realities & the human need for food. :)
I'm glad we instead get to see what happens next.
[0] See my follow-up comment about this.

larynx-dialogue

1 - -

My interest in offline TTS is actually entirely unrelated to the automation space: I'm interested in Text to Speech for creative pursuits, such as video game voice dialogue and animated videos.
This is one of the reasons why the range & quantity of available voices is particularly important to me.
After all, you can't really have scene set in a board room with nine characters[3] if you've only got three voices to go around. :)
I've actually been spending time this week on updating my "Dialogue Tool"[1] application (originally created to work with Larynx to help with narrative dialogue workflows such as voice "auditioning", intelligent caching & multiple voice recordings) to work with Piper.
Which is where I ran into the question of how to navigate/curate a collection of more than 900+ voices.
The main approaches I'm using so far are:
(1) Random luck--just audition a bunch of different voices with your sample dialogue & see what you like.
(2) Curation/sorting based on quality-related meta-data from the original dataset.
(3) Generating a different dialogue line for each voice that includes their speaker number for identification purposes that also (hopefully) isn't tedious to listen to for 900+ voices. :)
I haven't quite finished/uploaded results from (3) yet but example output based on approaches (3) & (2) can be heard here: https://rancidbacon.gitlab.io/piper-tts-demos/
The recording has two sets of 10 voices which had the lowest Word Error Rate scores in the original dataset--which doesn't mean the resulting voice model is necessary good but is at least a starting point for exploring.
I'd also like to explore more analysis-based approaches for grouping/curation (e.g. vocal characteristics such "softer", "lower", "older") but as I'm not getting paid for this[2], that's likely a longer term thing.
A different approach which I've previously found really interesting is to use voices as a prompt for writing narrative dialogue. It really helps to hear the dialogue as you write it and the nuances of different voices can help spur ideas for where a conversation goes next...
[1] See: https://rancidbacon.itch.io/dialogue-tool-for-larynx-text-to... & https://gitlab.com/RancidBacon/larynx-dialogue/-/tree/featur...
[2] Am currently available/open to be though. :D
[3] Will try to upload some example audio of this scene because I found it pretty funny. :)

coral-pi-rest-server

44 66 0.0 Jupyter Notebook

Perform inferencing of tensorflow-lite models on an RPi with acceleration from Coral USB stick

> On a Raspberry Pi 4, Piper can generate 2 seconds of audio with only 1 second of processing time.
> On a Raspberry Pi 4, voice commands can take around 7 seconds to process with about 200 MB of RAM used.
Have you looked into supporting something like the Coral Accelerator[0], which can drastically speed up machine learning inference on a Raspberry Pi?
[0]: https://coral.ai/products/accelerator

watchtower

215 16,889 8.2 Go

A process for automating Docker container base image updates.

I do this as well and leverage Watchtower [0] to upgrade the container for me and keep the image cache flushed. I just track the `stable` tag and make sure to keep an eye on changes. I only check for updates once a week and it has worked out great.
[0] https://containrrr.dev/watchtower/

piper

33 3,902 8.9 C++

A fast, local neural text to speech system (by rhasspy)

> 900+ voices
Where can I find all these voices? https://github.com/rhasspy/piper/releases/tag/v0.0.2 lists "only" ~50 files.

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project