-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
But yeah, I set up a new environment in Anaconda, and followed the instructions on their Github page to install it. I then used the "medium" model to transcribe a recent 20 minute video (日本人の英語の発音の特徴!アメリカでどう思われてるの) on the Kevin's English Room YouTube channel using YT-DLP, as it's easy to confirm the transcription given that it contains Japanese hard-coded subtitles, as most Japanese videos on YouTube do. This took about 11 minutes on a 2080 Ti, so approximately 2x real time. And I'd say the result is significantly better than the default YouTube auto transcription, especially when people are speaking in multiple languages (Pastebin link):
And while googling this, I stumbled upon this discussion on the Whisper GitHub repository, which seems to suggest that the issue is that the current VAD (Voice Activity Detection) is quite poor, and that it can be resolved by using another VAD (like silero-vad). This might be something I want to add to my WebUI in the future.