Our great sponsors
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
Yeah it does seem that that's fast enough for real-time...but I don't know enough about the underpinnings of the model, like some phrases get almost instantaneously transcribed, and then there's big unexpected pauses (given that the sample audio has a consistent stream of words). I don't know if it has anything to do with Whisper being designed to do phrase-level tokenization (i.e. you can't get word-by-word timestamp data)
I posted a command-line example here (it uses yt-dlp, aka youtube-dl to extract audio from an example online video:
Did you use venv, or did you work around this another way?