Our great sponsors
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
distil-whisper
Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.
-
willow
Open source, local, and self-hosted Amazon Echo/Google Home competitive Voice Assistant alternative
-
openWakeWord
An open-source audio wake word (or phrase) detection framework with a focus on performance and simplicity.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
I'll be shipping Distil-Whisper to whisper-turbo tomorrow! https://github.com/FL33TW00D/whisper-turbo
Should make running in the browser feasible even for underpowered devices:
Just a point of clarification - faster-whisper references it but ctranslate2[0] is what's really doing the magic here.
Ctranslate2 is a sleeper powerhouse project that enables a lot. They should be up front and center and get the credit they deserve.
[0] - https://github.com/OpenNMT/CTranslate2
How much faster in real wall-clock time is this in batched data than https://github.com/m-bain/whisperX ?
I have something pretty rudimentary here: https://github.com/Ono-Sendai/project-2501
I'm the founder of Willow[0] (we use ctranslate2 as well) and I will be looking at this as soon tomorrow as these models are released. HF claims they're drop-in compatible but we won't know for sure until someone looks at it.
[0] - https://heywillow.io/
There's also OpenWakeWord[0]. The models are readily available in tflite and ONNX formats and are impressively "light" in terms of compute requirements and performance.
It should be possible.
[0] - https://github.com/dscripka/openWakeWord
Oh yes, that's absolutely true - faster is better for everyone. It's just that this particular breakpoint would put realtime transcription on a $17 device with an amazing support ecosystem. It's wild.
That being said, even with this distillation there's still the aspect that Whisper isn't really designed for streaming. It's fairly simplistic and always deals with 30 second windows. I was expecting there to have been some sort of useful transform you could do to the model to avoid quite so much reprocessing per frame, but other than https://github.com/mit-han-lab/streaming-llm (which I'm not even sure directly helps) I haven't noticed anything out there.
That's the implication. If the distil models are same format as original openai models then the Distil models can be converted for faster-whisper use as per the conversion instructions on https://github.com/guillaumekln/faster-whisper/
So then we'll see whether we get the 6x model speedup on top of the stated 4x faster-whisper code speedup.
Fortunately yes, recently i've been playing with this github.com/rpdrewes/whisper-websocket-server which uses K6nele as frontend on android if you really care about performance.
Tho if you're looking for a standalone app then you can give this a go : https://github.com/alex-vt/WhisperInput and run it right on your phone :]
For now they both run regular openai whisper thus tiny.en but as you can see there's tons of impromvement potential with faster-whisper and now distill-whisper :D