Our great sponsors
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
wer_are_we
Attempt at tracking states of the arts and recent results (bibliography) on speech recognition.
-
NeMo
A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
-
vosk-api
Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
dragonfly
Speech recognition framework allowing powerful Python-based scripting and extension of Dragon NaturallySpeaking (DNS), Windows Speech Recognition (WSR), Kaldi and CMU Pocket Sphinx (by dictation-toolbox)
It seems like OpenAI are finally living up to their name for once with this release? Anything I'm missing?
From what I can gather:
1. Includes model weights. I can't find the URL, but they reference them enough and have a CLI tool, so I presume I just haven't found them yet.
2. Includes code: https://github.com/openai/whisper
3. Released under MIT License: https://github.com/openai/whisper/blob/main/LICENSE
What "script" are you using for doing txt2img? The watermark function is automatically called when you use the CLI in two places, https://github.com/CompVis/stable-diffusion/blob/69ae4b35e0a... and https://github.com/CompVis/stable-diffusion/blob/69ae4b35e0a...
Trivial to remove, I give you that. But AFAIK, the original repository + forks put the watermark automatically unless you've removed it on your own.
The authors do explicitly state that they're trying to do a lot of fancy new stuff here, like be multilingual, rather than pursuing just accuracy.
[1] https://github.com/syhw/wer_are_we
I'm giving up for the night, but https://github.com/Smaug123/whisper/pull/1/files at least contains the setup instructions that may help others get to this point.
As far as TTS goes, Mycroft.ai[0] has released a decent offline one.
[0]https://mycroft.ai/
I tried running it in realtime with live audio input (kind of).
You can find the python script in this repo: https://github.com/tobiashuttinger/openai-whisper-realtime
One thing they don't touch much on is the STT, as they use models from third parties. You could definitely do something that utilizes this model and then feeds the tokens to some of their parsing code. I've been working on something similar to this, but burned out around adding the STT portion [0].
[0]: https://github.com/Sheepybloke2-0/trashbot - It was called trashbot because the final implementation was going to look like oscar the grouch in a trashcan displaying the reminders.
Haven’t tried it yet but love the concept!
Have you thought of using VAD (voice activity detection) for breaks? Back in my day (a long time ago) the webrtc VAD stuff was considered decent:
https://github.com/wiseman/py-webrtcvad
Model isn’t optimized for this use but I like where you’re headed!
It understands my Swedish attempts at English really well with the medium.en model. (Although, it gives me a funny warning: `UserWarning: medium.en is an English-only model but receipted 'English'; using English instead.`. I guess it doesn't want to be told to use English when that's all it can do.)
However, it runs very slowly. It uses the CPU on my macbook, presumably because it hasn't got a NVidia card.
Googling about that I found [plaidML](https://github.com/plaidml/plaidml) which is a project promising to run ML on many different gpu architectures. Does anyone know whether it is possible to plug them together somehow? I am not an ML researcher, and don't quite understand anything about the technical details of the domain, but I can understand and write python code in domains that I do understand, so I could do some glue work if required.
Perhaps this could be adapted? https://github.com/mozilla/DeepSpeech-examples/blob/r0.9/mic...