Our great sponsors
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
transcribe-anything
Input a local file or url and this service will transcribe it using Whisper AI. Completely private and Free 🤯🤯🤯
-
ai-notes
notes for software engineers getting up to speed on new AI developments. Serves as datastore for https://latent.space writing, and product brainstorming, but has cleaned up canonical references under the /Resources folder.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
By the way there is also another project called Whisper.cpp:
https://github.com/ggerganov/whisper.cpp
Which uses x8 less memory than the Python implementation for the tiny model. It would be a good idea to keep an eye on it since there are Python bindings planned on the roadmap:
https://github.com/ggerganov/whisper.cpp#bindings
I was working on this yesterday. It seems that the most common approach with Whisper is simply to break the audio into chunks and transcribe each one separately. This works but as you'd expect sometimes has trouble at the edges.
You could do better by overlapping the segments, except then stitching the transcriptions together becomes an issue since whisper doesn't provide reliable per-token timestamps [0], and the output of the common part of overlapping segments isn't necessarily the same.
Some more useful discussion here [1].
0.https://github.com/openai/whisper/discussions/332
People interested in this might also be interested in transcribe-anything [1].
It automates video fetching and uses whisper to generate .srt, .vtt and .txt files.
[1] https://github.com/zackees/transcribe-anything