whisper.cpp
whisperX
whisper.cpp | whisperX | |
---|---|---|
199 | 34 | |
41,464 | 16,683 | |
2.7% | 4.2% | |
9.9 | 8.8 | |
4 days ago | 15 days ago | |
C++ | Python | |
MIT License | BSD 2-clause "Simplified" License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
whisper.cpp
-
Ask HN: What API or software are people using for transcription?
Whisper large v3 from openai, but we host it ourselves on Modal.com. It's easy, fast, no rate limits, and cheap as well.
If you want to run it locally, I'd still go with whisper, then I'd look at something like whisper.cpp https://github.com/ggml-org/whisper.cpp. Runs quite well.
- Whispercpp – Local, Fast, and Private Audio Transcription for Ruby
-
Build Your Own Siri. Locally. On-Device. No Cloud
not the gp but found this https://github.com/ggml-org/whisper.cpp/blob/master/models/c...
-
Run LLMs on Apple Neural Engine (ANE)
Actually that's a really good question, I hadn't considered that the comparison here is just CPU vs using Metal (CPU+GPU).
To answer the question though - I think this would be used for cases where you are building an app that wants to utilize a small AI model while at the same time having the GPU free to do graphics related things, which I'm guessing is why Apple stuck these into their hardware in the first place.
Here is an interesting comparison between the two from a whisper.cpp thread - ignoring startup times - the CPU+ANE seems about on par with CPU+GPU: https://github.com/ggml-org/whisper.cpp/pull/566#issuecommen...
-
Building a personal, private AI computer on a budget
A great thread with the type of info your looking for lives here: https://github.com/ggerganov/whisper.cpp/issues/89
But you can likely find similar threads for the llama.cpp benchmark here: https://github.com/ggerganov/llama.cpp/tree/master/examples/...
These are good examples because the llama.cpp and whisper.cpp benchmarks take full advantage of the Apple hardware but also take full advantage of non-Apple hardware with GPU support, AVX support etc.
It’s been true for a while now that the memory bandwidth of modern Apple systems in tandem with the neural cores and gpu has made them very competitive Nvidia for local inference and even training.
- Whisper.cpp: Looking for Maintainers
- Show HN: Galene-stt: automatic captioning for the Galene videconferencing system
-
Show HN: Transcribe YouTube Videos
Not as convenient, but you could also have the user manually install the model, like whisper does.
Just forward the error message output by whisper, or even make a more user-friendly error message with instructions on how/where to download the models.
Whisper does provide a simple bash script to download models: https://github.com/ggerganov/whisper.cpp/blob/master/models/...
(As a Windows user, I can run bash scripts via Git Bash for Windows[1])
[1]: https://git-scm.com/download/win
- OTranscribe: A free and open tool for transcribing audio interviews
-
Show HN: I created automatic subtitling app to boost short videos
whisper.cpp [1] has a karaoke example that uses ffmpeg's drawtext filter to display rudimentary karaoke-like captions. It also supports diarisation. Perhaps it could be a starting point to create a better script that does what you need.
--
1: https://github.com/ggerganov/whisper.cpp/blob/master/README....
whisperX
-
Ask HN: What API or software are people using for transcription?
I use whisperfile[1] directly. The whisper-large-v3 model seems good with non-English transcription, which is my main use-case.
I am also eyeing whisperX[2], because I want to play some more with speaker diarization.
Your use-case seems to be batch transcription, so I'd suggest you go ahead and just use whisperfile, it should work well on an M4 mini, and it also has an HTTP API if you just start it without arguments.
If you want more interactivity, I have been using Vibe[3] as an open-source replacement of SuperWhisper[4], but VoiceInk from a sibling comment seems better.
Aside: It seems that so many of the mentioned projects use whisper at the core, that it would be interesting to explicitly mark the projects that don't use whisper, so we can have a real fundamental comparison.
[1] https://huggingface.co/Mozilla/whisperfile
[2] https://github.com/m-bain/whisperX
[3] https://github.com/thewh1teagle/vibe/
[4] https://superwhisper.com/
-
Ask HN: Is Whisper Still Relevant?
Yes it's still relevant but I prefer WhisperX for some tasks: https://github.com/m-bain/whisperX
-
Show HN: Mikey – No bot meeting notetaker for Windows
https://github.com/m-bain/whisperX looks promising - I'm hacking away on an always-on transcriber for my notes for later search&recall. It has support for diarization (the speaker detection you're looking for).
I'm currently hacking away on a mix of https://github.com/speaches-ai/speaches + https://github.com/ufal/whisper_streaming though - mostly because my laptop doesn't have a decent GPU, I stream the audio to a home server instead.
But overall it's pretty simple to do after you wrangle the Python dependencies - all you need is a sink for the text files (for example, create a new file for every Teams meeting, but that's another story...)
-
VLC tops 6B downloads, previews AI-generated subtitles
You don't need to wait, you can use: https://github.com/m-bain/whisperX right now for STT with timestamps and diarization.
-
Transcriber AI – Free, end-to-end machine based transcription with speaker id
I use whisper and pyannote (https://github.com/m-bain/whisperX), but it is a pain to run locally - I run it on a 4080. This seems to be actually trying to identify the speakers. Not sure what they are doing for that.
-
Supercharge Your AI Skills: 5 Open Source Repositories You Can't Afford to Miss
3. WhisperX
-
Show HN: Offline audiobook from any format with one CLI command
> And do you know a good speech to text model?
OpenAI's whisper, code+model are available, and multiple projects have built on it. You could try this wrapper: https://github.com/m-bain/whisperX -- or for short utterances on a smart-phone https://github.com/futo-org/whisper-acft
- WhisperX: Precise ASR with Word-Level Timestamps and Diarization
- WhisperX: Precise ASR with Word-Level Timestamps and Speaker Diarization
- Text-to-Speech with Speaker Diarization
What are some alternatives?
bark - 🔊 Text-Prompted Generative Audio Model
faster-whisper - Faster Whisper transcription with CTranslate2
whisper - Robust Speech Recognition via Large-Scale Weak Supervision
openai-whisper-cpu - Improving transcription performance of OpenAI Whisper for CPU based deployment