whisper.cpp
whisper
whisper.cpp | whisper | |
---|---|---|
199 | 373 | |
41,464 | 84,697 | |
2.7% | 2.7% | |
9.9 | 7.4 | |
4 days ago | 21 days ago | |
C++ | Python | |
MIT License | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
whisper.cpp
-
Ask HN: What API or software are people using for transcription?
Whisper large v3 from openai, but we host it ourselves on Modal.com. It's easy, fast, no rate limits, and cheap as well.
If you want to run it locally, I'd still go with whisper, then I'd look at something like whisper.cpp https://github.com/ggml-org/whisper.cpp. Runs quite well.
- Whispercpp – Local, Fast, and Private Audio Transcription for Ruby
-
Build Your Own Siri. Locally. On-Device. No Cloud
not the gp but found this https://github.com/ggml-org/whisper.cpp/blob/master/models/c...
-
Run LLMs on Apple Neural Engine (ANE)
Actually that's a really good question, I hadn't considered that the comparison here is just CPU vs using Metal (CPU+GPU).
To answer the question though - I think this would be used for cases where you are building an app that wants to utilize a small AI model while at the same time having the GPU free to do graphics related things, which I'm guessing is why Apple stuck these into their hardware in the first place.
Here is an interesting comparison between the two from a whisper.cpp thread - ignoring startup times - the CPU+ANE seems about on par with CPU+GPU: https://github.com/ggml-org/whisper.cpp/pull/566#issuecommen...
-
Building a personal, private AI computer on a budget
A great thread with the type of info your looking for lives here: https://github.com/ggerganov/whisper.cpp/issues/89
But you can likely find similar threads for the llama.cpp benchmark here: https://github.com/ggerganov/llama.cpp/tree/master/examples/...
These are good examples because the llama.cpp and whisper.cpp benchmarks take full advantage of the Apple hardware but also take full advantage of non-Apple hardware with GPU support, AVX support etc.
It’s been true for a while now that the memory bandwidth of modern Apple systems in tandem with the neural cores and gpu has made them very competitive Nvidia for local inference and even training.
- Whisper.cpp: Looking for Maintainers
- Show HN: Galene-stt: automatic captioning for the Galene videconferencing system
-
Show HN: Transcribe YouTube Videos
Not as convenient, but you could also have the user manually install the model, like whisper does.
Just forward the error message output by whisper, or even make a more user-friendly error message with instructions on how/where to download the models.
Whisper does provide a simple bash script to download models: https://github.com/ggerganov/whisper.cpp/blob/master/models/...
(As a Windows user, I can run bash scripts via Git Bash for Windows[1])
[1]: https://git-scm.com/download/win
- OTranscribe: A free and open tool for transcribing audio interviews
-
Show HN: I created automatic subtitling app to boost short videos
whisper.cpp [1] has a karaoke example that uses ffmpeg's drawtext filter to display rudimentary karaoke-like captions. It also supports diarisation. Perhaps it could be a starting point to create a better script that does what you need.
--
1: https://github.com/ggerganov/whisper.cpp/blob/master/README....
whisper
-
Summarization experiments with Hugging Face Transformers - part 1
Now, I can't copy the YouTube video descriptions verbatim for SEO purposes: search engines don't like duplicate content. The obvious solution is to do some kind of summarization. First of all, I always use whisper and proofread the SRT file containing the recognized audio before uploading it to YouTube. I then copy the SRT file verbatim in an LLM chat, in chunks, and ask ChatGPT to generate a summary once I say so. This last step is tedious and you always rely on a SaaS. Some alternatives include the use local AI models via Ollama, or more specialized software that can run different tasks besides text generation.
-
Show HN: TokenDagger – A tokenizer 2-4x faster than OpenAI's Tiktoken
> has a great package ecosystem
So great there are 8 of them. 800% better than all the rest!
> If you think Python is a bad language for AI integrations, try writing one in a compiled language.
I'll take this challenge, all day, every day, so long as I and the hypothetical 'move fast and break things' have equal "must run in prod" and "must be understandable by some other human" qualifiers
What type is `array`? Don't worry your pretty head about it, feed it whatever type you want and let Sentry's TypeError sort it out <https://github.com/openai/whisper/blob/v20250625/whisper/aud...> Oh, sorry, and you wanted to know what `pad_or_trim` returns? Well that's just, like, your opinion man
-
DeepSpeech Is Discontinued
It seems that the team that used to work on DeepSpeech then worked on coqui-ai STT https://github.com/coqui-ai/STT and now recommends using OpenAI Whisper (https://github.com/openai/whisper)
-
OpenAI Charges by the Minute, So Make the Minutes Shorter
It's a very simple change in a vanilla python implementation. The encoder is a set of attention blocks, and the length of the attention can be changed without changing the calculation at all.
Here(https://github.com/openai/whisper/blob/main/whisper/model.py...) is the relevant code in the whisper repo. You'd just need to change the for loop to an enumerate and subsample the context along its length at the point you want. I believe it would be:
for i, block in enumerate(self.blocks):
-
Show HN: WhisperBuddy, Privacy-first AI-transcription app built after my layoff
I was laid off recently and, instead of looking for another job right away, I decided to build something I always wanted: a transcription tool that respects user privacy.
Existing transcription tools often require internet connectivity, send your private audio to cloud servers, or lock you into monthly subscriptions. I wanted something different—so I built WhisperBuddy, a privacy-first AI transcription app that runs entirely on your machine.
WhisperBuddy uses OpenAI’s Whisper model (see: https://github.com/openai/whisper). While the model may not always deliver the absolute best accuracy, it provides solid performance and, most importantly, 100% privacy—no audio ever leaves your device.
I optimized the app for local performance and usability, and I’m eager to hear feedback from the community.
Thanks for checking it out!
-
Auto-Generating Clips for Social Media from Live Streams with the Strands Agents SDK
To accomplish this task, I decided to try out the new Strands Agents SDK. It's a fairly new framework for building agents that has a simple way to define tools that the agent can use to assist in responding to prompts. For this solution, we'll need FFMPEG and Whisper installed on the machine where the agent runs. I'll be working locally, but this could easily be converted to a server-based solution using FastAPI or another web framework and deployed to the cloud in a Docker/Podman container.
-
From Voice to Text: Exploring Speech-to-Text Tools and APIs for Developers
Link: Whisper GitHub
-
15 AI tools that almost replace a full dev team but please don’t fire us yet
Whisper: OpenAI’s speech-to-text.
-
The ultimate open source stack for building AI agents
Start by hooking up speech-to-text (STT) using something like OpenAI’s Whisper if you’re going open source, or Deepgram if you want a super-accurate plug-and-play API.
-
How to create video transcription with ffmpeg and whisper
# Install Homebrew if you don't have it /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" # Install ffmpeg brew install ffmpeg # Install Python (if needed) brew install python # Install Whisper pip3 install --upgrade pip pip3 install git+https://github.com/openai/whisper.git
What are some alternatives?
bark - 🔊 Text-Prompted Generative Audio Model
vosk-api - Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
faster-whisper - Faster Whisper transcription with CTranslate2
silero-vad - Silero VAD: pre-trained enterprise-grade Voice Activity Detector
whisperX - WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)