Show HN: Port of OpenAI's Whisper model in C/C++

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • whisper.cpp

    Port of OpenAI's Whisper model in C/C++

  • Hi HN,

    OpenAI recently released a model for automatic speech recognition called Whisper [0]. I decided to reimplement the inference of the model from scratch using C/C++. To achieve this I implemented a minimalistic tensor library in C and ported the high-level architecture of the model in C++. The entire code is less than 8000 lines of code and is contained in just 2 source files without any third-party dependencies. The Github project is here:

    https://github.com/ggerganov/whisper.cpp

    With this implementation I can very easily build and run the model - “make base.en”. It also allows me to run it on a wide range of devices. For example, I have provided examples of running the model on an iPhone, Raspberry Pi 4 and even in a web page via WebAssembly!

    The implementation runs fully on the CPU and utilizes FP16, AVX intrinsics on x86 architectures and NEON + Accelerate framework on Apple Silicon. The latter is especially efficient and I observe that the inference is about 2-3 times faster compared to the current PyTorch implementation provided by OpenAI when running it on my MacBook M1 Pro. The WASM port utilizes SIMD 128-bit intrinsics - a feature supported in some modern web browsers [1].

    I am very happy with the performance that I observe on Apple Silicon devices. I didn’t expect that the Accelerate framework [2] (i.e. CBLAS) offers such a dramatic performance boost for matrix multiplications so I was very pleasantly surprised! To enable the framework in your C/C++ projects, all you have to do is add `-framework Accelerate` to your clang command-line flags.

    This entire exercise of implementing the Whisper model was very interesting to me and helped me understand a lot about how the transformer architecture works. I also got a lot of positive feedback from people finding and using my project. We brainstormed on a lot of interesting tools that can potentially be created with this library (such as speech-to-text plugin for Vim, RPi4 voice assistant, WASM chat bot, etc). If interested, checkout the “Examples” section and the “Show and tell” discussions for some ideas!

    Would love to know what you think about this project and about your experience with using the Accelerate framework in any of your projects.

  • whisper

    Robust Speech Recognition via Large-Scale Weak Supervision

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • Enzyme

    High-performance automatic differentiation of LLVM and MLIR. (by EnzymeAD)

  • https://ispc.github.io/ispc.html

    For the auto-differentiation when I need performance or memory, I currently use tapenade ( http://tapenade.inria.fr:8080/tapenade/index.jsp ) and/or manually written gradient when I need to fuse some kernel, but Enzyme ( https://enzyme.mit.edu/ ) is also very promising.

    MPI for parallelization across machines.

  • amx

    Apple AMX Instruction Set

  • You are correct, in that those are the four

    My understanding is that the AMX is more tightly wound with the CPU, ultimately being accessible via an instruction set (https://github.com/corsix/amx), and it is useful if you need to do matrix multiplications interleaved with other CPU tasks. A common example would be a VIO loop or something where you want that data in the CPU caches.

    The GPU and Neural Engine are not that – they take some time to set up and initialize. They also can parallelize tasks to a much higher degree. The GPU is more generalizable, because you can write compute shaders to do anything in parallel, but it uses a lot of resources. I'll have to check out the PR to see how exactly the MPS shaders match up with the task at hand, because you could also consider writing Metal compute shaders by hand.

    I know the least about the ANE, but it has specific hardware for running ML models, and you have to process the weights ahead of time to make sure they are in the right format. It can run ML models very efficiently and is the most battery friendly.

  • Halide

    a language for fast, portable data-parallel computation

  • I suggest looking into Halide as it will make trying different paths much easier (https://halide-lang.org/).

    I haven't looked at your code closely so can't say with certainty it would be the right fit but worth a look.

  • whisper.cpp

    Port of OpenAI's Whisper model in C/C++ (by Const-me)

  • Feel free to merge my fork, about 20% faster on my computer (Ryzen 7 5700G CPU, medium.en model): https://github.com/Const-me/whisper.cpp It also contains VS2022 projects to build on Windows, your cmake project results in disabled AVX which is critical for performance.

    Also, I didn’t really understand your multithreading code in ggml_graph_compute function, but that custom thread pool implementation IMO looks suspicious. Just too many atomics. Might be possible to improve a lot with a better multithreading strategy.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts