I made a free transcription service powered by Whisper AI

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • whisper

    Robust Speech Recognition via Large-Scale Weak Supervision

  • It's not as if people aren't trying to do that: https://github.com/openai/whisper/discussions/264

    I tried out this notebook about a month ago, and it was rough. After spending an evening improving it, I got everything "working", but pyannote was not reliable. I tried it against an hour-ish audio sample, and I found no way to tune pyannote to keep track of ~10 speakers over the course of that audio. It would identify some of the earlier speakers, but then it felt like it lost attention and would just start labeling every new speaker as the same speaker. There is an option to force the minimum number of speakers higher, and that just caused it to split some of the earlier speakers into multiple labels. It did nothing to address the latter half of the audio.

    So, sure, someone should continue working on putting the pieces together, but I think pyannote itself needs some improvement.

    Beyond that, I think using separate models for transcription and diarization ends up being really clunky. If you have a podcast-like environment where people get excited and start talking over each other, then even if pyannote correctly identifies all of the speakers during the overlapping segments and when they spoke... Whisper cannot be used to separate speakers. You end up with either duplicate transcripts attributed to everyone involved, or something worse. Impressively, I have seen pyannote do exactly that.

  • whisper.cpp

    Port of OpenAI's Whisper model in C/C++

  • This is a cool project. I’ve been very happy with whisper as an alternative to otter; it works better and solves real problems for me.

    I feel compelled to point out whisper.cpp. It may be cheaper for the author but is relevant for others.

    I was running whisper on a gtx 1070 to get decent performance; it was terribly slow on M1 Mac. Whisper.cpp has comparable performance to the 1070 while running on M1 CPU. It is easy to build and run and well documented.

    https://github.com/ggerganov/whisper.cpp

    I hope this doesn’t come off the wrong way, I love this project and I’m glad to see the technology democratized. Easily accessible high-quality transcription will be a game changer for many people and organizations.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • generate-subtitles

    Generate transcripts for audio and video content with a user friendly UI, powered by Open AI's Whisper with automatic translations and download videos automatically with yt-dlp integration

  • I'm just running this off of a 2x RTX A6000 server on Vast.ai at the moment, about $1.30/h and then using nginx on another server to reverse proxy it to Vast

    Open an issue on the Github repo and we can collab for sure!: https://github.com/mayeaux/generate-subtitles/issues

  • pyannote-audio

    Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

  • Free startup idea: Use Whisper with pyannote-audio[0]’s speaker diarization. Upload a recording, get back a multi-speaker annotated transcription.

    Make a JSON API and I’ll be your first customer.

    [0] https://github.com/pyannote/pyannote-audio

  • whisper-asr-webservice

    OpenAI Whisper ASR Webservice API

  • I think there's been talk to do speaker diarization with whisper-asr-webservice[0] which is also written in python and should be able to make use of goodies such as pyannote-audio, py-webrtcvad, etc.

    Whisper is great but at the point we get to kludging various things together it starts to make more sense to use something like Nvidia NeMo[1] which was built with all of this in mind and more

    [0] - https://github.com/ahmetoner/whisper-asr-webservice

    [1] - https://github.com/NVIDIA/NeMo

  • NeMo

    A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)

  • I think there's been talk to do speaker diarization with whisper-asr-webservice[0] which is also written in python and should be able to make use of goodies such as pyannote-audio, py-webrtcvad, etc.

    Whisper is great but at the point we get to kludging various things together it starts to make more sense to use something like Nvidia NeMo[1] which was built with all of this in mind and more

    [0] - https://github.com/ahmetoner/whisper-asr-webservice

    [1] - https://github.com/NVIDIA/NeMo

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts