Whisper – open source speech recognition by OpenAI

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • whisper

    Robust Speech Recognition via Large-Scale Weak Supervision

  • It seems like OpenAI are finally living up to their name for once with this release? Anything I'm missing?

    From what I can gather:

    1. Includes model weights. I can't find the URL, but they reference them enough and have a CLI tool, so I presume I just haven't found them yet.

    2. Includes code: https://github.com/openai/whisper

    3. Released under MIT License: https://github.com/openai/whisper/blob/main/LICENSE

  • stable-diffusion

    A latent text-to-image diffusion model

  • What "script" are you using for doing txt2img? The watermark function is automatically called when you use the CLI in two places, https://github.com/CompVis/stable-diffusion/blob/69ae4b35e0a... and https://github.com/CompVis/stable-diffusion/blob/69ae4b35e0a...

    Trivial to remove, I give you that. But AFAIK, the original repository + forks put the watermark automatically unless you've removed it on your own.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • wer_are_we

    Attempt at tracking states of the arts and recent results (bibliography) on speech recognition.

  • The authors do explicitly state that they're trying to do a lot of fancy new stuff here, like be multilingual, rather than pursuing just accuracy.

    [1] https://github.com/syhw/wer_are_we

  • whisper

  • I'm giving up for the night, but https://github.com/Smaug123/whisper/pull/1/files at least contains the setup instructions that may help others get to this point.

  • mycroft-core

    Mycroft Core, the Mycroft Artificial Intelligence platform.

  • As far as TTS goes, Mycroft.ai[0] has released a decent offline one.

    [0]https://mycroft.ai/

  • NeMo

    A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)

  • vosk-api

    Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • dragonfly

    Speech recognition framework allowing powerful Python-based scripting and extension of Dragon NaturallySpeaking (DNS), Windows Speech Recognition (WSR), Kaldi and CMU Pocket Sphinx (by dictation-toolbox)

  • openai-whisper-realtime

    A quick experiment to achieve almost realtime transcription using Whisper.

  • I tried running it in realtime with live audio input (kind of).

    You can find the python script in this repo: https://github.com/tobiashuttinger/openai-whisper-realtime

  • trashbot

    Trashbot helper AI assistant

  • One thing they don't touch much on is the STT, as they use models from third parties. You could definitely do something that utilizes this model and then feeds the tokens to some of their parsing code. I've been working on something similar to this, but burned out around adding the STT portion [0].

    [0]: https://github.com/Sheepybloke2-0/trashbot - It was called trashbot because the final implementation was going to look like oscar the grouch in a trashcan displaying the reminders.

  • py-webrtcvad

    Python interface to the WebRTC Voice Activity Detector

  • Haven’t tried it yet but love the concept!

    Have you thought of using VAD (voice activity detection) for breaks? Back in my day (a long time ago) the webrtc VAD stuff was considered decent:

    https://github.com/wiseman/py-webrtcvad

    Model isn’t optimized for this use but I like where you’re headed!

  • plaidml

    PlaidML is a framework for making deep learning work everywhere.

  • It understands my Swedish attempts at English really well with the medium.en model. (Although, it gives me a funny warning: `UserWarning: medium.en is an English-only model but receipted 'English'; using English instead.`. I guess it doesn't want to be told to use English when that's all it can do.)

    However, it runs very slowly. It uses the CPU on my macbook, presumably because it hasn't got a NVidia card.

    Googling about that I found [plaidML](https://github.com/plaidml/plaidml) which is a project promising to run ML on many different gpu architectures. Does anyone know whether it is possible to plug them together somehow? I am not an ML researcher, and don't quite understand anything about the technical details of the domain, but I can understand and write python code in domains that I do understand, so I could do some glue work if required.

  • DeepSpeech-examples

    Examples of how to use or integrate DeepSpeech

  • Perhaps this could be adapted? https://github.com/mozilla/DeepSpeech-examples/blob/r0.9/mic...

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts