Speech

Top 23 Speech Open-Source Projects

  • MockingBird

    ๐Ÿš€AIๆ‹Ÿๅฃฐ: 5็ง’ๅ†…ๅ…‹้š†ๆ‚จ็š„ๅฃฐ้Ÿณๅนถ็”Ÿๆˆไปปๆ„่ฏญ้Ÿณๅ†…ๅฎน Clone a voice in 5 seconds to generate arbitrary speech in real-time

  • TTS

    ๐Ÿธ๐Ÿ’ฌ - a deep learning toolkit for Text-to-Speech, battle-tested in research and production

  • Project mention: Ask HN: Open-source, local Text-to-Speech (TTS) generators | news.ycombinator.com | 2024-05-07

    I just noticed that https://coqui.ai/ is "Shutting down".

    I'm building a web app (React / Django) which takes a list of affirmations & goals (in Markdown files), puts them into a database (SQlite), and uses voice synthesis to create voice audio files of the phrases. These are combined with a relaxed backing track (ffmpeg), made into playlists of 10-20 phrases (randomly sampled, or according to a theme: "mind" "body" "soul") and then play automatically in the morning & evening (cron). This allows you to persistently hear & vocalize your own goals & good vibes over time.

    I had been planning to use Coqui TTS as the local text-to-speech engine, but with this cancellation, I'd love to hear from the community what is a great open-source, local text-to-speech engine?

    Generally, I learn both the highest quality commercially available technology (example: ElevenLabs), and also the best open-source equivalent. Would love to hear suggestions & perspectives on this. What voice synth tools are you investing your time into learning & building with?

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • datasets

    ๐Ÿค— The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

  • Project mention: ๐Ÿ๐Ÿ 23 issues to grow yourself as an exceptional open-source Python expert ๐Ÿง‘โ€๐Ÿ’ป ๐Ÿฅ‡ | dev.to | 2023-10-19
  • Kaldi Speech Recognition Toolkit

    kaldi-asr/kaldi is the official location of the Kaldi project.

  • Project mention: Amazon plans to charge for Alexa in Juneโ€“unless internal conflict delays revamp | news.ycombinator.com | 2024-01-20

    Yeah, whisper is the closest thing we have, but even it requires more processing power than is present in most of these edge devices in order to feel smooth. I've started a voice interface project on a Raspberry Pi 4, and it takes about 3 seconds to produce a result. That's impressive, but not fast enough for Alexa.

    From what I gather a Pi 5 can do it in 1.5 seconds, which is closer, so I suspect it's only a matter of time before we do have fully local STT running directly on speakers.

    > Probably anathema to the space, but if the devices leaned into the ~five tasks people use them for (timers, weather, todo list?) could probably tighten up the AI models to be more accurate and/or resource efficient.

    Yes, this is the approach taken by a lot of streaming STT systems, like Kaldi [0]. Rather than use a fully capable model, you train a specialized one that knows what kinds of things people are likely to say to it.

    [0] http://kaldi-asr.org/

  • Grounded-Segment-Anything

    Grounded-SAM: Marrying Grounding-DINO with Segment Anything & Stable Diffusion & Recognize Anything - Automatically Detect , Segment and Generate Anything

  • Project mention: Tooling for bulk image data set manipulation? | /r/computervision | 2023-06-27
  • AudioGPT

    AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

  • whisperX

    WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)

  • Project mention: Easy video transcription and subtitling with Whisper, FFmpeg, and Python | news.ycombinator.com | 2024-04-06

    It uses this, which does support diarization: https://github.com/m-bain/whisperX

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • TTS

    :robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts) (by mozilla)

  • Project mention: Coqui.ai Is Shutting Down | news.ycombinator.com | 2024-01-03

    Coqui-ai was a commercial continuation of Mozilla TTS and STT (https://github.com/mozilla/TTS).

    At the time (2018-ish), it was really impressive for on-device voice synthesis (with a quality approaching the Google and Azure cloud-based voice synthesis options) and open source, so a lot of people in the FOSS community were hoping it could be used for a privacy-respecting home assistant, Linux speech synthesis that doesn't suck, etc.

    After Mozilla abandoned the project, Coqui continued development and had some really impressive one-shot voice cloning, but pivoted to marketing speech synthesis for game developers. They were probably having trouble monetizing it, and it doesn't surprise me that they shut down.

    An equivalent project that's still in active development and doing really well is Piper TTS (https://github.com/rhasspy/piper).

  • annyang

    :speech_balloon: Speech recognition for your site

  • EmotiVoice

    EmotiVoice ๐Ÿ˜Š: a Multi-Voice and Prompt-Controlled TTS Engine

  • Project mention: FLaNK Stack Weekly 12 February 2024 | dev.to | 2024-02-12
  • modelscope

    ModelScope: bring the notion of Model-as-a-Service to life.

  • Project mention: FLaNK Stack Weekly for 20 June 2023 | dev.to | 2023-06-20

    Model as a Service https://github.com/modelscope/modelscope

  • silero-models

    Silero Models: pre-trained speech-to-text, text-to-speech and text-enhancement models made embarrassingly simple

  • Project mention: Weird A.I. Yankovic, a cursed deep dive into the world of voice cloning | news.ycombinator.com | 2023-10-02

    I doubt it's currently actually "the best open source text to speech", but the answer I came up with when throwing a couple of hours at the problem some months ago was "Silero" [0, 1].

    Following the "standalone" guide [2], it was pretty trivial to make the model render my sample text in about 100 English "voices" (many of which were similar to each other, and in varying quality). Sampling those, I got about 10 that were pretty "good". And maybe 6 that were the "best ones" (pretty natural, not annoying to listen to).

    IIRC the license was free for noncommercial use only. I'm not sure exactly "how open source" they are, but it was simple to install the dependencies and write the basic Python to try it out; I had to write a for loop to try all the voices like I wanted. I ended using something else for the project for other reasons, but this could still be fairly good backup option for some use cases IMO.

      [0] https://github.com/snakers4/silero-models#text-to-speech

  • lingvo

    Lingvo

  • aeneas

    aeneas is a Python/C library and a set of tools to automagically synchronize audio and text (aka forced alignment)

  • gTTS

    Python library and CLI tool to interface with Google Translate's text-to-speech API

  • Project mention: Using Groq to Build a Real-Time Language Translation App | dev.to | 2024-04-05

    For our real-time TTS needs, we'll employ the fantastic library called gTTS.

  • whisper-diarization

    Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper

  • Project mention: MacWhisper: Transcribe audio files on your Mac | news.ycombinator.com | 2023-08-23

    https://github.com/MahmoudAshraf97/whisper-diarization

    This project has been alright for transcribing audio with speaker diarization. A big finicky. The OpenAI model is better than other paid products(Descript, Riverside) so Iโ€™m looking forward to trying MacWhisper.

  • Amazing-Python-Scripts

    ๐Ÿš€ Curated collection of Amazing Python scripts from Basics to Advance with automation task scripts.

  • Project mention: Top 10 GitHub Repositories for Python and Java Developers | dev.to | 2024-05-03

    7. Avinashkranjan/Amazing-Python-Scripts - A collection of innovative Python scripts for tasks like web scraping and automation can be found in this repository. It's a great source of inspiration for creating projects. https://github.com/avinashkranjan/Amazing-Python-Scripts

  • DeepFilterNet

    Noise supression using deep filtering

  • Project mention: Anyone know of a good TTS pipeline for raw speech data? | /r/AudioAI | 2023-10-03

    You mean remove background noise and transcribe? Then you can use DeepFilterNet to remove noise, and Whisper to transcribe.

  • julius

    Open-Source Large Vocabulary Continuous Speech Recognition Engine (by julius-speech)

  • soloud

    Free, easy, portable audio engine for games

  • whisper-timestamped

    Multilingual Automatic Speech Recognition with word-level timestamps and confidence

  • Project mention: Show HN: AI Dub Tool I Made to Watch Foreign Language Videos with My 7-Year-Old | news.ycombinator.com | 2024-02-28

    Yes. But Whisper's word-level timings are actually quite inaccurate out of the box. There are some Python libraries that mitigate that. I tested several of them. whisper-timestamped seems to be the best one. [0]

    [0] https://github.com/linto-ai/whisper-timestamped

  • ai-audio-startups

    Community list of startups working with AI in audio and music technology

  • praat

    Praat: Doing Phonetics By Computer

  • Project mention: Praat: Doing Phonetics by Computer | news.ycombinator.com | 2024-04-18

    This brings back memories.

    I worked my way through some of its source code many years ago during my post-graduate studies and it was very _strange_. I see it is now on GitHub [0].

    They used C macros to implement object oriented programming, with symbols like `me` and `my` and `thee` scattered throughout the source code. It seems the code has been converted to C++ (IIRC it used to be in C), but I still see the `my` keyword in there.

    They have their own BASIC-like scripting language. The weirdest property for me was that it allowed for whitespace in the identifiers. Just look at the example in [1]: The `Create simple Matrix` is actually a function in the scripting language that constructs a matrix object. The function name corresponds to a menu item and IIRC they used some more preprocessor magic to reuse the same code for the menus on the GUI and the functions in the scripting language.

    I don't think you're supposed to write the scripts by hand. Rather it recorded your actions as you worked your way through the GUI and then you could export and modify those recordings as scripts.

    They also implemented their own cross platform GUI toolkit rather than using one of the existing cross-platform GUI toolkits, so it works on Windows, Linux (or any X Windows I believe) and MacOS.

    [0]: https://github.com/praat/praat

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Speech related posts

  • Praat: Doing Phonetics by Computer

    1 project | news.ycombinator.com | 18 Apr 2024
  • Easy video transcription and subtitling with Whisper, FFmpeg, and Python

    1 project | news.ycombinator.com | 6 Apr 2024
  • Using Groq to Build a Real-Time Language Translation App

    3 projects | dev.to | 5 Apr 2024
  • OpenAI deems its voice cloning tool too risky for general release

    1 project | news.ycombinator.com | 31 Mar 2024
  • SOTA ASR Tooling: Long-Form Transcription

    1 project | news.ycombinator.com | 31 Mar 2024
  • Deploying whisperX on AWS SageMaker as Asynchronous Endpoint

    2 projects | dev.to | 31 Mar 2024
  • Show HN: AI Dub Tool I Made to Watch Foreign Language Videos with My 7-Year-Old

    1 project | news.ycombinator.com | 28 Feb 2024
  • A note from our sponsor - InfluxDB
    www.influxdata.com | 10 May 2024
    Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more โ†’

Index

What are some of the best open-source Speech projects? This list will help you:

Project Stars
1 MockingBird 33,904
2 TTS 29,631
3 datasets 18,480
4 Kaldi Speech Recognition Toolkit 13,768
5 Grounded-Segment-Anything 13,615
6 AudioGPT 9,788
7 whisperX 9,173
8 TTS 8,845
9 annyang 6,547
10 EmotiVoice 6,330
11 modelscope 6,099
12 silero-models 4,584
13 lingvo 2,777
14 aeneas 2,379
15 gTTS 2,149
16 whisper-diarization 2,066
17 Amazing-Python-Scripts 2,024
18 DeepFilterNet 1,952
19 julius 1,778
20 soloud 1,655
21 whisper-timestamped 1,547
22 ai-audio-startups 1,454
23 praat 1,386

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com