Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more โ
Top 23 Speech Open-Source Projects
-
MockingBird
๐AIๆๅฃฐ: 5็งๅ ๅ ้ๆจ็ๅฃฐ้ณๅนถ็ๆไปปๆ่ฏญ้ณๅ ๅฎน Clone a voice in 5 seconds to generate arbitrary speech in real-time
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
datasets
๐ค The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
-
Grounded-Segment-Anything
Grounded-SAM: Marrying Grounding-DINO with Segment Anything & Stable Diffusion & Recognize Anything - Automatically Detect , Segment and Generate Anything
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
TTS
:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts) (by mozilla)
-
silero-models
Silero Models: pre-trained speech-to-text, text-to-speech and text-enhancement models made embarrassingly simple
-
aeneas
aeneas is a Python/C library and a set of tools to automagically synchronize audio and text (aka forced alignment)
-
Amazing-Python-Scripts
๐ Curated collection of Amazing Python scripts from Basics to Advance with automation task scripts.
-
whisper-timestamped
Multilingual Automatic Speech Recognition with word-level timestamps and confidence
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Project mention: Ask HN: Open-source, local Text-to-Speech (TTS) generators | news.ycombinator.com | 2024-05-07I just noticed that https://coqui.ai/ is "Shutting down".
I'm building a web app (React / Django) which takes a list of affirmations & goals (in Markdown files), puts them into a database (SQlite), and uses voice synthesis to create voice audio files of the phrases. These are combined with a relaxed backing track (ffmpeg), made into playlists of 10-20 phrases (randomly sampled, or according to a theme: "mind" "body" "soul") and then play automatically in the morning & evening (cron). This allows you to persistently hear & vocalize your own goals & good vibes over time.
I had been planning to use Coqui TTS as the local text-to-speech engine, but with this cancellation, I'd love to hear from the community what is a great open-source, local text-to-speech engine?
Generally, I learn both the highest quality commercially available technology (example: ElevenLabs), and also the best open-source equivalent. Would love to hear suggestions & perspectives on this. What voice synth tools are you investing your time into learning & building with?
Project mention: ๐๐ 23 issues to grow yourself as an exceptional open-source Python expert ๐งโ๐ป ๐ฅ | dev.to | 2023-10-19
Project mention: Amazon plans to charge for Alexa in Juneโunless internal conflict delays revamp | news.ycombinator.com | 2024-01-20Yeah, whisper is the closest thing we have, but even it requires more processing power than is present in most of these edge devices in order to feel smooth. I've started a voice interface project on a Raspberry Pi 4, and it takes about 3 seconds to produce a result. That's impressive, but not fast enough for Alexa.
From what I gather a Pi 5 can do it in 1.5 seconds, which is closer, so I suspect it's only a matter of time before we do have fully local STT running directly on speakers.
> Probably anathema to the space, but if the devices leaned into the ~five tasks people use them for (timers, weather, todo list?) could probably tighten up the AI models to be more accurate and/or resource efficient.
Yes, this is the approach taken by a lot of streaming STT systems, like Kaldi [0]. Rather than use a fully capable model, you train a specialized one that knows what kinds of things people are likely to say to it.
[0] http://kaldi-asr.org/
Project mention: Easy video transcription and subtitling with Whisper, FFmpeg, and Python | news.ycombinator.com | 2024-04-06It uses this, which does support diarization: https://github.com/m-bain/whisperX
Coqui-ai was a commercial continuation of Mozilla TTS and STT (https://github.com/mozilla/TTS).
At the time (2018-ish), it was really impressive for on-device voice synthesis (with a quality approaching the Google and Azure cloud-based voice synthesis options) and open source, so a lot of people in the FOSS community were hoping it could be used for a privacy-respecting home assistant, Linux speech synthesis that doesn't suck, etc.
After Mozilla abandoned the project, Coqui continued development and had some really impressive one-shot voice cloning, but pivoted to marketing speech synthesis for game developers. They were probably having trouble monetizing it, and it doesn't surprise me that they shut down.
An equivalent project that's still in active development and doing really well is Piper TTS (https://github.com/rhasspy/piper).
Model as a Service https://github.com/modelscope/modelscope
Project mention: Weird A.I. Yankovic, a cursed deep dive into the world of voice cloning | news.ycombinator.com | 2023-10-02I doubt it's currently actually "the best open source text to speech", but the answer I came up with when throwing a couple of hours at the problem some months ago was "Silero" [0, 1].
Following the "standalone" guide [2], it was pretty trivial to make the model render my sample text in about 100 English "voices" (many of which were similar to each other, and in varying quality). Sampling those, I got about 10 that were pretty "good". And maybe 6 that were the "best ones" (pretty natural, not annoying to listen to).
IIRC the license was free for noncommercial use only. I'm not sure exactly "how open source" they are, but it was simple to install the dependencies and write the basic Python to try it out; I had to write a for loop to try all the voices like I wanted. I ended using something else for the project for other reasons, but this could still be fairly good backup option for some use cases IMO.
[0] https://github.com/snakers4/silero-models#text-to-speech
For our real-time TTS needs, we'll employ the fantastic library called gTTS.
https://github.com/MahmoudAshraf97/whisper-diarization
This project has been alright for transcribing audio with speaker diarization. A big finicky. The OpenAI model is better than other paid products(Descript, Riverside) so Iโm looking forward to trying MacWhisper.
7. Avinashkranjan/Amazing-Python-Scripts - A collection of innovative Python scripts for tasks like web scraping and automation can be found in this repository. It's a great source of inspiration for creating projects. https://github.com/avinashkranjan/Amazing-Python-Scripts
You mean remove background noise and transcribe? Then you can use DeepFilterNet to remove noise, and Whisper to transcribe.
Project mention: Show HN: AI Dub Tool I Made to Watch Foreign Language Videos with My 7-Year-Old | news.ycombinator.com | 2024-02-28Yes. But Whisper's word-level timings are actually quite inaccurate out of the box. There are some Python libraries that mitigate that. I tested several of them. whisper-timestamped seems to be the best one. [0]
[0] https://github.com/linto-ai/whisper-timestamped
This brings back memories.
I worked my way through some of its source code many years ago during my post-graduate studies and it was very _strange_. I see it is now on GitHub [0].
They used C macros to implement object oriented programming, with symbols like `me` and `my` and `thee` scattered throughout the source code. It seems the code has been converted to C++ (IIRC it used to be in C), but I still see the `my` keyword in there.
They have their own BASIC-like scripting language. The weirdest property for me was that it allowed for whitespace in the identifiers. Just look at the example in [1]: The `Create simple Matrix` is actually a function in the scripting language that constructs a matrix object. The function name corresponds to a menu item and IIRC they used some more preprocessor magic to reuse the same code for the menus on the GUI and the functions in the scripting language.
I don't think you're supposed to write the scripts by hand. Rather it recorded your actions as you worked your way through the GUI and then you could export and modify those recordings as scripts.
They also implemented their own cross platform GUI toolkit rather than using one of the existing cross-platform GUI toolkits, so it works on Windows, Linux (or any X Windows I believe) and MacOS.
[0]: https://github.com/praat/praat
Speech related posts
-
Praat: Doing Phonetics by Computer
-
Easy video transcription and subtitling with Whisper, FFmpeg, and Python
-
Using Groq to Build a Real-Time Language Translation App
-
OpenAI deems its voice cloning tool too risky for general release
-
SOTA ASR Tooling: Long-Form Transcription
-
Deploying whisperX on AWS SageMaker as Asynchronous Endpoint
-
Show HN: AI Dub Tool I Made to Watch Foreign Language Videos with My 7-Year-Old
-
A note from our sponsor - InfluxDB
www.influxdata.com | 10 May 2024
Index
What are some of the best open-source Speech projects? This list will help you:
Project | Stars | |
---|---|---|
1 | MockingBird | 33,904 |
2 | TTS | 29,631 |
3 | datasets | 18,480 |
4 | Kaldi Speech Recognition Toolkit | 13,768 |
5 | Grounded-Segment-Anything | 13,615 |
6 | AudioGPT | 9,788 |
7 | whisperX | 9,173 |
8 | TTS | 8,845 |
9 | annyang | 6,547 |
10 | EmotiVoice | 6,330 |
11 | modelscope | 6,099 |
12 | silero-models | 4,584 |
13 | lingvo | 2,777 |
14 | aeneas | 2,379 |
15 | gTTS | 2,149 |
16 | whisper-diarization | 2,066 |
17 | Amazing-Python-Scripts | 2,024 |
18 | DeepFilterNet | 1,952 |
19 | julius | 1,778 |
20 | soloud | 1,655 |
21 | whisper-timestamped | 1,547 |
22 | ai-audio-startups | 1,454 |
23 | praat | 1,386 |
Sponsored