NeMo vs pyannote-audio

NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech) (by NVIDIA)

Source Code

docs.nvidia.com

Suggest alternative

Edit details

pyannote-audio

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding (by pyannote)

Pytorch speech-processing speaker-diarization speech-activity-detection speaker-change-detection speaker-embedding voice-activity-detection pretrained-models overlapped-speech-detection speaker-recognition speaker-verification

Source Code

pyannote.github.io

Suggest alternative

Edit details

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

NeMo		pyannote-audio
	Project
29	Mentions	15
10,084	Stars	5,027
7.1%	Growth	7.0%
9.8	Activity	8.6
2 days ago	Latest Commit	5 days ago
Python	Language	Jupyter Notebook
Apache License 2.0	License	MIT License

The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.

NeMo

Posts with mentions or reviews of NeMo. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-07-06.

[P] Making a TTS voice, HK-47 from Kotor using Tortoise (Ideally WaveRNN)
2 projects | /r/MachineLearning | 6 Jul 2023

I don't test WaveRNN but from the ones that I know the best that is open source is FastPitch. And it's easy to use, here is the tutorial for voice cloning.
[N] Huggingface/nvidia release open source GPT-2B trained on 1.1T tokens
1 project | /r/MachineLearning | 2 May 2023
[D] What is the best open source text to speech model?
15 projects | /r/MachineLearning | 13 Apr 2023
[D] JAX vs PyTorch in 2023
5 projects | /r/MachineLearning | 9 Mar 2023

Nowadays... bigger repos like https://github.com/NVIDIA/NeMo are all pytorch, lots of work also published by Meta and Microsoft is all torch. I check new work on GitHub all the time and I haven't seen a Tensorflow repo in years except one.
[D] What's stopping you from working on speech and voice?
7 projects | /r/MachineLearning | 30 Jan 2023

- https://github.com/NVIDIA/NeMo
Can I use PyTorch to build a fast capitalization recoverer?
1 project | /r/pytorch | 21 Nov 2022

Can’t you use the NeMo model and just strip the punctuation from the output again if you don’t want it? You can also fine tune the the model with capitalization only if you look at the examples https://github.com/NVIDIA/NeMo/blob/stable/tutorials/nlp/Punctuation_and_Capitalization.ipynb The capitalization and punctuation are annotated separately (U indicates that the word should be upper cased, and O - no capitalization ). The model seems to be a token level classifier not seq to seq so there should also be a way to get just the capitalization part but you would have to look into the model as it’s not shown in the examples.
I made a free transcription service powered by Whisper AI
8 projects | news.ycombinator.com | 18 Nov 2022

I think there's been talk to do speaker diarization with whisper-asr-webservice[0] which is also written in python and should be able to make use of goodies such as pyannote-audio, py-webrtcvad, etc.
Whisper is great but at the point we get to kludging various things together it starts to make more sense to use something like Nvidia NeMo[1] which was built with all of this in mind and more
[0] - https://github.com/ahmetoner/whisper-asr-webservice
[1] - https://github.com/NVIDIA/NeMo
Mozilla Common Voice - Korean Language is live - Help Build a Korean Corpus for Training AI/Navi/etc
2 projects | /r/Korean | 4 Nov 2022

[커먼보이스 전자우편](mailto:[email protected]) || Common Voice || Korean Language Homepage || FAQs || Speaking Aloud and Reviewing Recordings || Sentence Collector || NVidia/NeMo
Whisper – open source speech recognition by OpenAI
22 projects | news.ycombinator.com | 21 Sep 2022
Using Edge Biometrics For Better AI Security System Development
3 projects | dev.to | 31 Aug 2022

The final security grain was added with speech-to-text anti-spoofing built on QuartzNet from the Nemo framework. This model provides a decent quality user experience and is suitable for real-time scenarios. To measure how close what the person says to what the system expects, requires calculation of the Levenshtein distance between them.

pyannote-audio

Posts with mentions or reviews of pyannote-audio. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-10-02.

Open Source Libraries
25 projects | /r/AudioAI | 2 Oct 2023

pyannote/pyannote-audio
AI Transcribing tool for video with two voices?
1 project | /r/ChatGPT | 22 Jun 2023

Open Source. I've found this to be pretty nice, which is just a wrapper on some hugging face models https://github.com/pyannote/pyannote-audio
Show HN: PodText.ai – Search anything said on a podcast, Highlight text to play
4 projects | news.ycombinator.com | 9 Feb 2023

(not the creator, but I've built something similar for personal use)
This is a great library for determining which speaker is speaking during each time in an audio file (this is called speaker diarization); I imagine they used it or something like it. Works really well out of the box!
https://github.com/pyannote/pyannote-audio
I wanted to use OpenAI's Whisper speech-to-text on my Mac without installing stuff in the Terminal so I made MacWhisper, a free Mac app to transcribe audio and video files for easy transcription and subtitle generation. Would love to hear some feedback on it!
2 projects | /r/apple | 1 Feb 2023

Do you think pyannote could be implemented in the Pro version of the app to support diarization?
I won several speaker diarization challenges with pyannote.audio
1 project | news.ycombinator.com | 2 Dec 2022
I made a free transcription service powered by Whisper AI
8 projects | news.ycombinator.com | 18 Nov 2022

Free startup idea: Use Whisper with pyannote-audio[0]’s speaker diarization. Upload a recording, get back a multi-speaker annotated transcription.
Make a JSON API and I’ll be your first customer.
[0] https://github.com/pyannote/pyannote-audio
Can Whisper differentiate between different voices?
1 project | /r/OpenAI | 16 Nov 2022

Whisper can’t, but pyannote-audio can. I’ve seen a couple of prototypes out there which link the two together.
[D] Is there a way to distinguish different human voices from 1 audio file ?
2 projects | /r/MachineLearning | 3 Oct 2022

You can use pyannote python library. It will identify different speakers from audio and will create small audio files with those speakers.
Post-Game Analysis: Destiny & Alex VS Andrew & Zen Shapiro
1 project | /r/Destiny | 3 Sep 2022
A quick and dirty tool for automatically analyzing speaking time in online debates (Effortpost)
1 project | /r/Destiny | 1 Sep 2022

This Colab notebook is basically a standard template (with small changes) provided by pyannote-audio, the library implementing the speaker diarization functionality we need. (template)

What are some alternatives?

When comparing NeMo and pyannote-audio you can also consider the following projects:

DeepSpeech - DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.

speechbrain - A PyTorch-based Speech Toolkit

whisper - Robust Speech Recognition via Large-Scale Weak Supervision

Resemblyzer - A python package to analyze and compare voices with deep learning

espnet - End-to-End Speech Processing Toolkit

Kaldi Speech Recognition Toolkit - kaldi-asr/kaldi is the official location of the Kaldi project.

Real-Time-Voice-Cloning - Clone a voice in 5 seconds to generate arbitrary speech in real-time

inaSpeechSegmenter - CNN-based audio segmentation toolkit. Allows to detect speech, music, noise and speaker gender. Has been designed for large scale gender equality studies based on speech time per gender.

TTS - 🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production

uis-rnn - This is the library for the Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, corresponding to the paper Fully Supervised Speaker Diarization.

STT - 🐸STT - The deep learning toolkit for Speech-to-Text. Training and deploying STT models has never been so easy.

segmentation_models.pytorch - Segmentation models with pretrained backbones. PyTorch.

NeMo vs DeepSpeech pyannote-audio vs speechbrain NeMo vs whisper pyannote-audio vs Resemblyzer NeMo vs espnet pyannote-audio vs Kaldi Speech Recognition Toolkit NeMo vs Real-Time-Voice-Cloning pyannote-audio vs inaSpeechSegmenter NeMo vs TTS pyannote-audio vs uis-rnn NeMo vs STT pyannote-audio vs segmentation_models.pytorch

Compare NeMo vs pyannote-audio and see what are their differences.

NeMo

pyannote-audio

NeMo

pyannote-audio

What are some alternatives?