whisperX vs ControlNet

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

whisperX		ControlNet
	Project
24	Mentions	127
9,064	Stars	27,964
-	Growth	-
8.4	Activity	4.1
6 days ago	Latest Commit	2 months ago
Python	Language	Python
BSD 4-Clause "Original" or "Old" License	License	Apache License 2.0

The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.

whisperX

Posts with mentions or reviews of whisperX. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2024-03-31.

Easy video transcription and subtitling with Whisper, FFmpeg, and Python
1 project | news.ycombinator.com | 6 Apr 2024

It uses this, which does support diarization: https://github.com/m-bain/whisperX
SOTA ASR Tooling: Long-Form Transcription
1 project | news.ycombinator.com | 31 Mar 2024

Author compared various whisper implementation
"We found that WhisperX is the best framework for transcribing long audio files efficiently and accurately. It’s much better than using the standard openai-whisper library."
https://github.com/m-bain/whisperX
Deploying whisperX on AWS SageMaker as Asynchronous Endpoint
2 projects | dev.to | 31 Mar 2024

import os # Directory and file paths dir_path = './models-v1' inference_file_path = os.path.join(dir_path, 'code/inference.py') requirements_file_path = os.path.join(dir_path, 'code/requirements.txt') # Create the directory structure os.makedirs(os.path.dirname(inference_file_path), exist_ok=True) # Inference.py content inference_content = '''# inference.py # inference.py import io import json import logging import os import tempfile import time import boto3 import torch import whisperx DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu' s3 = boto3.client('s3') def model_fn(model_dir, context=None): """ Load and return the WhisperX model necessary for audio transcription. """ print("Entering model_fn") logging.info("Loading WhisperX model") model = whisperx.load_model(whisper_arch=f"{model_dir}/guillaumekln/faster-whisper-large-v2", device=DEVICE, language="en", compute_type="float16", vad_options={'model_fp': f"{model_dir}/whisperx/vad/pytorch_model.bin"}) print("Loaded WhisperX model") print("Exiting model_fn with model loaded") return { 'model': model } def input_fn(request_body, request_content_type): """ Process and load audio from S3, given the request body containing S3 bucket and key. """ print("Entering input_fn") if request_content_type != 'application/json': raise ValueError("Invalid content type. Must be application/json") request = json.loads(request_body) s3_bucket = request['s3bucket'] s3_key = request['s3key'] # Download the file from S3 temp_file = tempfile.NamedTemporaryFile(delete=False) s3.download_file(Bucket=s3_bucket, Key=s3_key, Filename=temp_file.name) print(f"Downloaded audio from S3: {s3_bucket}/{s3_key}") print("Exiting input_fn") return temp_file.name def predict_fn(input_data, model, context=None): """ Perform transcription on the provided audio file and delete the file afterwards. """ print("Entering predict_fn") start_time = time.time() whisperx_model = model['model'] logging.info("Loading audio") audio = whisperx.load_audio(input_data) logging.info("Transcribing audio") transcription_result = whisperx_model.transcribe(audio, batch_size=16) try: os.remove(input_data) # input_data contains the path to the temp file print(f"Temporary file {input_data} deleted.") except OSError as e: print(f"Error: {input_data} : {e.strerror}") end_time = time.time() elapsed_time = end_time - start_time logging.info(f"Transcription took {int(elapsed_time)} seconds") print(f"Exiting predict_fn, processing took {int(elapsed_time)} seconds") return transcription_result def output_fn(prediction, accept, context=None): """ Prepare the prediction result for the response. """ print("Entering output_fn") if accept != "application/json": raise ValueError("Accept header must be application/json") response_body = json.dumps(prediction) print("Exiting output_fn with response prepared") return response_body, accept ''' # Write the inference.py file with open(inference_file_path, 'w') as file: file.write(inference_content) # Requirements.txt content requirements_content = '''speechbrain==0.5.16 faster-whisper==0.7.1 git+https://github.com/m-bain/whisperx.git@1b092de19a1878a8f138f665b1467ca21b076e7e ffmpeg-python ''' # Write the requirements.txt file with open(requirements_file_path, 'w') as file: file.write(requirements_content)
OpenVoice: Versatile Instant Voice Cloning
7 projects | news.ycombinator.com | 29 Mar 2024

Whisper doesn't, but WhisperX <https://github.com/m-bain/whisperX/> does. I am using it right now and it's perfectly serviceable.
For reference, I'm transcribing research-related podcasts, meaning speech doesn't overlap a lot, which would be a problem for WhisperX from what I understand. There's also a lot of accents, which are straining on Whisper (though it's also doing well), but surely help WhisperX. It did have issues with figuring number of speakers on it's own, but that wasn't a problem for my use case.
FLaNK 15 Jan 2024
21 projects | dev.to | 15 Jan 2024
Subtitle is now open-source
3 projects | news.ycombinator.com | 24 Nov 2023

I've had good results with whisperx when I needed to generate captions. https://github.com/m-bain/whisperX
There is currently a problem with diarization, but otherwise, it is SOTA.
Insanely Fast Whisper: Transcribe 300 minutes of audio in less than 98 seconds
8 projects | news.ycombinator.com | 14 Nov 2023

https://github.com/m-bain/whisperX/issues/569
WhisperX with the new model. It's not fast.
Distil-Whisper: distilled version of Whisper that is 6 times faster, 49% smaller
14 projects | news.ycombinator.com | 31 Oct 2023

How much faster in real wall-clock time is this in batched data than https://github.com/m-bain/whisperX ?
whisper self hosted what's the most cost-efficient way
1 project | /r/selfhosted | 17 Oct 2023

Checkout whisperx
Whisper Turbo: transcribe 20x faster than realtime using Rust and WebGPU
3 projects | news.ycombinator.com | 12 Sep 2023

Neat to see a new implementation, although I'll note that for those looking for a drop-in replacement for the whisper library, I believe that both faster-whisper https://github.com/guillaumekln/faster-whisper and https://github.com/m-bain/whisperX are easier (PyTorch-based, doesn't require a web browser), and a lot faster (WhisperX is up to 70X realtime).

ControlNet

Posts with mentions or reviews of ControlNet. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-12-05.

With the recent developments, It looks like AI art is finally beginning to evolve in the right direction
5 projects | /r/aiwars | 5 Dec 2023

It`s all possible. Have a look into Automatic1111`s Web UI, ControlNet, OpenPose and, if you don`t have a dedicated GPU with at least 8GB of VRAM, or at least 16GB of RAM to use the CPU, you can also use Stable Horde to use the webUI with a peer-to-peer connection, where you`ll only use a fraction of your resources, but you`ll be able to use local AI models with all the bells and whistles that you won`t get from "state-of-the-art" paid services.
AI "Artists" Are Lazy, and the Ultimate Goal of AI Image Generation (hint: its sloth)
2 projects | /r/ArtistHate | 25 Nov 2023

Next up is ControlNet. Controlnet, as Illyasviel--creator of controlnet--describes it, "let's us control diffusion models!." ControlNet is a neural network structure to control diffusion models by adding extra connections. [8]. There is more to that than what I described, but the big take-away is that ControlNet takes a preprocessed image that you provide (or is generated) and uses that as a way of constraining the output the sampler's noise generates, allowing you to have a bit more control of the output. ControlNet is typically used for character or scene "artwork", which previously would have been a challenge with just prompting alone (at least with this current architecture).
Making a ControlNet inpaint for sdxl
3 projects | /r/StableDiffusion | 27 Oct 2023
[P5V6P2] Mother and Daughter (by azfumi)
2 projects | /r/HonzukiNoGekokujou | 12 Jul 2023

For your first part of the comment, I can simply refer you to technologies like ControlNet, LoRA and prompt embedding: https://github.com/lllyasviel/ControlNet https://github.com/microsoft/LoRA
Calling yourself an AI artist is almost exactly the same as calling yourself a cook for heating readymade meals in a microwave
1 project | /r/Showerthoughts | 8 Jul 2023
Why is the AI not listening to my prompts?
1 project | /r/StableDiffusion | 3 Jul 2023

Here you can see what every controlnet preprocessor and model do, to give you an idea of how to use
Can't get img2img working well
1 project | /r/StableDiffusion | 30 Jun 2023

Ya, it takes awhile to really start getting comfortable with the wonkiness. If you are trying to do something specific, look for a LoRA, but in general I'd recommend you get controlnet so you can feed it a reference image. Another simple trick is to edit the image a bit in GIMP or a photo editor to get the color scheme you like and then feed it back to img2img at low denoising (0.1-0.2) to refine it. You can also add just garishly bad cartoon drawing or photoshop in assets and img2img will usually make something of them and blend them into your image, I find this easier than using img2img scribble.
ControlNet on A1111 seems to have been broken in the new update
1 project | /r/ControlNet | 25 Jun 2023
Can anyone help me install SD and ControlNet on my Mac pro M1?
5 projects | /r/StableDiffusion | 25 Jun 2023

If there are no errors, go to the "Extensions" tab, then "Install from URL". There, enter "https://github.com/lllyasviel/ControlNet" then click "Install".
According to the poll on the recent thread, /r/dalle2 community decided to keep the subreddit restricted on Reddit.
2 projects | /r/dalle2 | 22 Jun 2023

This is a good place to start reading. Given the open-source nature of SD, there are setups of various difficulty available. A1111 is the "standard" people enjoy because it's easy to plug in new stuff (ControlNet, new models, etc.), but it's not inherently easy to set up and get going. There is an installer for it, but I haven't tried it.

What are some alternatives?

When comparing whisperX and ControlNet you can also consider the following projects:

whisper.cpp - Port of OpenAI's Whisper model in C/C++

InvokeAI - InvokeAI is a leading creative engine for Stable Diffusion models, empowering professionals, artists, and enthusiasts to generate and create visual media using the latest AI-driven technologies. The solution offers an industry leading WebUI, supports terminal use through a CLI, and serves as the foundation for multiple commercial products.

whisper - Robust Speech Recognition via Large-Scale Weak Supervision

lora - Using Low-rank adaptation to quickly fine-tune diffusion models.

faster-whisper - Faster Whisper transcription with CTranslate2

LoRA - Code for loralib, an implementation of "LoRA: Low-Rank Adaptation of Large Language Models"

insanely-fast-whisper - Incredibly fast Whisper-large-v3

sd-webui-controlnet - WebUI extension for ControlNet

openai-whisper-cpu - Improving transcription performance of OpenAI Whisper for CPU based deployment

stable-diffusion-webui-prompt-travel - Travel between prompts in the latent space to make pseudo-animation, extension script for AUTOMATIC1111/stable-diffusion-webui.

subgen - Autogenerate subtitles using OpenAI Whisper Model via Jellyfin, Plex, Emby, Tautulli, or Bazarr

stable-diffusion-webui - Stable Diffusion web UI

whisperX vs whisper.cpp ControlNet vs InvokeAI whisperX vs whisper ControlNet vs lora whisperX vs faster-whisper ControlNet vs LoRA whisperX vs insanely-fast-whisper ControlNet vs sd-webui-controlnet whisperX vs openai-whisper-cpu ControlNet vs stable-diffusion-webui-prompt-travel whisperX vs subgen ControlNet vs stable-diffusion-webui

Compare whisperX vs ControlNet and see what are their differences.

whisperX

ControlNet

whisperX

ControlNet

What are some alternatives?