Our great sponsors
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
I'm honestly surprised that they trained a StyleGAN. Recently, the Imagen architecture has been show to be both easier in structure, easier to train, and even faster to produce good results. Combined with the "Elucidating" paper by NVIDIA's Tero Karras you can train a 256px Imagen to tolerable quality within an hour on a RTX 3090.
Here's a PyTorch implementation by the LAION people:
https://github.com/lucidrains/imagen-pytorch
And here's 2 images I sampled after training it for some hours, like 2 hours base model + 4 hours upscaler:
https://imgur.com/a/46EZsJo
> The denoising part of a denoising autoencoder refers to the noise applied to its input
Agree, it converts a noisy image to a denoised image. But the odd thing is, when you put a noisy image into a StyleGAN2 encoder, you get latents which the decoder will turn into a de-noised image. So in practical use, you can take a trained StyleGAN2 encoder/decoder pair and use it as if it was a denoiser.
> These differences lead to learned distributions in the latent space that are entirely different
I also agree there. The training for a denoising auto-encoder and for a GAN network is different, leading to different distributions which are sampled for generating the images. But the architecture is still very similar, meaning the limits of what can be learned should be the same.
> Beyond that the comparison just doesn't work, yes there are two networks but the discriminator doesn't play the role of the AE's encoder at all
Yes, the discriminator in a GAN won't work like an encoder. But if you look at how StyleGAN 1/2 are used in practice, people combine it with a so-called "projection", which is effectively an encoder to convert images to latents. So people use a pipeline of "image to latent encoder" + "latent to image decoder".
That whole pipeline is very similar to an auto-encoder. For example, here's an NVIDIA paper about how they round-trip from image to latent to image with StyleGAN: https://arxiv.org/abs/1912.04958 My interpretation of what they did in that paper is that they effectively trained a StyleGAN-like model with the image L2 loss typically used for training a denoising auto-encoder.
Related posts
- Google's StyleDrop can transfer style from a single image
- One year ago I got access to closed beta DALL-E 2.
- Besides Gaming - for what can be a 4080 useful?
- Is creating a StableDiffusion-inspired model feasible for my Master's thesis?
- TEDx talk on how to prepare for a career in vfx with the rapid changes caused by AI / machine learning