← Back to blogs

From Chaos to Control: VAE-Encoded Latent Spaces as GAN Inputs

Dec 31, 2024
GANVAELatent SpacesImage Transformation

Imagine trying to paint a masterpiece by randomly throwing paint at the canvas. That’s somewhat similar to how traditional Generative Adversarial Networks (GANs) work — they start with random noise to create images. But what if instead of random splatter, we gave the artist a structured sketch to work from? This is exactly what happens when we use Variational Autoencoder (VAE) latent spaces to drive GAN generation.

The Traditional Approach vs The New Standard

Traditional GANs work like an artist blindfolded, creating images from pure randomness. They take random numbers as input and somehow need to transform this chaos into meaningful images. While this approach has produced good results, but it’s like trying to find your way through a maze in the dark.

VAEs — these neural networks are very good at understanding and compressing images into their essential features. These VAEs have two main components: an encoder and a decoder. The encoder compresses images into a compact representation (latent space), while the decoder reconstructs images from this representation. When we use a VAE’s encoder’s output (called the latent space) to drive a GAN instead of random noise, it’s like giving our artist a reference sketch instead of a blindfold. The GAN now works with meaningful features rather than pure randomness.

The Framework:

(E)-Encoder of VAE, (G)-Generator of GAN (D)-Discriminator of GAN

(E)-Encoder of VAE, (G)-Generator of GAN (D)-Discriminator of GAN

VAE Encoder:

The encoder of the VAE maps input data to a latent space , modeled as a Gaussian distribution:

z ~ q(z|x)

q(z|x) = N(μ(x), σ²(x))

E(x) = μ(x) + σ(x)ε

Where:

  • q(z|x) represents the conditional probability of latent vector z given input x
  • N denotes a Normal distribution
  • μ(x) is the mean vector output by the encoder
  • σ²(x) is the variance vector output by the encoder
  • E is encoder function

And using reparameterization trick to enable derivative for backpropagation:

z = μ(x) + σ(x)ε; where ε ~ N(0, I)

The GAN Generator:

The generator in this hybrid model maps the latent vector from the VAE to the data space:

G: z → y

y = G(z), where z = μ(x) + σ(x)ε, G is Generator function

NOTE: Here, the input for the Generator is not “random noise (z ~ N(0, I))” anymore, but rather the latent space from the encoder.

The generator learns a mapping that transforms the feature-rich into outputs resembling the target domain (e.g., sketches or paintings).

The GAN Discriminator:

The discriminator evaluates the realism of the generated data compared to real data . Its loss is:

Lᴅ = 𝔼[log D(y_real)] + 𝔼[log(1 — D(G(z)))]

Lᴅ = 𝔼[log D(y_real)] + 𝔼[log(1 — D(G(E(x))))]

The Hybrid Loss Function:

(a) VAE Loss:

Lᵥₐₑ = KL(q(z|x)||p(z)) + 𝔼[log p(x|z)]

where the reconstruction loss ensures , and the KL divergence regularizes to be close to a standard Gaussian prior.

(b) GAN Loss:

L𝒢ₐₙ = 𝔼[log D(y_real)] + 𝔼[log(1 — D(G(z)))]

So, The Total Hybrid Loss is

Lₜₒₜₐₗ = Lᵥₐₑ + λL𝒢ₐₙ

where λ balances reconstruction fidelity and adversarial realism.

Advantages of Using Latent Space as GAN Input

1. Structured Latent Representations

The VAE encoder provides feature-rich latent vectors , capturing semantic information (e.g., edges, textures):

This reduces the generator’s burden, leading to:

  • Faster convergence.
  • More realistic outputs.

2. Diversity and Control

Stochastic sampling in enables diverse outputs for the same input, that is small variations in ε produce diverse outputs for the same input

q: ∀ z₁, z₂ ~ q(z|x): G(z₁) ≠ G(z₂) where z₁ ≠ z₂

This ensures the generator can create varied and style-rich outputs, ex: from a single image

x(pencil_sketch) = G(z(pencil))

x(painting) = G(z(painting))

where both z(pencil) and z(painting) are from the same image but encoded differently

3. Improved Stability

The KL divergence regularizes , ensuring a smooth latent space:

Lᵣₑ = L𝒢ₐₙ + βKL(q(z|x) || p(z)) where β controls regularization strength

Smoothness constraint in latent space: ‖∇zG(z₁) — ∇zG(z₂)‖ ≤ L‖z₁ — z₂‖ for Lipschitz constant L

Mode collapse prevention constraint: 𝔼[KL(q(z|x) || p(z))] ≤ ε 𝔼[‖G(z₁) — G(z₂)‖] ≥ δ for z₁, z₂ ~ q(z|x), z₁ ≠ z₂

This stabilizes GAN training and prevents mode collapse.

Full regularized objective: min_G max_D [𝔼real_data[log D(x)] + 𝔼zq(z|x)[log(1 — D(G(z)))] + βKL(q(z|x) || p(z))]

4. Domain Specific Transformations

The discriminator enforces domain-specific realism:

  • Photo-to-Sketch: Sharp edges, grayscale tones.
  • Photo-to-Painting: Smooth gradients, artistic textures.

Conclusion

Using the VAE’s latent space as input for the GAN’s generator offers a powerful framework for image-to-image transformations. The structured latent representations enhance diversity, control, and stability, while the GAN’s adversarial training ensures high-quality outputs.

Me and my friends did experiment using this framework on changing normal pics to pencil sketch and here is the code:

GitHub repository https://github.com/arjuuuuunnnnn/The-Art-Of-Transformation