Imagine trying to paint a masterpiece by randomly throwing paint at the canvas. That’s somewhat similar to how traditional Generative Adversarial Networks (GANs) work — they start with random noise to create images. But what if instead of random splatter, we gave the artist a structured sketch to work from? This is exactly what happens when we use Variational Autoencoder (VAE) latent spaces to drive GAN generation.
Traditional GANs work like an artist blindfolded, creating images from pure randomness. They take random numbers as input and somehow need to transform this chaos into meaningful images. While this approach has produced good results, but it’s like trying to find your way through a maze in the dark.
VAEs — these neural networks are very good at understanding and compressing images into their essential features. These VAEs have two main components: an encoder and a decoder. The encoder compresses images into a compact representation (latent space), while the decoder reconstructs images from this representation. When we use a VAE’s encoder’s output (called the latent space) to drive a GAN instead of random noise, it’s like giving our artist a reference sketch instead of a blindfold. The GAN now works with meaningful features rather than pure randomness.
(E)-Encoder of VAE, (G)-Generator of GAN (D)-Discriminator of GAN
The encoder of the VAE maps input data to a latent space , modeled as a Gaussian distribution:
z ~ q(z|x)
q(z|x) = N(μ(x), σ²(x))
E(x) = μ(x) + σ(x)ε
Where:
And using reparameterization trick to enable derivative for backpropagation:
z = μ(x) + σ(x)ε; where ε ~ N(0, I)
The generator in this hybrid model maps the latent vector from the VAE to the data space:
G: z → y
y = G(z), where z = μ(x) + σ(x)ε, G is Generator function
NOTE: Here, the input for the Generator is not “random noise (z ~ N(0, I))” anymore, but rather the latent space from the encoder.
The generator learns a mapping that transforms the feature-rich into outputs resembling the target domain (e.g., sketches or paintings).
The discriminator evaluates the realism of the generated data compared to real data . Its loss is:
Lᴅ = 𝔼[log D(y_real)] + 𝔼[log(1 — D(G(z)))]
Lᴅ = 𝔼[log D(y_real)] + 𝔼[log(1 — D(G(E(x))))]
(a) VAE Loss:
Lᵥₐₑ = KL(q(z|x)||p(z)) + 𝔼[log p(x|z)]
where the reconstruction loss ensures , and the KL divergence regularizes to be close to a standard Gaussian prior.
(b) GAN Loss:
L𝒢ₐₙ = 𝔼[log D(y_real)] + 𝔼[log(1 — D(G(z)))]
So, The Total Hybrid Loss is
Lₜₒₜₐₗ = Lᵥₐₑ + λL𝒢ₐₙ
where λ balances reconstruction fidelity and adversarial realism.
1. Structured Latent Representations
The VAE encoder provides feature-rich latent vectors , capturing semantic information (e.g., edges, textures):
This reduces the generator’s burden, leading to:
2. Diversity and Control
Stochastic sampling in enables diverse outputs for the same input, that is small variations in ε produce diverse outputs for the same input
q: ∀ z₁, z₂ ~ q(z|x): G(z₁) ≠ G(z₂) where z₁ ≠ z₂
This ensures the generator can create varied and style-rich outputs, ex: from a single image
x(pencil_sketch) = G(z(pencil))
x(painting) = G(z(painting))
where both z(pencil) and z(painting) are from the same image but encoded differently
3. Improved Stability
The KL divergence regularizes , ensuring a smooth latent space:
Lᵣₑ = L𝒢ₐₙ + βKL(q(z|x) || p(z)) where β controls regularization strength
Smoothness constraint in latent space: ‖∇zG(z₁) — ∇zG(z₂)‖ ≤ L‖z₁ — z₂‖ for Lipschitz constant L
Mode collapse prevention constraint: 𝔼[KL(q(z|x) || p(z))] ≤ ε 𝔼[‖G(z₁) — G(z₂)‖] ≥ δ for z₁, z₂ ~ q(z|x), z₁ ≠ z₂
This stabilizes GAN training and prevents mode collapse.
Full regularized objective: min_G max_D [𝔼real_data[log D(x)] + 𝔼zq(z|x)[log(1 — D(G(z)))] + βKL(q(z|x) || p(z))]
4. Domain Specific Transformations
The discriminator enforces domain-specific realism:
Using the VAE’s latent space as input for the GAN’s generator offers a powerful framework for image-to-image transformations. The structured latent representations enhance diversity, control, and stability, while the GAN’s adversarial training ensures high-quality outputs.
Me and my friends did experiment using this framework on changing normal pics to pencil sketch and here is the code:
GitHub repository https://github.com/arjuuuuunnnnn/The-Art-Of-Transformation