Understanding latent diffusion, VAE, U-Net, and control mechanisms
Stable Diffusion transformed AI image generation from academic curiosity to mainstream creative tool. But how does it actually work? This article explains the underlying technology: the variational autoencoder that compresses images, the U-Net that performs the diffusion process, and the control mechanisms that enable precise image generation control.
Diffusion models generate images by learning to reverse a gradual noising process. The key insight: if you can learn to denoise, you can generate.
In the forward process, a clean image gradually becomes pure noise through T timesteps. At each step, a small amount of Gaussian noise is added:
q(xₜ | xₜ₋₁) = N(xₜ; √(1-βₜ)xₜ₋₁, βₜI)
Where:
βₜ = noise schedule (increases from ~0.0001 to ~0.02)
T = total timesteps (typically 1000)
By the mathematical property of additive Gaussian noise, you can directly sample from any timestep T without iterating:
xₜ = √(ᾱₜ) × x₀ + √(1-ᾱₜ) × ε
Where:
ᾱₜ = cumulative product of (1-βₜ)
ε = random noise
x₀ = original clean image
The model learns to reverse this process. Given a noisy image at timestep t, it predicts what the image looked like at timestep t-1:
pθ(xₜ₋₁ | xₜ) = N(μθ(xₜ, t), Σθ(xₜ, t))
The model predicts either:
- The noise εθ(xₜ, t)
- The original image x₀ prediction
- Velocity prediction (latest models)
Starting from random noise x_T, the model iteratively denoises to produce x₀.
The key innovation in Stable Diffusion is operating in latent space rather than pixel space. Instead of denoising 512×512×3 = 786,432 pixel values, it denoises 64×64×4 = 16,384 latent variables—a 48× reduction.
Pixel Space: 512 × 512 × 3 = 786,432 dimensions
Latent Space: 64 × 64 × 4 = 16,384 dimensions
Compression: 48× smaller
The VAE compresses images to latent space and decompresses back. It consists of:
Image (H, W, 3) → Encoder → Latent (H/8, W/8, 4) → Decoder → Reconstructed Image
↓
Mean and LogVar (for sampling)
The VAE is trained to minimize reconstruction loss and a KL divergence term ensuring the latent space is well-structured.
The U-Net is the heart of the diffusion model. It takes noisy latents and a timestep embedding, and predicts the noise (or velocity).
Input: Noisy latent (64×64×4) + Timestep embedding
↓
[Encoder path - downsampling]
Conv → Conv → Downsample → Conv → Conv → Downsample → ...
↓
[Bottleneck]
Attention + FeedForward blocks
↓
[Decoder path - upsampling]
... → Upsample → Concat(skip) → Conv → Conv → Upsample → ...
↓
Output: Predicted noise (64×64×4)
The key innovation is skip connections that preserve spatial information from encoder to decoder, enabling precise pixel-level predictions.
Stable Diffusion uses text prompts to guide image generation. This requires converting text to a format the U-Net can use.
Stable Diffusion uses CLIP (Contrastive Language-Image Pre-training) text encoders:
Prompt: "A photo of a cat"
↓ Tokenize → [32, 1024, 3256, ...] (token IDs)
↓ CLIP Text Encoder → [77, 768] (77 tokens × 768 dimensions)
↓ Cross-attention conditioning → U-Net
The CLIP text encoder is frozen during SD training—Stable Diffusion doesn't train its own text encoder. This is why using a better CLIP model (like CLIP-L or CLIP-H) can improve generation quality without retraining the diffusion model.
Classifier-free guidance (CFG) amplifies the conditioning effect without requiring a separate classifier:
εθ(xₜ, c) = εθ(xₜ, ∅) + w × (εθ(xₜ, c) - εθ(xₜ, ∅))
Where:
εθ(xₜ, ∅) = unconditional prediction (no text)
εθ(xₜ, c) = conditional prediction (with text)
w = guidance scale (typically 7-12)
Higher guidance scales improve prompt adherence but can reduce diversity and introduce artifacts (oversaturated colors, compressed compositions).
| CFG Scale | Prompt Adherence | Image Diversity | Artifact Risk |
|---|---|---|---|
| 2-4 | Low | High | Very low |
| 7-8 | Good | Moderate | Low |
| 12-15 | Very high | Low | Moderate |
| 20+ | Saturated | Very low | High |
ControlNet extends Stable Diffusion to accept additional conditioning inputs: edge maps, depth maps, keypoints, scribbles, and more.
ControlNet works by creating trainable copies of the U-Net encoder layers, with the original frozen:
Input: Noisy latent + Control map (edge, depth, pose, etc.)
↓
ControlNet Encoder (trainable):
Zero-initialized conv → Encoder copy → Feature maps
↓
Stable Diffusion U-Net (frozen):
Feature maps added via skip connections
↓
Output: Denoised latent
The zero initialization ensures ControlNet starts by contributing nothing, then gradually learns meaningful conditioning. This prevents the model from forgetting the pretrained SD weights.
Just as LoRA adapts language models efficiently, it adapts image generators:
Base Stable Diffusion:
UNet + Text Encoder (frozen)
LoRA adaptation:
Add trainable rank decomposition matrices to attention layers
Only ~1-10 MB per LoRA vs ~2-5 GB for full model
Training:
LoRA on specific style → Can generate that style on demand
LoRA on specific concept → Consistent character/object appearance
LoRA adoption exploded in the image generation community for:
Stable Diffusion XL (SDXL) brought substantial improvements over SD 1.5:
| Feature | SD 1.5 / 2.1 | SDXL |
|---|---|---|
| Base resolution | 512×512 | 1024×1024 |
| Latent channels | 4 | 4 (improved) |
| Text encoders | 1× CLIP-L (768d) | 2× CLIP (CLIP-L + CLIP-G) |
| U-Net parameters | ~860M | ~3.5B |
| VAE quality | Moderate | Significantly improved |
| Native refinement | No | Yes (base + refiner) |
SDXL's improvement in text rendering is particularly notable—previous models struggled to render legible text; SDXL handles simple text reasonably well.
Standard Stable Diffusion requires 20-50 denoising steps. New techniques reduce this dramatically:
LCM distilled the diffusion process to converge in 4-8 steps by directly predicting the solution of the ODE:
Standard SD: 20-50 steps, ~3-8 seconds
LCM: 4-8 steps, ~1-2 seconds
LCM LoRA: Apply to any model in 4 steps
Adversarial diffusion distillation (ADD) uses GAN-style training to achieve single-step generation with acceptable quality. Not perfect, but useful for rapid prototyping.
Stable Diffusion's power comes from the elegant combination of three components: the VAE for efficient latent space operations, the U-Net for learned denoising, and CLIP for language grounding. ControlNet extends this to arbitrary conditioning, while LoRA enables lightweight customization.
Understanding these principles helps practitioners troubleshoot generation issues, choose appropriate parameters, and design effective custom training regimens. The field continues to evolve rapidly—each generation brings improved quality, faster inference, and new control capabilities.