6 May 2026
Diffusion Models Explained: The Math Behind Stable Diffusion
How denoising diffusion probabilistic models learn to generate images by reversing a gradual noising process — explained from the ground up.
The Core Idea
Diffusion models belong to the family of latent variable generative models. Their insight is elegant: instead of learning to generate data directly, learn to denoise it.
The training process corrupts data samples by progressively adding Gaussian noise over T steps until the data is indistinguishable from pure noise. The model then learns to reverse this process — predicting and removing the noise step by step.
The Forward Process
Given a clean data sample x0, the forward process defines a Markov chain that gradually adds noise:
q(xt∣xt−1)=N(xt;1−βtxt−1,βtI)
where {βt}t=1T is a fixed noise schedule. A useful property: we can sample xt directly from x0 in closed form. Let αt=1−βt and αˉt=∏s=1tαs:
q(xt∣x0)=N(xt;αˉtx0,(1−αˉt)I)
Or equivalently via the reparameterisation trick:
xt=αˉtx0+1−αˉtϵ,ϵ∼N(0,I)
As t→T, αˉt→0 and xT≈N(0,I).
The Reverse Process
The reverse process learns to denoise step by step:
pθ(xt−1∣xt)=N(xt−1;μθ(xt,t),Σθ(xt,t))
The network ϵθ(xt,t) is trained to predict the noise ϵ that was added. The training objective simplifies to:
Lsimple=Et,x0,ϵ[∥ϵ−ϵθ(xt,t)∥2]
This is just a denoising regression problem — predict the noise, minimise MSE.
The Network Architecture: U-Net
The denoising network is a U-Net — an encoder-decoder architecture with skip connections between corresponding encoder and decoder feature maps.
Input x_t + timestep embedding
↓
[Conv] → [ResBlock] → [Attention] → [Downsample]
↓ ↓
[Conv] → [ResBlock] → [Attention] → [Downsample]
↓
Middle Block (ResBlock + Attention)
↓
[Upsample] → [ResBlock] → [Attention]
↓
[Upsample] → [ResBlock] → [Attention]
↓
Output (predicted noise ε)
The timestep t is encoded as a sinusoidal embedding (similar to Transformer positional encodings) and injected into each ResBlock via FiLM conditioning.
Noise Schedules
The schedule {βt} controls how quickly noise accumulates. Common choices:
- Linear (DDPM): βt increases linearly from β1=10−4 to βT=0.02
- Cosine (improved DDPM): αˉt=cos2(1+st/T+s⋅2π) — avoids over-noising at early steps
- Flow matching (used in Stable Diffusion 3, Flux): straight paths through data-noise space for faster sampling
DDIM: Faster Sampling
Standard DDPM requires T=1000 denoising steps to generate one image. DDIM (Denoising Diffusion Implicit Models) reformulates the reverse process as a non-Markovian chain, enabling generation in 20–50 steps with comparable quality.
The DDIM update step:
xt−1=αˉt−1predicted x0αˉtxt−1−αˉtϵθ+1−αˉt−1ϵθ
Latent Diffusion Models
Stable Diffusion operates in latent space, not pixel space. A variational autoencoder (VAE) first compresses the image:
z=E(x),x^=D(z)
The diffusion process runs on the latent z — typically 64×64×4 for a 512×512 image. This reduces the computation by a factor of ~48× compared to pixel-space diffusion.
Classifier-Free Guidance
To steer generation toward a text prompt c, classifier-free guidance interpolates between conditional and unconditional predictions:
ϵ~θ(xt,t,c)=ϵθ(xt,t,∅)+w[ϵθ(xt,t,c)−ϵθ(xt,t,∅)]
The guidance scale w controls the trade-off between sample quality (higher w) and diversity (lower w). Typical values are 7–15.
Key Takeaways
- Diffusion models frame generation as iterative denoising — a simple regression objective
- The forward process analytically defines xt from x0 in one step, enabling efficient training
- DDIM reduces inference steps from 1000 to ~20–50 with no retraining
- Latent diffusion (Stable Diffusion) moves the diffusion process into a compressed VAE latent space for tractable high-resolution generation
- Classifier-free guidance is the key lever for controlling output fidelity and prompt adherence