Back to Blog

6 May 2026

Diffusion Models Explained: The Math Behind Stable Diffusion

How denoising diffusion probabilistic models learn to generate images by reversing a gradual noising process — explained from the ground up.

Generative AI Diffusion Models Deep Learning Computer Vision

The Core Idea

Diffusion models belong to the family of latent variable generative models. Their insight is elegant: instead of learning to generate data directly, learn to denoise it.

The training process corrupts data samples by progressively adding Gaussian noise over TT steps until the data is indistinguishable from pure noise. The model then learns to reverse this process — predicting and removing the noise step by step.

The Forward Process

Given a clean data sample x0x_0, the forward process defines a Markov chain that gradually adds noise:

q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t | x_{t-1}) = \mathcal{N}(x_t;\, \sqrt{1 - \beta_t}\, x_{t-1},\, \beta_t I)

where {βt}t=1T\{\beta_t\}_{t=1}^T is a fixed noise schedule. A useful property: we can sample xtx_t directly from x0x_0 in closed form. Let αt=1βt\alpha_t = 1 - \beta_t and αˉt=s=1tαs\bar{\alpha}_t = \prod_{s=1}^t \alpha_s:

q(xtx0)=N(xt;αˉtx0,(1αˉt)I)q(x_t | x_0) = \mathcal{N}(x_t;\, \sqrt{\bar{\alpha}_t}\, x_0,\, (1 - \bar{\alpha}_t) I)

Or equivalently via the reparameterisation trick:

xt=αˉtx0+1αˉtϵ,ϵN(0,I)x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

As tTt \to T, αˉt0\bar{\alpha}_t \to 0 and xTN(0,I)x_T \approx \mathcal{N}(0, I).

The Reverse Process

The reverse process learns to denoise step by step:

pθ(xt1xt)=N(xt1;μθ(xt,t),Σθ(xt,t))p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1};\, \mu_\theta(x_t, t),\, \Sigma_\theta(x_t, t))

The network ϵθ(xt,t)\epsilon_\theta(x_t, t) is trained to predict the noise ϵ\epsilon that was added. The training objective simplifies to:

Lsimple=Et,x0,ϵ[ϵϵθ(xt,t)2]\mathcal{L}_{simple} = \mathbb{E}_{t, x_0, \epsilon}\left[\|\epsilon - \epsilon_\theta(x_t, t)\|^2\right]

This is just a denoising regression problem — predict the noise, minimise MSE.

The Network Architecture: U-Net

The denoising network is a U-Net — an encoder-decoder architecture with skip connections between corresponding encoder and decoder feature maps.

Input x_t + timestep embedding

  [Conv] → [ResBlock] → [Attention] → [Downsample]
         ↓                                  ↓
  [Conv] → [ResBlock] → [Attention] → [Downsample]

      Middle Block (ResBlock + Attention)

  [Upsample] → [ResBlock] → [Attention]

  [Upsample] → [ResBlock] → [Attention]

    Output (predicted noise ε)

The timestep tt is encoded as a sinusoidal embedding (similar to Transformer positional encodings) and injected into each ResBlock via FiLM conditioning.

Noise Schedules

The schedule {βt}\{\beta_t\} controls how quickly noise accumulates. Common choices:

DDIM: Faster Sampling

Standard DDPM requires T=1000T = 1000 denoising steps to generate one image. DDIM (Denoising Diffusion Implicit Models) reformulates the reverse process as a non-Markovian chain, enabling generation in 20–50 steps with comparable quality.

The DDIM update step:

xt1=αˉt1xt1αˉtϵθαˉtpredicted x0+1αˉt1ϵθx_{t-1} = \sqrt{\bar{\alpha}_{t-1}} \underbrace{\frac{x_t - \sqrt{1-\bar{\alpha}_t}\,\epsilon_\theta}{\sqrt{\bar{\alpha}_t}}}_{\text{predicted }x_0} + \sqrt{1 - \bar{\alpha}_{t-1}}\,\epsilon_\theta

Latent Diffusion Models

Stable Diffusion operates in latent space, not pixel space. A variational autoencoder (VAE) first compresses the image:

z=E(x),x^=D(z)z = \mathcal{E}(x), \quad \hat{x} = \mathcal{D}(z)

The diffusion process runs on the latent zz — typically 64×64×464 \times 64 \times 4 for a 512×512512 \times 512 image. This reduces the computation by a factor of ~48× compared to pixel-space diffusion.

Classifier-Free Guidance

To steer generation toward a text prompt cc, classifier-free guidance interpolates between conditional and unconditional predictions:

ϵ~θ(xt,t,c)=ϵθ(xt,t,)+w[ϵθ(xt,t,c)ϵθ(xt,t,)]\tilde{\epsilon}_\theta(x_t, t, c) = \epsilon_\theta(x_t, t, \emptyset) + w\,[\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \emptyset)]

The guidance scale ww controls the trade-off between sample quality (higher ww) and diversity (lower ww). Typical values are 7–15.

Key Takeaways