Diffusion Models Background and Code

We start by summarizing the diffusion models and guidance to control the generation process.

Section 1: Background


1.1. Diffusion Models

Diffusion Models (DMs) (Ho et al., 2020) generate data samples by gradually adding noise to data through a forward diffusion process, followed by a reverse denoising process that reconstructs the original sample.

The forward process corrupts a data sample \(x_0\) through iterative noise addition controlled by a schedule \(\alpha = \{\alpha_t\}_{t=1}^T\):

\[q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \alpha_t} \, x_{t-1}, \alpha_t I)\]

With \(\bar{\alpha}_t = \prod_{i=1}^{t} (1 - \alpha_i)\), \(x_t\) can be computed from \(x_0\) with marginal distribution:

\[q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} \, x_0, (1 - \bar{\alpha}_t) I)\]

The reverse process, parameterized by a neural network \(p_\theta\), learns to reconstruct the data sample by predicting the denoised mean and variance at each step:

\[p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))\]

Latent Diffusion Models (LDMs) (Rombach et al., 2021) work in a compressed latent space \(z_t\), rather than the high-dimensional data space, improving efficiency. The data \(x_0\) is encoded as \(z_0\) through an autoencoder, and the diffusion process is applied in this latent representation.

LDM models learn to minimize the objective:

\[L = \mathbb{E}_{z_0, t, \epsilon \sim \mathcal{N}(0, 1)} \left[ \| \epsilon - \epsilon_\theta(z_t, t) \|_2^2 \right]\]

where \(\epsilon\) is random Gaussian noise, and \(\epsilon_\theta\) is the model’s predicted noise at time \(t\).

1.2. Guidance

In the denoising process, different conditional inputs \(c\) (text, image, depth, mask, etc.) can be added to control the generation; so the denoising model predicts \(\epsilon_\theta(z_t, t, c)\).

At the time of sampling, the diffusion score \(\epsilon_\theta(z_t, t, c)\) is modified to include the adversarial gradient of the classifier (Dhariwal & Nichol, 2021):

\[\tilde{\epsilon}_\theta(z_t, t, c) = \epsilon_\theta(z_t, t) - w \nabla_{z_t} \log p_\phi(c | z_t)\]

where \(w\) is a guidance strength parameter that controls the influence of the classifier.

In classifier-free guidance (Ho & Salimans, 2022), a single neural network is used to parameterize both the unconditional denoising diffusion model \(p\theta(z)\) and the conditional denoising diffusion model \(p\theta(z | c)\). While training, the unconditional model receives a null token, \(\Phi\), as \(c\) randomly with some probability \(p\_\text{uncond}\), set as a hyperparameter.

During sampling, a linear combination of conditional and unconditional score estimates is used:

\[\tilde{\epsilon}_\theta(z_t, t, c) = (1 + w) \epsilon_\theta(z_t, t, c) - w (\epsilon_\theta(z_t, t))\]

Section 2: Code


We will use the codebase of DDPM-IP (Ning et al., 2023) to explore this section.

2.1. Forward Diffusion (Noising Process)
Step 1: Noise Addition

The forward diffusion process incrementally adds Gaussian noise to the data:

\[q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \alpha_t} \cdot x_{t-1}, \alpha_t \cdot I)\]
Direct Sampling of \(x_t\)

The noise is accumulated across timesteps, allowing \(x_t\) to be sampled directly from \(x_0\):

\[q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} \cdot x_0, (1 - \bar{\alpha}_t) \cdot I)\]
Code Implementation:
alphas = 1.0 - betas
self.alphas_cumprod = np.cumprod(alphas, axis=0)
self.sqrt_alphas_cumprod = np.sqrt(self.alphas_cumprod)
self.sqrt_one_minus_alphas_cumprod = np.sqrt(1.0 - self.alphas_cumprod)

def q_sample(self, x_start, t, noise=None):
    if noise is None:
        noise = th.randn_like(x_start)
    return (
        _extract_into_tensor(self.sqrt_alphas_cumprod, t, x_start.shape) * x_start
        + _extract_into_tensor(self.sqrt_one_minus_alphas_cumprod, t, x_start.shape) * noise
    )
2.2 Reverse Diffusion (Denoising Process)
Step 2: Predicted Noise \(\epsilon_{\theta}(x_t, t)\)

The model directly predicts the noise added to \(x_t\). This prediction is denoted by \(\epsilon_\theta(x_t, t)\) and represents the learned noise.

\[\epsilon_\theta(x_t, t) = f_\theta(x_t, t)\]
Code Implementation:
model_output = model(x_t, self._scale_timesteps(t), **model_kwargs)
Step 3: Predicted \(x_0\)

Using the predicted noise \(\epsilon_\theta(x_t, t)\), the noiseless image \(x_{\text{pred}_0}\) (predicted \(x_0\)) is reconstructed. This step removes the noise from \(x_t\) using the diffusion schedule.

\[x_{\text{pred}_0} = \frac{x_t - \sqrt{1 - \bar{\alpha}_t} \cdot \epsilon_\theta(x_t, t)}{\sqrt{\bar{\alpha}_t}}\]
Code Implementation:
def _predict_xstart_from_eps(self, x_t, t, eps):
    return (
        _extract_into_tensor(self.sqrt_recip_alphas_cumprod, t, x_t.shape) * x_t
        - _extract_into_tensor(self.sqrt_recipm1_alphas_cumprod, t, x_t.shape) * eps
    )
Step 4: Predicted Mean \(\mu_\theta(x_t, t)\)

The mean of the reverse process \(\mu_\theta(x_t, t)\) is calculated based on the predicted \(x_0\). This is a crucial step in approximating the denoising process.

\[\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}} \cdot \left(x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \cdot x_{\text{pred}_0}\right)\]
Code Implementation:
def q_posterior_mean_variance(self, x_start, x_t, t):
    posterior_mean = (
        _extract_into_tensor(self.posterior_mean_coef1, t, x_t.shape) * x_start
        + _extract_into_tensor(self.posterior_mean_coef2, t, x_t.shape) * x_t
    )
    posterior_variance = _extract_into_tensor(self.posterior_variance, t, x_t.shape)
    return posterior_mean, posterior_variance
Step 5: Adding Stochastic Noise

Finally, stochastic noise is added to the predicted mean \(\mu_\theta(x_t, t)\) to sample \(x_{t-1}\). This accounts for the uncertainty in the reverse process, ensuring the model generates diverse outputs.

\[x_{t-1} = \mu_\theta(x_t, t) + \sqrt{\Sigma_\theta(x_t, t)} \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)\]
Code Implementation:
sample = out["mean"] + nonzero_mask * th.exp(0.5 * out["log_variance"]) * noise

That briefly settles diffusion models, for more details please read (Chan, 2024).

References:

2024

  1. Tutorial on Diffusion Models for Imaging and Vision
    Stanley H Chan
    arXiv preprint arXiv:2403.18103, 2024

2023

  1. Input Perturbation Reduces Exposure Bias in Diffusion Models
    Mang Ning, Enver Sangineto, Angelo Porrello, and 2 more authors
    In International Conference on Machine Learning , 2023

2022

  1. Classifier-Free Diffusion Guidance
    Jonathan Ho, and Tim Salimans
    2022

2021

  1. Diffusion Models Beat GANs on Image Synthesis
    Prafulla Dhariwal, and Alexander Nichol
    In Advances in Neural Information Processing Systems , 2021
  2. High-Resolution Image Synthesis with Latent Diffusion Models
    Robin Rombach, Andreas Blattmann, Dominik Lorenz, and 2 more authors
    2021

2020

  1. Denoising Diffusion Probabilistic Models
    Jonathan Ho, Ajay Jain, and Pieter Abbeel
    In Advances in Neural Information Processing Systems , 2020






Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Distance metric for fairness in vision language models
  • Quotodian commands
  • SoccerNet Player Re-identification