Diffusion Models Background and Code
We start by summarizing the diffusion models and guidance to control the generation process.
Section 1: Background
1.1. Diffusion Models
Diffusion Models (DMs) (Ho et al., 2020) generate data samples by gradually adding noise to data through a forward diffusion process, followed by a reverse denoising process that reconstructs the original sample.
The forward process corrupts a data sample \(x_0\) through iterative noise addition controlled by a schedule \(\alpha = \{\alpha_t\}_{t=1}^T\):
\[q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \alpha_t} \, x_{t-1}, \alpha_t I)\]With \(\bar{\alpha}_t = \prod_{i=1}^{t} (1 - \alpha_i)\), \(x_t\) can be computed from \(x_0\) with marginal distribution:
\[q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} \, x_0, (1 - \bar{\alpha}_t) I)\]The reverse process, parameterized by a neural network \(p_\theta\), learns to reconstruct the data sample by predicting the denoised mean and variance at each step:
\[p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))\]Latent Diffusion Models (LDMs) (Rombach et al., 2021) work in a compressed latent space \(z_t\), rather than the high-dimensional data space, improving efficiency. The data \(x_0\) is encoded as \(z_0\) through an autoencoder, and the diffusion process is applied in this latent representation.
LDM models learn to minimize the objective:
\[L = \mathbb{E}_{z_0, t, \epsilon \sim \mathcal{N}(0, 1)} \left[ \| \epsilon - \epsilon_\theta(z_t, t) \|_2^2 \right]\]where \(\epsilon\) is random Gaussian noise, and \(\epsilon_\theta\) is the model’s predicted noise at time \(t\).
1.2. Guidance
In the denoising process, different conditional inputs \(c\) (text, image, depth, mask, etc.) can be added to control the generation; so the denoising model predicts \(\epsilon_\theta(z_t, t, c)\).
At the time of sampling, the diffusion score \(\epsilon_\theta(z_t, t, c)\) is modified to include the adversarial gradient of the classifier (Dhariwal & Nichol, 2021):
\[\tilde{\epsilon}_\theta(z_t, t, c) = \epsilon_\theta(z_t, t) - w \nabla_{z_t} \log p_\phi(c | z_t)\]where \(w\) is a guidance strength parameter that controls the influence of the classifier.
In classifier-free guidance (Ho & Salimans, 2022), a single neural network is used to parameterize both the unconditional denoising diffusion model \(p\theta(z)\) and the conditional denoising diffusion model \(p\theta(z | c)\). While training, the unconditional model receives a null token, \(\Phi\), as \(c\) randomly with some probability \(p\_\text{uncond}\), set as a hyperparameter.
During sampling, a linear combination of conditional and unconditional score estimates is used:
\[\tilde{\epsilon}_\theta(z_t, t, c) = (1 + w) \epsilon_\theta(z_t, t, c) - w (\epsilon_\theta(z_t, t))\]Section 2: Code
We will use the codebase of DDPM-IP (Ning et al., 2023) to explore this section.
2.1. Forward Diffusion (Noising Process)
Step 1: Noise Addition
The forward diffusion process incrementally adds Gaussian noise to the data:
\[q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \alpha_t} \cdot x_{t-1}, \alpha_t \cdot I)\]Direct Sampling of \(x_t\)
The noise is accumulated across timesteps, allowing \(x_t\) to be sampled directly from \(x_0\):
\[q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} \cdot x_0, (1 - \bar{\alpha}_t) \cdot I)\]Code Implementation:
alphas = 1.0 - betas
self.alphas_cumprod = np.cumprod(alphas, axis=0)
self.sqrt_alphas_cumprod = np.sqrt(self.alphas_cumprod)
self.sqrt_one_minus_alphas_cumprod = np.sqrt(1.0 - self.alphas_cumprod)
def q_sample(self, x_start, t, noise=None):
if noise is None:
noise = th.randn_like(x_start)
return (
_extract_into_tensor(self.sqrt_alphas_cumprod, t, x_start.shape) * x_start
+ _extract_into_tensor(self.sqrt_one_minus_alphas_cumprod, t, x_start.shape) * noise
)
2.2 Reverse Diffusion (Denoising Process)
Step 2: Predicted Noise \(\epsilon_{\theta}(x_t, t)\)
The model directly predicts the noise added to \(x_t\). This prediction is denoted by \(\epsilon_\theta(x_t, t)\) and represents the learned noise.
\[\epsilon_\theta(x_t, t) = f_\theta(x_t, t)\]Code Implementation:
model_output = model(x_t, self._scale_timesteps(t), **model_kwargs)
Step 3: Predicted \(x_0\)
Using the predicted noise \(\epsilon_\theta(x_t, t)\), the noiseless image \(x_{\text{pred}_0}\) (predicted \(x_0\)) is reconstructed. This step removes the noise from \(x_t\) using the diffusion schedule.
\[x_{\text{pred}_0} = \frac{x_t - \sqrt{1 - \bar{\alpha}_t} \cdot \epsilon_\theta(x_t, t)}{\sqrt{\bar{\alpha}_t}}\]Code Implementation:
def _predict_xstart_from_eps(self, x_t, t, eps):
return (
_extract_into_tensor(self.sqrt_recip_alphas_cumprod, t, x_t.shape) * x_t
- _extract_into_tensor(self.sqrt_recipm1_alphas_cumprod, t, x_t.shape) * eps
)
Step 4: Predicted Mean \(\mu_\theta(x_t, t)\)
The mean of the reverse process \(\mu_\theta(x_t, t)\) is calculated based on the predicted \(x_0\). This is a crucial step in approximating the denoising process.
\[\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}} \cdot \left(x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \cdot x_{\text{pred}_0}\right)\]Code Implementation:
def q_posterior_mean_variance(self, x_start, x_t, t):
posterior_mean = (
_extract_into_tensor(self.posterior_mean_coef1, t, x_t.shape) * x_start
+ _extract_into_tensor(self.posterior_mean_coef2, t, x_t.shape) * x_t
)
posterior_variance = _extract_into_tensor(self.posterior_variance, t, x_t.shape)
return posterior_mean, posterior_variance
Step 5: Adding Stochastic Noise
Finally, stochastic noise is added to the predicted mean \(\mu_\theta(x_t, t)\) to sample \(x_{t-1}\). This accounts for the uncertainty in the reverse process, ensuring the model generates diverse outputs.
\[x_{t-1} = \mu_\theta(x_t, t) + \sqrt{\Sigma_\theta(x_t, t)} \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)\]Code Implementation:
sample = out["mean"] + nonzero_mask * th.exp(0.5 * out["log_variance"]) * noise
That briefly settles diffusion models, for more details please read (Chan, 2024).
References:
2024
- Tutorial on Diffusion Models for Imaging and VisionarXiv preprint arXiv:2403.18103, 2024
2023
- Input Perturbation Reduces Exposure Bias in Diffusion ModelsIn International Conference on Machine Learning , 2023
2022
- Classifier-Free Diffusion Guidance2022
2021
- Diffusion Models Beat GANs on Image SynthesisIn Advances in Neural Information Processing Systems , 2021
- High-Resolution Image Synthesis with Latent Diffusion Models2021
2020
- Denoising Diffusion Probabilistic ModelsIn Advances in Neural Information Processing Systems , 2020
Enjoy Reading This Article?
Here are some more articles you might like to read next: