Derivations

This is some notes I took while learning diffusion models. Need to clean it up & turn it into a proper blog post.

\newcommand{\d}[1]{\mathrm{d}#1}

q(x0)q(x_0) is our data-distribution. It is a datapoint, x0x_0, before any forward process diffusion. pθp_\theta is what we're trying to learn

KL(q(x0)pθ)=Eq(x0)[logq(x0)logpθ]=Eq(x0)[logq(x0)]+Eq(x0)[logpθ]KL(q(x_0) | p_\theta) = \mathbb{E}_{q(x_0)}\left[\log q(x_0) - \log p_\theta\right] = \mathbb{E}_{q(x_0)}[\log q(x_0)] + \mathbb{E}_{q(x_0)}\left[-\log p_\theta\right]

We can't optimize the 1st part of the final expression, so all that's left is to optimize Eq(x0)[logpθ]\mathbb{E}_{q(x_0)}[-\log p_\theta]. Eq(x0)\mathbb{E}_{q(x_0)} is expectation under the data distribution. This is equivalent to sampling from the data distribution.

Expectation definition:

\displaylinesE[x]=xxp(x)\dxE[g(x)]=xg(x)p(x)\dx\displaylines{ \mathbb{E}[x] = \int_x x \cdot p(x) \d x\\ \mathbb{E}[g(x)] = \int_x g(x) \cdot p(x) \d x }

We currently have: pθ(x0)p_\theta(x_0). But the reverse diffusion process depends on latents x1x_1 through xTx_T. We need to pull these into our expression. To do that, we use marginalization.

What's marginalization?

  • Let AA be a random variable that represents the probability of it raining tomorrow.
  • Let BB be a random variable that represents some other event. Maybe: Tom catches Jerry today. (Note how the 2 events don't necessarily have to be related). The probability of it raining tomorrow can also be expressed as "probability of it raining tomorrow & probability that Tom catches Jerry today" + "probability of it raining tomorrow & probability that Tom does not catch Jerry today "

In this case, both AA and BB can take on 2 possible values.

P(A=a)=bP(A=a,B=b)P(A=a) = \sum_b P(A=a, B=b)

If this was continuous, we would have:

P(A=a)=P(A=a,B=b)\dbP(A = a) = \int P(A=a, B=b) \d b

Side note, in this case, AA and BB are independent, so we can do:

P(A=a)=bP(A=a,B=b)=bP(A=a)P(B=b)=P(A=a)bP(B=b)=P(A=a)1=P(A=a)\begin{align*} P(A = a) \\ &= \sum_b P(A=a,B=b) \\ &= \sum_b P(A=a) \cdot P(B=b) \\ &= P(A=a) \cdot \sum_b P(B=b) \\ &= P(A=a) \cdot 1 \\ &= P(A = a) \end{align*}

We now reach out 1st excerpt from the paper!

pθ(x0)pθ(x0:T)\dx1:Tp_\theta(\textbf{x}_0) \coloneqq \int p_\theta(\textbf{x}_{0:T})\d\textbf{x}_{1:T}

We needed to "pull in" all the latent variables that x0\textbf{x}_0 depends on. To do this, we marginalized x0\textbf{x}_0 with respect to all the latent variables in the reverse diffusion process.

Note:

\displaylinespθ(x0:T)=pθ(x0,x1,...,xT)\dx1:T=\dx1...\dxT\displaylines{ p_\theta(x_{0:T}) = p_\theta(x_0, x_1, ..., x_T)\\ \d{x}_{1:T} = \d{x}_1 ... \d{x}_T }

Aside: Forward Diffusion

For the forward diffusion process:

q(xtxt1)N(xt;1βtxt1,βtI)q(\textbf{x}_t|\textbf{x}_{t-1}) \coloneqq \mathcal{N}(\textbf{x}_t; \sqrt{1 - \beta_t}\textbf{x}_{t-1}, \beta_t\textbf{I})

We scale the mean by 1βt\sqrt{1-\beta_t} in order to prevent the values/variance from exploding. TODO: Be more precise here. Write some tests & verify what is happening.

I originally thought it was b/c we want to scale the mean down to 0. Indeed, if we put in data composed of all 1s into the forward diffusion process, the end result would have a mean of 0 and a stdev of 1.

But, that isn't the goal of scaling the mean. In diffusion models, we assume our input has a mean of 0. Because if our forward diffusion process shifts the mean to 0, then our neural net must learn to shift the mean back. That's probably bad since our neural net uses the same weights for each reverse diffusion step.

Aside: Training-Inference Compute Asymetry

Diffusion models allow you to train with much less compute, while still utilizing a ton of compute during sampling time.

Aside: Functional Form

Both processes have the same functional form when βt\beta_t are small

This is saying "we need a lot of timesteps in the forward/reverse diffusion processes" in order for both forward/reverse processes to be gaussian.

Example: Imagine your dataset consists of cats & dogs. If you do a single reverse step to go from noise to "full cat/dog" then your reverse step can't produce a gaussian distribution of outcomes. How do you fit a gaussian to cats & dogs? You would need a multimodal distribution instead.

ELBO

Eq(x0)[logpθ(x0)]=1.Eq(x0)[logpθ(x0:T)\dx1:T]=2.Eq(x0)[logq(x1:Tx0)q(x1:Tx0)pθ(x0:T)\dx1:T]=3.Eq(x0)[logEq(x1:Tx0)[pθ(x0:T)q(x1:Tx0)]]4.Eq(x0)[Eq(x1:Tx0)[logpθ(x0:T)q(x1:Tx0)]]\Large \begin{align*} \mathbb{E}_{q(x_0)}\left[-\log p_\theta(x_0)\right]\\ &=^{1.} \mathbb{E}_{q(x_0)}\left[-\log \int p_\theta(x_{0:T}) \d{x}_{1:T}\right]\\ &=^{2.} \mathbb{E}_{q(x_0)}\left[-\log \int \frac{q(x_{1:T}|x_0)}{q(x_{1:T}|x_0)}p_\theta(x_{0:T}) \d{x}_{1:T}\right]\\ &=^{3.} \mathbb{E}_{q(x_0)}\left[-\log \mathbb{E}_{q(x_{1:T|x_0})}\left[\frac{p_\theta(x_{0:T})}{q(x_{1:T}|x_0)}\right] \right]\\ &\leq^{4.} \mathbb{E}_{q(x_0)}\left[ \mathbb{E}_{q(x_{1:T}|x_0)}\left[- \log \frac{p_\theta(x_{0:T})}{q(x_{1:T}|x_0)}\right] \right]\\ \end{align*}
  1. Marginalization to pull-in the latent variables
  2. Bring in the forward diffusion process
  3. Integral => Expectation
  4. E[logX]logE[X]\mathbb{E}[\log X] \leq \log \mathbb{E}[X] using Jensen's Inequality

Jensen's Inequality

How do we prove E[logX]<logE[X]\mathbb{E}[\log X] < \log \mathbb{E}[X]?

Proof By Example

average(log(1), log(2), log(3)) 0.59\simeq 0.59 log(average(1, 2, 3)) 0.69\simeq 0.69

Picture Proof

Desmos ![[Pasted image 20220812143558.png]]


Eq(x0)[Eq(x1:Tx0)[logpθ(x0:T)q(x1:Tx0)]]=1.Eq(x0)[Eq(x1:Tx0)[logp(xT)t1logpθ(xt1xt)q(xtxt1)]]\Large \begin{align*} \mathbb{E}_{q(x_0)}\left[ \mathbb{E}_{q(x_{1:T}|x_0)}\left[- \log \frac{p_\theta(x_{0:T})}{q(x_{1:T}|x_0)}\right] \right]\\ =^{1.} \mathbb{E}_{q(x_0)}\left[\mathbb{E}_{q(x_{1:T}|x_0)}\left[-\log p(x_T) - \sum_{t\geq1} \log \frac{p_\theta(x_{t-1}|x_t)}{q(x_t|x_{t-1})}\right]\right]\\ \end{align*}
  1. Use the fact that pθp_\theta and qq are defined as products + log rules.

A note about p(xT)p(x_T):

Technically, it should be pθ(xT)p_\theta(x_T), but we assume that after enough forward diffusion steps, p(xT)p(x_T) is identical to the normal distribution (hence the equation earlier in the paper: p(xT)=N(xT;0,I)p(x_T) = \mathcal{N}(x_T;\textbf{0},\textbf{I})).

Concretely, to calculate logp(xT)\log p(x_T) we could:

  1. Take an image from our data set, x0x_0
  2. Run forward diffusion for T steps to get TT.
  3. Calculate the probability of xTx_T appearing from the normal distribution, N\mathcal{N}.
    • TODO: I'm guessing you would do this by calculating the probability of the normal distribution discretizing to the given image?

Rewriting ELBO

For simplicity, we'll write:

Eq(x0)[Eq(x1:Tx0)[logp(xT)t1logpθ(xt1xt)q(xtxt1)]]\mathbb{E}_{q(x_0)}\left[\mathbb{E}_{q(x_{1:T}|x_0)}\left[-\log p(x_T) - \sum_{t\geq1} \log \frac{p_\theta(x_{t-1}|x_t)}{q(x_t|x_{t-1})}\right]\right]\\

as

Eq[logp(xT)t1logpθ(xt1xt)q(xtxt1)]\mathbb{E}_{q}\left[-\log p(x_T) - \sum_{t\geq1} \log \frac{p_\theta(x_{t-1}|x_t)}{q(x_t|x_{t-1})}\right]\\

Now we'll simplify it:

Eq[logp(xT)t1logpθ(xt1xt)q(xtxt1)]=1.Eq[logp(xT)t2logpθ(xt1xt)q(xtxt1)logpθ(x0x1)q(x1x0)]=2.Eq[logp(xT)t2[logpθ(xt1xt)q(xt1xt)q(xt1)q(xt)]logpθ(x0x1)q(x1x0)]=3.Eq[logp(xT)t2log[pθ(xt1xt)q(xt1xt,x0)q(xt1x0)q(xtx0)]logpθ(x0x1)q(x1x0)]=4.Eq[logp(xT)t2logpθ(xt1xt)q(xt1xt,x0)t2logq(xt1x0)q(xtx0)logpθ(x0x1)q(x1x0)]=5.Eq[logp(xT)t2logpθ(xt1xt)q(xt1xt,x0)logq(x1x0)q(xTx0)logpθ(x0x1)q(x1x0)]=6.Eq[logp(xT)q(xTx0)t2logpθ(xt1xt)q(xt1xt,x0)logpθ(x0x1)]=7.Eq[DKL(q(xTx0)p(xT)+t2DKL(q(xt1xt,x0)pθ(xt1xt))logpθ(x0x1)]\large \begin{align*} &\mathbb{E}_{q}\left[-\log p(x_T) - \sum_{t\geq1} \log \frac{p_\theta(x_{t-1}|x_t)}{q(x_t|x_{t-1})}\right]\\ &=^{1.} \mathbb{E}_{q}\left[-\log p(x_T) - \sum_{t \geq 2} \log \frac{p_\theta(x_{t-1}|x_t)}{q(x_t|x_{t-1})} - \log\frac{p_\theta(x_0|x_1)}{q(x_1|x_0)} \right]\\ &=^{2.} \mathbb{E}_{q}\left[-\log p(x_T) - \sum_{t \geq 2} \left[ \log \frac{p_\theta(x_{t-1}|x_t)}{q(x_{t-1}|x_t)} \cdot \frac{q(x_{t-1})}{q(x_t)} \right] - \log\frac{p_\theta(x_0|x_1)}{q(x_1|x_0)} \right]\\ &=^{3.} \mathbb{E}_{q}\left[-\log p(x_T) - \sum_{t \geq 2} \log \left[ \frac{p_\theta(x_{t-1}|x_t)}{q(x_{t-1}|x_t,x_0)} \cdot \frac{q(x_{t-1}|x_0)}{q(x_t|x_0)} \right] - \log\frac{p_\theta(x_0|x_1)}{q(x_1|x_0)} \right]\\ &=^{4.} \mathbb{E}_{q}\left[-\log p(x_T) - \sum_{t \geq 2} \log \frac{p_\theta(x_{t-1}|x_t)}{q(x_{t-1}|x_t,x_0)} - \sum_{t\geq 2} \log \frac{q(x_{t-1}|x_0)}{q(x_t|x_0)} - \log\frac{p_\theta(x_0|x_1)}{q(x_1|x_0)} \right]\\ &=^{5.} \mathbb{E}_{q}\left[-\log p(x_T) - \sum_{t \geq 2} \log \frac{p_\theta(x_{t-1}|x_t)}{q(x_{t-1}|x_t,x_0)} - \log \frac{q(x_1|x_0)}{q(x_T|x_0)} - \log\frac{p_\theta(x_0|x_1)}{q(x_1|x_0)} \right]\\ &=^{6.} \mathbb{E}_q\left[-\log \frac{p(x_T)}{q(x_T|x_0)} - \sum_{t \geq 2} \log \frac{p_\theta(x_{t-1}|x_t)}{q(x_{t-1}|x_t,x_0)} - \log p_\theta(x_0|x_1) \right] \\ &=^{7.} \mathbb{E}_q\left[D_{KL}(q(x_T|x_0)\Vert p(x_T) + \sum_{t \geq 2} D_{KL}(q(x_{t-1}|x_t,x_0) \Vert p_\theta(x_{t-1}|x_t)) - \log p_\theta(x_0|x_1)\right] \end{align*}
  1. Extract the 1st term out of the summation. This is necessary because when we apply Bayes' Theorem to qq, we will get a non-sensical result if we also apply it to the 1st term in the summation.
  2. Apply Bayes' rule to q(xtxt1)q(x_t|x_{t-1})
  3. We need to condition the reverse conditional probability, qq, on x0x_0. Why? q(xt1xt)q(x_{t-1}|x_t) needs to give the probability distribution of xt1x_{t-1}s given xtx_t, but this might be extremely difficult if, e.g, xtx_t has a lot of noise (which it will near the end of the diffusion process). If we know the original image, x0x_0, this process becomes easy. This also makes the reverse conditional probability tractable. I.e., we can compute it.
  4. Log rules.
  5. Expand the 2nd summation, apply log rules to get the log of a cumulative product, cancel terms.
  6. Log rules
  7. Definition of KL-divergence

Reparameterization

We can do the forward process in 1-step. Need help simplifying an iterative diffusion process to a 1-step process

q(xtxt1)=N(xt;1βt,βtI)=1.1βtxt1+βtϵ=2.atxt1+1atϵ=3.αt(αt1xt2+1αt1ϵ)+1αtϵ=4.αtαt1xt2+αt1αt1ϵ+1αtϵ=5.αtαt1xt2+1αtαt1ϵ\begin{align*} q(x_t|x_{t-1}) = \mathcal{N}(x_t;\sqrt{1 - \beta_t}, \beta_tI)\\ &=^{1.} \sqrt{1-\beta_t}x_{t-1} + \sqrt{\beta_t}\epsilon\\ &=^{2.} \sqrt{a_t}x_{t-1} + \sqrt{1-a_t}\epsilon\\ &=^{3.} \sqrt{\alpha_t}\left(\sqrt{\alpha_{t-1}}x_{t-2} + \sqrt{1 - \alpha_{t-1}}\epsilon\right) + \sqrt{1 - \alpha_t}\epsilon\\ &=^{4.} \sqrt{\alpha_t\alpha_{t-1}}x_{t-2} + \sqrt{\alpha_t}\sqrt{1 - \alpha_{t-1}}\epsilon + \sqrt{1 - \alpha_t}\epsilon\\ &=^{5.} \sqrt{\alpha_t\alpha_{t-1}}x_{t-2} + \sqrt{1 - \alpha_t\alpha_{t-1}}\epsilon \end{align*}
  1. We go from βt\beta_t to βt\sqrt{\beta_t} because βtI\beta_tI is a covariance matrix. Specifically, it's the diagonal covariance matrix, so it only has covariance values of the form COV(X,X)=VAR(X)=σ2\text{COV}(X, X) = \text{VAR}(X) = \sigma^2
  2. Substitute βt\beta_t with ata_t
  3. Substitute using xt1=αt1xt2+1αt1ϵx_{t-1} = \sqrt{\alpha_{t-1}}x_{t-2} + \sqrt{1 - \alpha_{t-1}}\epsilon
  4. Algebra
  5. ϵ\epsilon is sampled from the normal distribution ϵ=N(0,I)\epsilon = \mathcal{N}(0, I) Thus, the last 2 terms are gaussians with mean 0 and standard deviation at1at1\sqrt{a_t}\sqrt{1-a_{t-1}} and 1at\sqrt{1-a_t}. VAR(X+Y)=VAR(X)+VAR(Y)\text{VAR}(X + Y) = \text{VAR}(X) + \text{VAR}(Y) if XX and YY are independent. Thus, the sum of those two terms is a gaussian with mean 0 and variance αt(1αt1)+(1αt)=1αtαt1\alpha_t(1 - \alpha_{t-1}) + (1 - \alpha_t) = 1 - \alpha_t\alpha_{t-1}

Deriving The Posterior

A lot of the math here comes from the excellent blog post What are Diffusion Models?

The reverse conditional probability is:

q(xt1xt,x0)=N(xt1;μ~(xt,x0),β~tI)q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1}; \color{blue}{\tilde{\boldsymbol{\mu}}}(\mathbf{x}_t, \mathbf{x}_0), \color{red}{\tilde{\beta}_t} \mathbf{I})

We need to calculate μ~\tilde\mu and β~\tilde\beta.

q(xt1xt,x0)=1.q(xtxt1,x0)q(xt1x0)q(xtx0)2.exp(12((xtαtxt1)2βt+(xt1αˉt1x0)21αˉt1(xtαˉtx0)21αˉt))=exp(12(xt22αtxtxt1+αtxt12βt+xt122αˉt1x0xt1+αˉt1x021αˉt1(xtαˉtx0)21αˉt))=exp(12((αtβt+11αˉt1)xt12(2αtβtxt+2αˉt11αˉt1x0)xt1+C(xt,x0)))\newcommand{\ncolor}{\color{white}} \begin{aligned} q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) &=^{1.} q(\mathbf{x}_t \vert \mathbf{x}_{t-1}, \mathbf{x}_0) \frac{ q(\mathbf{x}_{t-1} \vert \mathbf{x}_0) }{ q(\mathbf{x}_t \vert \mathbf{x}_0) } \\ &\propto^{2.} \exp \Big(-\frac{1}{2} \big(\frac{(\mathbf{x}_t - \sqrt{\alpha_t} \mathbf{x}_{t-1})^2}{\beta_t} + \frac{(\mathbf{x}_{t-1} - \sqrt{\bar{\alpha}_{t-1}} \mathbf{x}_0)^2}{1-\bar{\alpha}_{t-1}} - \frac{(\mathbf{x}_t - \sqrt{\bar{\alpha}_t} \mathbf{x}_0)^2}{1-\bar{\alpha}_t} \big) \Big) \\ &= \exp \Big(-\frac{1}{2} \big(\frac{\mathbf{x}_t^2 - 2\sqrt{\alpha_t} \mathbf{x}_t \color{blue}{\mathbf{x}_{t-1}} \ncolor{+ \alpha_t} \color{red}{\mathbf{x}_{t-1}^2} }{\beta_t} + \frac{ \color{red}{\mathbf{x}_{t-1}^2} \ncolor{- 2 \sqrt{\bar{\alpha}_{t-1}} \mathbf{x}_0} \color{blue}{\mathbf{x}_{t-1}} \ncolor{+ \bar{\alpha}_{t-1} \mathbf{x}_0^2} }{1-\bar{\alpha}_{t-1}} - \frac{(\mathbf{x}_t - \sqrt{\bar{\alpha}_t} \mathbf{x}_0)^2}{1-\bar{\alpha}_t} \big) \Big) \\ &= \exp\Big( -\frac{1}{2} \big( \color{red}{(\frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar{\alpha}_{t-1}})} \mathbf{x}_{t-1}^2 - \color{blue}{(\frac{2\sqrt{\alpha_t}}{\beta_t} \mathbf{x}_t + \frac{2\sqrt{\bar{\alpha}_{t-1}}}{1 - \bar{\alpha}_{t-1}} \mathbf{x}_0)} \mathbf{x}_{t-1} \ncolor{ + C(\mathbf{x}_t, \mathbf{x}_0) \big) \Big)} \end{aligned}
  1. Bayes' Rule
  2. This is the "proportional to" symbol. We're using these facts:
    1. The PDF of a gaussian is 12σ2πe(xμ)22σ2\frac{1}{\sqrt{2\sigma^2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}
    2. eaebec=ea+bce^a \cdot \frac{e^b}{e^c} = e^{a+b-c}
      • q(xtxt1,x0)=N(xt;1βtxt1,βtI)q(x_t|x_{t-1},x_0) = \mathcal{N}(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_tI) (definition of qq)
        • μ=1βtxt1=atxt1\mu = \sqrt{1-\beta_t}x_{t-1} = \sqrt{a_t}x_{t-1}
        • σ=βt\sigma = \sqrt{\beta_t}
      • q(xtx0)=N(xt;aˉtx0,(1aˉt)I)q(x_{t}|x_0) = \mathcal{N}(x_{t}; \sqrt{\bar{a}_t}x_0, (1-\bar{a}_t)I) (reparameterization trick)
        • μ=aˉtx0\mu = \sqrt{\bar{a}_t}x_0
        • σ=1aˉt\sigma = \sqrt{1-\bar{a}_t}
      • Same idea for q(xt1x0)q(x_{t-1}|x_0)

Let's examine part of the PDF of the gaussian more closely.

(xμ)22σ2=x22xμ+μ22σ2=12σ2x2+xμσ2μ22σ2=12(x2σ22xμσ2+μ2σ2)\begin{align*} -\frac{(x-\mu)^2}{2\sigma^2}\\ &= -\frac{x^2 - 2x\mu + \mu^2}{2\sigma^2}\\ &= -\frac{1}{2\sigma^2}x^2 + \frac{x\mu}{\sigma^2} - \frac{\mu^2}{2\sigma^2}\\ &= -\frac{1}{2}\left(\frac{x^2}{\sigma^2} - \frac{2x\mu}{\sigma^2} + \frac{\mu^2}{\sigma^2} \right) \end{align*}

From this, we know:

x2σ2=redσ2=x2red\begin{align} \frac{x^2}{\sigma^2} &= \color{red}\text{red}\\ \sigma^2 &= \frac{x^2}{\color{red}\text{red}} \end{align}
2xμσ2=blueμ=σ2blue2x\begin{align} \frac{2x\mu}{\sigma^2} = \color{blue}\text{blue}\\ \mu = \frac{\sigma^2\color{blue}\text{blue}}{2x} \end{align}

One last note, xx is actually xt1x_{t-1} since that is the output of q(xt1xt,x0)=N(xt1;μ~(xt,x0),β~tI)q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1}; \color{blue}{\tilde{\boldsymbol{\mu}}}(\mathbf{x}_t, \mathbf{x}_0), \color{red}{\tilde{\beta}_t} \mathbf{I}). Now we can derive the mean and variance of q(xt1xt,x0)q(x_{t-1}|x_t,x_0):

β~t=1/(αtβt+11αˉt1)=1/(αtαˉt+βtβt(1αˉt1))=1αˉt11αˉtβtμ~t(xt,x0)=(αtβtxt+αˉt11αˉt1x0)/(αtβt+11αˉt1)=(αtβtxt+αˉt11αˉt1x0)1αˉt11αˉtβt=αt(1αˉt1)1αˉtxt+αˉt1βt1αˉtx0\begin{aligned} \tilde{\beta}_t &= 1/(\frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar{\alpha}_{t-1}}) = 1/(\frac{\alpha_t - \bar{\alpha}_t + \beta_t}{\beta_t(1 - \bar{\alpha}_{t-1})}) = \color{green}{\frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \cdot \beta_t} \\ \tilde{\boldsymbol{\mu}}_t (\mathbf{x}_t, \mathbf{x}_0) &= (\frac{\sqrt{\alpha_t}}{\beta_t} \mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1} }}{1 - \bar{\alpha}_{t-1}} \mathbf{x}_0)/(\frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar{\alpha}_{t-1}}) \\ &= (\frac{\sqrt{\alpha_t}}{\beta_t} \mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1} }}{1 - \bar{\alpha}_{t-1}} \mathbf{x}_0) \color{green}{\frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \cdot \beta_t} \\ &= \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 - \bar{\alpha}_t} \mathbf{x}_0\\ \end{aligned}

Simplifying The Posterior Mean

From reparameterization we know:

xt=aˉtx0+1aˉtϵx0=1aˉt(xt1aˉtϵ)\begin{align*} x_t &= \sqrt{\bar{a}_t}x_0 + \sqrt{1 - \bar{a}_t}\epsilon\\ x_0 &= \frac{1}{\sqrt{\bar{a}_t}}\left(x_t - \sqrt{1-\bar{a}_t}\epsilon\right) \end{align*}
Aside: Implementation

We can do a bit more algebra to derive the terms used in the OpenAI codebase:

x0=1aˉtxt1aˉtaˉtϵx0=1aˉtxt1aˉt1ϵ\begin{align*} x_0 &= \frac{1}{\sqrt{\bar{a}_t}}x_t - \frac{\sqrt{1- \bar{a}_t}}{\sqrt{\bar{a}_t}}\cdot \epsilon\\ x_0 &= \frac{1}{\sqrt{\bar{a}_t}}x_t - \sqrt{\frac{1}{\bar{a}_t} - 1}\cdot \epsilon \end{align*}

We can use our definition of x0x_0 and substitute:

μ~t=αt(1αˉt1)1αˉtxt+αˉt1βt1αˉt1αˉt(xt1αˉtϵ)=1αt(xtβt1αˉtϵ)\begin{aligned} \tilde{\boldsymbol{\mu}}_t &= \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 - \bar{\alpha}_t} \frac{1}{\sqrt{\bar{\alpha}_t}}(\mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t}\mathbf{\epsilon}) \\ &= \color{cyan}{\frac{1}{\sqrt{\alpha_t}} \Big( \mathbf{x}_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon \Big)} \end{aligned}

I could not figure out the algebra for this step. But, I did verify the results numerically. See these tests.

Closed Form KL-Divergence

We want to calculate the KL-divergence efficiently. Normally, we would need to Monte Carlo estimation. But, since Lt1L_{t-1} is comparing 2 gaussians, we can use the closed form of the KL Divergence:

DKL(p,q)=logσ2σ1+σ12+(μ1μ2)22σ2212D_{KL}(p, q) = \log \frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac{1}{2}

The authors ignore the 1st & last terms since they don't learn the variance for the reverse diffusion process.

  • Question: Why bother including σt\sigma_t in the denominator given that we're not learning? (My guess: resembles denoised score matching)

Extensions

DDIM

DDIM allows deterministic sampling without modifying the training process. Let's analyze the OpenAI code for DDIM sampling:

    def ddim_sample(
        self,
        model,
        x,
        t,
        clip_denoised=True,
        denoised_fn=None,
        model_kwargs=None,
        eta=0.0,
    ):
        """
        Sample x_{t-1} from the model using DDIM.
        Same usage as p_sample().
        """
        out = self.p_mean_variance(
            model,
            x,
            t,
            clip_denoised=clip_denoised,
            denoised_fn=denoised_fn,
            model_kwargs=model_kwargs,
        )
        # Usually our model outputs epsilon, but we re-derive it
        # in case we used x_start or x_prev prediction.
        eps = self._predict_eps_from_xstart(x, t, out["pred_xstart"])
        alpha_bar = _extract_into_tensor(self.alphas_cumprod, t, x.shape)
        alpha_bar_prev = _extract_into_tensor(self.alphas_cumprod_prev, t, x.shape)
		# ===MARKER 1===
        sigma = (
            eta
            * th.sqrt((1 - alpha_bar_prev) / (1 - alpha_bar))
            * th.sqrt(1 - alpha_bar / alpha_bar_prev)
        )
        # Equation 12.
        noise = th.randn_like(x)
		# ===MARKER 2===
        mean_pred = (
            out["pred_xstart"] * th.sqrt(alpha_bar_prev)
            + th.sqrt(1 - alpha_bar_prev - sigma ** 2) * eps
        )
        nonzero_mask = (
            (t != 0).float().view(-1, *([1] * (len(x.shape) - 1)))
        )  # no noise when t == 0
        sample = mean_pred + nonzero_mask * sigma * noise
        return {"sample": sample, "pred_xstart": out["pred_xstart"]}

Marker 1

σ=eta1aˉt11aˉt1aˉtaˉt1=1.eta1aˉt11aˉt1(1βt)=eta1aˉt11aˉtβt=etaβt~\begin{align*} \sigma = \text{eta} \cdot \sqrt{\frac{1 - \bar{a}_{t-1}}{1 - \bar{a}_t}} \cdot \sqrt{1 - \frac{\bar{a}_t}{\bar{a}_{t - 1}}}\\ &=^{1.} \text{eta} \cdot \sqrt{\frac{1 - \bar{a}_{t-1}}{1 - \bar{a}_t}} \cdot \sqrt{1 - (1 - \beta_t)}\\ &= \text{eta} \cdot \sqrt{\frac{1 - \bar{a}_{t-1}}{1 - \bar{a}_t}} \cdot \sqrt{\beta_t}\\ &= \text{eta} \cdot \sqrt{\tilde{\beta_t}} \end{align*}
  1. Utilize the facts that A. aˉt\bar{a}_t is a product B. at1βta_t \coloneqq 1 - \beta_t

TODO: Why isn't the OpenAI code just using square root of βt\beta_t directly? My guess, float64/float32 numeric stuff.

Marker 2

mean_pred = (
    out["pred_xstart"] * th.sqrt(alpha_bar_prev)
    + th.sqrt(1 - alpha_bar_prev - sigma ** 2) * eps
)
# Skip a few lines
sample = mean_pred + nonzero_mask * sigma * noise
xt1=aˉt1x0+1aˉt1σt2ϵθ+σtϵx_{t-1} = \sqrt{\bar{a}_{t-1}}x_0 + \sqrt{1 - \bar{a}_{t-1} - \sigma_t^2}\epsilon_\theta + \sigma_t\epsilon

Where ϵθ\epsilon_\theta is the random noise predicted by the network and ϵ\epsilon is random noise we generate. Note, when eta=0\text{eta} = 0 , then this simplifies to:

xt1=aˉt1x0+1aˉt1ϵθx_{t-1} = \sqrt{\bar{a}_{t-1}}x_0 + \sqrt{1 - \bar{a}_{t-1}}\epsilon_\theta

which is a deterministic sampling process! In essence, instead of sampling from the predicted posterior mean/variance, we just use the reparameterization trick and stop 1-step earlier.