variational Bayesian methods Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. They are typically used in complex statistical models consisting of observed variables (usually ...

, the evidence lower bound (often abbreviated ELBO, also sometimes called the variational lower bound or negative variational free energy) is a useful lower bound on the log-likelihood of some observed data. The ELBO is useful because it provides a guarantee on the worst-case for the log-likelihood of some distribution (e.g.

p(X)

) which models a set of data. The actual log-likelihood may be higher (indicating an even better fit to the distribution) because the ELBO includes a Kullback-Leibler divergence (KL divergence) term which decreases the ELBO due to an internal part of the model being inaccurate despite good fit of the model overall. Thus improving the ELBO score indicates either improving the likelihood of the model

p(X)

or the fit of a component internal to the model, or both, and the ELBO score makes a good

loss function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...

, e.g., for training a

deep neural network Deep learning is a subset of machine learning that focuses on utilizing multilayered neural network (machine learning), neural networks to perform tasks such as Statistical classification, classification, Regression analysis, regression, and re ...

to improve both the model overall and the internal component. (The internal component is

q_\phi(\cdot ,  x)

, defined in detail later in this article.)

Definition

Let

X

and

Z

random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a Mathematics, mathematical formalization of a quantity or object which depends on randomness, random events. The term 'random variable' in its mathema ...

s, jointly distributed with distribution

p_\theta

. For example,

p_\theta( X)

is the

marginal distribution In probability theory and statistics, the marginal distribution of a subset of a collection of random variables is the probability distribution of the variables contained in the subset. It gives the probabilities of various values of the variable ...

X

, and

p_\theta( Z \mid  X)

is the

conditional distribution Conditional (if then) may refer to: * Causal conditional, if X then Y, where X is a cause of Y *Conditional probability, the probability of an event A given that another event B * Conditional proof, in logic: a proof that asserts a conditional, ...

Z

given

X

. Then, for a sample

x\sim p_\text

, and any distribution

q_\phi

, the ELBO is defined as

L(\phi, \theta; x) := \mathbb E_ \left \ln\frac \right .

The ELBO can equivalently be written as

\begin
L(\phi, \theta; x) = & \mathbb E_\left \ln p_\theta(x, z) \right + H

x) An emoticon (, , rarely , ), short for emotion icon, is a pictorial representation of a facial expression using characters—usually punctuation marks, numbers and letters—to express a person's feelings, mood or reaction, without needin ...

\\ = & \mathbb \ln \,p_\theta(x) - D_( q_\phi(z, x) , , p_\theta(z, x) ) . \\ \end In the first line,

H

is the

entropy Entropy is a scientific concept, most commonly associated with states of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynamics, where it was first recognized, to the micros ...

q_\phi

, which relates the ELBO to the

Helmholtz free energy In thermodynamics, the Helmholtz free energy (or Helmholtz energy) is a thermodynamic potential that measures the useful work obtainable from a closed thermodynamic system at a constant temperature ( isothermal). The change in the Helmholtz ene ...

. In the second line,

\ln p_\theta(x)

is called the ''evidence'' for

x

, and

D_( q_\phi(z, x) , ,  p_\theta(z, x) )

is the Kullback-Leibler divergence between

q_\phi

and

p_\theta

. Since the Kullback-Leibler divergence is non-negative,

L(\phi, \theta; x)

forms a lower bound on the evidence (''ELBO inequality'')

\ln p_\theta(x) \ge \mathbb \mathbb E_\left \ln\frac \right

Motivation

Variational Bayesian inference

Suppose we have an observable random variable

X

, and we want to find its true distribution

p^*

. This would allow us to generate data by sampling, and estimate probabilities of future events. In general, it is impossible to find

p^*

exactly, forcing us to search for a good approximation''.'' That is, we define a sufficiently large parametric family

\_

of distributions, then solve for

\min_\theta L(p_\theta, p^*)

for some loss function

L

. One possible way to solve this is by considering small variation from

p_\theta

p_

, and solve for

L(p_\theta, p^*) - L(p_, p^*) =0

. This is a problem in the

calculus of variations The calculus of variations (or variational calculus) is a field of mathematical analysis that uses variations, which are small changes in Function (mathematics), functions and functional (mathematics), functionals, to find maxima and minima of f ...

, thus it is called the variational method. Since there are not many explicitly parametrized distribution families (all the classical distribution families, such as the normal distribution, the

Gumbel distribution In probability theory and statistics, the Gumbel distribution (also known as the type-I generalized extreme value distribution) is used to model the distribution of the maximum (or the minimum) of a number of samples of various distributions. Thi ...

, etc, are far too simplistic to model the true distribution), we consider ''implicitly parametrized'' probability distributions: * First, define a simple distribution

p(z)

over a latent random variable

Z

. Usually a normal distribution or a uniform distribution suffices. * Next, define a family of complicated functions

f_\theta

(such as a

) parametrized by

\theta

. * Finally, define a way to convert any

f_\theta(z)

into a distribution (in general simple too, but unrelated to

p(z)

) over the observable random variable

X

. For example, let

f_\theta(z) = (f_1(z), f_2(z))

have two outputs, then we can define the corresponding distribution over

X

to be the normal distribution

\mathcal N(f_1(z), e^)

. This defines a family of joint distributions

p_\theta

over

(X, Z)

. It is very easy to sample

(x, z) \sim p_\theta

: simply sample

z\sim p

, then compute

f_\theta(z)

, and finally sample

x \sim p_\theta(\cdot ,  z)

using

f_\theta(z)

. In other words, we have a generative model for both the observable and the latent. Now, we consider a distribution

p_\theta

good, if it is a close approximation of

p^*

p_\theta(X) \approx p^*(X)

since the distribution on the right side is over

X

only, the distribution on the left side must marginalize the latent variable

Z

away.
In general, it's impossible to perform the integral

p_\theta(x) = \int p_\theta(x, z)p(z)dz

, forcing us to perform another approximation. Since

p_\theta(x) = \frac

(

Bayes' Rule Bayes' theorem (alternatively Bayes' law or Bayes' rule, after Thomas Bayes) gives a mathematical rule for inverting conditional probabilities, allowing one to find the probability of a cause given its effect. For example, if the risk of develo ...

), it suffices to find a good approximation of

p_\theta(z, x)

. So define another distribution family

q_\phi(z, x)

and use it to approximate

p_\theta(z, x)

. This is a discriminative model for the latent. The entire situation is summarized in the following table: In Bayesian language,

X

is the observed evidence, and

Z

is the latent/unobserved. The distribution

p

over

Z

is the ''prior distribution'' over

Z

p_\theta(x, z)

is the

likelihood function A likelihood function (often simply called the likelihood) measures how well a statistical model explains observed data by calculating the probability of seeing that data under different parameter values of the model. It is constructed from the ...

, and

p_\theta(z, x)

is the ''posterior'' ''distribution'' over

Z

. Given an observation

x

, we can ''infer'' what

z

likely gave rise to

x

by computing

p_\theta(z, x)

. The usual Bayesian method is to estimate the integral

p_\theta(x) = \int p_\theta(x, z)p(z)dz

, then compute by

Bayes' rule Bayes' theorem (alternatively Bayes' law or Bayes' rule, after Thomas Bayes) gives a mathematical rule for inverting conditional probabilities, allowing one to find the probability of a cause given its effect. For example, if the risk of develo ...

p_\theta(z, x) = \frac

. This is expensive to perform in general, but if we can simply find a good approximation

q_\phi(z, x) \approx p_\theta(z, x)

for most

x, z

, then we can infer

z

from

x

cheaply. Thus, the search for a good

q_\phi

is also called amortized inference. All in all, we have found a problem of variational Bayesian inference.

Deriving the ELBO

A basic result in variational inference is that minimizing the

Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler (KL) divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how much a model probability distribution is diff ...

(KL-divergence) is equivalent to maximizing the log-likelihood:

= -H(p^*) - D_(p^*(x) \, p_\theta(x))

where

H(p^*) = -\mathbb \mathbb E_ln p^*(x) /math> is the

of the true distribution. So if we can maximize

\mathbb_ln p_\theta (x) /math>, we can minimize D_(p^*(x) \,  p_\theta(x)), and consequently find an accurate approximation p_\theta \approx p^* .

To maximize \mathbb_ln p_\theta (x) /math>, we simply sample many x_i\sim p^*(x), i.e. use

importance sampling Importance sampling is a Monte Carlo method for evaluating properties of a particular distribution, while only having samples generated from a different distribution than the distribution of interest. Its introduction in statistics is generally at ...

approx \max_\theta \sum_i \ln p_\theta (x_i)

where

N

is the number of samples drawn from the true distribution. This approximation can be seen as overfitting. In order to maximize

\sum_i \ln p_\theta (x_i)

, it's necessary to find

\ln p_\theta(x)

\ln p_\theta(x) = \ln \int p_\theta(x, z) p(z)dz

This usually has no closed form and must be estimated. The usual way to estimate integrals is

Monte Carlo integration In mathematics, Monte Carlo integration is a technique for numerical integration using random numbers. It is a particular Monte Carlo method that numerically computes a definite integral. While other algorithms usually evaluate the integrand at ...

with

\int p_\theta(x, z) p(z)dz = \mathbb E_\left frac\right /math>where q_\phi(z, x) is a

sampling distribution In statistics, a sampling distribution or finite-sample distribution is the probability distribution of a given random-sample-based statistic. For an arbitrarily large number of samples where each sample, involving multiple observations (data poi ...

over

z

that we use to perform the Monte Carlo integration. So we see that if we sample

z\sim q_\phi(\cdot, x)

, then

\frac

is an unbiased estimator of

p_\theta(x)

. Unfortunately, this does not give us an unbiased estimator of

\ln p_\theta(x)

, because

\ln

is nonlinear. Indeed, we have by

Jensen's inequality In mathematics, Jensen's inequality, named after the Danish mathematician Johan Jensen, relates the value of a convex function of an integral to the integral of the convex function. It was proved by Jensen in 1906, building on an earlier p ...

\ln p_\theta(x)= \ln \mathbb E_\left frac\right \geq \mathbb E_\left ln\frac\right /math>In fact, all the obvious estimators of \ln p_\theta(x) are biased downwards, because no matter how many samples of z_i\sim q_\phi(\cdot ,  x) we take, we have by Jensen's inequality: \mathbb E_\left \ln \left(\frac 1N \sum_i  \frac\right)
		  \right \leq \ln \mathbb E_\left \frac 1N \sum_i  \frac
		  \right = \ln p_\theta(x) Subtracting the right side, we see that the problem comes down to a biased estimator of zero: \mathbb E_\left \ln \left(\frac 1N \sum_i  \frac\right)
		  \right \leq 0 At this point, we could branch off towards the development of an importance-weighted autoencoder, but we will instead continue with the simplest case with N=1 : \ln p_\theta(x)= \ln \mathbb E_\left frac\right \geq \mathbb E_\left ln\frac\right /math>The tightness of the inequality has a closed form: \ln p_\theta(x)- \mathbb E_\left ln\frac\right = D_(q_\phi(\cdot ,  x)\,  p_\theta(\cdot ,  x))\geq 0 We have thus obtained the ELBO function: L(\phi, \theta; x) := \ln p_\theta(x) - D_(q_\phi(\cdot ,  x)\,  p_\theta(\cdot ,  x))

Maximizing the ELBO

For fixed

x

, the optimization

\max_ L(\phi, \theta; x)

simultaneously attempts to maximize

\ln p_\theta(x)

and minimize

D_(q_\phi(\cdot ,  x)\,  p_\theta(\cdot ,  x))

. If the parametrization for

p_\theta

and

q_\phi

are flexible enough, we would obtain some

\hat\phi, \hat \theta

, such that we have simultaneously

\ln p_(x) \approx \max_\theta \ln p_\theta(x); \quad q_(\cdot ,  x)\approx p_(\cdot ,  x)

Since

= -H(p^*) - D_(p^*(x) \, p_\theta(x))

we have

\ln p_(x) \approx \max_\theta -H(p^*)  - D_(p^*(x) \,  p_\theta(x))

and so

\hat\theta \approx \arg\min  D_(p^*(x) \,  p_\theta(x))

In other words, maximizing the ELBO would simultaneously allow us to obtain an accurate generative model

p_ \approx p^*

and an accurate discriminative model

q_(\cdot ,  x)\approx p_(\cdot ,  x)

Main forms

The ELBO has many possible expressions, each with some different emphasis. :

= \int q_\phi(z, x)\ln\fracdz

This form shows that if we sample

z\sim q_\phi(\cdot ,  x)

, then

\ln\frac

is an

unbiased estimator In statistics, the bias of an estimator (or bias function) is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called ''unbiased''. In stat ...

of the ELBO. :

\ln p_\theta(x) - D_(q_\phi(\cdot ,  x) \;\, \; p_\theta(\cdot ,  x))

This form shows that the ELBO is a lower bound on the evidence

\ln p_\theta(x)

, and that maximizing the ELBO with respect to

\phi

is equivalent to minimizing the KL-divergence from

p_\theta(\cdot ,  x)

q_\phi(\cdot ,  x)

. :

- D_(q_\phi(\cdot , x) \;\, \; p)

This form shows that maximizing the ELBO simultaneously attempts to keep

q_\phi(\cdot ,  x)

close to

p

and concentrate

q_\phi(\cdot ,  x)

on those

z

that maximizes

\ln p_\theta (x, z)

. That is, the approximate posterior

q_\phi(\cdot ,  x)

balances between staying close to the prior

p

and moving towards the maximum likelihood

\arg\max_z \ln p_\theta (x, z)

Data-processing inequality

Suppose we take

N

independent samples from

p^*

, and collect them in the dataset

D = \

, then we have empirical distribution

q_D(x) = \frac 1N \sum_i \delta_

. Fitting

p_\theta(x)

q_D(x)

can be done, as usual, by maximizing the loglikelihood

\ln p_\theta(D)

D_(q_D(x) \,   p_\theta(x)) = -\frac 1N \sum_i \ln p_\theta(x_i) - H(q_D)=  -\frac 1N \ln p_\theta(D) - H(q_D)

Now, by the ELBO inequality, we can bound

\ln p_\theta(D)

, and thus

D_(q_D(x) \,   p_\theta(x)) \leq -\frac 1N  L(\phi, \theta; D) - H(q_D)

The right-hand-side simplifies to a KL-divergence, and so we get:

D_(q_D(x) \,   p_\theta(x)) \leq -\frac 1N \sum_i L(\phi, \theta; x_i) - H(q_D)=  D_(q_(x, z); p_\theta(x, z))

This result can be interpreted as a special case of the

data processing inequality The data processing inequality is an information theoretic concept that states that the information content of a signal cannot be increased via a local physical operation. This can be expressed concisely as 'post-processing cannot increase inform ...

. In this interpretation, maximizing

L(\phi, \theta; D)= \sum_i L(\phi, \theta; x_i)

is minimizing

D_(q_(x, z); p_\theta(x, z))

, which upper-bounds the real quantity of interest

D_(q_(x); p_\theta(x))

via the data-processing inequality. That is, we append a latent space to the observable space, paying the price of a weaker inequality for the sake of more computationally efficient minimization of the KL-divergence.

References

Notes

{{reflist, group=note Theory of probability distributions