Variational Bayesian methods are a family of techniques for approximating intractable
integral
In mathematics
Mathematics is an area of knowledge that includes the topics of numbers, formulas and related structures, shapes and the spaces in which they are contained, and quantities and their changes. These topics are represented i ...
s arising in
Bayesian inference
Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, a ...
and
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
. They are typically used in complex
statistical model
A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of Sample (statistics), sample data (and similar data from a larger Statistical population, population). A statistical model repres ...
s consisting of observed variables (usually termed "data") as well as unknown
parameter
A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...
s and
latent variables, with various sorts of relationships among the three types of
random variable
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
s, as might be described by a
graphical model. As typical in Bayesian inference, the parameters and latent variables are grouped together as "unobserved variables". Variational Bayesian methods are primarily used for two purposes:
#To provide an analytical approximation to the
posterior probability
The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posterior p ...
of the unobserved variables, in order to do
statistical inference
Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution, distribution of probability.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical ...
over these variables.
#To derive a
lower bound
In mathematics, particularly in order theory, an upper bound or majorant of a subset of some preordered set is an element of that is greater than or equal to every element of .
Dually, a lower bound or minorant of is defined to be an element ...
for the
marginal likelihood (sometimes called the ''evidence'') of the observed data (i.e. the
marginal probability
In probability theory and statistics, the marginal distribution of a subset of a collection of random variables is the probability distribution of the variables contained in the subset. It gives the probabilities of various values of the variables ...
of the data given the model, with marginalization performed over unobserved variables). This is typically used for performing
model selection, the general idea being that a higher marginal likelihood for a given model indicates a better fit of the data by that model and hence a greater probability that the model in question was the one that generated the data. (See also the
Bayes factor article.)
In the former purpose (that of approximating a posterior probability), variational Bayes is an alternative to
Monte Carlo sampling
Monte Carlo methods, or Monte Carlo experiments, are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. The underlying concept is to use randomness to solve problems that might be determi ...
methods—particularly,
Markov chain Monte Carlo
In statistics, Markov chain Monte Carlo (MCMC) methods comprise a class of algorithms for sampling from a probability distribution. By constructing a Markov chain that has the desired distribution as its equilibrium distribution, one can obtain ...
methods such as
Gibbs sampling
In statistics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for obtaining a sequence of observations which are approximated from a specified multivariate probability distribution, when direct sampling is dif ...
—for taking a fully Bayesian approach to
statistical inference
Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution, distribution of probability.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical ...
over complex
distributions that are difficult to evaluate directly or
sample
Sample or samples may refer to:
Base meaning
* Sample (statistics), a subset of a population – complete data set
* Sample (signal), a digital discrete sample of a continuous analog signal
* Sample (material), a specimen or small quantity of s ...
. In particular, whereas Monte Carlo techniques provide a numerical approximation to the exact posterior using a set of samples, variational Bayes provides a locally-optimal, exact analytical solution to an approximation of the posterior.
Variational Bayes can be seen as an extension of the
expectation-maximization (EM) algorithm from
maximum a posteriori estimation
In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the b ...
(MAP estimation) of the single most probable value of each parameter to fully Bayesian estimation which computes (an approximation to) the entire
posterior distribution of the parameters and latent variables. As in EM, it finds a set of optimal parameter values, and it has the same alternating structure as does EM, based on a set of interlocked (mutually dependent) equations that cannot be solved analytically.
For many applications, variational Bayes produces solutions of comparable accuracy to Gibbs sampling at greater speed. However, deriving the set of equations used to update the parameters iteratively often requires a large amount of work compared with deriving the comparable Gibbs sampling equations. This is the case even for many models that are conceptually quite simple, as is demonstrated below in the case of a basic non-hierarchical model with only two parameters and no latent variables.
Mathematical derivation
Problem
In
variational inference, the posterior distribution over a set of unobserved variables
given some data
is approximated by a so-called variational distribution,
:
The distribution
is restricted to belong to a family of distributions of simpler form than
(e.g. a family of Gaussian distributions), selected with the intention of making
similar to the true posterior,
.
The similarity (or dissimilarity) is measured in terms of a dissimilarity function
and hence inference is performed by selecting the distribution
that minimizes
.
KL divergence
The most common type of variational Bayes uses the
Kullback–Leibler divergence (KL-divergence) of ''Q'' from ''P'' as the choice of dissimilarity function. This choice makes this minimization tractable. The KL-divergence is defined as
:
Note that ''Q'' and ''P'' are reversed from what one might expect. This use of reversed KL-divergence is conceptually similar to the
expectation-maximization algorithm. (Using the KL-divergence in the other way produces the
expectation propagation algorithm.)
Intractability
Variational techniques are typically used to form an approximation for:
:
The marginalization over
to calculate
in the denominator is typically intractable, because, for example, the search space of
is combinatorially large. Therefore, we seek an approximation, using
.
Evidence lower bound
Given that
, the KL-divergence above can also be written as
:
Because
is a constant with respect to
and
because
is a distribution, we have
:
which, according to the definition of
expected value
In probability theory, the expected value (also called expectation, expectancy, mathematical expectation, mean, average, or first moment) is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a l ...
(for a discrete
random variable
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
), can be written as follows
:
which can be rearranged to become
:
As the ''log-
evidence
Evidence for a proposition is what supports this proposition. It is usually understood as an indication that the supported proposition is true. What role evidence plays and how it is conceived varies from field to field.
In epistemology, evidenc ...
''
is fixed with respect to
, maximizing the final term
minimizes the KL divergence of
from
. By appropriate choice of
,
becomes tractable to compute and to maximize. Hence we have both an analytical approximation
for the posterior
, and a lower bound
for the log-evidence
(since the KL-divergence is non-negative).
The lower bound
is known as the (negative) variational free energy in analogy with
thermodynamic free energy because it can also be expressed as a negative energy