Variational Bayesian methods are a family of techniques for approximating intractable
integral
In mathematics, an integral assigns numbers to functions in a way that describes displacement, area, volume, and other concepts that arise by combining infinitesimal data. The process of finding integrals is called integration. Along with ...
s arising in
Bayesian inference
Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, and ...
and
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machin ...
. They are typically used in complex
statistical model
A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population). A statistical model represents, often in considerably idealized form, ...
s consisting of observed variables (usually termed "data") as well as unknown
parameter
A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...
s and
latent variable
In statistics, latent variables (from Latin: present participle of ''lateo'', “lie hidden”) are variables that can only be inferred indirectly through a mathematical model from other observable variables that can be directly observed or ...
s, with various sorts of relationships among the three types of
random variable
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
s, as might be described by a
graphical model
A graphical model or probabilistic graphical model (PGM) or structured probabilistic model is a probabilistic model for which a graph expresses the conditional dependence structure between random variables. They are commonly used in probability ...
. As typical in Bayesian inference, the parameters and latent variables are grouped together as "unobserved variables". Variational Bayesian methods are primarily used for two purposes:
#To provide an analytical approximation to the
posterior probability
The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posterior ...
of the unobserved variables, in order to do
statistical inference
Statistical inference is the process of using data analysis to infer properties of an underlying distribution of probability.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical analysis infers properti ...
over these variables.
#To derive a
lower bound
In mathematics, particularly in order theory, an upper bound or majorant of a subset of some preordered set is an element of that is greater than or equal to every element of .
Dually, a lower bound or minorant of is defined to be an ele ...
for the
marginal likelihood (sometimes called the ''evidence'') of the observed data (i.e. the
marginal probability
In probability theory and statistics, the marginal distribution of a subset of a collection of random variables is the probability distribution of the variables contained in the subset. It gives the probabilities of various values of the variables ...
of the data given the model, with marginalization performed over unobserved variables). This is typically used for performing
model selection
Model selection is the task of selecting a statistical model from a set of candidate models, given data. In the simplest cases, a pre-existing set of data is considered. However, the task can also involve the design of experiments such that the ...
, the general idea being that a higher marginal likelihood for a given model indicates a better fit of the data by that model and hence a greater probability that the model in question was the one that generated the data. (See also the
Bayes factor
The Bayes factor is a ratio of two competing statistical models represented by their marginal likelihood, and is used to quantify the support for one model over the other. The models in questions can have a common set of parameters, such as a nul ...
article.)
In the former purpose (that of approximating a posterior probability), variational Bayes is an alternative to
Monte Carlo sampling
Monte Carlo methods, or Monte Carlo experiments, are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. The underlying concept is to use randomness to solve problems that might be determin ...
methods—particularly,
Markov chain Monte Carlo
In statistics, Markov chain Monte Carlo (MCMC) methods comprise a class of algorithms for sampling from a probability distribution. By constructing a Markov chain that has the desired distribution as its equilibrium distribution, one can obtain ...
methods such as
Gibbs sampling
In statistics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for obtaining a sequence of observations which are approximated from a specified multivariate probability distribution, when direct sampling is diffi ...
—for taking a fully Bayesian approach to
statistical inference
Statistical inference is the process of using data analysis to infer properties of an underlying distribution of probability.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical analysis infers properti ...
over complex
distributions that are difficult to evaluate directly or
sample. In particular, whereas Monte Carlo techniques provide a numerical approximation to the exact posterior using a set of samples, variational Bayes provides a locally-optimal, exact analytical solution to an approximation of the posterior.
Variational Bayes can be seen as an extension of the
expectation-maximization (EM) algorithm from
maximum a posteriori estimation
In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on th ...
(MAP estimation) of the single most probable value of each parameter to fully Bayesian estimation which computes (an approximation to) the entire
posterior distribution
The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posterior ...
of the parameters and latent variables. As in EM, it finds a set of optimal parameter values, and it has the same alternating structure as does EM, based on a set of interlocked (mutually dependent) equations that cannot be solved analytically.
For many applications, variational Bayes produces solutions of comparable accuracy to Gibbs sampling at greater speed. However, deriving the set of equations used to update the parameters iteratively often requires a large amount of work compared with deriving the comparable Gibbs sampling equations. This is the case even for many models that are conceptually quite simple, as is demonstrated below in the case of a basic non-hierarchical model with only two parameters and no latent variables.
Mathematical derivation
Problem
In
variational inference, the posterior distribution over a set of unobserved variables
given some data
is approximated by a so-called variational distribution,
:
The distribution
is restricted to belong to a family of distributions of simpler form than
(e.g. a family of Gaussian distributions), selected with the intention of making
similar to the true posterior,
.
The similarity (or dissimilarity) is measured in terms of a dissimilarity function
and hence inference is performed by selecting the distribution
that minimizes
.
KL divergence
The most common type of variational Bayes uses the
Kullback–Leibler divergence
In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how one probability distribution ''P'' is different fro ...
(KL-divergence) of ''Q'' from ''P'' as the choice of dissimilarity function. This choice makes this minimization tractable. The KL-divergence is defined as
:
Note that ''Q'' and ''P'' are reversed from what one might expect. This use of reversed KL-divergence is conceptually similar to the
expectation-maximization algorithm. (Using the KL-divergence in the other way produces the
expectation propagation
Expectation propagation (EP) is a technique in Bayesian machine learning.
EP finds approximations to a probability distribution. It uses an iterative approach that uses the factorization structure of the target distribution. It differs from ot ...
algorithm.)
Intractability
Variational techniques are typically used to form an approximation for:
:
The marginalization over
to calculate
in the denominator is typically intractable, because, for example, the search space of
is combinatorially large. Therefore, we seek an approximation, using
.
Evidence lower bound
Given that
, the KL-divergence above can also be written as
:
Because
is a constant with respect to
and
because
is a distribution, we have
:
which, according to the definition of
expected value
In probability theory, the expected value (also called expectation, expectancy, mathematical expectation, mean, average, or first moment) is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a ...
(for a discrete
random variable
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
), can be written as follows
:
which can be rearranged to become
:
As the ''log-
evidence
Evidence for a proposition is what supports this proposition. It is usually understood as an indication that the supported proposition is true. What role evidence plays and how it is conceived varies from field to field.
In epistemology, evidenc ...
''
is fixed with respect to
, maximizing the final term
minimizes the KL divergence of
from
. By appropriate choice of
,
becomes tractable to compute and to maximize. Hence we have both an analytical approximation
for the posterior
, and a lower bound
for the log-evidence
(since the KL-divergence is non-negative).
The lower bound
is known as the (negative) variational free energy in analogy with
thermodynamic free energy
The thermodynamic free energy is a concept useful in the thermodynamics of chemical or thermal processes in engineering and science. The change in the free energy is the maximum amount of work that a thermodynamic system can perform in a process a ...
because it can also be expressed as a negative energy