
In
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
, a variational autoencoder (VAE) is an
artificial neural network
In machine learning, a neural network (also artificial neural network or neural net, abbreviated ANN or NN) is a computational model inspired by the structure and functions of biological neural networks.
A neural network consists of connected ...
architecture introduced by Diederik P. Kingma and
Max Welling. It is part of the families of
probabilistic graphical models and
variational Bayesian methods.
In addition to being seen as an
autoencoder neural network architecture, variational autoencoders can also be studied within the mathematical formulation of
variational Bayesian methods, connecting a neural encoder network to its decoder through a probabilistic
latent space (for example, as a
multivariate Gaussian distribution) that corresponds to the parameters of a variational distribution.
Thus, the encoder maps each point (such as an image) from a large complex dataset into a distribution within the latent space, rather than to a single point in that space. The decoder has the opposite function, which is to map from the latent space to the input space, again according to a distribution (although in practice, noise is rarely added during the decoding stage). By mapping a point to a distribution instead of a single point, the network can avoid overfitting the training data. Both networks are typically trained together with the usage of the
reparameterization trick, although the variance of the noise model can be learned separately.
Although this type of model was initially designed for
unsupervised learning
Unsupervised learning is a framework in machine learning where, in contrast to supervised learning, algorithms learn patterns exclusively from unlabeled data. Other frameworks in the spectrum of supervisions include weak- or semi-supervision, wh ...
, its effectiveness has been proven for
semi-supervised learning and
supervised learning
In machine learning, supervised learning (SL) is a paradigm where a Statistical model, model is trained using input objects (e.g. a vector of predictor variables) and desired output values (also known as a ''supervisory signal''), which are often ...
.
Overview of architecture and operation
A variational autoencoder is a generative model with a prior and noise distribution respectively. Usually such models are trained using the
expectation-maximization meta-algorithm (e.g.
probabilistic PCA, (spike & slab) sparse coding). Such a scheme optimizes a lower bound of the data likelihood, which is usually computationally intractable, and in doing so requires the discovery of q-distributions, or variational
posteriors. These q-distributions are normally parameterized for each individual data point in a separate optimization process. However, variational autoencoders use a neural network as an amortized approach to jointly optimize across data points. In that way, the same parameters are reused for multiple data points, which can result in massive memory savings. The first neural network takes as input the data points themselves, and outputs parameters for the variational distribution. As it maps from a known input space to the low-dimensional latent space, it is called the encoder.
The decoder is the second neural network of this model. It is a function that maps from the latent space to the input space, e.g. as the means of the noise distribution. It is possible to use another neural network that maps to the variance, however this can be omitted for simplicity. In such a case, the variance can be optimized with gradient descent.
To optimize this model, one needs to know two terms: the "reconstruction error", and the
Kullback–Leibler divergence
In mathematical statistics, the Kullback–Leibler (KL) divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how much a model probability distribution is diff ...
(KL-D). Both terms are derived from the free energy expression of the probabilistic model, and therefore differ depending on the noise distribution and the assumed prior of the data, here referred to as p-distribution. For example, a standard VAE task such as IMAGENET is typically assumed to have a gaussianly distributed noise; however, tasks such as binarized MNIST require a Bernoulli noise. The KL-D from the free energy expression maximizes the probability mass of the q-distribution that overlaps with the p-distribution, which unfortunately can result in mode-seeking behaviour. The "reconstruction" term is the remainder of the free energy expression, and requires a sampling approximation to compute its expectation value.
More recent approaches replace
Kullback–Leibler divergence
In mathematical statistics, the Kullback–Leibler (KL) divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how much a model probability distribution is diff ...
(KL-D) with
various statistical distances, see
"Statistical distance VAE variants" below.
Formulation
From the point of view of probabilistic modeling, one wants to maximize the likelihood of the data
by their chosen parameterized probability distribution
. This distribution is usually chosen to be a Gaussian
which is parameterized by
and
respectively, and as a member of the exponential family it is easy to work with as a noise distribution. Simple distributions are easy enough to maximize, however distributions where a prior is assumed over the latents
results in intractable integrals. Let us find
via
marginalizing over
.
:
where
represents the
joint distribution under
of the observable data
and its latent representation or encoding
. According to the
chain rule
In calculus, the chain rule is a formula that expresses the derivative of the Function composition, composition of two differentiable functions and in terms of the derivatives of and . More precisely, if h=f\circ g is the function such that h ...
, the equation can be rewritten as
:
In the vanilla variational autoencoder,
is usually taken to be a finite-dimensional vector of real numbers, and
to be a
Gaussian distribution
In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real number, real-valued random variable. The general form of its probability density function is
f(x ...
. Then
is a mixture of Gaussian distributions.
It is now possible to define the set of the relationships between the input data and its latent representation as
* Prior
* Likelihood
* Posterior
Unfortunately, the computation of
is expensive and in most cases intractable. To speed up the calculus to make it feasible, it is necessary to introduce a further function to approximate the posterior distribution as
:
with
defined as the set of real values that parametrize
. This is sometimes called ''amortized inference'', since by "investing" in finding a good
, one can later infer
from
quickly without doing any integrals.
In this way, the problem is to find a good probabilistic autoencoder, in which the conditional likelihood distribution
is computed by the ''probabilistic decoder'', and the approximated posterior distribution
is computed by the ''probabilistic encoder''.
Parametrize the encoder as
, and the decoder as
.
Evidence lower bound (ELBO)
Like many
deep learning
Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...
approaches that use gradient-based optimization, VAEs require a differentiable loss function to update the network weights through
backpropagation.
For variational autoencoders, the idea is to jointly optimize the generative model parameters
to reduce the reconstruction error between the input and the output, and
to make
as close as possible to
. As reconstruction loss,
mean squared error
In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference betwee ...
and
cross entropy are often used.
As distance loss between the two distributions the Kullback–Leibler divergence
is a good choice to squeeze
under
.
[
The distance loss just defined is expanded as
:
Now define the evidence lower bound (ELBO):Maximizing the ELBOis equivalent to simultaneously maximizing and minimizing . That is, maximizing the log-likelihood of the observed data, and minimizing the divergence of the approximate posterior from the exact posterior .
The form given is not very convenient for maximization, but the following, equivalent form, is:where is implemented as , since that is, up to an additive constant, what yields. That is, we model the distribution of conditional on to be a Gaussian distribution centered on . The distribution of and are often also chosen to be Gaussians as and , with which we obtain by the formula for KL divergence of Gaussians:Here is the dimension of . For a more detailed derivation and more interpretations of ELBO and its maximization, see its main page.
]
Reparameterization
To efficiently search for the typical method is gradient ascent.
It is straightforward to findHowever, does not allow one to put the inside the expectation, since appears in the probability distribution itself. The reparameterization trick (also known as stochastic backpropagation) bypasses this difficulty.[
The most important example is when is normally distributed, as .
: ]
This can be reparametrized by letting be a "standard random number generator
Random number generation is a process by which, often by means of a random number generator (RNG), a sequence of numbers or symbols is generated that cannot be reasonably predicted better than by random chance. This means that the particular ou ...
", and construct as . Here, is obtained by the Cholesky decomposition:Then we haveand so we obtained an unbiased estimator of the gradient, allowing stochastic gradient descent
Stochastic gradient descent (often abbreviated SGD) is an Iterative method, iterative method for optimizing an objective function with suitable smoothness properties (e.g. Differentiable function, differentiable or Subderivative, subdifferentiable ...
.
Since we reparametrized , we need to find . Let be the probability density function for , then where is the Jacobian matrix of with respect to . Since , this is
Variations
Many variational autoencoders applications and extensions have been used to adapt the architecture to other domains and improve its performance.
-VAE is an implementation with a weighted Kullback–Leibler divergence term to automatically discover and interpret factorised latent representations. With this implementation, it is possible to force manifold disentanglement for values greater than one. This architecture can discover disentangled latent factors without supervision.
The conditional VAE (CVAE), inserts label information in the latent space to force a deterministic constrained representation of the learned data.
Some structures directly deal with the quality of the generated samples or implement more than one latent space to further improve the representation learning.
Some architectures mix VAE and generative adversarial networks to obtain hybrid models.
It is not necessary to use gradients to update the encoder. In fact, the encoder is not necessary for the generative model.
Statistical distance VAE variants
After the initial work of Diederik P. Kingma and Max Welling, several procedures were
proposed to formulate in a more abstract way the operation of the VAE. In these approaches the loss function is composed of two parts :
* the usual reconstruction error part which seeks to ensure that the encoder-then-decoder mapping is as close to the identity map as possible; the sampling is done at run time from the empirical distribution of objects available (e.g., for MNIST or IMAGENET this will be the empirical probability law of all images in the dataset). This gives the term: Multivariate normal distribution
In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional ( univariate) normal distribution to higher dimensions. One d ...
. We will denote E_\phi \sharp \mathbb^ this pushforward measure which in practice is just the empirical distribution obtained by passing all dataset objects through the encoder E_\phi. In order to make sure that E_\phi \sharp \mathbb^ is close to the target \mu(dz), a Statistical distance d is invoked and the term d \left( \mu(dz), E_\phi \sharp \mathbb^ \right)^2 is added to the loss.
We obtain the final formula for the loss:
L_ = \mathbb_ \left x - D_\theta(E_\phi(x))\, _2^2\right+d \left( \mu(dz), E_\phi \sharp \mathbb^ \right)^2
The statistical distance d requires special properties, for instance it has to be posses a formula as expectation because the loss function will need to be optimized by stochastic optimization algorithms. Several distances can be chosen and this gave rise to several flavors of VAEs:
* the sliced Wasserstein distance used by S Kolouri, et al. in their VAE
* the energy distance implemented in the Radon Sobolev Variational Auto-Encoder
* the Maximum Mean Discrepancy distance used in the MMD-VAE
* the Wasserstein distance used in the WAEs
* kernel-based distances used in the Kernelized Variational Autoencoder (K-VAE)
See also
* Autoencoder
* Artificial neural network
In machine learning, a neural network (also artificial neural network or neural net, abbreviated ANN or NN) is a computational model inspired by the structure and functions of biological neural networks.
A neural network consists of connected ...
* Deep learning
Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...
* Generative adversarial network
* Representation learning
* Sparse dictionary learning
* Data augmentation
* Backpropagation
References
Further reading
*
{{Artificial intelligence navbox
Neural network architectures
Unsupervised learning
Supervised learning
Graphical models
Bayesian statistics
Dimension reduction