HOME

TheInfoList



OR:

In
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machin ...
, a variational autoencoder (VAE), is an
artificial neural network Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains. An ANN is based on a collection of connected units ...
architecture introduced by Diederik P. Kingma and
Max Welling Max Welling (born 1968) is a Dutch computer scientist in machine learning at the University of Amsterdam. In August 2017, the university spin-off ''Scyfer BV'', co-founded by Welling, was acquired by Qualcomm. He has since then served as a Vice Pr ...
, belonging to the families of probabilistic graphical models and
variational Bayesian methods Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. They are typically used in complex statistical models consisting of observed variables (usually ...
. Variational autoencoders are often associated with the autoencoder model because of its architectural affinity, but with significant differences in the goal and mathematical formulation. Variational autoencoders are probabilistic generative models that require neural networks as only a part of their overall structure, as e.g. in VQ-VAE. The neural network components are typically referred to as the encoder and decoder for the first and second component respectively. The first neural network maps the input variable to a latent space that corresponds to the parameters of a variational distribution. In this way, the encoder can produce multiple different samples that all come from the same distribution. The decoder has the opposite function, which is to map from the latent space to the input space, in order to produce or generate data points. Both networks are typically trained together with the usage of the reparameterization trick, although the variance of the noise model can be learned separately. Although this type of model was initially designed for
unsupervised learning Unsupervised learning is a type of algorithm that learns patterns from untagged data. The hope is that through mimicry, which is an important mode of learning in people, the machine is forced to build a concise representation of its world and t ...
, its effectiveness has been proven for
semi-supervised learning Weak supervision is a branch of machine learning where noisy, limited, or imprecise sources are used to provide supervision signal for labeling large amounts of training data in a supervised learning setting. This approach alleviates the burden of o ...
and
supervised learning Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...
.


Overview of architecture and operation

A variational autoencoder is a generative model with a prior and noise distribution respectively. Usually such models are trained using the Expectation-Maximization meta-algorithm (e.g. probabilistic PCA, (spike & slab) sparse coding). Such a scheme optimizes a lower bound of the data likelihood, which is usually intractable, and in doing so requires the discovery of q-distributions, or variational posteriors. These q distributions are normally parameterized for each individual data point in a separate optimization process. However, variational autoencoders use a neural network as an amortized approach to jointly optimize across data points. This neural network takes as input the data points themselves, and outputs parameters for the variational distribution. As it maps from a known input space to the low-dimensional latent space, it is called the encoder. The decoder is the second neural network of this model. It is a function that maps from the latent space to the input space, e.g. as the means of the noise distribution. It is possible to use another neural network that maps to the variance, however this can be omitted for simplicity. In such a case, the variance can be optimized with gradient descent. To optimize this model, one needs to know two terms: the "reconstruction error", and the
Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how one probability distribution ''P'' is different fro ...
. Both terms are derived from the free energy expression of the probabilistic model, and therefore differ depending on the noise distribution and the assumed prior of the data. The KL-D from the free energy expression maximizes the probability mass of the q distribution that overlaps with the p distribution, which unfortunately can result in mode-seeking behaviour. The "reconstruction" term is the remainder of the free energy expression, and requires a sampling approximation to compute its expectation value.


Formulation

From the point of view of probabilistic modelling, one wants to maximize the likelihood of the data x by their chosen parameterized probability distribution p_(x) = p(x, \theta). This distribution is usually chosen to be a Gaussian N(x, \mu,\sigma) which is parameterized by \mu and \sigma respectively, and as a member of the exponential family it is easy to work with as a noise distribution. Simple distributions are easy enough to maximize, however distributions where a prior is assumed over the latents z results in intractable integrals. Let us find p_\theta(x) via marginalizing over z. : p_\theta(x) = \int_p_\theta() \, dz, where p_\theta() represents the
joint distribution Given two random variables that are defined on the same probability space, the joint probability distribution is the corresponding probability distribution on all possible pairs of outputs. The joint distribution can just as well be considered ...
under p_\theta of the observable data x and its latent representation or encoding z . According to the
chain rule In calculus, the chain rule is a formula that expresses the derivative of the composition of two differentiable functions and in terms of the derivatives of and . More precisely, if h=f\circ g is the function such that h(x)=f(g(x)) for every , ...
, the equation can be rewritten as : p_\theta(x) = \int_p_\theta()p_\theta(z) \, dz In the vanilla variational autoencoder, z is usually taken to be a finite-dimensional vector of real numbers, and p_\theta() to be a
Gaussian distribution In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is : f(x) = \frac e^ The parameter \mu ...
. Then p_\theta(x) is a mixture of Gaussian distributions. It is now possible to define the set of the relationships between the input data and its latent representation as * Prior p_\theta(z) * Likelihood p_\theta(x, z) * Posterior p_\theta(z, x) Unfortunately, the computation of p_\theta(x) is expensive and in most cases intractable. To speed up the calculus to make it feasible, it is necessary to introduce a further function to approximate the posterior distribution as :q_\phi() \approx p_\theta() with \phi defined as the set of real values that parametrize q. This is sometimes called ''amortized inference'', since by "investing" in finding a good q_\phi, one can later infer z from x quickly without doing any integrals. In this way, the problem is of finding a good probabilistic autoencoder, in which the conditional likelihood distribution p_\theta(x, z) is computed by the ''probabilistic decoder'', and the approximated posterior distribution q_\phi(z, x) is computed by the ''probabilistic encoder''.


Evidence lower bound (ELBO)

As in every
deep learning Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. D ...
problem, it is necessary to define a differentiable loss function in order to update the network weights through
backpropagation In machine learning, backpropagation (backprop, BP) is a widely used algorithm for training feedforward artificial neural networks. Generalizations of backpropagation exist for other artificial neural networks (ANNs), and for functions gene ...
. For variational autoencoders, the idea is to jointly optimize the generative model parameters \theta to reduce the reconstruction error between the input and the output, and \phi to make q_\phi() as close as possible to p_\theta(z, x). As reconstruction loss,
mean squared error In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between ...
and cross entropy are often used. As distance loss between the two distributions the reverse Kullback–Leibler divergence D_(q_\phi()\parallel p_\theta()) is a good choice to squeeze q_\phi() under p_\theta(z, x). The distance loss just defined is expanded as : \begin D_(q_\phi()\parallel p_\theta()) &= \mathbb E_ \left ln \frac\right\ &= \mathbb E_ \left ln \frac\right\ &=\ln p_\theta(x) + \mathbb E_ \left ln \frac\right\end Now define the evidence lower bound (ELBO):L_(x) := \mathbb E_ \left ln \frac\right = \ln p_\theta(x) - D_(q_\phi()\parallel p_\theta()) Maximizing the ELBO\theta^*,\phi^* = \underset\operatorname \, L_(x) is equivalent to simultaneously maximizing \ln p_\theta(x) and minimizing D_(q_\phi()\parallel p_\theta()) . That is, maximizing the log-likelihood of the observed data, and minimizing the divergence of the approximate posterior q_\phi(\cdot , x) from the exact posterior p_\theta(\cdot , x) . For a more detailed derivation and more interpretations of ELBO and its maximization, see its main page.


Reparameterization

To efficient search for \theta^*,\phi^* = \underset\operatorname \, L_(x) the typical method is
gradient descent In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of th ...
. It is straightforward to find\nabla_\theta \mathbb E_ \left ln \frac\right= \mathbb E_ \left \nabla_\theta \ln \frac\right However, \nabla_\phi \mathbb E_ \left ln \frac\right does not allow one to put the \nabla_\phi inside the expectation, since \phi appears in the probability distribution itself. The reparameterization trick (also known as stochastic backpropagation) bypasses this difficulty. The most important example is when z \sim q_\phi(\cdot , x) is normally distributed, as \mathcal N(\mu_\phi(x), \Sigma_\phi(x)) . : This can be reparametrized by letting \boldsymbol \sim \mathcal(0, \boldsymbol) be a "standard
random number generator Random number generation is a process by which, often by means of a random number generator (RNG), a sequence of numbers or symbols that cannot be reasonably predicted better than by random chance is generated. This means that the particular outc ...
", and construct z as z = \mu_\phi(x) + L_\phi(x)\epsilon . Here, L_\phi(x) is obtained by the
Cholesky decomposition In linear algebra, the Cholesky decomposition or Cholesky factorization (pronounced ) is a decomposition of a Hermitian, positive-definite matrix into the product of a lower triangular matrix and its conjugate transpose, which is useful for effic ...
:\Sigma_\phi(x) = L_\phi(x)L_\phi(x)^T Then we have\nabla_\phi \mathbb E_ \left ln \frac\right = \mathbb _\left \nabla_\phi \ln \right and so we obtained an unbiased estimator of the gradient, allowing
stochastic gradient descent Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of gr ...
. Since we reparametrized z, we need to find q_\phi(z, x). Let q_0 by the probability density function for \epsilon, then\ln q_\phi(z , x) = \ln q_0 (\epsilon) - \ln, \det(\partial_\epsilon z), where \partial_\epsilon z is the Jacobian matrix of \epsilon with respect to z. Since z = \mu_\phi(x) + L_\phi(x)\epsilon , this is \ln q_\phi(z , x) = -\frac 12 \, \epsilon\, ^2 - \ln, \det L_\phi(x), - \frac n2 \ln(2\pi)


Variations

Many variational autoencoders applications and extensions have been used to adapt the architecture to other domains and improve its performance. \beta-VAE is an implementation with a weighted Kullback–Leibler divergence term to automatically discover and interpret factorised latent representations. With this implementation, it is possible to force manifold disentanglement for \beta values greater than one. This architecture can discover disentangled latent factors without supervision. The conditional VAE (CVAE), inserts label information in the latent space to force a deterministic constrained representation of the learned data. Some structures directly deal with the quality of the generated samples or implement more than one latent space to further improve the representation learning. Some architectures mix VAE and
generative adversarial network A generative adversarial network (GAN) is a class of machine learning frameworks designed by Ian Goodfellow and his colleagues in June 2014. Two neural networks contest with each other in the form of a zero-sum game, where one agent's gain is ...
s to obtain hybrid models.


See also

* Autoencoder *
Artificial neural network Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains. An ANN is based on a collection of connected units ...
*
Deep learning Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. D ...
*
Generative adversarial network A generative adversarial network (GAN) is a class of machine learning frameworks designed by Ian Goodfellow and his colleagues in June 2014. Two neural networks contest with each other in the form of a zero-sum game, where one agent's gain is ...
*
Representation learning In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature ...
*
Sparse dictionary learning Sparse coding is a representation learning method which aims at finding a sparse representation of the input data (also known as sparse coding) in the form of a linear combination of basic elements as well as those basic elements themselves. These ...
*
Data augmentation Data augmentation in data analysis are techniques used to increase the amount of data by adding slightly modified copies of already existing data or newly created synthetic data from existing data. It acts as a regularizer and helps reduce over ...
*
Backpropagation In machine learning, backpropagation (backprop, BP) is a widely used algorithm for training feedforward artificial neural networks. Generalizations of backpropagation exist for other artificial neural networks (ANNs), and for functions gene ...


References

{{Differentiable computing Neural network architectures Unsupervised learning Supervised learning Graphical models Bayesian statistics Dimension reduction