An autoencoder is a type of
artificial neural network
Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains.
An ANN is based on a collection of connected units ...
used to learn
efficient codings of unlabeled data (
unsupervised learning
Unsupervised learning is a type of algorithm that learns patterns from untagged data. The hope is that through mimicry, which is an important mode of learning in people, the machine is forced to build a concise representation of its world and t ...
). The encoding is validated and refined by attempting to regenerate the input from the encoding. The autoencoder learns a
representation
Representation may refer to:
Law and politics
*Representation (politics), political activities undertaken by elected representatives, as well as other theories
** Representative democracy, type of democracy in which elected officials represent a ...
(encoding) for a set of data, typically for
dimensionality reduction
Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...
, by training the network to ignore insignificant data (“noise”).
Variants exist, aiming to force the learned representations to assume useful properties.
[ Examples are regularized autoencoders (''Sparse'', ''Denoising'' and ''Contractive''), which are effective in learning representations for subsequent ]classification Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood.
Classification is the grouping of related facts into classes.
It may also refer to:
Business, organizat ...
tasks,[ and ''Variational'' autoencoders, with applications as ]generative model
In statistical classification, two main approaches are called the generative approach and the discriminative approach. These compute classifiers by different approaches, differing in the degree of statistical modelling. Terminology is incons ...
s. Autoencoders are applied to many problems, including facial recognition Facial recognition or face recognition may refer to:
* Face detection, often a step done before facial recognition
* Face perception, the process by which the human brain understands and interprets the face
* Pareidolia, which involves, in part, se ...
, feature detection, anomaly detection and acquiring the meaning of words. Autoencoders are also generative models which can randomly generate new data that is similar to the input data (training data).[
]
Mathematical principles
Definition
An autoencoder is defined by the following components: Two sets: the space of decoded messages ; the space of encoded messages . Almost always, both and are Euclidean spaces, that is, for some .
Two parametrized families of functions: the encoder family , parametrized by ; the decoder family , parametrized by .
For any , we usually write , and refer to it as the code, the latent variable
In statistics, latent variables (from Latin: present participle of ''lateo'', “lie hidden”) are variables that can only be inferred indirectly through a mathematical model from other observable variables that can be directly observed or me ...
, latent representation, latent vector, etc. Conversely, for any , we usually write , and refer to it as the (decoded) message.
Usually, both the encoder and the decoder are defined as multilayer perceptrons. For example, a one-layer-MLP encoder is:
:
where is an element-wise activation function such as a sigmoid function
A sigmoid function is a mathematical function having a characteristic "S"-shaped curve or sigmoid curve.
A common example of a sigmoid function is the logistic function shown in the first figure and defined by the formula:
:S(x) = \frac = \ ...
or a rectified linear unit
In the context of artificial neural networks, the rectifier or ReLU (rectified linear unit) activation function is an activation function defined as the positive part of its argument:
: f(x) = x^+ = \max(0, x),
where ''x'' is the input to a n ...
, is a matrix called "weight", and is a vector called "bias".
Training an autoencoder
An autoencoder, by itself, is simply a tuple of two functions. To judge its ''quality'', we need a ''task''. A task is defined by a reference probability distribution over , and a "reconstruction quality" function gradient descent
In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of ...
. This search process is referred to as "training the autoencoder".
In most situations, the reference distribution is just the empirical distribution given by a dataset \ \subset \mathcal X, so that\mu_ = \frac\sum_^N \delta_
and the quality function is just L2 loss: d(x, x') = \, x - x'\, _2^2. Then the problem of searching for the optimal autoencoder is just a least-squares
The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems (sets of equations in which there are more equations than unknowns) by minimizing the sum of the squares of the res ...
optimization:\min_ L(\theta, \phi), \text L(\theta, \phi) = \frac\sum_^N \, x_i - D_\theta(E_\phi(x_i))\, _2^2
Interpretation
An autoencoder has two main parts: an encoder that maps the message to a code, and a decoder that reconstructs the message from the code. An optimal autoencoder would perform as close to perfect reconstruction as possible, with "close to perfect" defined by the reconstruction quality function d.
The simplest way to perform the copying task perfectly would be to duplicate the signal. To suppress this behavior, the code space \mathcal Z usually has fewer dimensions than the message space \mathcal.
Such an autoencoder is called ''undercomplete''. It can be interpreted as compressing the message, or reducing its dimensionality.[
At the limit of an ideal undercomplete autoencoder, every possible code z in the code space is used to encode a message x that really appears in the distribution \mu_, and the decoder is also perfect: D_\theta(E_\phi(x)) = x. This ideal autoencoder can then be used to generate messages indistinguishable from real messages, by feeding its decoder arbitrary code z and obtaining D_\theta(z), which is a message that really appears in the distribution \mu_.
If the code space \mathcal Z has dimension larger than (''overcomplete''), or equal to, the message space \mathcal, or the hidden units are given enough capacity, an autoencoder can learn the ]identity function
Graph of the identity function on the real numbers
In mathematics, an identity function, also called an identity relation, identity map or identity transformation, is a function that always returns the value that was used as its argument, unc ...
and become useless. However, experimental results found that overcomplete autoencoders might still learn useful features.
In the ideal setting, the code dimension and the model capacity could be set on the basis of the complexity of the data distribution to be modeled. A standard way to do so is to add modifications to the basic autoencoder, to be detailed below.[
]
History
The autoencoder has also been called the autoassociator, or Diabolo network.[ Its first applications date to the 1980s.][ Their most traditional application was ]dimensionality reduction
Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...
or feature learning
In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature ...
, but the concept became widely used for learning generative model
In statistical classification, two main approaches are called the generative approach and the discriminative approach. These compute classifiers by different approaches, differing in the degree of statistical modelling. Terminology is incons ...
s of data.[Generating Faces with Torch, Boesen A., Larsen L. and Sonderby S.K., 2015 ] Some of the most powerful AIs in the 2010s involved autoencoders stacked inside deep neural networks.
Variations
Regularized autoencoders
Various techniques exist to prevent autoencoders from learning the identity function
Graph of the identity function on the real numbers
In mathematics, an identity function, also called an identity relation, identity map or identity transformation, is a function that always returns the value that was used as its argument, unc ...
and to improve their ability to capture important information and learn richer representations.
Sparse autoencoder (SAE)
Inspired by the sparse coding
Neural coding (or Neural representation) is a neuroscience field concerned with characterising the hypothetical relationship between the stimulus and the individual or ensemble neuronal responses and the relationship among the electrical activit ...
hypothesis in neuroscience, sparse autoencoders are variants of autoencoders, such that the codes D_\phi(x) for messages tend to be ''sparse codes'', that is, D_\phi(x) is close to zero in most entries. Sparse autoencoders may include more (rather than fewer) hidden units than inputs, but only a small number of the hidden units are allowed to be active at the same time.[ Encouraging sparsity improves performance on classification tasks.]
There are two main ways to enforce sparsity. One way is to simply clamp all but the highest-k activations of the latent code to zero. This is the k-sparse autoencoder.
The k-sparse autoencoder inserts the following "k-sparse function" in the latent layer of a standard autoencoder:f_k(x_1, ..., x_n) = (x_1 b_1, ..., x_n b_n)where b_i = 1 if , x_i, ranks in the top k, and 0 otherwise.
Backpropagating through f_k is simple: set gradient to 0 for b_i = 0 entries, and keep gradient for b_i=1 entries. This is essentially a generalized ReLU function.[
The other way is a relaxed version of the k-sparse autoencoder. Instead of forcing sparsity, we add a sparsity regularization loss, then optimize for\min_L(\theta, \phi) + \lambda L_ (\theta, \phi)where \lambda > 0 measures how much sparsity we want to enforce.][
Let the autoencoder architecture have K layers. To define a sparsity regularization loss, we need a "desired" sparsity \hat \rho_k for each layer, a weight w_k for how much to enforce each sparsity, and a function s: , 1times , 1\to , \infty/math> to measure how much two sparsities differ.
For each input x, let the actual sparsity of activation in each layer k be\rho_k(x) = \frac 1n \sum_^n a_(x)where a_(x) is the activation in the i -th neuron of the k -th layer upon input x.
The sparsity loss upon input x for one layer is s(\hat\rho_k, \rho_k(x)), and the sparsity regularization loss for the entire autoencoder is the expected weighted sum of sparsity losses:L_(\theta, \phi) = \mathbb \mathbb E_\left sum_ w_k s(\hat\rho_k, \rho_k(x)) \right/math>Typically, the function s is either the Kullback-Leibler (KL) divergence, as][Ng, A. (2011)]
Sparse autoencoder
''CS294A Lecture notes'', ''72''(2011), 1-19.
::s(\rho, \hat\rho) = KL(\rho , , \hat) = \rho \log \frac+(1- \rho)\log \frac
or the L1 loss, as s(\rho, \hat\rho) = , \rho- \hat\rho, , or the L2 loss, as s(\rho, \hat\rho) = , \rho- \hat\rho, ^2.
Alternatively, the sparsity regularization loss may be defined without reference to any "desired sparsity", but simply force as much sparsity as possible. In this case, one can sparsity regularization loss as L_(\theta, \phi) = \mathbb \mathbb E_\left h_k\,
\right/math>where h_k is the activation vector in the k-th layer of the autoencoder. The norm \, \cdot\, is usually the L1 norm (giving the L1 sparse autoencoder) or the L2 norm (giving the L2 sparse autoencoder).
Denoising autoencoder (DAE)
Denoising autoencoders (DAE) try to achieve a ''good'' representation by changing the ''reconstruction criterion''.[
A DAE is defined by adding a noise process to the standard autoencoder. A noise process is defined by a probability distribution \mu_T over functions T:\mathcal X \to \mathcal X. That is, the function T takes a message x\in \mathcal X, and corrupts it to a noisy version T(x). The function T is selected randomly, with a probability distribution \mu_T.
Given a task (\mu_, d), the problem of training a DAE is the optimization problem:\min_L(\theta, \phi) = \mathbb \mathbb E_ (x, (D_\theta\circ E_\phi \circ T)(x))/math>That is, the optimal DAE should take any noisy message and attempt to recover the original message without noise, thus the name "denoising"''.''
Usually, the noise process T is applied only during training and testing, not during downstream use.
The use of DAE depends on two assumptions:
* There exist representations to the messages that are relatively stable and robust to the type of noise we are likely to encounter;
* The said representations capture structures in the input distribution that are useful for our purposes.]
Example noise processes include:
* additive isotropic Gaussian noise
Gaussian noise, named after Carl Friedrich Gauss, is a term from signal processing theory denoting a kind of signal noise that has a probability density function (pdf) equal to that of the normal distribution (which is also known as the Gaussian ...
,
* masking noise (a fraction of the input is randomly chosen and set to 0)
* salt-and-pepper noise (a fraction of the input is randomly chosen and randomly set to its minimum or maximum value).[
]
Contractive autoencoder (CAE)
A contractive autoencoder adds the contractive regularization loss to the standard autoencoder loss:\min_L(\theta, \phi) + \lambda L_ (\theta, \phi)where \lambda > 0 measures how much contractive-ness we want to enforce. The contractive regularization loss itself is defined as the expected Frobenius norm
In mathematics, a matrix norm is a vector norm in a vector space whose elements (vectors) are matrices (of given dimensions).
Preliminaries
Given a field K of either real or complex numbers, let K^ be the -vector space of matrices with m ...
of the Jacobian matrix
In vector calculus, the Jacobian matrix (, ) of a vector-valued function of several variables is the matrix of all its first-order partial derivatives. When this matrix is square, that is, when the function takes the same number of variables ...
of the encoder activations with respect to the input:L_(\theta, \phi) = \mathbb E_ \, \nabla_x E_\phi(x) \, _F^2To understand what L_ measures, note the fact\, E_\phi(x + \delta x) - E_\phi(x)\, _2 \leq \, \nabla_x E_\phi(x) \, _F \, \delta x\, _2for any message x\in \mathcal X, and small variation \delta x in it. Thus, if \, \nabla_x E_\phi(x) \, _F^2 is small, it means that a small neighborhood of the message maps to a small neighborhood of its code. This is a desired property, as it means small variation in the message leads to small, perhaps even zero, variation in its code, like how two pictures may look the same even if they are not exactly the same.
The DAE can be understood as an infinitesimal limit of CAE: in the limit of small Gaussian input noise, DAEs make the reconstruction function resist small but finite-sized input perturbations, while CAEs make the extracted features resist infinitesimal input perturbations.
Minimal description length autoencoder
Concrete autoencoder
The concrete autoencoder is designed for discrete feature selection. A concrete autoencoder forces the latent space to consist only of a user-specified number of features. The concrete autoencoder uses a continuous relaxation of the categorical distribution
In probability theory and statistics, a categorical distribution (also called a generalized Bernoulli distribution, multinoulli distribution) is a discrete probability distribution that describes the possible results of a random variable that ca ...
to allow gradients to pass through the feature selector layer, which makes it possible to use standard backpropagation
In machine learning, backpropagation (backprop, BP) is a widely used algorithm for training feedforward artificial neural networks. Generalizations of backpropagation exist for other artificial neural networks (ANNs), and for functions gener ...
to learn an optimal subset of input features that minimize reconstruction loss.
Variational autoencoder (VAE)
Variational autoencoders (VAEs) belong to the families of variational Bayesian methods
Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. They are typically used in complex statistical models consisting of observed variables (usually ...
. Despite the architectural similarities with basic autoencoders, VAEs are architecture with different goals and with a completely different mathematical formulation. The latent space is in this case composed by a mixture of distributions instead of a fixed vector.
Given an input dataset x characterized by an unknown probability function P(x) and a multivariate latent encoding vector z, the objective is to model the data as a distribution p_\theta(x), with \theta defined as the set of the network parameters so that p_\theta(x) = \int_p_\theta(x,z)dz .
Advantages of depth
Autoencoders are often trained with a single layer encoder and a single layer decoder, but using many-layered (deep) encoders and decoders offers many advantages.[
* Depth can exponentially reduce the computational cost of representing some functions.][
* Depth can exponentially decrease the amount of training data needed to learn some functions.][
* Experimentally, deep autoencoders yield better compression compared to shallow or linear autoencoders.][
]
Training
Geoffrey Hinton developed the deep belief network technique for training many-layered deep autoencoders. His method involves treating each neighbouring set of two layers as a restricted Boltzmann machine
A restricted Boltzmann machine (RBM) is a generative stochastic artificial neural network that can learn a probability distribution over its set of inputs.
RBMs were initially invented under the name Harmonium by Paul Smolensky in 1986,
and ros ...
so that pretraining approximates a good solution, then using backpropagation to fine-tune the results.
Researchers have debated whether joint training (i.e. training the whole architecture together with a single global reconstruction objective to optimize) would be better for deep auto-encoders. A 2015 study showed that joint training learns better data models along with more representative features for classification as compared to the layerwise method.[ However, their experiments showed that the success of joint training depends heavily on the regularization strategies adopted.][
]
Applications
The two main applications of autoencoders are dimensionality reduction and information retrieval, but modern variations have been applied to other tasks.
Dimensionality reduction

Dimensionality reduction
Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...
was one of the first deep learning applications.[
For Hinton's 2006 study,][ he pretrained a multi-layer autoencoder with a stack of RBMs and then used their weights to initialize a deep autoencoder with gradually smaller hidden layers until hitting a bottleneck of 30 neurons. The resulting 30 dimensions of the code yielded a smaller reconstruction error compared to the first 30 components of a principal component analysis (PCA), and learned a representation that was qualitatively easier to interpret, clearly separating data clusters.][
Representing dimensions can improve performance on tasks such as classification.][ Indeed, the hallmark of dimensionality reduction is to place semantically related examples near each other.]
Principal component analysis
If linear activations are used, or only a single sigmoid hidden layer, then the optimal solution to an autoencoder is strongly related to principal component analysis
Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
(PCA). The weights of an autoencoder with a single hidden layer of size p (where p is less than the size of the input) span the same vector subspace as the one spanned by the first p principal components, and the output of the autoencoder is an orthogonal projection onto this subspace. The autoencoder weights are not equal to the principal components, and are generally not orthogonal, yet the principal components may be recovered from them using the singular value decomposition
In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any \ m \times n\ matrix. It is r ...
.
However, the potential of autoencoders resides in their non-linearity, allowing the model to learn more powerful generalizations compared to PCA, and to reconstruct the input with significantly lower information loss.[
]
Information retrieval
Information retrieval benefits particularly from dimensionality reduction
Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...
in that search can become more efficient in certain kinds of low dimensional spaces. Autoencoders were indeed applied to semantic hashing, proposed by Salakhutdinov and Hinton in 2007.[ By training the algorithm to produce a low-dimensional binary code, all database entries could be stored in a ]hash table
In computing, a hash table, also known as hash map, is a data structure that implements an associative array or dictionary. It is an abstract data type that maps keys to values. A hash table uses a hash function to compute an ''index'', ...
mapping binary code vectors to entries. This table would then support information retrieval by returning all entries with the same binary code as the query, or slightly less similar entries by flipping some bits from the query encoding.
Anomaly detection
Another application for autoencoders is anomaly detection
In data analysis, anomaly detection (also referred to as outlier detection and sometimes as novelty detection) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority o ...
.[An, J., & Cho, S. (2015)]
Variational Autoencoder based Anomaly Detection using Reconstruction Probability
''Special Lecture on IE'', ''2'', 1-18. By learning to replicate the most salient features in the training data under some of the constraints described previously, the model is encouraged to learn to precisely reproduce the most frequently observed characteristics. When facing anomalies, the model should worsen its reconstruction performance. In most cases, only data with normal instances are used to train the autoencoder; in others, the frequency of anomalies is small compared to the observation set so that its contribution to the learned representation could be ignored. After training, the autoencoder will accurately reconstruct "normal" data, while failing to do so with unfamiliar anomalous data.[ Reconstruction error (the error between the original data and its low dimensional reconstruction) is used as an anomaly score to detect anomalies.][
Recent literature has however shown that certain autoencoding models can, counterintuitively, be very good at reconstructing anomalous examples and consequently not able to reliably perform anomaly detection.
]
Image processing
The characteristics of autoencoders are useful in image processing.
One example can be found in lossy image compression
Image compression is a type of data compression applied to digital images, to reduce their cost for storage or transmission. Algorithms may take advantage of visual perception and the statistical properties of image data to provide superior re ...
, where autoencoders outperformed other approaches and proved competitive against JPEG 2000
JPEG 2000 (JP2) is an image compression standard and coding system. It was developed from 1997 to 2000 by a Joint Photographic Experts Group committee chaired by Touradj Ebrahimi (later the JPEG president), with the intention of superseding th ...
.
Another useful application of autoencoders in image preprocessing is image denoising.
Autoencoders found use in more demanding contexts such as medical imaging where they have been used for image denoising as well as super-resolution. In image-assisted diagnosis, experiments have applied autoencoders for breast cancer
Breast cancer is cancer that develops from breast tissue. Signs of breast cancer may include a lump in the breast, a change in breast shape, dimpling of the skin, milk rejection, fluid coming from the nipple, a newly inverted nipple, or ...
detection and for modelling the relation between the cognitive decline of Alzheimer's disease and the latent features of an autoencoder trained with MRI.
Drug discovery
In 2019 molecules generated with variational autoencoders were validated experimentally in mice.
Popularity prediction
Recently, a stacked autoencoder framework produced promising results in predicting popularity of social media posts, which is helpful for online advertising strategies.
Machine translation
Autoencoders have been applied to machine translation
Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...
, which is usually referred to as neural machine translation (NMT). Unlike traditional autoencoders, the output does not match the input - it is in another language. In NMT, texts are treated as sequences to be encoded into the learning procedure, while on the decoder side sequences in the target language(s) are generated. Language
Language is a structured system of communication. The structure of a language is its grammar and the free components are its vocabulary. Languages are the primary means by which humans communicate, and may be conveyed through a variety of ...
-specific autoencoders incorporate further linguistic
Linguistics is the scientific study of human language. It is called a scientific study because it entails a comprehensive, systematic, objective, and precise analysis of all aspects of language, particularly its nature and structure. Linguis ...
features into the learning procedure, such as Chinese decomposition features. Machine translation is rarely still done with autoencoders, but rather transformer
A transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple circuits. A varying current in any coil of the transformer produces a varying magnetic flux in the transformer' ...
networks.
See also
* Representation learning
* Sparse dictionary learning
* Deep learning
References
{{Noise
Neural network architectures
Unsupervised learning
Dimension reduction