machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

, the vanishing gradient problem is the problem of greatly diverging

gradient In vector calculus, the gradient of a scalar-valued differentiable function f of several variables is the vector field (or vector-valued function) \nabla f whose value at a point p gives the direction and the rate of fastest increase. The g ...

magnitudes between earlier and later layers encountered when training

neural networks A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either Cell (biology), biological cells or signal pathways. While individual neurons are simple, many of them together in a netwo ...

with

backpropagation In machine learning, backpropagation is a gradient computation method commonly used for training a neural network to compute its parameter updates. It is an efficient application of the chain rule to neural networks. Backpropagation computes th ...

. In such methods, neural network weights are updated proportional to their

partial derivative In mathematics, a partial derivative of a function of several variables is its derivative with respect to one of those variables, with the others held constant (as opposed to the total derivative, in which all variables are allowed to vary). P ...

of the

loss function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...

. As the number of forward propagation steps in a network increases, for instance due to greater network depth, the gradients of earlier weights are calculated with increasingly many multiplications. These multiplications shrink the gradient magnitude. Consequently, the gradients of earlier weights will be exponentially smaller than the gradients of later weights. This difference in gradient magnitude might introduce instability in the training process, slow it, or halt it entirely. For instance, consider the hyperbolic tangent

activation function The activation function of a node in an artificial neural network is a function that calculates the output of the node based on its individual inputs and their weights. Nontrivial problems can be solved using only a few nodes if the activation f ...

. The gradients of this function are in range . The product of repeated multiplication with such gradients decreases exponentially. The inverse problem, when weight gradients at earlier layers get exponentially larger, is called the exploding gradient problem. Backpropagation allowed researchers to train supervised deep artificial neural networks from scratch, initially with little success. Hochreiter's

diplom A ''Diplom'' (, from ) is an academic degree in the German-speaking countries Germany, Austria, and Switzerland and a similarly named degree in some other European countries including Albania, Bulgaria, Belarus, Bosnia and Herzegovina, Croatia ...

thesis of 1991 formally identified the reason for this failure in the "vanishing gradient problem", which not only affects many-layered feedforward networks, but also recurrent networks. The latter are trained by unfolding them into very deep feedforward networks, where a new layer is created for each time-step of an input sequence processed by the network (the combination of unfolding and backpropagation is termed

backpropagation through time Backpropagation through time (BPTT) is a gradient-based technique for training certain types of recurrent neural networks, such as Elman networks. The algorithm was independently derived by numerous researchers. Algorithm The training data ...

Prototypical models

This section is based on the paper ''On the difficulty of training Recurrent Neural Networks'' by Pascanu, Mikolov, and Bengio.

Recurrent network model

A generic recurrent network has hidden states

h_1, h_2, ...

, inputs

u_1, u_2, ...

, and outputs

x_1, x_2, ...

. Let it be parametrized by

\theta

, so that the system evolves as

(h_t, x_t) = F(h_, u_t, \theta)

Often, the output

x_t

is a function of

h_t

, as some

x_t = G(h_t)

. The vanishing gradient problem already presents itself clearly when

x_t = h_t

, so we simplify our notation to the special case with:

x_t = F(x_, u_t, \theta)

Now, take its differential:

\begin
 dx_t &= \nabla_\theta F(x_, u_t, \theta) d\theta + \nabla_x F(x_, u_t, \theta) dx_ \\
  &= \nabla_\theta F(x_, u_t, \theta) d\theta + \nabla_x F(x_, u_t, \theta)(\nabla_\theta F(x_, u_, \theta) d\theta + \nabla_x F(x_, u_, \theta) dx_) \\
  &= \cdots \\
  &= \left(\nabla_\theta F(x_, u_t, \theta) + \nabla_x F(x_, u_t, \theta) \nabla_\theta F(x_, u_, \theta) + \cdots \right) d\theta
\end

Training the network requires us to define a loss function to be minimized. Let it be

L(x_T, u_1, ..., u_T)

, then minimizing it by gradient descent gives

\Delta \theta = -\eta \cdot\left \nabla_x L(x_T)\left(\nabla_\theta F(x_, u_t, \theta) + \nabla_x F(x_, u_t, \theta) \nabla_\theta F(x_, u_, \theta) + \cdots \right) 
\right T

where

\eta

is the learning rate. The vanishing/exploding gradient problem appears because there are repeated multiplications, of the form

\nabla_x F(x_, u_t, \theta) \nabla_x F(x_, u_, \theta)\nabla_x F(x_, u_, \theta)  \cdots

Example: recurrent network with sigmoid activation

For a concrete example, consider a typical recurrent network defined by

x_t = F(x_, u_t, \theta) = W_ \sigma(x_) + W_ u_t + b

where

\theta = (W_, W_)

is the network parameter,

\sigma

is the sigmoid activation function, applied to each vector coordinate separately, and

b

is the bias vector. Then,

\nabla_x F(x_, u_t, \theta) = W_ \mathop(\sigma'(x_))

, and so

\begin
\nabla_x F(x_, u_t, \theta) & \nabla_x F(x_, u_, \theta)\cdots \nabla_x F(x_, u_, \theta) \\
= W_ \mathop(\sigma'(x_)) & W_ \mathop(\sigma'(x_))  \cdots W_ \mathop(\sigma'(x_))

\end

Since

, \sigma', \leq 1

, the

operator norm In mathematics, the operator norm measures the "size" of certain linear operators by assigning each a real number called its . Formally, it is a norm defined on the space of bounded linear operators between two given normed vector spaces. Inform ...

of the above multiplication is bounded above by

\, W_\, ^k

. So if the

spectral radius ''Spectral'' is a 2016 Hungarian-American military science fiction action film co-written and directed by Nic Mathieu. Written with Ian Fried (screenwriter), Ian Fried & George Nolfi, the film stars James Badge Dale as DARPA research scientist Ma ...

W_

\gamma<1

, then at large

k

, the above multiplication has operator norm bounded above by

\gamma^k \to 0

. This is the prototypical vanishing gradient problem. The effect of a vanishing gradient is that the network cannot learn long-range effects. Recall Equation ():

\nabla_\theta L = \nabla_x L(x_T, u_1, ..., u_T)\left(\nabla_\theta F(x_, u_t, \theta) + \nabla_x F(x_, u_t, \theta) \nabla_\theta F(x_, u_, \theta) + \cdots \right)

The components of

\nabla_\theta F(x, u, \theta)

are just components of

\sigma(x)

and

u

, so if

u_t, u_, ...

are bounded, then

\, \nabla _F(x_,u_,\theta)\,

is also bounded by some

M>0

, and so the terms in

\nabla_\theta L

decay as

M\gamma^k

. This means that, effectively,

\nabla_\theta L

is affected only by the first

O(\gamma^)

terms in the sum. If

\gamma\geq 1

, the above analysis does not quite work. For the prototypical exploding gradient problem, the next model is clearer.

Dynamical systems model

One-neuron recurrent network bifurcation diagram

Following (Doya, 1993), consider this one-neuron recurrent network with sigmoid activation:

x_ = (1-\epsilon) x_t + \epsilon \sigma(wx_t + b) + \epsilon w' u_t

At the small

\epsilon

limit, the dynamics of the network becomes

\frac = -x(t) + \sigma(wx(t)+b) + w' u(t)

Consider first the

autonomous In developmental psychology and moral, political, and bioethical philosophy, autonomy is the capacity to make an informed, uncoerced decision. Autonomous organizations or institutions are independent or self-governing. Autonomy can also be defi ...

case, with

u=0

. Set

w=5.0

, and vary

b

3, -2

. As

b

decreases, the system has 1 stable point, then has 2 stable points and 1 unstable point, and finally has 1 stable point again. Explicitly, the stable points are

(x, b) = \left(x, \ln\left(\frac\right)-5x \right)

. Now consider

\frac

and

\frac

, where

T

is large enough that the system has settled into one of the stable points. If

(x(0), b)

puts the system very close to an unstable point, then a tiny variation in

x(0)

b

would make

x(T)

move from one stable point to the other. This makes

\frac

and

\frac

both very large, a case of the exploding gradient. If

(x(0), b)

puts the system far from an unstable point, then a small variation in

x(0)

would have no effect on

x(T)

, making

\frac= 0

, a case of the vanishing gradient. Note that in this case,

\frac\approx \frac = \left(\frac-5\right)^

neither decays to zero nor blows up to infinity. Indeed, it's the only well-behaved gradient, which explains why early researches focused on learning or designing recurrent networks systems that could perform long-ranged computations (such as outputting the first input it sees at the very end of an episode) by shaping its stable attractors. For the general case, the intuition still holds ( Figures 3, 4, and 5).

Geometric model

Continue using the above one-neuron network, fixing

w = 5, x(0) = 0.5, u(t) = 0

, and consider a loss function defined by

L(x(T)) = (0.855 - x(T))^2

. This produces a rather pathological loss landscape: as

b

approach

-2.5

from above, the loss approaches zero, but as soon as

b

crosses

-2.5

, the attractor basin changes, and loss jumps to 0.50. Consequently, attempting to train

b

by gradient descent would "hit a wall in the loss landscape", and cause exploding gradient. A slightly more complex situation is plotted in, Figures 6.

Solutions

To overcome this problem, several methods were proposed.

RNN

For

recurrent neural network Recurrent neural networks (RNNs) are a class of artificial neural networks designed for processing sequential data, such as text, speech, and time series, where the order of elements is important. Unlike feedforward neural networks, which proces ...

s, the

long short-term memory Long short-term memory (LSTM) is a type of recurrent neural network (RNN) aimed at mitigating the vanishing gradient problem commonly encountered by traditional RNNs. Its relative insensitivity to gap length is its advantage over other RNNs, ...

(LSTM) network was designed to solve the problem ( Hochreiter & Schmidhuber, 1997). For the exploding gradient problem, (Pascanu et al, 2012) recommended gradient clipping, meaning dividing the gradient vector

g

\, g\, /g_

\, g\,  > g_

. This restricts the gradient vectors within a ball of radius

g_

Batch normalization

Batch normalization Batch normalization (also known as batch norm) is a normalization technique used to make training of artificial neural networks faster and more stable by adjusting the inputs to each layer—re-centering them around zero and re-scaling them to ...

is a standard method for solving both the exploding and the vanishing gradient problems.

Multi-level hierarchy

In multi-level hierarchy of networks ( Schmidhuber, 1992), pre-trained one level at a time through

unsupervised learning Unsupervised learning is a framework in machine learning where, in contrast to supervised learning, algorithms learn patterns exclusively from unlabeled data. Other frameworks in the spectrum of supervisions include weak- or semi-supervision, wh ...

, fine-tuned through

.J. Schmidhuber., "Learning complex, extended sequences using the principle of history compression," ''Neural Computation'', 4, pp. 234–242, 1992. Here each level learns a compressed representation of the observations that is fed to the next level.

Deep belief network

Similar ideas have been used in feed-forward neural networks for unsupervised pre-training to structure a neural network, making it first learn generally useful

feature detectors Feature may refer to: Computing * Feature recognition, could be a hole, pocket, or notch * Feature (computer vision), could be an edge, corner or blob * Feature (machine learning), in statistics: individual measurable properties of the phenome ...

. Then the network is trained further by supervised

to classify labeled data. The deep belief network model by Hinton et al. (2006) involves learning the distribution of a high-level representation using successive layers of binary or real-valued

latent variable In statistics, latent variables (from Latin: present participle of ) are variables that can only be inferred indirectly through a mathematical model from other observable variables that can be directly observed or measured. Such '' latent va ...

s. It uses a restricted Boltzmann machine to model each new layer of higher level features. Each new layer guarantees an increase on the lower-bound of the

log likelihood A likelihood function (often simply called the likelihood) measures how well a statistical model explains observed data by calculating the probability of seeing that data under different parameter values of the model. It is constructed from the j ...

of the data, thus improving the model, if trained properly. Once sufficiently many layers have been learned the deep architecture may be used as a

generative model In statistical classification, two main approaches are called the generative approach and the discriminative approach. These compute classifiers by different approaches, differing in the degree of statistical modelling. Terminology is inconsiste ...

by reproducing the data when sampling down the model (an "ancestral pass") from the top level feature activations. Hinton reports that his models are effective feature extractors over high-dimensional, structured data.

Faster hardware

Hardware advances have meant that from 1991 to 2015, computer power (especially as delivered by

GPUs A graphics processing unit (GPU) is a specialized electronic circuit designed for digital image processing and to accelerate computer graphics, being present either as a discrete video card or embedded on motherboards, mobile phones, personal ...

) has increased around a million-fold, making standard backpropagation feasible for networks several layers deeper than when the vanishing gradient problem was recognized. Schmidhuber notes that this "is basically what is winning many of the image recognition competitions now", but that it "does not really overcome the problem in a fundamental way" since the original models tackling the vanishing gradient problem by Hinton and others were trained in a Xeon processor, not GPUs.

Residual connection

Residual connections, or skip connections, refers to the architectural motif of where

f

is an arbitrary neural network module. This gives the gradient of

\nabla f + I

, where the identity matrix do not suffer from the vanishing or exploding gradient. During backpropagation, part of the gradient flows through the residual connections. Concretely, let the neural network (without residual connections) be

f_n \circ f_ \circ\cdots\circ f_1

, then with residual connections, the gradient of output with respect to the activations at layer

l

I + \nabla f_ + \nabla f_\nabla f_ + \cdots

. The gradient thus does not vanish in arbitrarily deep networks. Feedforward networks with residual connections can be regarded as an ensemble of relatively shallow nets. In this perspective, they resolve the vanishing gradient problem by being equivalent to ensembles of many shallow networks, for which there is no vanishing gradient problem.

Other activation functions

Rectifier A rectifier is an electrical device that converts alternating current (AC), which periodically reverses direction, to direct current (DC), which flows in only one direction. The process is known as ''rectification'', since it "straightens" t ...

s such as

ReLU In the context of Neural network (machine learning), artificial neural networks, the rectifier or ReLU (rectified linear unit) activation function is an activation function defined as the non-negative part of its argument, i.e., the ramp function ...

suffer less from the vanishing gradient problem, because they only saturate in one direction.

Weight initialization

Weight initialization is another approach that has been proposed to reduce the vanishing gradient problem in deep networks. Kumar suggested that the distribution of initial weights should vary according to activation function used and proposed to initialize the weights in networks with the logistic activation function using a Gaussian distribution with a zero mean and a standard deviation of 3.6/sqrt(N), where N is the number of neurons in a layer. Recently, Yilmaz and Poli performed a theoretical analysis on how gradients are affected by the mean of the initial weights in deep neural networks using the logistic activation function and found that gradients do not vanish if the ''mean'' of the initial weights is set according to the formula: max(−1,-8/N). This simple strategy allows networks with 10 or 15 hidden layers to be trained very efficiently and effectively using the standard

Other

Behnke relied only on the sign of the gradient ( Rprop) when training his Neural Abstraction Pyramid to solve problems like image reconstruction and face localization. Neural networks can also be optimized by using a universal search algorithm on the space of neural network's weights, e.g., random guess or more systematically

genetic algorithm In computer science and operations research, a genetic algorithm (GA) is a metaheuristic inspired by the process of natural selection that belongs to the larger class of evolutionary algorithms (EA). Genetic algorithms are commonly used to g ...

. This approach is not based on gradient and avoids the vanishing gradient problem.

Notes

References

{{Use dmy dates, date=August 2019 Artificial neural networks