HOME

TheInfoList



OR:

In
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
, the delta rule is a
gradient descent In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of ...
learning rule for updating the weights of the inputs to
artificial neurons An artificial neuron is a mathematical function conceived as a model of biological neurons, a neural network. Artificial neurons are elementary units in an artificial neural network. The artificial neuron receives one or more inputs (representing ...
in a single-layer neural network. It is a special case of the more general
backpropagation In machine learning, backpropagation (backprop, BP) is a widely used algorithm for training feedforward artificial neural networks. Generalizations of backpropagation exist for other artificial neural networks (ANNs), and for functions gener ...
algorithm. For a neuron j with activation function g(x) , the delta rule for neuron j 's i th weight w_ is given by :\Delta w_=\alpha(t_j-y_j) g'(h_j) x_i , where It holds that h_j=\sum x_i w_ and y_j=g(h_j) . The delta rule is commonly stated in simplified form for a neuron with a linear activation function as :\Delta w_=\alpha(t_j-y_j) x_i While the delta rule is similar to the
perceptron In machine learning, the perceptron (or McCulloch-Pitts neuron) is an algorithm for supervised classification, supervised learning of binary classification, binary classifiers. A binary classifier is a function which can decide whether or not an ...
's update rule, the derivation is different. The perceptron uses the
Heaviside step function The Heaviside step function, or the unit step function, usually denoted by or (but sometimes , or ), is a step function, named after Oliver Heaviside (1850–1925), the value of which is zero for negative arguments and one for positive argume ...
as the activation function g(h), and that means that g'(h) does not exist at zero, and is equal to zero elsewhere, which makes the direct application of the delta rule impossible.


Derivation of the delta rule

The delta rule is derived by attempting to minimize the error in the output of the neural network through
gradient descent In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of ...
. The error for a neural network with j outputs can be measured as :E=\sum_ \frac(t_j-y_j)^2 . In this case, we wish to move through "weight space" of the neuron (the space of all possible values of all of the neuron's weights) in proportion to the gradient of the error function with respect to each weight. In order to do that, we calculate the
partial derivative In mathematics, a partial derivative of a function of several variables is its derivative with respect to one of those variables, with the others held constant (as opposed to the total derivative, in which all variables are allowed to vary). Pa ...
of the error with respect to each weight. For the i th weight, this derivative can be written as :\frac . Because we are only concerning ourselves with the j th neuron, we can substitute the error formula above while omitting the summation: :\frac = \frac Next we use the
chain rule In calculus, the chain rule is a formula that expresses the derivative of the Function composition, composition of two differentiable functions and in terms of the derivatives of and . More precisely, if h=f\circ g is the function such that h(x) ...
to split this into two derivatives: := \frac \frac To find the left derivative, we simply apply the
chain rule In calculus, the chain rule is a formula that expresses the derivative of the Function composition, composition of two differentiable functions and in terms of the derivatives of and . More precisely, if h=f\circ g is the function such that h(x) ...
: := - \left ( t_j-y_j \right ) \frac To find the right derivative, we again apply the chain rule, this time differentiating with respect to the total input to j , h_j : := - \left ( t_j-y_j \right ) \frac \frac Note that the output of the jth neuron, y_j , is just the neuron's activation function g applied to the neuron's input h_j . We can therefore write the derivative of y_j with respect to h_j simply as g 's first derivative: := - \left ( t_j-y_j \right ) g'(h_j) \frac Next we rewrite h_j in the last term as the sum over all k weights of each weight w_ times its corresponding input x_k : := - \left ( t_j-y_j \right ) g'(h_j) \frac Because we are only concerned with the i th weight, the only term of the summation that is relevant is x_i w_ . Clearly, :\frac=x_i , giving us our final equation for the gradient: :\frac = - \left ( t_j-y_j \right ) g'(h_j) x_i As noted above, gradient descent tells us that our change for each weight should be proportional to the gradient. Choosing a proportionality constant \alpha and eliminating the minus sign to enable us to move the weight in the negative direction of the gradient to minimize error, we arrive at our target equation: :\Delta w_=\alpha(t_j-y_j) g'(h_j) x_i .


See also

*
Stochastic gradient descent Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of ...
*
Backpropagation In machine learning, backpropagation (backprop, BP) is a widely used algorithm for training feedforward artificial neural networks. Generalizations of backpropagation exist for other artificial neural networks (ANNs), and for functions gener ...
*
Rescorla–Wagner model The Rescorla–Wagner model ("R-W") is a model of classical conditioning, in which learning is conceptualized in terms of associations between conditioned (CS) and unconditioned (US) stimuli. A strong CS-US association means that the CS signals pr ...
– the origin of delta rule


References

{{DEFAULTSORT:Delta Rule Artificial neural networks