In
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
and
statistics
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
, the learning rate is a
tuning parameter in an
optimization algorithm that determines the step size at each iteration while moving toward a minimum of a
loss function. Since it influences to what extent newly acquired information overrides old information, it metaphorically represents the speed at which a machine learning model "learns". In the
adaptive control literature, the learning rate is commonly referred to as gain.
In setting a learning rate, there is a trade-off between the rate of convergence and
overshooting. While the
descent direction is usually determined from the
gradient of the loss function, the learning rate determines how big a step is taken in that direction. A too high learning rate will make the learning jump over minima but a too low learning rate will either take too long to converge or get stuck in an undesirable local minimum.
In order to achieve faster convergence, prevent oscillations and getting stuck in undesirable local minima the learning rate is often varied during training either in accordance to a learning rate schedule or by using an adaptive learning rate.
The learning rate and its adjustments may also differ per parameter, in which case it is a
diagonal matrix
In linear algebra, a diagonal matrix is a matrix in which the entries outside the main diagonal are all zero; the term usually refers to square matrices. Elements of the main diagonal can either be zero or nonzero. An example of a 2×2 diagon ...
that can be interpreted as an approximation to the
inverse of the
Hessian matrix in
Newton's method
In numerical analysis, the Newton–Raphson method, also known simply as Newton's method, named after Isaac Newton and Joseph Raphson, is a root-finding algorithm which produces successively better approximations to the roots (or zeroes) of a ...
. The learning rate is related to the step length determined by inexact
line search in
quasi-Newton methods and related optimization algorithms.
Learning rate schedule
Initial rate can be left as system default or can be selected using a range of techniques. A learning rate schedule changes the learning rate during learning and is most often changed between epochs/iterations. This is mainly done with two parameters: decay and momentum. There are many different learning rate schedules but the most common are time-based, step-based and exponential.
Decay serves to settle the learning in a nice place and avoid oscillations, a situation that may arise when a too high constant learning rate makes the learning jump back and forth over a minimum, and is controlled by a hyperparameter.
Momentum is analogous to a ball rolling down a hill; we want the ball to settle at the lowest point of the hill (corresponding to the lowest error). Momentum both speeds up the learning (increasing the learning rate) when the error cost gradient is heading in the same direction for a long time and also avoids local minima by 'rolling over' small bumps. Momentum is controlled by a hyperparameter analogous to a ball's mass which must be chosen manually—too high and the ball will roll over minima which we wish to find, too low and it will not fulfil its purpose.
The formula for factoring in the momentum is more complex than for decay but is most often built in with deep learning libraries such as
Keras.
Time-based learning schedules alter the learning rate depending on the learning rate of the previous time iteration. Factoring in the decay the mathematical formula for the learning rate is:
where
is the learning rate,
is a decay parameter and
is the iteration step.
Step-based learning schedules changes the learning rate according to some predefined steps. The decay application formula is here defined as:
where
is the learning rate at iteration
,
is the initial learning rate,
is how much the learning rate should change at each drop (0.5 corresponds to a halving) and
corresponds to the ''drop rate'', or how often the rate should be dropped (10 corresponds to a drop every 10 iterations). The ''
floor
A floor is the bottom surface of a room or vehicle. Floors vary from wikt:hovel, simple dirt in a cave to many layered surfaces made with modern technology. Floors may be stone, wood, bamboo, metal or any other material that can support the ex ...
'' function (
) here drops the value of its input to 0 for all values smaller than 1.
Exponential learning schedules are similar to step-based, but instead of steps, a decreasing exponential function is used. The mathematical formula for factoring in the decay is:
where
is a decay parameter.
Adaptive learning rate
The issue with learning rate schedules is that they all depend on hyperparameters that must be manually chosen for each given learning session and may vary greatly depending on the problem at hand or the model used. To combat this, there are many different types of
adaptive gradient descent algorithms such as
Adagrad, Adadelta,
RMSprop, and
Adam
Adam is the name given in Genesis 1–5 to the first human. Adam is the first human-being aware of God, and features as such in various belief systems (including Judaism, Christianity, Gnosticism and Islam).
According to Christianity, Adam ...
which are generally built into deep learning libraries such as
Keras.
See also
*
Hyperparameter (machine learning)
*
Hyperparameter optimization
*
Stochastic gradient descent
*
Variable metric methods
*
Overfitting
*
Backpropagation
*
AutoML
*
Model selection
*
Self-tuning
References
Further reading
*
*
External links
*{{cite web , first=Nando , last=de Freitas , title=Optimization , work=Deep Learning Lecture 6 , location=University of Oxford , date=February 12, 2015 , url=https://www.youtube.com/watch?v=0qUAb94CpOw&list=PLE6Wd9FR--EfW8dtjAuPoTuPcqmOV53Fu&index=9 , via=
YouTube
YouTube is an American social media and online video sharing platform owned by Google. YouTube was founded on February 14, 2005, by Steve Chen, Chad Hurley, and Jawed Karim who were three former employees of PayPal. Headquartered in ...
Machine learning
Model selection
Optimization algorithms and methods