In (unconstrained)

mathematical optimization Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criterion, from some set of available alternatives. It is generally divided into two subfi ...

, a backtracking line search is a

line search In optimization, the line search strategy is one of two basic iterative approaches to find a local minimum \mathbf^* of an objective function f:\mathbb R^n\to\mathbb R. The other approach is trust region. The line search approach first finds a ...

method to determine the amount to move along a given search direction. Its use requires that the

objective function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cos ...

differentiable In mathematics, a differentiable function of one real variable is a function whose derivative exists at each point in its domain. In other words, the graph of a differentiable function has a non- vertical tangent line at each interior point i ...

and that its

gradient In vector calculus, the gradient of a scalar-valued differentiable function of several variables is the vector field (or vector-valued function) \nabla f whose value at a point p is the "direction and rate of fastest increase". If the gr ...

is known. The method involves starting with a relatively large estimate of the

step size In machine learning and statistics, the learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function. Since it influences to what extent newly ac ...

for movement along the line search direction, and iteratively shrinking the step size (i.e., "backtracking") until a decrease of the objective function is observed that adequately corresponds to the amount of decrease that is expected, based on the step size and the local gradient of the objective function. The stopping criterion is known as the Armijo–Goldstein condition. Backtracking line search is typically used for

gradient descent In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of ...

(GD), but it can also be used in other contexts. For example, it can be used with

Newton's method In numerical analysis, Newton's method, also known as the Newton–Raphson method, named after Isaac Newton and Joseph Raphson, is a root-finding algorithm which produces successively better approximations to the roots (or zeroes) of a real ...

if the

Hessian matrix In mathematics, the Hessian matrix or Hessian is a square matrix of second-order partial derivatives of a scalar-valued function, or scalar field. It describes the local curvature of a function of many variables. The Hessian matrix was developed ...

positive definite In mathematics, positive definiteness is a property of any object to which a bilinear form or a sesquilinear form may be naturally associated, which is positive-definite. See, in particular: * Positive-definite bilinear form * Positive-definite ...

Motivation

Given a starting position

\mathbf

and a search direction

\mathbf

, the task of a line search is to determine a step size

\alpha > 0

that adequately reduces the objective function

f:\mathbb R^n\to\mathbb R

(assumed

C^1

i.e. continuously differentiable), i.e., to find a value of

\alpha

that reduces

f(\mathbf+\alpha\,\mathbf)

relative to

f(\mathbf)

. However, it is usually undesirable to devote substantial resources to finding a value of

\alpha

to precisely minimize

f

. This is because the computing resources needed to find a more precise minimum along one particular direction could instead be employed to identify a better search direction. Once an improved starting point has been identified by the line search, another subsequent line search will ordinarily be performed in a new direction. The goal, then, is just to identify a value of

\alpha

that provides a reasonable amount of improvement in the objective function, rather than to find the actual minimizing value of

\alpha

. The backtracking line search starts with a large estimate of

\alpha

and iteratively shrinks it. The shrinking continues until a value is found that is small enough to provide a decrease in the objective function that adequately matches the decrease that is expected to be achieved, based on the local function gradient

\nabla f(\mathbf)\,.

Define the local slope of the function of

\alpha

along the search direction

\mathbf

m = \nabla f(\mathbf)^\,\mathbf=\langle\nabla f(\mathbf),\mathbf\rangle

(where

\langle\cdot,\cdot\rangle

denotes the

dot product In mathematics, the dot product or scalar productThe term ''scalar product'' means literally "product with a scalar as a result". It is also used sometimes for other symmetric bilinear forms, for example in a pseudo-Euclidean space. is an alg ...

). It is assumed that

\mathbf

is a vector for which some local decrease is possible, i.e., it is assumed that

m < 0

. Based on a selected control parameter

c\,\in\,(0,1)

, the Armijo–Goldstein condition tests whether a step-wise movement from a current position

\mathbf

to a modified position

\mathbf+\alpha\,\mathbf

achieves an adequately corresponding decrease in the objective function. The condition is fulfilled, see , if

f(\mathbf+\alpha\,\mathbf)\leq f(\mathbf)+\alpha\,c\,m\,.

This condition, when used appropriately as part of a line search, can ensure that the step size is not excessively large. However, this condition is not sufficient on its own to ensure that the step size is nearly optimal, since any value of

\displaystyle \alpha

that is sufficiently small will satisfy the condition. Thus, the backtracking line search strategy starts with a relatively large step size, and repeatedly shrinks it by a factor

\tau\,\in\,(0,1)

until the Armijo–Goldstein condition is fulfilled. The search will terminate after a finite number of steps for any positive values of

c

and

\tau

that are less than 1. For example, Armijo used for both

c

and

\tau

in .

Algorithm

This condition is from . Starting with a maximum candidate step size value

\alpha_0>0\,

, using search control parameters

\tau\,\in\,(0,1)

and

c\,\in\,(0,1)

, the backtracking line search algorithm can be expressed as follows: # Set

t = -c\,m

and iteration counter

j\,=\,0

. # Until the condition is satisfied that

f(\mathbf)-f(\mathbf+\alpha_j\,\mathbf)\geq \alpha_j\,t,

repeatedly increment

j

and set

\alpha_j=\tau\,\alpha_\,.

# Return

\alpha_j

as the solution. In other words, reduce

\alpha_0

by a factor of

\tau\,

in each iteration until the Armijo–Goldstein condition is fulfilled.

Function minimization using backtracking line search in practice

In practice, the above algorithm is typically iterated to produce a sequence

\mathbf_n

n = 1, 2, ...

, to converge to a minimum, provided such a minimum exists and

\mathbf_n

is selected appropriately in each step. For gradient descent,

\mathbf_n

is selected as

-\nabla f(\mathbf_n)

. The value of

\alpha_j

for the

j

that fulfills the Armijo–Goldstein condition depends on

\mathbf

and

\mathbf

, and is thus denoted below by

\alpha (\mathbf,\mathbf)

. It also depends on

f

\alpha_0

\tau

and

c

of course, although these dependencies can be left implicit if they are assumed to be fixed with respect to the optimization problem. The detailed steps are thus, see , : # Choose an initial starting point

\mathbf_0

and set the iteration counter

n=0

. # Until some stopping condition is satisfied, choose a descent direction

\mathbf_n

, increment

n

, and update the position to

\mathbf_=\mathbf_n+\alpha (\mathbf_n,\mathbf_n)\,\mathbf_n

. # Return

\mathbf_

as the minimizing position and

f(\mathbf_n)

as the function minimum. To assure good behavior, it is necessary that some conditions must be satisfied by

\mathbf_n

. Roughly speaking

\mathbf_n

should not be too far away from

\nabla f(\mathbf_n)

. A precise version is as follows (see e.g. ). There are constants

C_1,C_2>0

so that the following two conditions are satisfied: # For all n,

\, \mathbf_n\, \geq C_1 \,\, \nabla f(\mathbf_n)\,

. Here,

\, y\,

is the

Euclidean norm Euclidean space is the fundamental space of geometry, intended to represent physical space. Originally, that is, in Euclid's ''Elements'', it was the three-dimensional space of Euclidean geometry, but in modern mathematics there are Euclidea ...

y

. (This assures that if

\mathbf_n=0

, then also

\nabla f(\mathbf_n)=0

. More generally, if

\lim _\mathbf_n=0

, then also

\lim _\nabla f(\mathbf_n)=0

.) A more strict version requires also the reverse inequality:

\, \mathbf_n\, \leq C_3 \,\, \nabla f(\mathbf_n)\,

for a positive constant

C_3>0

. # For all n,

\, \mathbf_n\, \,\, \nabla f(\mathbf_n)\,  \leq -C_2 \,\langle\mathbf_n,\nabla f(\mathbf_n)\rangle

. (This condition ensures that the directions of

\mathbf_n

and

-\nabla f(\mathbf_n)

are similar.)

Lower bound for learning rates

This addresses the question whether there is a systematic way to find a positive number

\beta (\mathbf,\mathbf)

- depending on the function f, the point

\mathbf

and the descent direction

\mathbf

- so that all

learning rate In machine learning and statistics, the learning rate is a Hyperparameter (machine learning), tuning parameter in an Mathematical optimization, optimization algorithm that determines the step size at each iteration while moving toward a minimum of ...

\alpha \leq \beta (\mathbf,\mathbf)

satisfy Armijo's condition. When

\mathbf=-\nabla f(\mathbf)

, we can choose

\beta (\mathbf,\mathbf)

in the order of

1/L(\mathbf)\,

, where

L(\mathbf)\,

is a local Lipschitz constant for the gradient

\nabla f\,

near the point

\mathbf

(see

Lipschitz continuity In mathematical analysis, Lipschitz continuity, named after German mathematician Rudolf Lipschitz, is a strong form of uniform continuity for functions. Intuitively, a Lipschitz continuous function is limited in how fast it can change: there ...

). If the function is

C^2

, then

L(\mathbf)\,

is close to the Hessian of the function at the point

\mathbf

. See for more detail.

Upper bound for learning rates

In the same situation where

\mathbf=-\nabla f(\mathbf)

, an interesting question is how large learning rates can be chosen in Armijo's condition (that is, when one has no limit on

\alpha_0

as defined in the section "Function minimization using backtracking line search in practice"), since larger learning rates when

\mathbf_n

is closer to the limit point (if exists) can make convergence faster. For example, in

Wolfe conditions In the unconstrained minimization problem, the Wolfe conditions are a set of inequalities for performing inexact line search, especially in quasi-Newton methods, first published by Philip Wolfe in 1969. In these methods the idea is to find ::\mi ...

, there is no mention of

\alpha_0

but another condition called curvature condition is introduced. An upper bound for learning rates is shown to exist if one wants the constructed sequence

\mathbf_n

converges to a

non-degenerate critical point In mathematics, the Hessian matrix or Hessian is a square matrix of second-order partial derivatives of a scalar-valued function, or scalar field. It describes the local curvature of a function of many variables. The Hessian matrix was developed ...

, see : The learning rates must be bounded from above roughly by

, , H, , \times , , H^, , ^2

. Here H is the Hessian of the function at the limit point,

H^

is its inverse, and

, , ., ,

is the norm of a linear operator. Thus, this result applies for example when one uses Backtracking line search for Morse functions. Note that in dimension 1,

H

is a number and hence this upper bound is of the same size as the lower bound in the section "Lower bound for learning rates". On the other hand, if the limit point is degenerate, then learning rates can be unbounded. For example, a modification of backtracking line search known as unbounded backtracking gradient descent (see ) allows the learning rate to be have the size

, , \nabla f(\mathbf_n), , ^

, where

1>\gamma >0

is a constant. Experiments with simple functions such as

f(x,y)=x^4+y^4

show that unbounded backtracking gradient descent converges much faster than the basic version described in the section "Function minimization using backtracking line search in practice".

Time efficiency

An argument against the use of Backtracking line search, in particular in Large scale optimisation, is that satisfying Armijo's condition is expensive. There is a way (so-called Two-way Backtracking) to go around, with good theoretical guarantees and has been tested with good results on

deep neural networks Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. ...

, see . (There, one can find also good/stable implementations of Armijo's condition and its combination with some popular algorithms such as Momentum and NAG, on datasets such as Cifar10 and Cifar100.) One observes that if the sequence

\mathbf_

converges (as wished when one makes use of an iterative optimisation method), then the sequence of learning rates

\alpha_n

should vary little when n is large enough. Therefore, in the search for

\alpha_n

, if one always starts from

\alpha_0

, one would waste a lot of time if it turns out that the sequence

\alpha_n

stays far away from

\alpha_0

. Instead, one should search for

\alpha_n

by starting from

\alpha_

. The second observation is that

\alpha_n

could be larger than

\alpha_

, and hence one should allow to increase learning rate (and not just decrease as in the section Algorithm). Here is the detailed algorithm for Two-way Backtracking: At step n # Set

\gamma _0=\alpha _

. Set

t = -c\,m

and iteration counter

j\,=\,0

. # (Increase learning rate if Armijo's condition is satisfied.) If

f(\mathbf)-f(\mathbf+\gamma _j\,\mathbf)\geq \gamma_j\,t,

, then while this condition and the condition that

\gamma _j\leq \alpha _0

are satisfied, repeatedly set

\gamma_=\gamma_/\tau

and increase j. # (Otherwise, reduce the learning rate if Armijo's condition is not satisfied.) If in contrast

f(\mathbf)-f(\mathbf+\gamma _0\,\mathbf)< \gamma_j\,t,

, then until the condition is satisfied that

f(\mathbf)-f(\mathbf+\gamma_j\,\mathbf)\geq \gamma_j\,t,

repeatedly increment

j

and set

\gamma_j=\tau\,\gamma_\,.

# Return

\gamma_j

for the learning rate

\alpha _n

. (In one can find a description of an algorithm with 1), 3) and 4) above, which was not tested in deep neural networks before the cited paper.) One can save time further by a hybrid mixture between two-way backtracking and the basic standard gradient descent algorithm. This procedure also has good theoretical guarantee and good test performance. Roughly speaking, we run two-way backtracking a few times, then use the learning rate we get from then unchanged, except if the function value increases. Here is precisely how it is done. One choose in advance a number

N

, and a number

m\leq N

. # Set iteration counter j=0. # At the steps

jN+1,\ldots ,jN+m

, use Two-way Backtracking. # At each step k in the set

jN+m+1,\ldots ,jN+N-1

: Set

\alpha =\alpha _

. If

f(x_)-f(x_+\alpha p_)\geq 0

, then choose

\alpha _=\alpha _

and

x_k=x_+\alpha _p_

. (So, in this case, use the learning rate

\alpha _

unchanged.) Otherwise, if

f(x_)-f(x_+\alpha p_)< 0

, use Two-way Backtracking. Increase k by 1 and repeat. # Increase j by 1.

Theoretical guarantee (for gradient descent)

Compared with Wolfe's conditions, which is more complicated, Armijo's condition has a better theoretical guarantee. Indeed, so far backtracking line search and its modifications are the most theoretically guaranteed methods among all numerical optimization algorithms concerning convergence to critical points and avoidance of

saddle points In mathematics, a saddle point or minimax point is a point on the surface of the graph of a function where the slopes (derivatives) in orthogonal directions are all zero (a critical point), but which is not a local extremum of the function ...

, see below. Critical points are points where the gradient of the objective function is 0. Local minima are critical points, but there are critical points which are not local minima. An example is saddle points.

Saddle points In mathematics, a saddle point or minimax point is a point on the surface of the graph of a function where the slopes (derivatives) in orthogonal directions are all zero (a critical point), but which is not a local extremum of the function ...

are critical points, at which there are at least one direction where the function is (local) maximum. Therefore, these points are far from being local minima. For example, if a function has at least one saddle point, then it cannot be

convex Convex or convexity may refer to: Science and technology * Convex lens, in optics Mathematics * Convex set, containing the whole line segment that joins points ** Convex polygon, a polygon which encloses a convex set of points ** Convex polytop ...

. The relevance of saddle points to optimisation algorithms is that in large scale (i.e. high-dimensional) optimisation, one likely sees more saddle points than minima, see . Hence, a good optimisation algorithm should be able to avoid saddle points. In the setting of deep learning, saddle points are also prevalent, see . Thus, to apply in deep learning, one needs results for non-convex functions. For convergence to critical points: For example, if the cost function is a

real analytic function In mathematics, an analytic function is a function that is locally given by a convergent power series. There exist both real analytic functions and complex analytic functions. Functions of each type are infinitely differentiable, but complex ...

, then it is shown in that convergence is guaranteed. The main idea is to use

Łojasiewicz inequality In real algebraic geometry, the Łojasiewicz inequality, named after Stanisław Łojasiewicz, gives an upper bound for the distance of a point to the nearest zero of a given real analytic function. Specifically, let ƒ : ''U'' � ...

which is enjoyed by a real analytic function. For non-smooth functions satisfying

, the above convergence guarantee is extended, see . In , there is a proof that for every sequence constructed by backtracking line search, a cluster point (i.e. the

limit Limit or Limits may refer to: Arts and media * ''Limit'' (manga), a manga by Keiko Suenobu * ''Limit'' (film), a South Korean film * Limit (music), a way to characterize harmony * "Limit" (song), a 2016 single by Luna Sea * "Limits", a 2019 ...

of one

subsequence In mathematics, a subsequence of a given sequence is a sequence that can be derived from the given sequence by deleting some or no elements without changing the order of the remaining elements. For example, the sequence \langle A,B,D \rangle is ...

, if the subsequence converges) is a critical point. For the case of a function with at most countably many critical points (such as a

Morse function In mathematics, specifically in differential topology, Morse theory enables one to analyze the topology of a manifold by studying differentiable functions on that manifold. According to the basic insights of Marston Morse, a typical differentiab ...

) and

compact Compact as used in politics may refer broadly to a pact or treaty; in more specific cases it may refer to: * Interstate compact * Blood compact, an ancient ritual of the Philippines * Compact government, a type of colonial rule utilized in British ...

sublevels, as well as with Lipschitz continuous gradient where one uses standard GD with learning rate <1/L (see the section "Stochastic gradient descent"), then convergence is guaranteed, see for example Chapter 12 in . Here the assumption about compact sublevels is to make sure that one deals with compact sets of the Euclidean space only. In the general case, where

f

is only assumed to be

C^1

and have at most countably many critical points, convergence is guaranteed, see . In the same reference, similarly convergence is guaranteed for other modifications of Backtracking line search (such as Unbounded backtracking gradient descent mentioned in the section "Upper bound for learning rates"), and even if the function has uncountably many critical points still one can deduce some non-trivial facts about convergence behaviour. In the stochastic setting, under the same assumption that the gradient is Lipschitz continuous and one uses a more restrictive version (requiring in addition that the sum of learning rates is infinite and the sum of squares of learning rates is finite) of diminishing learning rate scheme (see section "Stochastic gradient descent") and moreover the function is strictly convex, then the convergence is established in the well-known result , see for generalisations to less restrictive versions of a diminishing learning rate scheme. None of these results (for non-convex functions) have been proven for any other optimization algorithm so far. For avoidance of saddle points: For example, if the gradient of the cost function is Lipschitz continuous and one chooses standard GD with learning rate <1/L, then with a random choice of initial point

\mathbf_0

(more precisely, outside a set of

Lebesgue measure In measure theory, a branch of mathematics, the Lebesgue measure, named after French mathematician Henri Lebesgue, is the standard way of assigning a measure to subsets of ''n''-dimensional Euclidean space. For ''n'' = 1, 2, or 3, it coincides ...

zero), the sequence constructed will not converge to a

non-degenerate In mathematics, specifically linear algebra, a degenerate bilinear form on a vector space ''V'' is a bilinear form such that the map from ''V'' to ''V''∗ (the dual space of ''V'' ) given by is not an isomorphism. An equivalent definit ...

saddle point (proven in ), and more generally it is also true that the sequence constructed will not converge to a degenerate saddle point (proven in ). Under the same assumption that the gradient is Lipschitz continuous and one uses a diminishing learning rate scheme (see the section "Stochastic gradient descent"), then avoidance of saddle points is established in .

A special case: (standard) stochastic gradient descent (SGD)

While it is trivial to mention, if the gradient of a cost function is Lipschitz continuous, with Lipschitz constant L, then with choosing learning rate to be constant and of the size

1/L

, one has a special case of backtracking line search (for gradient descent). This has been used at least in . This scheme however requires that one needs to have a good estimate for L, otherwise if learning rate is too big (relative to 1/L) then the scheme has no convergence guarantee. One can see what will go wrong if the cost function is a smoothing (near the point 0) of the function f(t)=, t, . Such a good estimate is, however, difficult and laborious in large dimensions. Also, if the gradient of the function is not globally Lipschitz continuous, then this scheme has no convergence guarantee. For example, this is similar to an exercise in , for the cost function

f(t)=, t, ^ \,

and for whatever constant learning rate one chooses, with a random initial point the sequence constructed by this special scheme does not converge to the global minimum 0. If one does not care about the condition that learning rate must be bounded by 1/L, then this special scheme has been used much older, at least since 1847 by

Cauchy Baron Augustin-Louis Cauchy (, ; ; 21 August 178923 May 1857) was a French mathematician, engineer, and physicist who made pioneering contributions to several branches of mathematics, including mathematical analysis and continuum mechanics. He ...

, which can be called standard GD (not to be confused with stochastic gradient descent, which is abbreviated herein as SGD). In the stochastic setting (such as in the mini-batch setting in deep learning), standard GD is called

stochastic gradient descent Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of ...

, or SGD. Even if the cost function has globally continuous gradient, good estimate of the Lipschitz constant for the cost functions in deep learning may not be feasible or desirable, given the very high dimensions of

. Hence, there is a technique of fine-tuning of learning rates in applying standard GD or SGD. One way is to choose many learning rates from a grid search, with the hope that some of the learning rates can give good results. (However, if the loss function does not have global Lipschitz continuous gradient, then the example with

f(t)=, t, ^ \,

above shows that grid search cannot help.) Another way is the so-called adaptive standard GD or SGD, some representatives are Adam, Adadelta, RMSProp and so on, see the article on

Stochastic gradient descent Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of ...

. In adaptive standard GD or SGD, learning rates are allowed to vary at each iterate step n, but in a different manner from Backtracking line search for gradient descent. Apparently, it would be more expensive to use Backtracking line search for gradient descent, since one needs to do a loop search until Armijo's condition is satisfied, while for adaptive standard GD or SGD no loop search is needed. Most of these adaptive standard GD or SGD do not have the descent property

f(x_)\leq f(x_n)

, for all n, as Backtracking line search for gradient descent. Only a few has this property, and which have good theoretical properties, but they turn out to be special cases of Backtracking line search or more generally Armijo's condition . The first one is when one chooses learning rate to be a constant <1/L, as mentioned above, if one can have a good estimate of L. The second is the so called diminshing learning rate, used in the well-known paper by , if again the function has globally Lipschitz continuous gradient (but the Lipschitz constant may be unknown) and the learning rates converge to 0.

Practicalities for using in the stochastic setting and implementation in deep neural networks

The algorithm described above is for the deterministic setting, which as reported above has good theoretical guarantees. When one uses Armijo's algorithm in real life settings where random features play important roles, there are practicalities one needs to take care about. This section describes some main points to be noted in the more theoretical setting of stochastic optimization and the more realistic setting of mini-batch in deep neural networks. 1) Stochastic optimization: The typical statement of this setting is as follows: Assume that we have a function

f(x,\xi )

depending on a random variable

\xi

. Now we want to find minimum of the expectation function

F(x):=E(f(x,\xi ))

, here

E(.)

computes the

expectation Expectation or Expectations may refer to: Science * Expectation (epistemic) * Expected value, in mathematical probability theory * Expectation value (quantum mechanics) * Expectation–maximization algorithm, in statistics Music * ''Expectation' ...

with respect to the random variable

\xi

. In general, it is very difficult to compute the expectation, in particular when working for example with deep neural networks (see the next part). Therefore, we can only do as follows: At step n: choose randomly

k_n

variables

\xi _,\xi _,\ldots ,\xi _

(with respect to the given distribution), and replace the (unknown) function

F(x)

by an approximation

F_n(x):=\frac\sum _^f(x,\xi _i)

. We need to modify Armijo's backtracking line search as follows: Then the descent direction is chosen as

p_n=-\nabla F_n(x)

and the Armijo's learning rate

\alpha (x_n,p_n)

is chosen relative to the function

F_n(x)

. It is noted that in general

k_n

is allowed to change. There are works which let

k_n

increase. The following example explains the motivation for such a scheme (also, see theoretical justification later). However, it will be explained in the section about implementing in deep neural networks that there are difficulties/not very good performance with implementing this scheme. Example 1: Consider the simple function

F(x)=x^2

, which is convex with a unique critical point

x=0

. Recall the

normal distribution In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is : f(x) = \frac e^ The parameter \mu i ...

N(1,\sigma ^2)

. Then

F(x)=E(\xi x^2)

, where

\xi

is a random variable with distribution

N(1,\sigma ^2)

. Then, one can check with Python, that if we choose

\sigma ^2=100

and

k_n=1

for all step n, and run Armijo's Backtracking line search with maximum learning rate

\alpha _0=1

, in this setting as described above, then we have divergence (not convergence as one would hope for this too simple convex function)! [What happens here is that, while small, there is a non-zero probability of the random value of

\xi

<0

, in which case what Armijo's backtracking line search will do is to push away from the point 0 by a large amount.] Another way is, originating from the classical work , in SGD to keep

k_n

constant (even can choose to be always 1), but let the learning rates

\delta _n

going to 0. [Even if the random value of

\xi

<0

, since the learning rate

\delta _n

is smaller and smaller, this scheme will push away from 0 less than backtracking.] While this scheme works well also for this example, again in the setting of deep neural networks the experimental results are not very good. It is worthy to note also that if one uses SGD and keeps

k_n

constant, but also learning rates constant, then one can observe divergence even for some values of the learning rates for which there is convergence guarantee for

F(x)

. So what mentions in Example 1 is not the fault of backtracking itself, but also for SGD, the most studied optimization algorithm. Example 1'. The same function F(x) and distribution

N(1,\sigma ^2)

. Choose

\sigma ^2=100

and

k_n=1

for all step n. Note that the function F(x) is

C^_L

with

L=2

. Hence, by classical theory, for learning rate

<1/(2L)=0.25

, SGD with constant learning rate will converge for F(x). However, one check with Python that for the associated stochastic optimization problem, even with the choice of a smaller learning rate 0.1, one observes divergence to infinity! On the other hand, with the learning rate 0.05 then one observes convergence to the minimum 0. Therefore, it seems that even for this simple function, the choice of (constant) learning rate must take into account the distribution, and not just that from the deterministic function F(x), for to obtain convergence for the stochastic optimization problem. [Note: if here one runs Armijo's Backtracking line search with maximum learning rate

\alpha _0=0.2>0.1

, then one observes convergence! Hence, still Armijo's Backtracking line search helps with choosing a good learning rate to use, compared to SGD with constant learning rates.] A typical theoretical result (in the literature, one may find weaker results than what stated here, either because requiring stronger assumptions or proving weaker conclusions) in this setting is as follows: Prototype theorem: Let

F(x)

be convex (or strongly convex) and in

C^_L

. Then both of the following schemes guarantee convergence: Scheme 1: Backtracking with increasing "mini-batch" sizes

k _n

. Scheme 2: SGD with learning rates decreasing to 0. Why these schemes yield not very good experimental results in deep neural networks can be because: - The implementation does not take into account the differences between stochastic optimization and deep neural networks (see more in the next section); and/or - The assumptions of these theoretical results are not satisfied (too strong to be satisfied) by the cost functions in deep neural networks. It is worthy to notice that current successful implementations of backtracking in deep neural networks use constant mini-batch sizes

k_n=constant

. However, the difference with Example 1 is that one does not choose k_n=1 since that is too small. Indeed, one has the following. Example 2. Let the function F(x) and the distribution

N(1,\sigma ^2)

be as in Example 1, where

\sigma ^2=100

. If one chooses

k_n=500

for all n, then one can check with Python that Armijo's backtracking line search with maximum learning rate

\alpha _0=1

in this stochastic setting converges! This can be explained that in this example, choosing

k_n=1

for all n is not good enough to approximate the function F(x), but choosing

k_n=500

for all n is a good approximation. The same happens with deep neural networks: one needs to choose a big enough mini-batch size depending on the specific question and setting (see next section). People who are familiar with undergraduate calculus can find an analogy in using Riemann sums to approximate an

integral In mathematics, an integral assigns numbers to functions in a way that describes displacement, area, volume, and other concepts that arise by combining infinitesimal data. The process of finding integrals is called integration. Along with ...

. 2) deep neural networks: While stochastic optimization is meant to be a theoretical model for what happens in training deep neural networks, in practice there are important differences in many aspects: assumptions, resources, ways to proceed and goals. Assumptions: in stochastic optimization, among many others, there is usually the requirement that the cost function is (strongly-)convex. This is hardly satisfied or verified in practice.

Summary

In summary, backtracking line search (and its modifications) is a method which is easy to implement, is applicable for very general functions, has very good theoretical guarantee (for both convergence to critical points and avoidance of saddle points) and works well in practice. Several other methods which have good theoretical guarantee, such as diminishing learning rates or standard GD with learning rate <1/L – both require the gradient of the objective function to be Lipschitz continuous, turn out to be a special case of Backtracking line search or satisfy Armijo's condition. Even though ''a priori'' one needs the cost function to be continuously differentiable to apply this method, in practice one can apply this method successfully also for functions which are continuously differentiable on a dense open subset such as

f(t)=, t,

f(t)=ReLu(t)=\max\

References

* * * * * * * * * * * * * * * {{cite journal , last1 = Truong , first1 = T. T. , first2=H.-T. , last2=Nguyen , year = 2020 , title = Backtracking Gradient Descent Method and Some Applications in Large Scale Optimisation. Part 2: Algorithms and Experiments , journal = Applied Mathematics & Optimization , date = 6 September 2020 , volume = 84 , issue = 3 , pages = 2557–2586 , doi=10.1007/s00245-020-09718-8, doi-access = free Mathematical optimization Optimization algorithms and methods

Motivation

Algorithm

Function minimization using backtracking line search in practice

Lower bound for learning rates

Upper bound for learning rates

Time efficiency

Theoretical guarantee (for gradient descent)

A special case: (standard) stochastic gradient descent (SGD)

Practicalities for using in the stochastic setting and implementation in deep neural networks

Summary

See also

References