numerical optimization Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criterion, from some set of available alternatives. It is generally divided into two subfi ...

, the nonlinear conjugate gradient method generalizes the

conjugate gradient method In mathematics, the conjugate gradient method is an algorithm for the numerical solution of particular systems of linear equations, namely those whose matrix is positive-definite. The conjugate gradient method is often implemented as an iter ...

to nonlinear optimization. For a quadratic function

\displaystyle f(x)

\displaystyle f(x)=\, Ax-b\, ^2,

the minimum of

f

is obtained when the

gradient In vector calculus, the gradient of a scalar-valued differentiable function of several variables is the vector field (or vector-valued function) \nabla f whose value at a point p is the "direction and rate of fastest increase". If the gr ...

is 0: ::

\nabla_x f=2 A^T(Ax-b)=0

. Whereas linear conjugate gradient seeks a solution to the linear equation

\displaystyle A^T Ax=A^T b

, the nonlinear conjugate gradient method is generally used to find the

local minimum In mathematical analysis, the maxima and minima (the respective plurals of maximum and minimum) of a function, known collectively as extrema (the plural of extremum), are the largest and smallest value of the function, either within a given r ...

of a nonlinear function using its

\nabla_x f

alone. It works when the function is approximately quadratic near the minimum, which is the case when the function is twice differentiable at the minimum and the second derivative is non-singular there. Given a function

\displaystyle f(x)

N

variables to minimize, its gradient

\nabla_x f

indicates the direction of maximum increase. One simply starts in the opposite ( steepest descent) direction: ::

\Delta x_0=-\nabla_x f (x_0)

with an adjustable step length

\displaystyle \alpha

and performs a line search in this direction until it reaches the minimum of

\displaystyle f

: ::

\displaystyle \alpha_0:= \arg \min_\alpha f(x_0+\alpha \Delta x_0)

, ::

\displaystyle x_1=x_0+\alpha_0 \Delta x_0

After this first iteration in the steepest direction

\displaystyle \Delta x_0

, the following steps constitute one iteration of moving along a subsequent conjugate direction

\displaystyle s_n

, where

\displaystyle s_0=\Delta x_0

: # Calculate the steepest direction:

\Delta x_n=-\nabla_x f (x_n)

, # Compute

\displaystyle \beta_n

according to one of the formulas below, # Update the conjugate direction:

\displaystyle s_n=\Delta x_n+\beta_n s_

# Perform a line search: optimize

\displaystyle \alpha_n=\arg \min_ f(x_n+\alpha s_n)

, # Update the position:

\displaystyle x_=x_+\alpha_ s_

, With a pure quadratic function the minimum is reached within ''N'' iterations (excepting roundoff error), but a non-quadratic function will make slower progress. Subsequent search directions lose conjugacy requiring the search direction to be reset to the steepest descent direction at least every ''N'' iterations, or sooner if progress stops. However, resetting every iteration turns the method into steepest descent. The algorithm stops when it finds the minimum, determined when no progress is made after a direction reset (i.e. in the steepest descent direction), or when some tolerance criterion is reached. Within a linear approximation, the parameters

\displaystyle \alpha

and

\displaystyle \beta

are the same as in the linear conjugate gradient method but have been obtained with line searches. The conjugate gradient method can follow narrow ( ill-conditioned) valleys, where the steepest descent method slows down and follows a criss-cross pattern. Four of the best known formulas for

\displaystyle \beta_n

are named after their developers: * Fletcher–Reeves: ::

\beta_^ = \frac
.

* Polak–Ribière: ::

\beta_^ = \frac
.

* Hestenes-Stiefel: ::

\beta_n^ = \frac
.

* Dai–Yuan: ::

\beta_^ = \frac
.

. These formulas are equivalent for a quadratic function, but for nonlinear optimization the preferred formula is a matter of heuristics or taste. A popular choice is

\displaystyle \beta=\max\

, which provides a direction reset automatically. Algorithms based on

Newton's method In numerical analysis, Newton's method, also known as the Newton–Raphson method, named after Isaac Newton and Joseph Raphson, is a root-finding algorithm which produces successively better approximations to the roots (or zeroes) of a real ...

potentially converge much faster. There, both step direction and length are computed from the gradient as the solution of a linear system of equations, with the coefficient matrix being the exact

Hessian matrix In mathematics, the Hessian matrix or Hessian is a square matrix of second-order partial derivatives of a scalar-valued function, or scalar field. It describes the local curvature of a function of many variables. The Hessian matrix was developed ...

(for Newton's method proper) or an estimate thereof (in the quasi-Newton methods, where the observed change in the gradient during the iterations is used to update the Hessian estimate). For high-dimensional problems, the exact computation of the Hessian is usually prohibitively expensive, and even its storage can be problematic, requiring

O(N^2)

memory (but see the limited-memory L-BFGS quasi-Newton method). The conjugate gradient method can also be derived using optimal control theory. In this accelerated optimization theory, the conjugate gradient method falls out as a nonlinear optimal feedback controller,

u = k(x, \dot x):= -\gamma_a \nabla_x f(x) - \gamma_b \dot x

for the double integrator system,

\ddot x = u

The quantities

\gamma_a > 0

and

\gamma_b > 0

are variable feedback gains.

References

{{Optimization algorithms Optimization algorithms and methods Gradient methods

See also

References