Non-linear least squares is the form of

least squares The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems (sets of equations in which there are more equations than unknowns) by minimizing the sum of the squares of the re ...

analysis used to fit a set of ''m'' observations with a model that is non-linear in ''n'' unknown parameters (''m'' ≥ ''n''). It is used in some forms of

nonlinear regression In statistics, nonlinear regression is a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables. The data are fi ...

. The basis of the method is to approximate the model by a linear one and to refine the parameters by successive iterations. There are many similarities to linear least squares, but also some significant differences. In economic theory, the non-linear least squares method is applied in (i) the probit regression, (ii) threshold regression, (iii) smooth regression, (iv) logistic link regression, (v) Box-Cox transformed regressors (

m(x,\theta_i) = \theta_1 + \theta_2 x^

Theory

Consider a set of

m

data points,

(x_1, y_1), (x_2, y_2), \dots, (x_m, y_m),

and a curve (model function)

\hat = f(x, \boldsymbol \beta),

that in addition to the variable

x

also depends on

n

parameters,

\boldsymbol \beta = (\beta_1, \beta_2, \dots, \beta_n),

with

m\ge n.

It is desired to find the vector

\boldsymbol \beta

of parameters such that the curve fits best the given data in the least squares sense, that is, the sum of squares

S = \sum_^ r_i^2

is minimized, where the residuals (in-sample prediction errors) are given by

r_i = y_i - f(x_i, \boldsymbol \beta)

for

i=1, 2,\dots, m.

The

minimum In mathematical analysis, the maxima and minima (the respective plurals of maximum and minimum) of a function, known collectively as extrema (the plural of extremum), are the largest and smallest value of the function, either within a given r ...

value of occurs when the

gradient In vector calculus, the gradient of a scalar-valued differentiable function of several variables is the vector field (or vector-valued function) \nabla f whose value at a point p is the "direction and rate of fastest increase". If the gr ...

is zero. Since the model contains parameters there are gradient equations:

\frac = 2 \sum_i r_i\frac = 0 \quad (j=1,\ldots,n).

In a nonlinear system, the derivatives

\frac

are functions of both the independent variable and the parameters, so in general these gradient equations do not have a closed solution. Instead, initial values must be chosen for the parameters. Then, the parameters are refined iteratively, that is, the values are obtained by successive approximation,

\beta_j \approx \beta_j^ =\beta^k_j+\Delta \beta_j.

Here, is an iteration number and the vector of increments,

\Delta \boldsymbol \beta

is known as the shift vector. At each iteration the model is linearized by approximation to a first-order

Taylor polynomial In mathematics, the Taylor series or Taylor expansion of a function is an infinite sum of terms that are expressed in terms of the function's derivatives at a single point. For most common functions, the function and the sum of its Taylor ser ...

expansion about

\boldsymbol \beta^k

f(x_i,\boldsymbol \beta)\approx f(x_i,\boldsymbol \beta^k) +\sum_j \frac \left(\beta_j -\beta^_j \right) = f(x_i,\boldsymbol \beta^k) +\sum_j J_ \,\Delta\beta_j.

The

Jacobian matrix In vector calculus, the Jacobian matrix (, ) of a vector-valued function of several variables is the matrix of all its first-order partial derivatives. When this matrix is square, that is, when the function takes the same number of variable ...

, , is a function of constants, the independent variable ''and'' the parameters, so it changes from one iteration to the next. Thus, in terms of the linearized model,

\frac = -J_

and the residuals are given by

\Delta y_i = y_i- f(x_i,\boldsymbol \beta^k),

r_i = y_i - f(x_i, \boldsymbol \beta) = \left(y_i- f(x_i,\boldsymbol \beta^k)\right)+ \left(f(x_i,\boldsymbol \beta^k) - f(x_i, \boldsymbol \beta)\right)\approx\Delta y_i- \sum_^ J_ \Delta \beta_s .

Substituting these expressions into the gradient equations, they become

-2\sum_^ J_ \left( \Delta y_i - \sum_^ J_\ \Delta \beta_s \right) = 0,

which, on rearrangement, become simultaneous linear equations, the normal equations

\sum_^ \sum_^ J_J_\ \Delta \beta_s=\sum_^ J_\ \Delta y_i \qquad (j=1,\dots,n).

The normal equations are written in matrix notation as

\left(\mathbf^\mathsf\mathbf\right) \Delta \boldsymbol \beta = \mathbf^\mathsf\ \Delta \mathbf.

These equations form the basis for the

Gauss–Newton algorithm The Gauss–Newton algorithm is used to solve non-linear least squares problems, which is equivalent to minimizing a sum of squared function values. It is an extension of Newton's method for finding a minimum of a non-linear function. Since a sum ...

for a non-linear least squares problem. Note the sign convention in the definition of the Jacobian matrix in terms of the derivatives. Formulas linear in

J

may appear with factor of

-1

in other articles or the literature.

Extension by weights

When the observations are not equally reliable, a weighted sum of squares may be minimized,

S = \sum_^m W_ r_i^2.

Each element of the

diagonal In geometry, a diagonal is a line segment joining two vertices of a polygon or polyhedron, when those vertices are not on the same edge. Informally, any sloping line is called diagonal. The word ''diagonal'' derives from the ancient Greek δ� ...

weight matrix should, ideally, be equal to the reciprocal of the error

variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbe ...

of the measurement. The normal equations are then, more generally,

\left(\mathbf^\mathsf\mathbf\right) \Delta \boldsymbol \beta = \mathbf^\mathsf\mathbf\ \Delta \mathbf.

Geometrical interpretation

In linear least squares the

objective function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...

, , is a

quadratic function In mathematics, a quadratic polynomial is a polynomial of degree two in one or more variables. A quadratic function is the polynomial function defined by a quadratic polynomial. Before 20th century, the distinction was unclear between a polynomia ...

of the parameters.

S = \sum_i W_ \left(y_i - \sum_j X_\beta_j \right)^2

When there is only one parameter the graph of with respect to that parameter will be a

parabola In mathematics, a parabola is a plane curve which is mirror-symmetrical and is approximately U-shaped. It fits several superficially different mathematical descriptions, which can all be proved to define exactly the same curves. One descri ...

. With two or more parameters the contours of with respect to any pair of parameters will be concentric

ellipse In mathematics, an ellipse is a plane curve surrounding two focal points, such that for all points on the curve, the sum of the two distances to the focal points is a constant. It generalizes a circle, which is the special type of ellipse in ...

s (assuming that the normal equations matrix

\mathbf^\mathsf\mathbf

positive definite In mathematics, positive definiteness is a property of any object to which a bilinear form or a sesquilinear form may be naturally associated, which is positive-definite. See, in particular: * Positive-definite bilinear form * Positive-definite fu ...

). The minimum parameter values are to be found at the centre of the ellipses. The geometry of the general objective function can be described as paraboloid elliptical. In NLLSQ the objective function is quadratic with respect to the parameters only in a region close to its minimum value, where the truncated Taylor series is a good approximation to the model.

S \approx \sum_i W_ \left(y_i - \sum_j J_\beta_j \right)^2

The more the parameter values differ from their optimal values, the more the contours deviate from elliptical shape. A consequence of this is that initial parameter estimates should be as close as practicable to their (unknown!) optimal values. It also explains how divergence can come about as the Gauss–Newton algorithm is convergent only when the objective function is approximately quadratic in the parameters.

Computation

Initial parameter estimates

Some problems of ill-conditioning and divergence can be corrected by finding initial parameter estimates that are near to the optimal values. A good way to do this is by

computer simulation Computer simulation is the process of mathematical modelling, performed on a computer, which is designed to predict the behaviour of, or the outcome of, a real-world or physical system. The reliability of some mathematical models can be deter ...

. Both the observed and calculated data are displayed on a screen. The parameters of the model are adjusted by hand until the agreement between observed and calculated data is reasonably good. Although this will be a subjective judgment, it is sufficient to find a good starting point for the non-linear refinement. Initial parameter estimates can be created using transformations or linearizations. Better still evolutionary algorithms such as the Stochastic Funnel Algorithm can lead to the convex basin of attraction that surrounds the optimal parameter estimates. Hybrid algorithms that use randomization and elitism, followed by Newton methods have been shown to be useful and computationally efficient.

Solution

Any method among the ones described

below Below may refer to: *Earth *Ground (disambiguation) *Soil *Floor *Bottom (disambiguation) *Less than *Temperatures below freezing *Hell or underworld People with the surname *Ernst von Below (1863–1955), German World War I general *Fred Below ( ...

can be applied to find a solution.

Convergence criteria

The common sense criterion for convergence is that the sum of squares does not decrease from one iteration to the next. However this criterion is often difficult to implement in practice, for various reasons. A useful convergence criterion is

\left, \frac\ < 0.0001.

The value 0.0001 is somewhat arbitrary and may need to be changed. In particular it may need to be increased when experimental errors are large. An alternative criterion is

\left, \frac\ < 0.001, \qquad j=1,\dots,n.

Again, the numerical value is somewhat arbitrary; 0.001 is equivalent to specifying that each parameter should be refined to 0.1% precision. This is reasonable when it is less than the largest relative standard deviation on the parameters.

Calculation of the Jacobian by numerical approximation

There are models for which it is either very difficult or even impossible to derive analytical expressions for the elements of the Jacobian. Then, the numerical approximation

\frac \approx \frac

is obtained by calculation of

f(x_i, \boldsymbol \beta)

for

\beta_j

and

\beta_j+\delta \beta_j

. The increment,

\delta \beta_j

, size should be chosen so the numerical derivative is not subject to approximation error by being too large, or round-off error by being too small.

Parameter errors, confidence limits, residuals etc.

Some information is given in the corresponding section on the linear least squares page.

Multiple minima

Multiple minima can occur in a variety of circumstances some of which are: * A parameter is raised to a power of two or more. For example, when fitting data to a Lorentzian curve

f(x_i, \boldsymbol \beta) = \frac

where

\alpha

is the height,

\gamma

is the position and

\beta

is the half-width at half height, there are two solutions for the half-width,

\hat \beta

and

-\hat \beta

which give the same optimal value for the objective function. * Two parameters can be interchanged without changing the value of the model. A simple example is when the model contains the product of two parameters, since

\alpha \beta

will give the same value as

\beta \alpha

. * A parameter is in a trigonometric function, such as

\sin \beta

, which has identical values at

\hat \beta + 2n \pi

. See

Levenberg–Marquardt algorithm In mathematics and computing, the Levenberg–Marquardt algorithm (LMA or just LM), also known as the damped least-squares (DLS) method, is used to solve non-linear least squares problems. These minimization problems arise especially in least sq ...

for an example. Not all multiple minima have equal values of the objective function. False minima, also known as local minima, occur when the objective function value is greater than its value at the so-called global minimum. To be certain that the minimum found is the global minimum, the refinement should be started with widely differing initial values of the parameters. When the same minimum is found regardless of starting point, it is likely to be the global minimum. When multiple minima exist there is an important consequence: the objective function will have a maximum value somewhere between two minima. The normal equations matrix is not positive definite at a maximum in the objective function, as the gradient is zero and no unique direction of descent exists. Refinement from a point (a set of parameter values) close to a maximum will be ill-conditioned and should be avoided as a starting point. For example, when fitting a Lorentzian the normal equations matrix is not positive definite when the half-width of the band is zero.

Transformation to a linear model

A non-linear model can sometimes be transformed into a linear one. For example, when the model is a simple exponential function,

f(x_i,\boldsymbol \beta)= \alpha e^

it can be transformed into a linear model by taking logarithms.

\log f(x_i,\boldsymbol \beta) = \log \alpha + \beta x_i

Graphically this corresponds to working on a

semi-log plot In science and engineering, a semi-log plot/graph or semi-logarithmic plot/graph has one axis on a logarithmic scale, the other on a linear scale. It is useful for data with exponential relationships, where one variable covers a large range o ...

. The sum of squares becomes

S = \sum_i (\log y_i-\log \alpha - \beta x_i)^2.

This procedure should be avoided unless the errors are multiplicative and log-normally distributed because it can give misleading results. This comes from the fact that whatever the experimental errors on might be, the errors on are different. Therefore, when the transformed sum of squares is minimized different results will be obtained both for the parameter values and their calculated standard deviations. However, with multiplicative errors that are log-normally distributed, this procedure gives unbiased and consistent parameter estimates. Another example is furnished by Michaelis–Menten kinetics, used to determine two parameters

V_

and

K_m

v = \frac.

The

Lineweaver–Burk plot In biochemistry, the Lineweaver–Burk plot (or double reciprocal plot) is a graphical representation of the Lineweaver–Burk equation of enzyme kinetics, described by Hans Lineweaver and Dean Burk in 1934. The Lineweaver–Burk plot for inhibit ...

\frac = \frac + \frac

\frac

against

\frac

is linear in the parameters

\frac

and

\frac

, but very sensitive to data error and strongly biased toward fitting the data in a particular range of the independent variable

/math>.

Algorithms

Gauss–Newton method

The normal equations

\left( \mathbf^\mathsf\mathbf \right)\Delta \boldsymbol\beta = \left( \mathbf^\mathsf\mathbf \right) \Delta \mathbf

may be solved for

\Delta \boldsymbol\beta

Cholesky decomposition In linear algebra, the Cholesky decomposition or Cholesky factorization (pronounced ) is a decomposition of a Hermitian, positive-definite matrix into the product of a lower triangular matrix and its conjugate transpose, which is useful for effici ...

, as described in linear least squares. The parameters are updated iteratively

\boldsymbol\beta^ = \boldsymbol\beta^k + \Delta \boldsymbol\beta

where ''k'' is an iteration number. While this method may be adequate for simple models, it will fail if divergence occurs. Therefore, protection against divergence is essential.

Shift-cutting

If divergence occurs, a simple expedient is to reduce the length of the shift vector,

\Delta \boldsymbol\beta

, by a fraction, ''f''

\boldsymbol\beta^ = \boldsymbol\beta^k+f\ \Delta \boldsymbol\beta.

For example, the length of the shift vector may be successively halved until the new value of the objective function is less than its value at the last iteration. The fraction, ''f'' could be optimized by a

line search In optimization, the line search strategy is one of two basic iterative approaches to find a local minimum \mathbf^* of an objective function f:\mathbb R^n\to\mathbb R. The other approach is trust region. The line search approach first finds a ...

.M.J. Box, D. Davies and W.H. Swann, Non-Linear optimisation Techniques, Oliver & Boyd, 1969 As each trial value of ''f'' requires the objective function to be re-calculated it is not worth optimizing its value too stringently. When using shift-cutting, the direction of the shift vector remains unchanged. This limits the applicability of the method to situations where the direction of the shift vector is not very different from what it would be if the objective function were approximately quadratic in the parameters,

\boldsymbol\beta^k.

Marquardt parameter

If divergence occurs and the direction of the shift vector is so far from its "ideal" direction that shift-cutting is not very effective, that is, the fraction, ''f'' required to avoid divergence is very small, the direction must be changed. This can be achieved by using the

Marquardt Marquardt is a surname of German origin. Notable people with the surname include: * August F. Marquardt (1850–1925), American politician *Bridget Marquardt (born 1973), American television personality, glamour model, and actress * Christel Marquar ...

parameter. In this method the normal equations are modified

\left( \mathbf^\mathsf \mathbf + \lambda \mathbf \right) \Delta \boldsymbol \beta = \left( \mathbf^\mathsf \mathbf \right) \Delta \mathbf

where

\lambda

is the Marquardt parameter and I is an identity matrix. Increasing the value of

\lambda

has the effect of changing both the direction and the length of the shift vector. The shift vector is rotated towards the direction of

steepest descent In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the ...

when

\lambda \mathbf \gg \mathbf^\mathsf\mathbf, \   \approx \frac 1 \lambda \mathbf^\mathsf\mathbf\  \Delta \mathbf.

\mathbf^\mathsf\mathbf\, \Delta \mathbf

is the steepest descent vector. So, when

\lambda

becomes very large, the shift vector becomes a small fraction of the steepest descent vector. Various strategies have been proposed for the determination of the Marquardt parameter. As with shift-cutting, it is wasteful to optimize this parameter too stringently. Rather, once a value has been found that brings about a reduction in the value of the objective function, that value of the parameter is carried to the next iteration, reduced if possible, or increased if need be. When reducing the value of the Marquardt parameter, there is a cut-off value below which it is safe to set it to zero, that is, to continue with the unmodified Gauss–Newton method. The cut-off value may be set equal to the smallest singular value of the Jacobian. A bound for this value is given by

1 / \operatorname \left(\mathbf^\mathsf\mathbf\right)^

where is the trace function.

QR decomposition

The minimum in the sum of squares can be found by a method that does not involve forming the normal equations. The residuals with the linearized model can be written as

\mathbf = \Delta \mathbf - \mathbf\, \Delta\boldsymbol\beta.

The Jacobian is subjected to an orthogonal decomposition; the

QR decomposition In linear algebra, a QR decomposition, also known as a QR factorization or QU factorization, is a decomposition of a matrix ''A'' into a product ''A'' = ''QR'' of an orthogonal matrix ''Q'' and an upper triangular matrix ''R''. QR decomp ...

will serve to illustrate the process.

\mathbf = \mathbf

where is an

orthogonal In mathematics, orthogonality is the generalization of the geometric notion of '' perpendicularity''. By extension, orthogonality is also used to refer to the separation of specific features of a system. The term also has specialized meanings in ...

m \times m

matrix and is an

m \times n

matrix which is partitioned into an

n \times n

block,

\mathbf_n

, and a

(m-n) \times n

zero block.

\mathbf_n

is upper triangular.

\mathbf= \begin
\mathbf_n \\
\mathbf
\end

The residual vector is left-multiplied by

\mathbf Q^\mathsf

\mathbf^\mathsf \mathbf = \mathbf^\mathsf\ \Delta \mathbf - \mathbf\ \Delta\boldsymbol\beta = \begin
\left(\mathbf^\mathsf\ \Delta \mathbf - \mathbf\ \Delta\boldsymbol\beta \right)_n \\
\left(\mathbf^\mathsf\ \Delta \mathbf \right)_
\end

This has no effect on the sum of squares since

S = \mathbf^\mathsf \mathbf \mathbf^\mathsf \mathbf = \mathbf^\mathsf \mathbf

because Q is

The minimum value of ''S'' is attained when the upper block is zero. Therefore, the shift vector is found by solving

\mathbf_n\ \Delta\boldsymbol\beta = \left(\mathbf^\mathsf\ \Delta \mathbf \right)_n.

These equations are easily solved as R is upper triangular.

Singular value decomposition

A variant of the method of orthogonal decomposition involves

singular value decomposition In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any \ m \times n\ matrix. It is re ...

, in which R is diagonalized by further orthogonal transformations.

\mathbf = \mathbf \boldsymbol\Sigma \mathbf^\mathsf

where

\mathbf U

is orthogonal,

\boldsymbol\Sigma

is a diagonal matrix of singular values and

\mathbf V

is the orthogonal matrix of the eigenvectors of

\mathbf ^\mathsf\mathbf

or equivalently the right singular vectors of

\mathbf

. In this case the shift vector is given by

\Delta\boldsymbol\beta = \mathbf \boldsymbol\Sigma^ \left( \mathbf^\mathsf\   \Delta \mathbf \right)_n.

The relative simplicity of this expression is very useful in theoretical analysis of non-linear least squares. The application of singular value decomposition is discussed in detail in Lawson and Hanson.C.L. Lawson and R.J. Hanson, Solving Least Squares Problems, Prentice–Hall, 1974

Gradient methods

There are many examples in the scientific literature where different methods have been used for non-linear data-fitting problems. *Inclusion of second derivatives in The Taylor series expansion of the model function. This is

Newton's method in optimization In calculus, Newton's method is an iterative method for finding the roots of a differentiable function , which are solutions to the equation . As such, Newton's method can be applied to the derivative of a twice-differentiable function to fi ...

f(x_i, \boldsymbol \beta) = f^k(x_i, \boldsymbol \beta) +\sum_j J_ \, \Delta \beta_j + \frac\sum_j\sum_k \Delta\beta_j \, \Delta\beta_k \,H_,\ H_ = \frac.

The matrix H is known as the

Hessian matrix In mathematics, the Hessian matrix or Hessian is a square matrix of second-order partial derivatives of a scalar-valued function, or scalar field. It describes the local curvature of a function of many variables. The Hessian matrix was developed ...

. Although this model has better convergence properties near to the minimum, it is much worse when the parameters are far from their optimal values. Calculation of the Hessian adds to the complexity of the algorithm. This method is not in general use. * Davidon–Fletcher–Powell method. This method, a form of pseudo-Newton method, is similar to the one above but calculates the Hessian by successive approximation, to avoid having to use analytical expressions for the second derivatives. *

Steepest descent In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the ...

. Although a reduction in the sum of squares is guaranteed when the shift vector points in the direction of steepest descent, this method often performs poorly. When the parameter values are far from optimal the direction of the steepest descent vector, which is normal (perpendicular) to the contours of the objective function, is very different from the direction of the Gauss–Newton vector. This makes divergence much more likely, especially as the minimum along the direction of steepest descent may correspond to a small fraction of the length of the steepest descent vector. When the contours of the objective function are very eccentric, due to there being high correlation between parameters, the steepest descent iterations, with shift-cutting, follow a slow, zig-zag trajectory towards the minimum. * Conjugate gradient search. This is an improved steepest descent based method with good theoretical convergence properties, although it can fail on finite-precision digital computers even when used on quadratic problems.M. J. D. Powell, Computer Journal, (1964), 7, 155.

Direct search methods

Direct search methods depend on evaluations of the objective function at a variety of parameter values and do not use derivatives at all. They offer alternatives to the use of numerical derivatives in the Gauss–Newton method and gradient methods. * Alternating variable search. Each parameter is varied in turn by adding a fixed or variable increment to it and retaining the value that brings about a reduction in the sum of squares. The method is simple and effective when the parameters are not highly correlated. It has very poor convergence properties, but may be useful for finding initial parameter estimates. * Nelder–Mead (simplex) search. A

simplex In geometry, a simplex (plural: simplexes or simplices) is a generalization of the notion of a triangle or tetrahedron to arbitrary dimensions. The simplex is so-named because it represents the simplest possible polytope in any given dimension. ...

in this context is a

polytope In elementary geometry, a polytope is a geometric object with flat sides ('' faces''). Polytopes are the generalization of three-dimensional polyhedra to any number of dimensions. Polytopes may exist in any general number of dimensions as an ...

of ''n'' + 1 vertices in ''n'' dimensions; a triangle on a plane, a tetrahedron in three-dimensional space and so forth. Each vertex corresponds to a value of the objective function for a particular set of parameters. The shape and size of the simplex is adjusted by varying the parameters in such a way that the value of the objective function at the highest vertex always decreases. Although the sum of squares may initially decrease rapidly, it can converge to a nonstationary point on quasiconvex problems, by an example of M. J. D. Powell. More detailed descriptions of these, and other, methods are available, in ''

Numerical Recipes ''Numerical Recipes'' is the generic title of a series of books on algorithms and numerical analysis by William H. Press, Saul A. Teukolsky, William T. Vetterling and Brian P. Flannery. In various editions, the books have been in print since 1 ...

'', together with computer code in various languages.

Theory

Extension by weights

Geometrical interpretation

Computation

Initial parameter estimates

Solution

Convergence criteria

Calculation of the Jacobian by numerical approximation

Parameter errors, confidence limits, residuals etc.

Multiple minima

Transformation to a linear model

Algorithms

Gauss–Newton method

Shift-cutting

Marquardt parameter

QR decomposition

Singular value decomposition

Gradient methods

Direct search methods

See also

References

Further reading