Non-linear least squares is the form of
least squares
The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems (sets of equations in which there are more equations than unknowns) by minimizing the sum of the squares of the r ...
analysis used to fit a set of ''m'' observations with a model that is non-linear in ''n'' unknown parameters (''m'' ≥ ''n''). It is used in some forms of
nonlinear regression
In statistics, nonlinear regression is a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables. The data are fi ...
. The basis of the method is to approximate the model by a linear one and to refine the parameters by successive iterations. There are many similarities to
linear least squares
Linear least squares (LLS) is the least squares approximation of linear functions to data.
It is a set of formulations for solving statistical problems involved in linear regression, including variants for ordinary (unweighted), weighted, and ...
, but also some
significant differences. In economic theory, the non-linear least squares method is applied in (i) the probit regression, (ii) threshold regression, (iii) smooth regression, (iv) logistic link regression, (v) Box-Cox transformed regressors (
).
Theory
Consider a set of
data points,
and a curve (model function)
that in addition to the variable
also depends on
parameters,
with
It is desired to find the vector
of parameters such that the curve fits best the given data in the least squares sense, that is, the sum of squares
is minimized, where the
residuals (in-sample prediction errors) are given by
for
The
minimum
In mathematical analysis, the maxima and minima (the respective plurals of maximum and minimum) of a function, known collectively as extrema (the plural of extremum), are the largest and smallest value of the function, either within a given r ...
value of occurs when the
gradient
In vector calculus, the gradient of a scalar-valued differentiable function of several variables is the vector field (or vector-valued function) \nabla f whose value at a point p is the "direction and rate of fastest increase". If the gr ...
is zero. Since the model contains parameters there are gradient equations:
In a nonlinear system, the derivatives
are functions of both the independent variable and the parameters, so in general these gradient equations do not have a closed solution. Instead, initial values must be chosen for the parameters. Then, the parameters are refined iteratively, that is, the values are obtained by successive approximation,
Here, is an iteration number and the vector of increments,
is known as the shift vector. At each iteration the model is linearized by approximation to a first-order
Taylor polynomial
In mathematics, the Taylor series or Taylor expansion of a function is an infinite sum of terms that are expressed in terms of the function's derivatives at a single point. For most common functions, the function and the sum of its Taylor ser ...
expansion about
The
Jacobian matrix
In vector calculus, the Jacobian matrix (, ) of a vector-valued function of several variables is the matrix of all its first-order partial derivatives. When this matrix is square, that is, when the function takes the same number of variables ...
, , is a function of constants, the independent variable ''and'' the parameters, so it changes from one iteration to the next. Thus, in terms of the linearized model,
and the residuals are given by
Substituting these expressions into the gradient equations, they become
which, on rearrangement, become simultaneous linear equations, the normal equations
The normal equations are written in matrix notation as
These equations form the basis for the
Gauss–Newton algorithm
The Gauss–Newton algorithm is used to solve non-linear least squares problems, which is equivalent to minimizing a sum of squared function values. It is an extension of Newton's method for finding a minimum of a non-linear function. Since a sum ...
for a non-linear least squares problem.
Note the sign convention in the definition of the Jacobian matrix in terms of the derivatives. Formulas linear in
may appear with factor of
in other articles or the literature.
Extension by weights
When the observations are not equally reliable, a weighted sum of squares may be minimized,
Each element of the
diagonal
In geometry, a diagonal is a line segment joining two vertices of a polygon or polyhedron, when those vertices are not on the same edge. Informally, any sloping line is called diagonal. The word ''diagonal'' derives from the ancient Gree ...
weight matrix should, ideally, be equal to the reciprocal of the error
variance
In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of number ...
of the measurement.
The normal equations are then, more generally,
Geometrical interpretation
In linear least squares the
objective function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cos ...
, , is a
quadratic function
In mathematics, a quadratic polynomial is a polynomial of degree two in one or more variables. A quadratic function is the polynomial function defined by a quadratic polynomial. Before 20th century, the distinction was unclear between a polynomi ...
of the parameters.
When there is only one parameter the graph of with respect to that parameter will be a
parabola
In mathematics, a parabola is a plane curve which is mirror-symmetrical and is approximately U-shaped. It fits several superficially different mathematical descriptions, which can all be proved to define exactly the same curves.
One descri ...
. With two or more parameters the contours of with respect to any pair of parameters will be concentric
ellipses (assuming that the normal equations matrix
is
positive definite In mathematics, positive definiteness is a property of any object to which a bilinear form or a sesquilinear form may be naturally associated, which is positive-definite. See, in particular:
* Positive-definite bilinear form
* Positive-definite ...
). The minimum parameter values are to be found at the centre of the ellipses. The geometry of the general objective function can be described as paraboloid elliptical.
In NLLSQ the objective function is quadratic with respect to the parameters only in a region close to its minimum value, where the truncated Taylor series is a good approximation to the model.
The more the parameter values differ from their optimal values, the more the contours deviate from elliptical shape. A consequence of this is that initial parameter estimates should be as close as practicable to their (unknown!) optimal values. It also explains how divergence can come about as the Gauss–Newton algorithm is convergent only when the objective function is approximately quadratic in the parameters.
Computation
Initial parameter estimates
Some problems of ill-conditioning and divergence can be corrected by finding initial parameter estimates that are near to the optimal values. A good way to do this is by
computer simulation
Computer simulation is the process of mathematical modelling, performed on a computer, which is designed to predict the behaviour of, or the outcome of, a real-world or physical system. The reliability of some mathematical models can be dete ...
. Both the observed and calculated data are displayed on a screen. The parameters of the model are adjusted by hand until the agreement between observed and calculated data is reasonably good. Although this will be a subjective judgment, it is sufficient to find a good starting point for the non-linear refinement. Initial parameter estimates can be created using transformations or linearizations. Better still evolutionary algorithms such as the Stochastic Funnel Algorithm can lead to the convex basin of attraction that surrounds the optimal parameter estimates. Hybrid algorithms that use randomization and elitism, followed by Newton methods have been shown to be useful and computationally efficient.
Solution
Any method among the ones described
below
Below may refer to:
*Earth
* Ground (disambiguation)
* Soil
* Floor
* Bottom (disambiguation)
* Less than
*Temperatures below freezing
* Hell or underworld
People with the surname
* Ernst von Below (1863–1955), German World War I general
* Fr ...
can be applied to find a solution.
Convergence criteria
The common sense criterion for convergence is that the sum of squares does not decrease from one iteration to the next. However this criterion is often difficult to implement in practice, for various reasons. A useful convergence criterion is
The value 0.0001 is somewhat arbitrary and may need to be changed. In particular it may need to be increased when experimental errors are large. An alternative criterion is
Again, the numerical value is somewhat arbitrary; 0.001 is equivalent to specifying that each parameter should be refined to 0.1% precision. This is reasonable when it is less than the largest relative standard deviation on the parameters.
Calculation of the Jacobian by numerical approximation
There are models for which it is either very difficult or even impossible to derive analytical expressions for the elements of the Jacobian. Then, the numerical approximation
is obtained by calculation of
for
and
. The increment,
, size should be chosen so the numerical derivative is not subject to approximation error by being too large, or
round-off
A roundoff error, also called rounding error, is the difference between the result produced by a given algorithm using exact arithmetic and the result produced by the same algorithm using finite-precision, rounded arithmetic. Rounding errors are d ...
error by being too small.
Parameter errors, confidence limits, residuals etc.
Some information is given in
the corresponding section on the
linear least squares
Linear least squares (LLS) is the least squares approximation of linear functions to data.
It is a set of formulations for solving statistical problems involved in linear regression, including variants for ordinary (unweighted), weighted, and ...
page.
Multiple minima
Multiple minima can occur in a variety of circumstances some of which are:
* A parameter is raised to a power of two or more. For example, when fitting data to a
Lorentzian curve
where
is the height,
is the position and
is the half-width at half height, there are two solutions for the half-width,
and
which give the same optimal value for the objective function.
* Two parameters can be interchanged without changing the value of the model. A simple example is when the model contains the product of two parameters, since
will give the same value as
.
* A parameter is in a trigonometric function, such as
, which has identical values at
. See
Levenberg–Marquardt algorithm
In mathematics and computing, the Levenberg–Marquardt algorithm (LMA or just LM), also known as the damped least-squares (DLS) method, is used to solve non-linear least squares problems. These minimization problems arise especially in least s ...
for an example.
Not all multiple minima have equal values of the objective function. False minima, also known as local minima, occur when the objective function value is greater than its value at the so-called global minimum. To be certain that the minimum found is the global minimum, the refinement should be started with widely differing initial values of the parameters. When the same minimum is found regardless of starting point, it is likely to be the global minimum.
When multiple minima exist there is an important consequence: the objective function will have a maximum value somewhere between two minima. The normal equations matrix is not positive definite at a maximum in the objective function, as the gradient is zero and no unique direction of descent exists. Refinement from a point (a set of parameter values) close to a maximum will be ill-conditioned and should be avoided as a starting point. For example, when fitting a Lorentzian the normal equations matrix is not positive definite when the half-width of the band is zero.
Transformation to a linear model
A non-linear model can sometimes be transformed into a linear one. For example, when the model is a simple exponential function,
it can be transformed into a linear model by taking logarithms.
Graphically this corresponds to working on a
semi-log plot
In science and engineering, a semi-log plot/graph or semi-logarithmic plot/graph has one axis on a logarithmic scale, the other on a linear scale. It is useful for data with exponential relationships, where one variable covers a large range ...
. The sum of squares becomes
This procedure should be avoided unless the errors are multiplicative and
log-normally distributed because it can give misleading results. This comes from the fact that whatever the experimental errors on might be, the errors on are different. Therefore, when the transformed sum of squares is minimized different results will be obtained both for the parameter values and their calculated standard deviations. However, with multiplicative errors that are log-normally distributed, this procedure gives unbiased and consistent parameter estimates.
Another example is furnished by
Michaelis–Menten kinetics
In biochemistry, Michaelis–Menten kinetics, named after Leonor Michaelis and Maud Menten, is the simplest case of enzyme kinetics, applied to enzyme-catalysed reactions of one substrate and one product. It takes the form of an equation describi ...
, used to determine two parameters
and
:
The
Lineweaver–Burk plot
In biochemistry, the Lineweaver–Burk plot (or double reciprocal plot) is a graphical representation of the Lineweaver–Burk equation of enzyme kinetics, described by Hans Lineweaver and Dean Burk in 1934. The Lineweaver–Burk plot for inhibit ...
of
against
is linear in the parameters
and
, but very sensitive to data error and strongly biased toward fitting the data in a particular range of the independent variable