least squares
   HOME

TheInfoList



OR:

The method of least squares is a standard approach in
regression analysis In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one ...
to approximate the solution of
overdetermined system In mathematics, a system of equations is considered overdetermined if there are more equations than unknowns. An overdetermined system is almost always inconsistent (it has no solution) when constructed with random coefficients. However, an over ...
s (sets of equations in which there are more equations than unknowns) by minimizing the sum of the squares of the residuals (a residual being the difference between an observed value and the fitted value provided by a model) made in the results of each individual equation. The most important application is in data fitting. When the problem has substantial uncertainties in the independent variable (the ''x'' variable), then simple regression and least-squares methods have problems; in such cases, the methodology required for fitting
errors-in-variables models In statistics, errors-in-variables models or measurement error models are regression models that account for measurement errors in the independent variables. In contrast, standard regression models assume that those regressors have been measured e ...
may be considered instead of that for least squares. Least squares problems fall into two categories: linear or
ordinary least squares In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model (with fixed level-one effects of a linear function of a set of explanatory variables) by the ...
and
nonlinear least squares Non-linear least squares is the form of least squares analysis used to fit a set of ''m'' observations with a model that is non-linear in ''n'' unknown parameters (''m'' ≥ ''n''). It is used in some forms of nonlinear regression. The ...
, depending on whether or not the residuals are linear in all unknowns. The linear least-squares problem occurs in statistical
regression analysis In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one ...
; it has a
closed-form solution In mathematics, a closed-form expression is a mathematical expression that uses a finite number of standard operations. It may contain constants, variables, certain well-known operations (e.g., + − × ÷), and functions (e.g., ''n''th roo ...
. The nonlinear problem is usually solved by iterative refinement; at each iteration the system is approximated by a linear one, and thus the core calculation is similar in both cases. Polynomial least squares describes the variance in a prediction of the dependent variable as a function of the independent variable and the deviations from the fitted curve. When the observations come from an
exponential family In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate ...
with identity as its natural sufficient statistics and mild-conditions are satisfied (e.g. for normal, exponential, Poisson and binomial distributions), standardized least-squares estimates and maximum-likelihood estimates are identical. The method of least squares can also be derived as a method of moments estimator. The following discussion is mostly presented in terms of
linear Linearity is the property of a mathematical relationship ('' function'') that can be graphically represented as a straight line. Linearity is closely related to '' proportionality''. Examples in physics include rectilinear motion, the linear ...
functions but the use of least squares is valid and practical for more general families of functions. Also, by iteratively applying local quadratic approximation to the likelihood (through the
Fisher information In mathematical statistics, the Fisher information (sometimes simply called information) is a way of measuring the amount of information that an observable random variable ''X'' carries about an unknown parameter ''θ'' of a distribution that model ...
), the least-squares method may be used to fit a generalized linear model. The least-squares method was officially discovered and published by
Adrien-Marie Legendre Adrien-Marie Legendre (; ; 18 September 1752 – 9 January 1833) was a French mathematician who made numerous contributions to mathematics. Well-known and important concepts such as the Legendre polynomials and Legendre transformation are name ...
(1805), though it is usually also co-credited to
Carl Friedrich Gauss Johann Carl Friedrich Gauss (; german: Gauß ; la, Carolus Fridericus Gauss; 30 April 177723 February 1855) was a German mathematician and physicist who made significant contributions to many fields in mathematics and science. Sometimes refer ...
(1795) who contributed significant theoretical advances to the method and may have previously used it in his work.


History


Founding

The method of least squares grew out of the fields of
astronomy Astronomy () is a natural science that studies celestial objects and phenomena. It uses mathematics, physics, and chemistry in order to explain their origin and evolution. Objects of interest include planets, moons, stars, nebulae, g ...
and geodesy, as scientists and mathematicians sought to provide solutions to the challenges of navigating the Earth's oceans during the
Age of Discovery The Age of Discovery (or the Age of Exploration), also known as the early modern period, was a period largely overlapping with the Age of Sail, approximately from the 15th century to the 17th century in European history, during which seafarin ...
. The accurate description of the behavior of celestial bodies was the key to enabling ships to sail in open seas, where sailors could no longer rely on land sightings for navigation. The method was the culmination of several advances that took place during the course of the eighteenth century: *The combination of different observations as being the best estimate of the true value; errors decrease with aggregation rather than increase, perhaps first expressed by
Roger Cotes Roger Cotes (10 July 1682 – 5 June 1716) was an English mathematician, known for working closely with Isaac Newton by proofreading the second edition of his famous book, the '' Principia'', before publication. He also invented the quadratur ...
in 1722. *The combination of different observations taken under the ''same'' conditions contrary to simply trying one's best to observe and record a single observation accurately. The approach was known as the method of averages. This approach was notably used by
Tobias Mayer Tobias Mayer (17 February 172320 February 1762) was a German astronomer famous for his studies of the Moon. He was born at Marbach, in Württemberg, and brought up at Esslingen in poor circumstances. A self-taught mathematician, he earned a l ...
while studying the librations of the moon in 1750, and by
Pierre-Simon Laplace Pierre-Simon, marquis de Laplace (; ; 23 March 1749 – 5 March 1827) was a French scholar and polymath whose work was important to the development of engineering, mathematics, statistics, physics, astronomy, and philosophy. He summarized ...
in his work in explaining the differences in motion of
Jupiter Jupiter is the fifth planet from the Sun and the largest in the Solar System. It is a gas giant with a mass more than two and a half times that of all the other planets in the Solar System combined, but slightly less than one-thousandth t ...
and Saturn in 1788. *The combination of different observations taken under conditions. The method came to be known as the method of ''
least absolute deviation Least absolute deviations (LAD), also known as least absolute errors (LAE), least absolute residuals (LAR), or least absolute values (LAV), is a statistical optimality criterion and a statistical optimization technique based minimizing the '' sum ...
''. It was notably performed by
Roger Joseph Boscovich Roger Joseph Boscovich ( hr, Ruđer Josip Bošković; ; it, Ruggiero Giuseppe Boscovich; la, Rogerius (Iosephus) Boscovicius; sr, Руђер Јосип Бошковић; 18 May 1711 – 13 February 1787) was a physicist, astronomer, ...
in his work on the shape of the earth in 1757 and by
Pierre-Simon Laplace Pierre-Simon, marquis de Laplace (; ; 23 March 1749 – 5 March 1827) was a French scholar and polymath whose work was important to the development of engineering, mathematics, statistics, physics, astronomy, and philosophy. He summarized ...
for the same problem in 1799. *The development of a criterion that can be evaluated to determine when the solution with the minimum error has been achieved. Laplace tried to specify a mathematical form of the
probability Probability is the branch of mathematics concerning numerical descriptions of how likely an event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and 1, where, roughly speakin ...
density for the errors and define a method of estimation that minimizes the error of estimation. For this purpose, Laplace used a symmetric two-sided exponential distribution we now call Laplace distribution to model the error distribution, and used the sum of absolute deviation as error of estimation. He felt these to be the simplest assumptions he could make, and he had hoped to obtain the arithmetic mean as the best estimate. Instead, his estimator was the posterior median.


The method

The first clear and concise exposition of the method of least squares was published by Legendre in 1805. The technique is described as an algebraic procedure for fitting linear equations to data and Legendre demonstrates the new method by analyzing the same data as Laplace for the shape of the earth. Within ten years after Legendre's publication, the method of least squares had been adopted as a standard tool in astronomy and geodesy in France, Italy, and Prussia, which constitutes an extraordinarily rapid acceptance of a scientific technique. In 1809
Carl Friedrich Gauss Johann Carl Friedrich Gauss (; german: Gauß ; la, Carolus Fridericus Gauss; 30 April 177723 February 1855) was a German mathematician and physicist who made significant contributions to many fields in mathematics and science. Sometimes refer ...
published his method of calculating the orbits of celestial bodies. In that work he claimed to have been in possession of the method of least squares since 1795. This naturally led to a priority dispute with Legendre. However, to Gauss's credit, he went beyond Legendre and succeeded in connecting the method of least squares with the principles of probability and to the
normal distribution In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is : f(x) = \frac e^ The parameter \mu ...
. He had managed to complete Laplace's program of specifying a mathematical form of the probability density for the observations, depending on a finite number of unknown parameters, and define a method of estimation that minimizes the error of estimation. Gauss showed that the arithmetic mean is indeed the best estimate of the location parameter by changing both the
probability density In probability theory, a probability density function (PDF), or density of a continuous random variable, is a function whose value at any given sample (or point) in the sample space (the set of possible values taken by the random variable) can ...
and the method of estimation. He then turned the problem around by asking what form the density should have and what method of estimation should be used to get the arithmetic mean as estimate of the location parameter. In this attempt, he invented the normal distribution. An early demonstration of the strength of
Gauss's method In orbital mechanics (a subfield of celestial mechanics), Gauss's method is used for preliminary orbit determination from at least three observations (more observations increases the accuracy of the determined orbit) of the orbiting body of intere ...
came when it was used to predict the future location of the newly discovered asteroid Ceres. On 1 January 1801, the Italian astronomer
Giuseppe Piazzi Giuseppe Piazzi ( , ; 16 July 1746 – 22 July 1826) was an Italian Catholic priest of the Theatine order, mathematician, and astronomer. He established an observatory at Palermo, now the '' Osservatorio Astronomico di Palermo – Giuseppe S ...
discovered Ceres and was able to track its path for 40 days before it was lost in the glare of the sun. Based on these data, astronomers desired to determine the location of Ceres after it emerged from behind the sun without solving Kepler's complicated nonlinear equations of planetary motion. The only predictions that successfully allowed Hungarian astronomer
Franz Xaver von Zach Baron Franz Xaver von Zach (''Franz Xaver Freiherr von Zach''; 4 June 1754 – 2 September 1832) was a Hungarian astronomer born at Pest, Hungary (now Budapest in Hungary). Biography Zach studied physics at the Royal University of Pest, and s ...
to relocate Ceres were those performed by the 24-year-old Gauss using least-squares analysis. In 1810, after reading Gauss's work, Laplace, after proving the
central limit theorem In probability theory, the central limit theorem (CLT) establishes that, in many situations, when independent random variables are summed up, their properly normalized sum tends toward a normal distribution even if the original variables themsel ...
, used it to give a large sample justification for the method of least squares and the normal distribution. In 1822, Gauss was able to state that the least-squares approach to regression analysis is optimal in the sense that in a linear model where the errors have a mean of zero, are uncorrelated, and have equal variances, the best linear unbiased estimator of the coefficients is the least-squares estimator. This result is known as the
Gauss–Markov theorem In statistics, the Gauss–Markov theorem (or simply Gauss theorem for some authors) states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the ...
. The idea of least-squares analysis was also independently formulated by the American
Robert Adrain Robert Adrain (30 September 1775 – 10 August 1843) was an Irish political exile who won renown as a mathematician in the United States. He left Ireland after leading republican insurgents in the Rebellion of 1798, and settled in New Jersey an ...
in 1808. In the next two centuries workers in the theory of errors and in statistics found many different ways of implementing least squares.


Problem statement

The objective consists of adjusting the parameters of a model function to best fit a data set. A simple data set consists of ''n'' points (data pairs) (x_i,y_i)\!, ''i'' = 1, …, ''n'', where x_i\! is an independent variable and y_i\! is a dependent variable whose value is found by observation. The model function has the form f(x, \boldsymbol \beta), where ''m'' adjustable parameters are held in the vector \boldsymbol \beta. The goal is to find the parameter values for the model that "best" fits the data. The fit of a model to a data point is measured by its residual, defined as the difference between the observed value of the dependent variable and the value predicted by the model: :r_i = y_i - f(x_i, \boldsymbol \beta). The least-squares method finds the optimal parameter values by minimizing the sum of squared residuals, S: :S=\sum_^r_i^2. In the simplest case f(x_i, \boldsymbol \beta)= \beta and the result of the least-squares method is the arithmetic mean of the input data. An example of a model in two dimensions is that of the straight line. Denoting the y-intercept as \beta_0 and the slope as \beta_1, the model function is given by f(x,\boldsymbol \beta)=\beta_0+\beta_1 x. See linear least squares for a fully worked out example of this model. A data point may consist of more than one independent variable. For example, when fitting a plane to a set of height measurements, the plane is a function of two independent variables, ''x'' and ''z'', say. In the most general case there may be one or more independent variables and one or more dependent variables at each data point. To the right is a residual plot illustrating random fluctuations about r_i=0, indicating that a linear model(Y_i = \alpha + \beta x_i + U_i) is appropriate. U_i is an independent, random variable.   If the residual points had some sort of a shape and were not randomly fluctuating, a linear model would not be appropriate. For example, if the residual plot had a parabolic shape as seen to the right, a parabolic model (Y_i = \alpha + \beta x_i + \gamma x_i^2 + U_i) would be appropriate for the data. The residuals for a parabolic model can be calculated via r_i=y_i-\hat-\hat x_i-\widehat x_i^2.


Limitations

This regression formulation considers only observational errors in the dependent variable (but the alternative total least squares regression can account for errors in both variables). There are two rather different contexts with different implications: *Regression for prediction. Here a model is fitted to provide a prediction rule for application in a similar situation to which the data used for fitting apply. Here the dependent variables corresponding to such future application would be subject to the same types of observation error as those in the data used for fitting. It is therefore logically consistent to use the least-squares prediction rule for such data. *Regression for fitting a "true relationship". In standard
regression analysis In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one ...
that leads to fitting by least squares there is an implicit assumption that errors in the independent variable are zero or strictly controlled so as to be negligible. When errors in the independent variable are non-negligible, models of measurement error can be used; such methods can lead to parameter estimates,
hypothesis testing A statistical hypothesis test is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis. Hypothesis testing allows us to make probabilistic statements about population parameters. ...
and confidence intervals that take into account the presence of observation errors in the independent variables. An alternative approach is to fit a model by total least squares; this can be viewed as taking a pragmatic approach to balancing the effects of the different sources of error in formulating an objective function for use in model-fitting.


Solving the least squares problem

The
minimum In mathematical analysis, the maxima and minima (the respective plurals of maximum and minimum) of a function, known collectively as extrema (the plural of extremum), are the largest and smallest value of the function, either within a given r ...
of the sum of squares is found by setting the
gradient In vector calculus, the gradient of a scalar-valued differentiable function of several variables is the vector field (or vector-valued function) \nabla f whose value at a point p is the "direction and rate of fastest increase". If the gr ...
to zero. Since the model contains ''m'' parameters, there are ''m'' gradient equations: \frac=2\sum_i r_i\frac = 0,\ j=1,\ldots,m, and since r_i=y_i-f(x_i,\boldsymbol \beta), the gradient equations become -2\sum_i r_i\frac=0,\ j=1,\ldots,m. The gradient equations apply to all least squares problems. Each particular problem requires particular expressions for the model and its partial derivatives.


Linear least squares

A regression model is a linear one when the model comprises a linear combination of the parameters, i.e., f(x, \boldsymbol \beta) = \sum_^m \beta_j \phi_j(x), where the function \phi_j is a function of x . Letting X_= \phi_j(x_) and putting the independent and dependent variables in matrices X and Y, respectively, we can compute the least squares in the following way. Note that D is the set of all data. L(D, \boldsymbol)= \left\, Y - X\boldsymbol \right\, ^2 = (Y - X\boldsymbol)^\mathsf (Y - X\boldsymbol) = Y^\mathsfY- Y^\mathsfX\boldsymbol- \boldsymbol^\mathsfX^\mathsfY+\boldsymbol^\mathsfX^\mathsfX\boldsymbol The gradient of the loss is: \frac = \frac = -2X^\mathsfY + 2X^\mathsfX\boldsymbol Setting the gradient of the loss to zero and solving for \boldsymbol, we get: -2X^\mathsfY + 2X^\mathsfX\boldsymbol = 0 \Rightarrow X^\mathsfY = X^\mathsfX\boldsymbol \boldsymbol = \left(X^\mathsfX\right)^ X^\mathsfY


Non-linear least squares

There is, in some cases, a
closed-form solution In mathematics, a closed-form expression is a mathematical expression that uses a finite number of standard operations. It may contain constants, variables, certain well-known operations (e.g., + − × ÷), and functions (e.g., ''n''th roo ...
to a non-linear least squares problem – but in general there is not. In the case of no closed-form solution, numerical algorithms are used to find the value of the parameters \beta that minimizes the objective. Most algorithms involve choosing initial values for the parameters. Then, the parameters are refined iteratively, that is, the values are obtained by successive approximation: ^ = ^k+\Delta \beta_j, where a superscript ''k'' is an iteration number, and the vector of increments \Delta \beta_j is called the shift vector. In some commonly used algorithms, at each iteration the model may be linearized by approximation to a first-order
Taylor series In mathematics, the Taylor series or Taylor expansion of a function is an infinite sum of terms that are expressed in terms of the function's derivatives at a single point. For most common functions, the function and the sum of its Taylor ser ...
expansion about \boldsymbol \beta^k: \begin f(x_i,\boldsymbol \beta) &= f^k(x_i,\boldsymbol \beta) +\sum_j \frac \left(\beta_j-^k \right) \\ &= f^k(x_i,\boldsymbol \beta) +\sum_j J_ \,\Delta\beta_j. \end The Jacobian J is a function of constants, the independent variable ''and'' the parameters, so it changes from one iteration to the next. The residuals are given by r_i = y_i - f^k(x_i, \boldsymbol \beta)- \sum_^ J_\,\Delta\beta_k = \Delta y_i- \sum_^ J_\,\Delta\beta_j. To minimize the sum of squares of r_i, the gradient equation is set to zero and solved for \Delta \beta_j: -2\sum_^n J_ \left( \Delta y_i-\sum_^m J_ \, \Delta \beta_k \right) = 0, which, on rearrangement, become ''m'' simultaneous linear equations, the normal equations: \sum_^\sum_^m J_ J_ \, \Delta \beta_k=\sum_^n J_ \, \Delta y_i \qquad (j=1,\ldots,m). The normal equations are written in matrix notation as \left(\mathbf^\mathsf \mathbf\right) \Delta \boldsymbol \beta = \mathbf^\mathsf\Delta \mathbf. These are the defining equations of the Gauss–Newton algorithm.


Differences between linear and nonlinear least squares

* The model function, ''f'', in LLSQ (linear least squares) is a linear combination of parameters of the form f = X_\beta_1 + X_\beta_2 +\cdots The model may represent a straight line, a parabola or any other linear combination of functions. In NLLSQ (nonlinear least squares) the parameters appear as functions, such as \beta^2, e^ and so forth. If the derivatives \partial f / \partial \beta_j are either constant or depend only on the values of the independent variable, the model is linear in the parameters. Otherwise the model is nonlinear. *Need initial values for the parameters to find the solution to a NLLSQ problem; LLSQ does not require them. *Solution algorithms for NLLSQ often require that the Jacobian can be calculated similar to LLSQ. Analytical expressions for the partial derivatives can be complicated. If analytical expressions are impossible to obtain either the partial derivatives must be calculated by numerical approximation or an estimate must be made of the Jacobian, often via
finite differences A finite difference is a mathematical expression of the form . If a finite difference is divided by , one gets a difference quotient. The approximation of derivatives by finite differences plays a central role in finite difference methods for the ...
. *Non-convergence (failure of the algorithm to find a minimum) is a common phenomenon in NLLSQ. *LLSQ is globally concave so non-convergence is not an issue. *Solving NLLSQ is usually an iterative process which has to be terminated when a convergence criterion is satisfied. LLSQ solutions can be computed using direct methods, although problems with large numbers of parameters are typically solved with iterative methods, such as the Gauss–Seidel method. *In LLSQ the solution is unique, but in NLLSQ there may be multiple minima in the sum of squares. *Under the condition that the errors are uncorrelated with the predictor variables, LLSQ yields unbiased estimates, but even under that condition NLLSQ estimates are generally biased. These differences must be considered whenever the solution to a nonlinear least squares problem is being sought.


Example

Consider a simple example drawn from physics. A spring should obey
Hooke's law In physics, Hooke's law is an empirical law which states that the force () needed to extend or compress a spring by some distance () scales linearly with respect to that distance—that is, where is a constant factor characteristic of ...
which states that the extension of a spring is proportional to the force, ''F'', applied to it. :y = f(F,k)=kF\! constitutes the model, where ''F'' is the independent variable. In order to estimate the
force constant In physics, Hooke's law is an empirical law which states that the force () needed to extend or compress a spring by some distance () scales linearly with respect to that distance—that is, where is a constant factor characteristic of ...
, ''k'', we conduct a series of ''n'' measurements with different forces to produce a set of data, (F_i, y_i),\ i=1,\dots,n\!, where ''yi'' is a measured spring extension. Each experimental observation will contain some error, \varepsilon, and so we may specify an empirical model for our observations, : y_i = kF_i + \varepsilon_i. \, There are many methods we might use to estimate the unknown parameter ''k''. Since the ''n'' equations in the ''m'' variables in our data comprise an
overdetermined system In mathematics, a system of equations is considered overdetermined if there are more equations than unknowns. An overdetermined system is almost always inconsistent (it has no solution) when constructed with random coefficients. However, an over ...
with one unknown and ''n'' equations, we estimate ''k'' using least squares. The sum of squares to be minimized is : S = \sum_^n (y_i - kF_i)^2. The least squares estimate of the force constant, ''k'', is given by :\hat k=\frac. We assume that applying force ''causes'' the spring to expand. After having derived the force constant by least squares fitting, we predict the extension from Hooke's law.


Uncertainty quantification

In a least squares calculation with unit weights, or in linear regression, the variance on the ''j''th parameter, denoted \operatorname(\hat_j), is usually estimated with : \operatorname(\hat_j)= \sigma^2\left(\left ^\mathsfX\right\right)_ \approx \hat^2 C_, : \hat^2 \approx \frac S : C=\left(X^\mathsfX\right)^, where the true error variance ''σ''2 is replaced by an estimate, the
reduced chi-squared statistic In statistics, the reduced chi-square statistic is used extensively in goodness of fit testing. It is also known as mean squared weighted deviation (MSWD) in isotopic dating and variance of unit weight in the context of weighted least squares. ...
, based on the minimized value of the residual sum of squares (objective function), ''S''. The denominator, ''n'' − ''m'', is the statistical degrees of freedom; see effective degrees of freedom for generalizations. ''C'' is the covariance matrix.


Statistical testing

If the probability distribution of the parameters is known or an asymptotic approximation is made,
confidence limits In frequentist statistics, a confidence interval (CI) is a range of estimates for an unknown Statistical parameter, parameter. A confidence interval is computed at a designated ''confidence level''; the 95% confidence level is most common, but ...
can be found. Similarly, statistical tests on the residuals can be conducted if the probability distribution of the residuals is known or assumed. We can derive the probability distribution of any linear combination of the dependent variables if the probability distribution of experimental errors is known or assumed. Inferring is easy when assuming that the errors follow a normal distribution, consequently implying that the parameter estimates and residuals will also be normally distributed conditional on the values of the independent variables. It is necessary to make assumptions about the nature of the experimental errors to test the results statistically. A common assumption is that the errors belong to a normal distribution. The
central limit theorem In probability theory, the central limit theorem (CLT) establishes that, in many situations, when independent random variables are summed up, their properly normalized sum tends toward a normal distribution even if the original variables themsel ...
supports the idea that this is a good approximation in many cases. * The
Gauss–Markov theorem In statistics, the Gauss–Markov theorem (or simply Gauss theorem for some authors) states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the ...
. In a linear model in which the errors have expectation zero conditional on the independent variables, are
uncorrelated In probability theory and statistics, two real-valued random variables, X, Y, are said to be uncorrelated if their covariance, \operatorname ,Y= \operatorname Y- \operatorname \operatorname /math>, is zero. If two variables are uncorrelated, ther ...
and have equal
variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbe ...
s, the best linear
unbiased Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individual, a group, ...
estimator of any linear combination of the observations, is its least-squares estimator. "Best" means that the least squares estimators of the parameters have minimum variance. The assumption of equal variance is valid when the errors all belong to the same distribution. *If the errors belong to a normal distribution, the least-squares estimators are also the
maximum likelihood estimator In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statis ...
s in a linear model. However, suppose the errors are not normally distributed. In that case, a
central limit theorem In probability theory, the central limit theorem (CLT) establishes that, in many situations, when independent random variables are summed up, their properly normalized sum tends toward a normal distribution even if the original variables themsel ...
often nonetheless implies that the parameter estimates will be approximately normally distributed so long as the sample is reasonably large. For this reason, given the important property that the error mean is independent of the independent variables, the distribution of the error term is not an important issue in regression analysis. Specifically, it is not typically important whether the error term follows a normal distribution.


Weighted least squares

A special case of
generalized least squares In statistics, generalized least squares (GLS) is a technique for estimating the unknown parameters in a linear regression model when there is a certain degree of correlation between the residuals in a regression model. In these cases, ordinar ...
called weighted least squares occurs when all the off-diagonal entries of ''Ω'' (the correlation matrix of the residuals) are null; the
variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbe ...
s of the observations (along the covariance matrix diagonal) may still be unequal ( heteroscedasticity). In simpler terms, heteroscedasticity is when the variance of Y_i depends on the value of x_i which causes the residual plot to create a "fanning out" effect towards larger Y_i values as seen in the residual plot to the right. On the other hand,
homoscedasticity In statistics, a sequence (or a vector) of random variables is homoscedastic () if all its random variables have the same finite variance. This is also known as homogeneity of variance. The complementary notion is called heteroscedasticity. Th ...
is assuming that the variance of Y_i and variance of U_i are equal.  


Relationship to principal components

The first
principal component Principal may refer to: Title or rank * Principal (academia), the chief executive of a university ** Principal (education), the office holder/ or boss in any school * Principal (civil service) or principal officer, the senior management level ...
about the mean of a set of points can be represented by that line which most closely approaches the data points (as measured by squared distance of closest approach, i.e. perpendicular to the line). In contrast, linear least squares tries to minimize the distance in the y direction only. Thus, although the two use a similar error metric, linear least squares is a method that treats one dimension of the data preferentially, while PCA treats all dimensions equally.


Relationship to measure theory

Notable statistician
Sara van de Geer Sara Anna van de Geer (born 7 May 1958, Leiden) is a Dutch statistician who is a professor in the department of mathematics at ETH Zurich.. She is the daughter of psychologist John P. van de Geer. Education She earned a master's degree in ...
used empirical process theory and the
Vapnik–Chervonenkis dimension Vapnik–Chervonenkis theory, the Vapnik–Chervonenkis (VC) dimension is a measure of the capacity (complexity, expressive power, richness, or flexibility) of a set of functions that can be learned by a statistical binary classification algorithm ...
to prove a least-squares estimator can be interpreted as a measure on the space of square-integrable functions.


Regularization


Tikhonov regularization

In some contexts a regularized version of the least squares solution may be preferable.
Tikhonov regularization Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Also ...
(or
ridge regression Ridge regression is a method of estimating the coefficients of multiple- regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Als ...
) adds a constraint that \, \beta\, _2^2, the squared \ell_2-norm of the parameter vector, is not greater than a given value to the least squares formulation, leading to a constrained minimization problem. This is equivalent to the unconstrained minimization problem where the objective function is the residual sum of squares plus a penalty term \alpha\, \beta\, _2^2 and \alpha is a tuning parameter (this is the
Lagrangian Lagrangian may refer to: Mathematics * Lagrangian function, used to solve constrained minimization problems in optimization theory; see Lagrange multiplier ** Lagrangian relaxation, the method of approximating a difficult constrained problem with ...
form of the constrained minimization problem). In a
Bayesian Thomas Bayes (/beɪz/; c. 1701 – 1761) was an English statistician, philosopher, and Presbyterian minister. Bayesian () refers either to a range of concepts and approaches that relate to statistical methods based on Bayes' theorem, or a followe ...
context, this is equivalent to placing a zero-mean normally distributed prior on the parameter vector.


Lasso method

An alternative regularized version of least squares is
Lasso A lasso ( or ), also called lariat, riata, or reata (all from Castilian, la reata 're-tied rope'), is a loop of rope designed as a restraint to be thrown around a target and tightened when pulled. It is a well-known tool of the Spanish an ...
(least absolute shrinkage and selection operator), which uses the constraint that \, \beta\, _1, the L1-norm of the parameter vector, is no greater than a given value. (One can show like above using Lagrange multipliers that this is equivalent to an unconstrained minimization of the least-squares penalty with \alpha\, \beta\, _1 added.) In a
Bayesian Thomas Bayes (/beɪz/; c. 1701 – 1761) was an English statistician, philosopher, and Presbyterian minister. Bayesian () refers either to a range of concepts and approaches that relate to statistical methods based on Bayes' theorem, or a followe ...
context, this is equivalent to placing a zero-mean Laplace
prior distribution In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken int ...
on the parameter vector. The optimization problem may be solved using
quadratic programming Quadratic programming (QP) is the process of solving certain mathematical optimization problems involving quadratic functions. Specifically, one seeks to optimize (minimize or maximize) a multivariate quadratic function subject to linear constr ...
or more general
convex optimization Convex optimization is a subfield of mathematical optimization that studies the problem of minimizing convex functions over convex sets (or, equivalently, maximizing concave functions over convex sets). Many classes of convex optimization pr ...
methods, as well as by specific algorithms such as the least angle regression algorithm. One of the prime differences between Lasso and ridge regression is that in ridge regression, as the penalty is increased, all parameters are reduced while still remaining non-zero, while in Lasso, increasing the penalty will cause more and more of the parameters to be driven to zero. This is an advantage of Lasso over ridge regression, as driving parameters to zero deselects the features from the regression. Thus, Lasso automatically selects more relevant features and discards the others, whereas Ridge regression never fully discards any features. Some
feature selection In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construc ...
techniques are developed based on the LASSO including Bolasso which bootstraps samples, and FeaLect which analyzes the regression coefficients corresponding to different values of \alpha to score all the features. The L1-regularized formulation is useful in some contexts due to its tendency to prefer solutions where more parameters are zero, which gives solutions that depend on fewer variables. For this reason, the Lasso and its variants are fundamental to the field of
compressed sensing Compressed sensing (also known as compressive sensing, compressive sampling, or sparse sampling) is a signal processing technique for efficiently acquiring and reconstructing a signal, by finding solutions to underdetermined linear systems. This ...
. An extension of this approach is
elastic net regularization In statistics and, in particular, in the fitting of linear or logistic regression models, the elastic net is a regularized regression method that linearly combines the L1 and L2 penalties of the lasso and ridge methods. Specification The el ...
.


See also

*
Least-squares adjustment Least-squares adjustment is a model for the solution of an overdetermined system of equations based on the principle of least squares of observation residuals. It is used extensively in the disciplines of surveying, geodesy, and photogrammetry—t ...
* Bayesian MMSE estimator *
Best linear unbiased estimator Best or The Best may refer to: People * Best (surname), people with the surname Best * Best (footballer, born 1968), retired Portuguese footballer Companies and organizations * Best & Co., an 1879–1971 clothing chain * Best Lock Corporation ...
(BLUE) *
Best linear unbiased prediction In statistics, best linear unbiased prediction (BLUP) is used in linear mixed models for the estimation of random effects. BLUP was derived by Charles Roy Henderson in 1950 but the term "best linear unbiased predictor" (or "prediction") seems not ...
(BLUP) *
Gauss–Markov theorem In statistics, the Gauss–Markov theorem (or simply Gauss theorem for some authors) states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the ...
* ''L''2 norm *
Least absolute deviations Least absolute deviations (LAD), also known as least absolute errors (LAE), least absolute residuals (LAR), or least absolute values (LAV), is a statistical optimality criterion and a statistical optimization technique based minimizing the ''sum o ...
*
Least-squares spectral analysis Least-squares spectral analysis (LSSA) is a method of estimating a frequency spectrum, based on a least squares fit of sinusoids to data samples, similar to Fourier analysis. Fourier analysis, the most used spectral method in science, generally ...
* Measurement uncertainty *
Orthogonal projection In linear algebra and functional analysis, a projection is a linear transformation P from a vector space to itself (an endomorphism) such that P\circ P=P. That is, whenever P is applied twice to any vector, it gives the same result as if it wer ...
*
Proximal gradient methods for learning Proximal gradient (forward backward splitting) methods for learning is an area of research in optimization and statistical learning theory which studies algorithms for a general class of convex regularization problems where the regularization penal ...
*
Quadratic loss function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...
* Root mean square *
Squared deviations from the mean Squared deviations from the mean (SDM) result from squaring deviations. In probability theory and statistics, the definition of ''variance'' is either the expected value of the SDM (when considering a theoretical distribution) or its average v ...


References


Further reading

* * * * * *


External links

* {{DEFAULTSORT:Least Squares Single-equation methods (econometrics) Optimization algorithms and methods gl:Mínimos cadrados lineais vi:Bình phương tối thiểu tuyến tính