Linear least squares (LLS) is the

least squares approximation The method of least squares is a mathematical optimization technique that aims to determine the best fit function by minimizing the sum of the squares of the differences between the observed values and the predicted values of the model. The me ...

of linear functions to data. It is a set of formulations for solving statistical problems involved in

linear regression In statistics, linear regression is a statistical model, model that estimates the relationship between a Scalar (mathematics), scalar response (dependent variable) and one or more explanatory variables (regressor or independent variable). A mode ...

, including variants for ordinary (unweighted), weighted, and generalized (correlated) residuals. Numerical methods for linear least squares include inverting the matrix of the normal equations and

orthogonal decomposition In the mathematics, mathematical fields of linear algebra and functional analysis, the orthogonal complement of a linear subspace, subspace W of a vector space V equipped with a bilinear form B is the set W^\perp of all vectors in V that are orthog ...

methods.

Basic formulation

Consider the linear equation where

A \in \mathbb^

and

b \in \mathbb^m

are given and

x \in \mathbb^n

is variable to be computed. When

m > n,

it is generally the case that () has no solution. For example, there is no value of

x

that satisfies

\begin
1 & 0 \\ 
0 & 1 \\
1 & 1  
\end x 
= \begin
1 \\ 
1 \\ 
0
\end,

because the first two rows require that

x = (1, 1),

but then the third row is not satisfied. Thus, for

m > n,

the goal of solving () exactly is typically replaced by finding the value of

x

that minimizes some error. There are many ways that the error can be defined, but one of the most common is to define it as

\, Ax - b\, ^2.

This produces a minimization problem, called a ''least squares problem'' The solution to the least squares problem () is computed by solving the ''normal equation'' where

A^\top

denotes the

transpose In linear algebra, the transpose of a Matrix (mathematics), matrix is an operator which flips a matrix over its diagonal; that is, it switches the row and column indices of the matrix by producing another matrix, often denoted by (among other ...

A

. Continuing the example, above, with

A = \begin
1 & 0 \\ 
0 & 1 \\
1 & 1  
\end \quad \text \quad
b = \begin
1 \\ 
1 \\ 
0
\end,

we find

A^\top A = \begin
1 & 0 & 1\\ 
0 & 1 & 1
\end
\begin
1 & 0 \\ 
0 & 1 \\
1 & 1  
\end 
=
\begin
2 & 1 \\ 
1 & 2
\end

and

A^\top b = \begin
1 & 0 & 1\\ 
0 & 1 & 1
\end
\begin
1 \\ 
1 \\ 
0
\end
= \begin
1 \\ 
1
\end.

Solving the normal equation gives

x = (1/3, 1/3).

Formulations for Linear Regression

The three main linear least squares formulations are: * ''

Ordinary least squares In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression In statistics, linear regression is a statistical model, model that estimates the relationship ...

'' (OLS) is the most common estimator. OLS estimates are commonly used to analyze both

experiment An experiment is a procedure carried out to support or refute a hypothesis, or determine the efficacy or likelihood of something previously untried. Experiments provide insight into cause-and-effect by demonstrating what outcome occurs whe ...

al and observational data. The OLS method minimizes the sum of squared residuals, and leads to a closed-form expression for the estimated value of the unknown parameter vector ''β'':

\hat = (\mathbf^\mathsf\mathbf)^ \mathbf^\mathsf \mathbf,

where

\mathbf

is a vector whose ''i''th element is the ''i''th observation of the

dependent variable A variable is considered dependent if it depends on (or is hypothesized to depend on) an independent variable. Dependent variables are studied under the supposition or demand that they depend, by some law or rule (e.g., by a mathematical functio ...

, and

\mathbf

is the

Design matrix In statistics and in particular in regression analysis, a design matrix, also known as model matrix or regressor matrix and often denoted by X, is a matrix of values of explanatory variables of a set of objects. Each row represents an individual o ...

whose ''ij'' element is the ''i''th observation of the ''j''th

independent variable A variable is considered dependent if it depends on (or is hypothesized to depend on) an independent variable. Dependent variables are studied under the supposition or demand that they depend, by some law or rule (e.g., by a mathematical function ...

. The estimator is

unbiased Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is inaccurate, closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individ ...

and

consistent In deductive logic, a consistent theory is one that does not lead to a logical contradiction. A theory T is consistent if there is no formula \varphi such that both \varphi and its negation \lnot\varphi are elements of the set of consequences ...

if the errors have finite variance and are uncorrelated with the regressors:

\operatorname,\mathbf_i\varepsilon_i\, = 0,

where

\mathbf_i

is the transpose of row ''i'' of the matrix

\mathbf.

It is also efficient under the assumption that the errors have finite variance and are homoscedastic, meaning that E 'ε''_''i''²''x''_''i''does not depend on ''i''. The condition that the errors are uncorrelated with the regressors will generally be satisfied in an experiment, but in the case of observational data, it is difficult to exclude the possibility of an omitted covariate ''z'' that is related to both the observed covariates and the response variable. The existence of such a covariate will generally lead to a correlation between the regressors and the response variable, and hence to an inconsistent estimator of ''β''. The condition of homoscedasticity can fail with either experimental or observational data. If the goal is either inference or predictive modeling, the performance of OLS estimates can be poor if multicollinearity is present, unless the sample size is large. * ''

Weighted least squares Weighted least squares (WLS), also known as weighted linear regression, is a generalization of ordinary least squares and linear regression in which knowledge of the unequal variance of observations (''heteroscedasticity'') is incorporated into ...

'' (WLS) are used when

heteroscedasticity In statistics, a sequence of random variables is homoscedastic () if all its random variables have the same finite variance; this is also known as homogeneity of variance. The complementary notion is called heteroscedasticity, also known as hete ...

is present in the error terms of the model. * ''

Generalized least squares In statistics, generalized least squares (GLS) is a method used to estimate the unknown parameters in a Linear regression, linear regression model. It is used when there is a non-zero amount of correlation between the Residual (statistics), resi ...

'' (GLS) is an extension of the OLS method, that allows efficient estimation of ''β'' when either

, or correlations, or both are present among the error terms of the model, as long as the form of heteroscedasticity and correlation is known independently of the data. To handle heteroscedasticity when the error terms are uncorrelated with each other, GLS minimizes a weighted analogue to the sum of squared residuals from OLS regression, where the weight for the ''i''^th case is inversely proportional to var(''ε''_''i''). This special case of GLS is called "weighted least squares". The GLS solution to an estimation problem is

\hat = (\mathbf^\mathsf \boldsymbol\Omega^ \mathbf)^\mathbf^\mathsf\boldsymbol\Omega^\mathbf,

where ''Ω'' is the covariance matrix of the errors. GLS can be viewed as applying a linear transformation to the data so that the assumptions of OLS are met for the transformed data. For GLS to be applied, the covariance structure of the errors must be known up to a multiplicative constant.

Alternative formulations

Other formulations include: * '' Iteratively reweighted least squares'' (IRLS) is used when

, or correlations, or both are present among the error terms of the model, but where little is known about the covariance structure of the errors independently of the data. In the first iteration, OLS, or GLS with a provisional covariance structure is carried out, and the residuals are obtained from the fit. Based on the residuals, an improved estimate of the covariance structure of the errors can usually be obtained. A subsequent GLS iteration is then performed using this estimate of the error structure to define the weights. The process can be iterated to convergence, but in many cases, only one iteration is sufficient to achieve an efficient estimate of ''β''. * '' Instrumental variables'' regression (IV) can be performed when the regressors are correlated with the errors. In this case, we need the existence of some auxiliary ''instrumental variables'' ''z''_''i'' such that E 'z''_''i''''ε''_''i''nbsp;= 0. If ''Z'' is the matrix of instruments, then the estimator can be given in closed form as

\hat = (\mathbf^\mathsf\mathbf(\mathbf^\mathsf\mathbf)^\mathbf^\mathsf\mathbf)^\mathbf^\mathsf\mathbf(\mathbf^\mathsf\mathbf)^\mathbf^\mathsf\mathbf.

''Optimal instruments'' regression is an extension of classical IV regression to the situation where . * ''

Total least squares In applied statistics, total least squares is a type of errors-in-variables regression, a least squares data modeling technique in which observational errors on both dependent and independent variables are taken into account. It is a generaliz ...

'' (TLS) is an approach to least squares estimation of the linear regression model that treats the covariates and response variable in a more geometrically symmetric manner than OLS. It is one approach to handling the "errors in variables" problem, and is also sometimes used even when the covariates are assumed to be error-free. *'' Linear Template Fit'' (LTF) combines a linear regression with (generalized) least squares in order to determine the best estimator. The Linear Template Fit addresses the frequent issue, when the residuals cannot be expressed analytically or are too time consuming to be evaluate repeatedly, as it is often the case in iterative minimization algorithms. In the Linear Template Fit, the residuals are estimated from the random variables and from a linear approximation of the underlying ''true'' model, while the true model needs to be provided for at least

n+1

(were

n

is the number of estimators) distinct reference values ''β''. The true distribution is then approximated by a linear regression, and the best estimators are obtained in closed form as

\hat = ((\mathbf)^\mathsf \boldsymbol\Omega^ \mathbf)^(\mathbf)^\mathsf\boldsymbol\Omega^(\mathbf-\mathbf,

where

\mathbf

denotes the template matrix with the values of the known or previously determined model for any of the reference values ''β'',

\mathbf

are the random variables (e.g. a measurement), and the matrix

\mathbf

and the vector

\mathbf

are calculated from the values of ''β''. The LTF can also be expressed for

Log-normal distribution In probability theory, a log-normal (or lognormal) distribution is a continuous probability distribution of a random variable whose logarithm is normal distribution, normally distributed. Thus, if the random variable is log-normally distributed ...

distributed random variables. A generalization of the LTF is the Quadratic Template Fit, which assumes a second order regression of the model, requires predictions for at least

n^2+2n

distinct values ''β'', and it finds the best estimator using

Newton's method In numerical analysis, the Newton–Raphson method, also known simply as Newton's method, named after Isaac Newton and Joseph Raphson, is a root-finding algorithm which produces successively better approximations to the roots (or zeroes) of a ...

. *'' Percentage least squares'' focuses on reducing percentage errors, which is useful in the field of forecasting or time series analysis. It is also useful in situations where the dependent variable has a wide range without constant variance, as here the larger residuals at the upper end of the range would dominate if OLS were used. When the percentage or relative error is normally distributed, least squares percentage regression provides maximum likelihood estimates. Percentage regression is linked to a multiplicative error model, whereas OLS is linked to models containing an additive error term. * '' Constrained least squares'', indicates a linear least squares problem with additional constraints on the solution.

Objective function

In OLS (i.e., assuming unweighted observations), the optimal value of the

objective function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...

is found by substituting the optimal expression for the coefficient vector:

S=\mathbf y^\mathsf (\mathbf - \mathbf)^\mathsf (\mathbf - \mathbf) \mathbf y = \mathbf y^\mathsf (\mathbf - \mathbf) \mathbf y,

where

\mathbf=\mathbf(\mathbf^\mathsf\mathbf)^ \mathbf^\mathsf

, the latter equality holding since

(\mathbf - \mathbf)

is symmetric and idempotent. It can be shown from this that under an appropriate assignment of weights the

expected value In probability theory, the expected value (also called expectation, expectancy, expectation operator, mathematical expectation, mean, expectation value, or first Moment (mathematics), moment) is a generalization of the weighted average. Informa ...

of ''S'' is

m-n

. If instead unit weights are assumed, the expected value of ''S'' is

(m - n)\sigma^2

, where

\sigma^2

is the variance of each observation. If it is assumed that the residuals belong to a normal distribution, the objective function, being a sum of weighted squared residuals, will belong to a chi-squared distribution with ''m'' − ''n''

degrees of freedom In many scientific fields, the degrees of freedom of a system is the number of parameters of the system that may vary independently. For example, a point in the plane has two degrees of freedom for translation: its two coordinates; a non-infinite ...

. Some illustrative percentile values of

\chi ^2

are given in the following table. These values can be used for a statistical criterion as to the

goodness of fit The goodness of fit of a statistical model describes how well it fits a set of observations. Measures of goodness of fit typically summarize the discrepancy between observed values and the values expected under the model in question. Such measur ...

. When unit weights are used, the numbers should be divided by the variance of an observation. For WLS, the ordinary objective function above is replaced for a weighted average of residuals.

Discussion

statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

and

mathematics Mathematics is a field of study that discovers and organizes methods, Mathematical theory, theories and theorems that are developed and Mathematical proof, proved for the needs of empirical sciences and mathematics itself. There are many ar ...

, linear least squares is an approach to fitting a

mathematical Mathematics is a field of study that discovers and organizes methods, Mathematical theory, theories and theorems that are developed and Mathematical proof, proved for the needs of empirical sciences and mathematics itself. There are many ar ...

statistical model A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of Sample (statistics), sample data (and similar data from a larger Statistical population, population). A statistical model repre ...

data Data ( , ) are a collection of discrete or continuous values that convey information, describing the quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted for ...

in cases where the idealized value provided by the model for any data point is expressed linearly in terms of the unknown

parameter A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...

s of the model. The resulting fitted model can be used to summarize the data, to predict unobserved values from the same system, and to understand the mechanisms that may underlie the system. Mathematically, linear least squares is the problem of approximately solving an overdetermined system of linear equations A x = b, where b is not an element of the

column space In linear algebra, the column space (also called the range or image) of a matrix ''A'' is the span (set of all possible linear combinations) of its column vectors. The column space of a matrix is the image or range of the corresponding matr ...

of the matrix A. The approximate solution is realized as an exact solution to A x = b', where b' is the projection of b onto the column space of A. The best approximation is then that which minimizes the sum of squared differences between the data values and their corresponding modeled values. The approach is called ''linear'' least squares since the assumed function is linear in the parameters to be estimated. Linear least squares problems are

convex Convex or convexity may refer to: Science and technology * Convex lens, in optics Mathematics * Convex set, containing the whole line segment that joins points ** Convex polygon, a polygon which encloses a convex set of points ** Convex polytop ...

and have a

closed-form solution In mathematics, an expression or equation is in closed form if it is formed with constants, variables, and a set of functions considered as ''basic'' and connected by arithmetic operations (, and integer powers) and function composition. C ...

that is unique, provided that the number of data points used for fitting equals or exceeds the number of unknown parameters, except in special degenerate situations. In contrast, non-linear least squares problems generally must be solved by an iterative procedure, and the problems can be non-convex with multiple optima for the objective function. If prior distributions are available, then even an underdetermined system can be solved using the Bayesian MMSE estimator. In statistics, linear least squares problems correspond to a particularly important type of

called

which arises as a particular form of regression analysis. One basic form of such a model is an

ordinary least squares In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression In statistics, linear regression is a statistical model, model that estimates the relationship ...

model. The present article concentrates on the mathematical aspects of linear least squares problems, with discussion of the formulation and interpretation of statistical regression models and

statistical inference Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical analysis infers properties of ...

s related to these being dealt with in the articles just mentioned. See outline of regression analysis for an outline of the topic.

Properties

If the experimental errors,

\varepsilon

, are uncorrelated, have a mean of zero and a constant variance,

\sigma

, the

Gauss–Markov theorem In statistics, the Gauss–Markov theorem (or simply Gauss theorem for some authors) states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in ...

states that the least-squares estimator,

\hat

, has the minimum variance of all estimators that are linear combinations of the observations. In this sense it is the best, or optimal, estimator of the parameters. Note particularly that this property is independent of the statistical distribution function of the errors. In other words, ''the distribution function of the errors need not be a

normal distribution In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is f(x) = \frac ...

''. However, for some probability distributions, there is no guarantee that the least-squares solution is even possible given the observations; still, in such cases it is the best estimator that is both linear and unbiased. For example, it is easy to show that the

arithmetic mean In mathematics and statistics, the arithmetic mean ( ), arithmetic average, or just the ''mean'' or ''average'' is the sum of a collection of numbers divided by the count of numbers in the collection. The collection is often a set of results fr ...

of a set of measurements of a quantity is the least-squares estimator of the value of that quantity. If the conditions of the Gauss–Markov theorem apply, the arithmetic mean is optimal, whatever the distribution of errors of the measurements might be. However, in the case that the experimental errors do belong to a normal distribution, the least-squares estimator is also a

maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ...

estimator. These properties underpin the use of the method of least squares for all types of data fitting, even when the assumptions are not strictly valid.

Limitations

An assumption underlying the treatment given above is that the independent variable, ''x'', is free of error. In practice, the errors on the measurements of the independent variable are usually much smaller than the errors on the dependent variable and can therefore be ignored. When this is not the case,

total least squares In applied statistics, total least squares is a type of errors-in-variables regression, a least squares data modeling technique in which observational errors on both dependent and independent variables are taken into account. It is a generaliz ...

or more generally errors-in-variables models, or ''rigorous least squares'', should be used. This can be done by adjusting the weighting scheme to take into account errors on both the dependent and independent variables and then following the standard procedure. In some cases the (weighted) normal equations matrix ''X''^T''X'' is

ill-conditioned In numerical analysis, the condition number of a function measures how much the output value of the function can change for a small change in the input argument. This is used to measure how sensitive a function is to changes or errors in the inpu ...

. When fitting polynomials the normal equations matrix is a Vandermonde matrix. Vandermonde matrices become increasingly ill-conditioned as the order of the matrix increases. In these cases, the least squares estimate amplifies the measurement noise and may be grossly inaccurate. Various regularization techniques can be applied in such cases, the most common of which is called ridge regression. If further information about the parameters is known, for example, a range of possible values of

\mathbf

, then various techniques can be used to increase the stability of the solution. For example, see constrained least squares. Another drawback of the least squares estimator is the fact that the norm of the residuals,

\,  \mathbf y - \mathbf X\hat \,

is minimized, whereas in some cases one is truly interested in obtaining small error in the parameter

\mathbf

, e.g., a small value of

\, -\hat\,

. However, since the true parameter

is necessarily unknown, this quantity cannot be directly minimized. If a

prior probability A prior probability distribution of an uncertain quantity, simply called the prior, is its assumed probability distribution before some evidence is taken into account. For example, the prior could be the probability distribution representing the ...

\hat

is known, then a

Bayes estimator In estimation theory and decision theory, a Bayes estimator or a Bayes action is an estimator or decision rule that minimizes the posterior expected value of a loss function (i.e., the posterior expected loss). Equivalently, it maximizes the ...

can be used to minimize the

mean squared error In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference betwee ...

E \left\

. The least squares method is often applied when no prior is known. When several parameters are being estimated jointly, better estimators can be constructed, an effect known as Stein's phenomenon. For example, if the measurement error is Gaussian, several estimators are known which

dominate The Dominate is a periodisation of the Roman Empire during late antiquity Late antiquity marks the period that comes after the end of classical antiquity and stretches into the onset of the Early Middle Ages. Late antiquity as a period was p ...

, or outperform, the least squares technique; the best known of these is the James–Stein estimator. This is an example of more general shrinkage estimators that have been applied to regression problems.

Applications

* Polynomial fitting: models are

polynomial In mathematics, a polynomial is a Expression (mathematics), mathematical expression consisting of indeterminate (variable), indeterminates (also called variable (mathematics), variables) and coefficients, that involves only the operations of addit ...

s in an independent variable, ''x'': ** Straight line:

f(x, \boldsymbol \beta)=\beta_1 +\beta_2 x

. ** Quadratic:

f(x, \boldsymbol \beta)=\beta_1 + \beta_2 x +\beta_3 x^2

. ** Cubic, quartic and higher polynomials. For regression with high-order polynomials, the use of

orthogonal polynomials In mathematics, an orthogonal polynomial sequence is a family of polynomials such that any two different polynomials in the sequence are orthogonal In mathematics, orthogonality (mathematics), orthogonality is the generalization of the geom ...

is recommended. * Numerical smoothing and differentiation — this is an application of polynomial fitting. * Multinomials in more than one independent variable, including surface fitting * Curve fitting with

B-spline In numerical analysis, a B-spline (short for basis spline) is a type of Spline (mathematics), spline function designed to have minimal Support (mathematics), support (overlap) for a given Degree of a polynomial, degree, smoothness, and set of bre ...

s *

Chemometrics Chemometrics is the science of extracting information from chemical systems by data-driven means. Chemometrics is inherently interdisciplinary, using methods frequently employed in core data-analytic disciplines such as multivariate statistics, ap ...

Calibration curve In analytical chemistry, a calibration curve, also known as a standard curve, is a general method for determining the concentration of a substance in an unknown sample by comparing the unknown to a set of standard samples of known concentration. ...

, Standard addition,

Gran plot A Gran plot (also known as Gran titration or the Gran method) is a common means of standardizing a titrate or titrant by estimating the ''equivalence volume'' or '' end point'' in a strong acid-strong base titration Titration (also known as ...

, analysis of mixtures

Uses in data fitting

The primary application of linear least squares is in data fitting. Given a set of ''m'' data points

y_1, y_2,\dots, y_m,

consisting of experimentally measured values taken at ''m'' values

x_1, x_2,\dots, x_m

of an independent variable (

x_i

may be scalar or vector quantities), and given a model function

y=f(x, \boldsymbol \beta),

with

\boldsymbol \beta = (\beta_1, \beta_2, \dots, \beta_n),

it is desired to find the parameters

\beta_j

such that the model function "best" fits the data. In linear least squares, linearity is meant to be with respect to parameters

\beta_j,

f(x, \boldsymbol \beta) = \sum_^ \beta_j \varphi_j(x).

Here, the functions

\varphi_j

may be nonlinear with respect to the variable x. Ideally, the model function fits the data exactly, so

y_i = f(x_i, \boldsymbol \beta)

for all

i=1, 2, \dots, m.

This is usually not possible in practice, as there are more data points than there are parameters to be determined. The approach chosen then is to find the minimal possible value of the sum of squares of the residuals

r_i(\boldsymbol \beta)= y_i - f(x_i, \boldsymbol \beta),\  (i=1, 2, \dots, m)

so to minimize the function

S(\boldsymbol \beta)=\sum_^r_i^2(\boldsymbol \beta).

After substituting for

r_i

and then for

f

, this minimization problem becomes the quadratic minimization problem above with

X_ = \varphi_j(x_i),

and the best fit can be found by solving the normal equations.

Example

A hypothetical researcher conducts an experiment and obtains four

(x, y)

data points:

(1, 6),

(2, 5),

(3, 7),

and

(4, 10)

(shown in red in the diagram on the right). Because of exploratory data analysis or prior knowledge of the subject matter, the researcher suspects that the

y

-values depend on the

x

-values systematically. The

x

-values are assumed to be exact, but the

y

-values contain some uncertainty or "noise", because of the phenomenon being studied, imperfections in the measurements, etc.

Fitting a line

One of the simplest possible relationships between

x

and

y

is a line

y=\beta_1+\beta_2 x

. The intercept

\beta_1

and the slope

\beta_2

are initially unknown. The researcher would like to find values of

\beta_1

and

\beta_2

that cause the line to pass through the four data points. In other words, the researcher would like to solve the system of linear equations

\begin
\beta_1 + 1 \beta_2 &&\; = \;&& 6, & \\
\beta_1 + 2 \beta_2 &&\; = \;&& 5, & \\
\beta_1 + 3 \beta_2 &&\; = \;&& 7, & \\
\beta_1 + 4 \beta_2 &&\; = \;&& 10. & \\
\end

With four equations in two unknowns, this system is overdetermined. There is no exact solution. To consider approximate solutions, one introduces residuals

r_1

r_2

r_3

r_4

into the equations:

\begin
\beta_1 + 1 \beta_2 + r_1 &&\; = \;&& 6, & \\
\beta_1 + 2 \beta_2 + r_2 &&\; = \;&& 5, & \\
\beta_1 + 3 \beta_2 + r_3 &&\; = \;&& 7, & \\
\beta_1 + 4 \beta_2 + r_4 &&\; = \;&& 10. & \\
\end

The

i

th residual

r_i

is the misfit between the

i

th observation

y_i

and the

i

th prediction

\beta_1 + \beta_2 x_i

\begin
r_1 &&\; = \;&& 6 - (\beta_1  +  1\beta_2), & \\
r_2 &&\; = \;&& 5 - (\beta_1  +  2\beta_2), & \\
r_3 &&\; = \;&& 7 - (\beta_1  +  3\beta_2), & \\
r_4 &&\; = \;&& 10 - (\beta_1  +  4\beta_2). & \\
\end

Among all approximate solutions, the researcher would like to find the one that is "best" in some sense. In

least squares The method of least squares is a mathematical optimization technique that aims to determine the best fit function by minimizing the sum of the squares of the differences between the observed values and the predicted values of the model. The me ...

, one focuses on the sum

S

of the squared residuals:

\end

The best solution is defined to be the one that minimizes

S

with respect to

\beta_1

and

\beta_2

. The minimum can be calculated by setting the

partial derivative In mathematics, a partial derivative of a function of several variables is its derivative with respect to one of those variables, with the others held constant (as opposed to the total derivative, in which all variables are allowed to vary). P ...

s of

S

to zero:

0 = \frac = 8 \beta_1 + 20 \beta_2 -56,

0 = \frac = 20 \beta_1 + 60 \beta_2 -154.

These ''normal equations'' constitute a system of two linear equations in two unknowns. The solution is

\beta_1 = 3.5

and

\beta_2 = 1.4

, and the best-fit line is therefore

y = 3.5 + 1.4x

. The residuals are

1.1,

-1.3,

-0.7,

and

0.9

(see the diagram on the right). The minimum value of the sum of squared residuals is

S(3.5, 1.4) = 1.1^2+(-1.3)^2+(-0.7)^2+0.9^2 = 4.2.

This calculation can be expressed in matrix notation as follows. The original system of equations is

\mathbf = \mathbf \mathbf

, where

\mathbf = \left begin 6 \\ 5 \\ 7 \\ 10 \end\right \;\;\;\; \mathbf = \left begin1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \end\right \;\;\;\; \mathbf = \left begin \beta_1 \\ \beta_2 \end\right

Intuitively,

\mathbf = \mathbf \mathbf \;\;\;\; \Rightarrow \;\;\;\; \mathbf^\top \mathbf = \mathbf^\top \mathbf \mathbf \;\;\;\; \Rightarrow \;\;\;\; \mathbf = \left(\mathbf^\top \mathbf\right)^ \mathbf^\top \mathbf = \left begin 3.5 \\ 1.4 \end\right

More rigorously, if

\mathbf^\top \mathbf

is invertible, then the matrix

\mathbf \left(\mathbf^\top \mathbf\right)^ \mathbf^\top

represents orthogonal projection onto the column space of

\mathbf

. Therefore, among all vectors of the form

\mathbf \mathbf

, the one closest to

\mathbf

\mathbf \left(\mathbf^\top \mathbf\right)^ \mathbf^\top \mathbf

. Setting

\mathbf \left(\mathbf^\top \mathbf\right)^ \mathbf^\top \mathbf = \mathbf \mathbf,

it is evident that

\mathbf = \left(\mathbf^\top \mathbf\right)^ \mathbf^\top \mathbf

is a solution.

Fitting a parabola

Suppose that the hypothetical researcher wishes to fit a parabola of the form

y = \beta_1 x^2

. Importantly, this model is still linear in the unknown parameters (now just

\beta_1

), so linear least squares still applies. The system of equations incorporating residuals is

\begin
6 &&\; = \beta_1 (1)^2 + r_1 \\
5 &&\; = \beta_1 (2)^2 + r_2 \\
7 &&\; = \beta_1 (3)^2 + r_3 \\
10 &&\; = \beta_1 (4)^2 + r_4 \\
\end

The sum of squared residuals is

S(\beta_1) = (6 - \beta_1)^2 + (5 - 4 \beta_1)^2 + (7 - 9 \beta_1)^2 + (10 - 16 \beta_1)^2.

There is just one partial derivative to set to 0:

0 = \frac = 708 \beta_1 - 498.

The solution is

\beta_1 = 0.703

, and the fit model is

y = 0.703 x^2

. In matrix notation, the equations without residuals are again

\mathbf = \mathbf \mathbf

, where now

\mathbf = \left begin 6 \\ 5 \\ 7 \\ 10 \end\right \;\;\;\; \mathbf = \left begin1 \\ 4 \\ 9 \\ 16 \end\right \;\;\;\; \mathbf = \left begin \beta_1 \end\right

By the same logic as above, the solution is

\mathbf = \left(\mathbf^\top \mathbf\right)^ \mathbf^\top \mathbf = \left begin 0.703 \end\right

The figure shows an extension to fitting the three parameter parabola using a design matrix

\mathbf

with three columns (one for

x^0

x^1

, and

x^2

), and one row for each of the red data points.

Fitting other curves and surfaces

More generally, one can have

n

regressors

x_j

, and a linear model

y = \beta_0 + \sum_^ \beta_ x_.

References

External links

Least Squares Fitting – From MathWorld
{{Least Squares and Regression Analysis Broad-concept articles Least squares Computational statistics

Basic formulation

Formulations for Linear Regression

Alternative formulations

Objective function

Discussion

Properties

Limitations

Applications

Uses in data fitting

Example

Fitting a line

Fitting a parabola

Fitting other curves and surfaces

See also

References

Further reading

External links