statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

, ordinary least squares (OLS) is a type of

linear least squares Linear least squares (LLS) is the least squares approximation of linear functions to data. It is a set of formulations for solving statistical problems involved in linear regression, including variants for ordinary (unweighted), weighted, and ...

method for choosing the unknown

parameters A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...

in a

linear regression In statistics, linear regression is a statistical model, model that estimates the relationship between a Scalar (mathematics), scalar response (dependent variable) and one or more explanatory variables (regressor or independent variable). A mode ...

model (with fixed level-one effects of a

linear function In mathematics, the term linear function refers to two distinct but related notions: * In calculus and related areas, a linear function is a function whose graph is a straight line, that is, a polynomial function of degree zero or one. For di ...

of a set of

explanatory variable A variable is considered dependent if it depends on (or is hypothesized to depend on) an independent variable. Dependent variables are studied under the supposition or demand that they depend, by some law or rule (e.g., by a mathematical function ...

s) by the principle of

least squares The method of least squares is a mathematical optimization technique that aims to determine the best fit function by minimizing the sum of the squares of the differences between the observed values and the predicted values of the model. The me ...

: minimizing the sum of the squares of the differences between the observed

dependent variable A variable is considered dependent if it depends on (or is hypothesized to depend on) an independent variable. Dependent variables are studied under the supposition or demand that they depend, by some law or rule (e.g., by a mathematical functio ...

(values of the variable being observed) in the input

dataset A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record o ...

and the output of the (linear) function of the

independent variable A variable is considered dependent if it depends on (or is hypothesized to depend on) an independent variable. Dependent variables are studied under the supposition or demand that they depend, by some law or rule (e.g., by a mathematical function ...

. Some sources consider OLS to be linear regression. Geometrically, this is seen as the sum of the squared distances, parallel to the axis of the dependent variable, between each data point in the set and the corresponding point on the regression surface—the smaller the differences, the better the model fits the data. The resulting

estimator In statistics, an estimator is a rule for calculating an estimate of a given quantity based on Sample (statistics), observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguish ...

can be expressed by a simple formula, especially in the case of a

simple linear regression In statistics, simple linear regression (SLR) is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable (conventionally, the ''x ...

, in which there is a single

regressor A variable is considered dependent if it depends on (or is hypothesized to depend on) an independent variable. Dependent variables are studied under the supposition or demand that they depend, by some law or rule (e.g., by a mathematical function ...

on the right side of the regression equation. The OLS estimator is

consistent In deductive logic, a consistent theory is one that does not lead to a logical contradiction. A theory T is consistent if there is no formula \varphi such that both \varphi and its negation \lnot\varphi are elements of the set of consequences ...

for the level-one fixed effects when the regressors are

exogenous In a variety of contexts, exogeny or exogeneity () is the fact of an action or object originating externally. It is the opposite of endogeneity or endogeny, the fact of being influenced from within a system. Economics In an economic model, an ...

and forms perfect colinearity (rank condition), consistent for the variance estimate of the residuals when regressors have finite fourth moments and—by the

Gauss–Markov theorem In statistics, the Gauss–Markov theorem (or simply Gauss theorem for some authors) states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in ...

— optimal in the class of linear unbiased estimators when the

error An error (from the Latin , meaning 'to wander'Oxford English Dictionary, s.v. “error (n.), Etymology,” September 2023, .) is an inaccurate or incorrect action, thought, or judgement. In statistics, "error" refers to the difference between t ...

s are

homoscedastic In statistics, a sequence of random variables is homoscedastic () if all its random variables have the same finite variance; this is also known as homogeneity of variance. The complementary notion is called heteroscedasticity, also known as hete ...

and serially uncorrelated. Under these conditions, the method of OLS provides minimum-variance mean-unbiased estimation when the errors have finite

variance In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...

s. Under the additional assumption that the errors are

normally distributed In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real number, real-valued random variable. The general form of its probability density function is f(x ...

with zero mean, OLS is the

maximum likelihood estimator In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ...

that outperforms any non-linear unbiased estimator.

Linear model

Suppose the data consists of

n

observations Observation in the natural sciences is an act or instance of noticing or perceiving and the acquisition of information from a primary source. In living beings, observation employs the senses. In science, observation can also involve the perceptio ...

\left\_^n

. Each observation

i

includes a scalar response

y_i

and a column vector

\mathbf_i

p

parameters (regressors), i.e.,

\operatorname

. In a

linear regression model In statistics, linear regression is a model that estimates the relationship between a scalar response (dependent variable) and one or more explanatory variables (regressor or independent variable). A model with exactly one explanatory variable ...

, the response variable,

y_i

, is a linear function of the regressors: :

y_i = \beta_1\ x_ + \beta_2\ x_ + \cdots + \beta_p\ x_ + \varepsilon_i,

or in

vector Vector most often refers to: * Euclidean vector, a quantity with a magnitude and a direction * Disease vector, an agent that carries and transmits an infectious pathogen into another living organism Vector may also refer to: Mathematics a ...

form, :

y_i = \mathbf_i^\operatorname \boldsymbol + \varepsilon_i, \,

where

\mathbf_i

, as introduced previously, is a column vector of the

i

-th observation of all the explanatory variables;

\boldsymbol

is a

p \times 1

vector of unknown parameters; and the scalar

\varepsilon_i

represents unobserved random variables ( errors) of the

i

-th observation.

\varepsilon_i

accounts for the influences upon the responses

y_i

from sources other than the explanatory variables

\mathbf_i

. This model can also be written in matrix notation as :

\mathbf = \mathbf \boldsymbol + \boldsymbol, \,

where

\mathbf

and

\boldsymbol

are

n \times 1

vectors of the response variables and the errors of the

n

observations, and

\mathbf

is an

n \times p

matrix of regressors, also sometimes called the

design matrix In statistics and in particular in regression analysis, a design matrix, also known as model matrix or regressor matrix and often denoted by X, is a matrix of values of explanatory variables of a set of objects. Each row represents an individual o ...

, whose row

i

\mathbf_i^\operatorname

and contains the

i

-th observations on all the explanatory variables. Typically, a constant term is included in the set of regressors

\mathbf

, say, by taking

x_=1

for all

i=1, \dots, n

. The coefficient

\beta_1

corresponding to this regressor is called the ''intercept''. Without the intercept, the fitted line is forced to cross the origin when

x_i = \vec

. Regressors do not have to be independent for estimation to be consistent e.g. they may be non-linearly dependent. Short of perfect multicollinearity, parameter estimates may still be consistent; however, as multicollinearity rises the standard error around such estimates increases and reduces the precision of such estimates. When there is perfect multicollinearity, it is no longer possible to obtain unique estimates for the coefficients to the related regressors; estimation for these parameters cannot converge (thus, it cannot be consistent). As a concrete example where regressors are non-linearly dependent yet estimation may still be consistent, we might suspect the response depends linearly both on a value and its square; in which case we would include one regressor whose value is just the square of another regressor. In that case, the model would be ''quadratic'' in the second regressor, but none-the-less is still considered a ''linear'' model because the model ''is'' still linear in the parameters (

\boldsymbol

Matrix/vector formulation

Consider an

overdetermined system In mathematics, a system of equations is considered overdetermined if there are more equations than unknowns. An overdetermined system is almost always inconsistent equations, inconsistent (it has no solution) when constructed with random coeffi ...

\sum_^ x_ \beta_j = y_i,\ (i=1, 2, \dots, n),

n

linear equation In mathematics, a linear equation is an equation that may be put in the form a_1x_1+\ldots+a_nx_n+b=0, where x_1,\ldots,x_n are the variables (or unknowns), and b,a_1,\ldots,a_n are the coefficients, which are often real numbers. The coeffici ...

s in

p

unknown

coefficients In mathematics, a coefficient is a multiplicative factor involved in some term of a polynomial, a series, or any other type of expression. It may be a number without units, in which case it is known as a numerical factor. It may also be a ...

\beta_1, \beta_2, \dots, \beta_p

, with

n > p

. This can be written in

matrix Matrix (: matrices or matrixes) or MATRIX may refer to: Science and mathematics * Matrix (mathematics), a rectangular array of numbers, symbols or expressions * Matrix (logic), part of a formula in prenex normal form * Matrix (biology), the m ...

form as :

\mathbf \boldsymbol = \mathbf ,

where :

\mathbf = \begin
X_ & X_ & \cdots & X_ \\
X_ & X_ & \cdots & X_ \\
\vdots & \vdots & \ddots & \vdots \\
X_ & X_ & \cdots & X_
\end ,\qquad
\boldsymbol \beta = \begin
\beta_1 \\ \beta_2 \\ \vdots \\ \beta_p
\end ,\qquad
\mathbf y = \begin
y_1 \\ y_2 \\ \vdots \\ y_n
\end.

(Note: for a linear model as above, not all elements in

\mathbf

contains information on the data points. The first column is populated with ones,

X_ = 1

. Only the other columns contain actual data. So here

p

is equal to the number of regressors plus one). Such a system usually has no exact solution, so the goal is instead to find the coefficients

\boldsymbol

which fit the equations "best", in the sense of solving the quadratic minimization problem :

\hat = \underset\,S(\boldsymbol),

where the objective function

S

is given by :

S(\boldsymbol) = \sum_^n \left,  y_i - \sum_^p X_\beta_j\^2 = \left\, \mathbf y - \mathbf \boldsymbol \beta \right\, ^2.

A justification for choosing this criterion is given in

Properties Property is the ownership of land, resources, improvements or other tangible objects, or intellectual property. Property may also refer to: Philosophy and science * Property (philosophy), in philosophy and logic, an abstraction characterizing an ...

below. This minimization problem has a unique solution, provided that the

p

columns of the matrix

\mathbf

are

linearly independent In the theory of vector spaces, a set of vectors is said to be if there exists no nontrivial linear combination of the vectors that equals the zero vector. If such a linear combination exists, then the vectors are said to be . These concep ...

, given by solving the so-called ''normal equations'': :

\left( \mathbf^ \mathbf \right)\hat = \mathbf^ \mathbf y\ .

The matrix

\mathbf^ \mathbf

is known as the ''normal matrix'' or

Gram matrix In linear algebra, the Gram matrix (or Gramian matrix, Gramian) of a set of vectors v_1,\dots, v_n in an inner product space is the Hermitian matrix of inner products, whose entries are given by the inner product G_ = \left\langle v_i, v_j \right\r ...

and the matrix

\mathbf^ \mathbf y

is known as the moment matrix of regressand by regressors. Finally,

\hat

is the coefficient vector of the least-squares

hyperplane In geometry, a hyperplane is a generalization of a two-dimensional plane in three-dimensional space to mathematical spaces of arbitrary dimension. Like a plane in space, a hyperplane is a flat hypersurface, a subspace whose dimension is ...

, expressed as :

\hat = \left( \mathbf^ \mathbf \right)^ \mathbf^ \mathbf y.

or :

\hat = \boldsymbol + \left(\mathbf^\mathbf\right)^\mathbf ^ \boldsymbol.

Estimation

Suppose ''b'' is a "candidate" value for the parameter vector ''β''. The quantity , called the '' residual'' for the ''i''-th observation, measures the vertical distance between the data point and the hyperplane , and thus assesses the degree of fit between the actual data and the model. The '' sum of squared residuals'' (''SSR'') (also called the ''error sum of squares'' (''ESS'') or ''residual sum of squares'' (''RSS'')) is a measure of the overall model fit: :

S(b) = \sum_^n (y_i - x_i ^\operatorname b)^2 = (y-Xb)^\operatorname(y-Xb),

where ''T'' denotes the matrix

transpose In linear algebra, the transpose of a Matrix (mathematics), matrix is an operator which flips a matrix over its diagonal; that is, it switches the row and column indices of the matrix by producing another matrix, often denoted by (among other ...

, and the rows of ''X'', denoting the values of all the independent variables associated with a particular value of the dependent variable, are ''X_i = x_i''^T. The value of ''b'' which minimizes this sum is called the OLS estimator for ''β''. The function ''S''(''b'') is quadratic in ''b'' with positive-definite Hessian, and therefore this function possesses a unique global minimum at

b =\hat\beta

, which can be given by the explicit formula^{roof/sup>

: $\hat\beta = \operatorname_ S(b) = (X^\operatornameX)^X^\operatornamey\ .$

The product ''N'' = ''X''^T ''X'' is a Gram matrix
In linear algebra, the Gram matrix (or Gramian matrix, Gramian) of a set of vectors v_1,\dots, v_n in an inner product space is the Hermitian matrix of inner products, whose entries are given by the inner product G_ = \left\langle v_i, v_j \right\r ...
, and its inverse, ''Q'' = ''N''⁻¹, is the ''cofactor matrix'' of ''β'', closely related to its covariance matrix

In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...
, ''C''_''β''.
The matrix (''X''^T ''X'')⁻¹ ''X''^T = ''Q'' ''X''^T is called the Moore–Penrose pseudoinverse matrix of ''X''. This formulation highlights the point that estimation can be carried out if, and only if, there is no perfect multicollinearity

In statistics, multicollinearity or collinearity is a situation where the predictors in a regression model are linearly dependent.

Perfect multicollinearity refers to a situation where the predictive variables have an ''exact'' linear rela ...
between the explanatory variables (which would cause the Gram matrix to have no inverse).

Prediction

After we have estimated ''β'', the '' fitted values'' (or ''predicted values'') from the regression will be
: $\hat = X\hat\beta = Py,$
where ''P'' = ''X''(''X''^T''X'')⁻¹''X''^T is the ''projection matrix

In statistics, the projection matrix (\mathbf), sometimes also called the influence matrix or hat matrix (\mathbf), maps the vector of response values (dependent variable values) to the vector of fitted values (or predicted values). It describes ...
'' onto the space ''V'' spanned by the columns of ''X''. This matrix ''P'' is also sometimes called the ''hat matrix

In statistics, the projection matrix (\mathbf), sometimes also called the influence matrix or hat matrix (\mathbf), maps the vector of response values (dependent variable values) to the vector of fitted values (or predicted values). It describes ...
'' because it "puts a hat" onto the variable ''y''. Another matrix, closely related to ''P'' is the ''annihilator'' matrix ; this is a projection matrix onto the space orthogonal to ''V''. Both matrices ''P'' and ''M'' are symmetric

Symmetry () in everyday life refers to a sense of harmonious and beautiful proportion and balance. In mathematics, the term has a more precise definition and is usually used to refer to an object that is invariant under some transformations ...
and idempotent

Idempotence (, ) is the property of certain operations in mathematics and computer science whereby they can be applied multiple times without changing the result beyond the initial application. The concept of idempotence arises in a number of pl ...
(meaning that and ), and relate to the data matrix ''X'' via identities and . Matrix ''M'' creates the ''residuals'' from the regression:
: $\hat\varepsilon = y - \hat y = y - X\hat\beta = My = M(X\beta+\varepsilon) = (MX)\beta + M\varepsilon = M\varepsilon.$

The variances of the predicted values $s^2_$ are found in the main diagonal of the variance-covariance matrix of predicted values:
: $C_\hat = s^2 P,$
where ''P'' is the projection matrix and ''s''² is the sample variance.
The full matrix is very large; its diagonal elements can be calculated individually as:
: $s^2_ = s^2 X_i (X^T X)^ X_i^T,$
where ''X''_i is the ''i''-th row of matrix ''X''.

Sample statistics

Using these residuals we can estimate the sample variance ''s''² using the '' reduced chi-squared'' statistic:
: $s^2 = \frac = \frac = \frac= \frac = \frac,\qquad
\hat\sigma^2 = \frac\;s^2$
The denominator, ''n''−''p'', is the statistical degrees of freedom. The first quantity, ''s''², is the OLS estimate for ''σ''², whereas the second, $\scriptstyle\hat\sigma^2$ , is the MLE estimate for ''σ''². The two estimators are quite similar in large samples; the first estimator is always unbiased

Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is inaccurate, closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individ ...
, while the second estimator is biased but has a smaller mean squared error

In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference betwee ...
. In practice ''s''² is used more often, since it is more convenient for the hypothesis testing. The square root of ''s''² is called the '' regression standard error'', ''standard error of the regression'', or ''standard error of the equation''.

It is common to assess the goodness-of-fit of the OLS regression by comparing how much the initial variation in the sample can be reduced by regressing onto ''X''. The ''coefficient of determination

In statistics, the coefficient of determination, denoted ''R''2 or ''r''2 and pronounced "R squared", is the proportion of the variation in the dependent variable that is predictable from the independent variable(s).
It is a statistic used in t ...
'' ''R''² is defined as a ratio of "explained" variance to the "total" variance of the dependent variable ''y'', in the cases where the regression sum of squares equals the sum of squares of residuals:
: $R^2 = \frac = \frac = 1 - \frac = 1 - \frac$
where TSS is the ''total sum of squares
In statistical data analysis the total sum of squares (TSS or SST) is a quantity that appears as part of a standard way of presenting results of such analyses. For a set of observations, y_i, i\leq n, it is defined as the sum over all squared dif ...
'' for the dependent variable, $L=I_n-\fracJ_n$ , and $J_n$ is an ''n''×''n'' matrix of ones. ( $L$ is a centering matrix

In mathematics and multivariate statistics, the centering matrixJohn I. Marden, ''Analyzing and Modeling Rank Data'', Chapman & Hall, 1995, , page 59. is a symmetric and idempotent matrix, which when multiplied with a vector has the same effect a ...
which is equivalent to regression on a constant; it simply subtracts the mean from a variable.) In order for ''R''² to be meaningful, the matrix ''X'' of data on regressors must contain a column vector of ones to represent the constant whose coefficient is the regression intercept. In that case, ''R''² will always be a number between 0 and 1, with values close to 1 indicating a good degree of fit.

Simple linear regression model

If the data matrix ''X'' contains only two variables, a constant and a scalar regressor ''x_i'', then this is called the "simple regression model". This case is often considered in the beginner statistics classes, as it provides much simpler formulas even suitable for manual calculation. The parameters are commonly denoted as :
: $y_i = \alpha + \beta x_i + \varepsilon_i.$
The least squares estimates in this case are given by simple formulas
: $\widehat\alpha &= \bar - \widehat\beta\,\bar\ , \end$

Alternative derivations

In the previous section the least squares estimator $\hat\beta$ was obtained as a value that minimizes the sum of squared residuals of the model. However it is also possible to derive the same estimator from other approaches. In all cases the formula for OLS estimator remains the same: ; the only difference is in how we interpret this result.

Projection

For mathematicians, OLS is an approximate solution to an overdetermined system of linear equations , where ''β'' is the unknown. Assuming the system cannot be solved exactly (the number of equations ''n'' is much larger than the number of unknowns ''p''), we are looking for a solution that could provide the smallest discrepancy between the right- and left- hand sides. In other words, we are looking for the solution that satisfies
: $\hat\beta = \min_\beta\,\lVert \mathbf - \mathbf\boldsymbol\beta \rVert^2,$
where is the standard ''L''² norm in the ''n''-dimensional Euclidean space

Euclidean space is the fundamental space of geometry, intended to represent physical space. Originally, in Euclid's ''Elements'', it was the three-dimensional space of Euclidean geometry, but in modern mathematics there are ''Euclidean spaces ...
R^''n''. The predicted quantity ''Xβ'' is just a certain linear combination of the vectors of regressors. Thus, the residual vector will have the smallest length when ''y'' is projected orthogonally onto the linear subspace

In mathematics, the term ''linear'' is used in two distinct senses for two different properties:
* linearity of a ''function (mathematics), function'' (or ''mapping (mathematics), mapping'');
* linearity of a ''polynomial''.
An example of a li ...
spanned by the columns of ''X''. The OLS estimator $\hat\beta$ in this case can be interpreted as the coefficients of vector decomposition of along the basis of ''X''.

In other words, the gradient equations at the minimum can be written as:

: $(\mathbf y - \mathbf \hat)^ \mathbf=0.$

A geometrical interpretation of these equations is that the vector of residuals, $\mathbf y - X \hat$ is orthogonal to the column space

In linear algebra, the column space (also called the range or image) of a matrix ''A'' is the span (set of all possible linear combinations) of its column vectors. The column space of a matrix is the image or range of the corresponding matr ...
of ''X'', since the dot product

In mathematics, the dot product or scalar productThe term ''scalar product'' means literally "product with a Scalar (mathematics), scalar as a result". It is also used for other symmetric bilinear forms, for example in a pseudo-Euclidean space. N ...
$(\mathbf y- \mathbf\hat)\cdot \mathbf \mathbf v$ is equal to zero for ''any'' conformal vector, v. This means that $\mathbf y - \mathbf \boldsymbol$ is the shortest of all possible vectors $\mathbf- \mathbf \boldsymbol \beta$ , that is, the variance of the residuals is the minimum possible. This is illustrated at the right.

Introducing $\hat$ and a matrix ''K'' with the assumption that a matrix $mathbf \ \mathbf /math> is non-singular and ''K''$ ^T ''X'' = 0 (cf. Orthogonal projections), the residual vector should satisfy the following equation:
: $\hat := \mathbf - \mathbf \hat = \mathbf \hat.$
The equation and solution of linear least squares are thus described as follows:
: $\begin
\mathbf &= \begin\mathbf & \mathbf\end \begin \hat \\ \hat \end , \\
\Rightarrow \begin \hat \\ \hat \end &= \begin\mathbf & \mathbf\end^ \mathbf = \begin \left(\mathbf^ \mathbf\right)^ \mathbf^ \\ \left(\mathbf^ \mathbf\right)^ \mathbf^\top \end \mathbf .
\end$

Another way of looking at it is to consider the regression line to be a weighted average of the lines passing through the combination of any two points in the dataset. Although this way of calculation is more computationally expensive, it provides a better intuition on OLS.

Maximum likelihood

The OLS estimator is identical to the maximum likelihood estimator

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ...
(MLE) under the normality assumption for the error terms.^{roof/sup> This normality assumption has historical importance, as it provided the basis for the early work in linear regression analysis by Yule

Yule is a winter festival historically observed by the Germanic peoples that was incorporated into Christmas during the Christianisation of the Germanic peoples. In present times adherents of some new religious movements (such as Modern ...
and Pearson. From the properties of MLE, we can infer that the OLS estimator is asymptotically efficient (in the sense of attaining the Cramér–Rao bound

In estimation theory and statistics, the Cramér–Rao bound (CRB) relates to estimation of a deterministic (fixed, though unknown) parameter. The result is named in honor of Harald Cramér and Calyampudi Radhakrishna Rao, but has also been d ...
for variance) if the normality assumption is satisfied.

Generalized method of moments

In iid case the OLS estimator can also be viewed as a GMM estimator arising from the moment conditions
: $\mathrm\big, x_i\left(y_i - x_i ^\operatorname \beta\right) \,\big = 0.$
These moment conditions state that the regressors should be uncorrelated with the errors. Since ''x_i'' is a ''p''-vector, the number of moment conditions is equal to the dimension of the parameter vector ''β'', and thus the system is exactly identified. This is the so-called classical GMM case, when the estimator does not depend on the choice of the weighting matrix.

Note that the original strict exogeneity assumption implies a far richer set of moment conditions than stated above. In particular, this assumption implies that for any vector-function , the moment condition will hold. However it can be shown using the Gauss–Markov theorem

In statistics, the Gauss–Markov theorem (or simply Gauss theorem for some authors) states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in ...
that the optimal choice of function is to take , which results in the moment equation posted above.

Properties

Assumptions

There are several different frameworks in which the linear regression model

In statistics, linear regression is a model that estimates the relationship between a scalar response (dependent variable) and one or more explanatory variables (regressor or independent variable). A model with exactly one explanatory variable ...
can be cast in order to make the OLS technique applicable. Each of these settings produces the same formulas and same results. The only difference is the interpretation and the assumptions which have to be imposed in order for the method to give meaningful results. The choice of the applicable framework depends mostly on the nature of data in hand, and on the inference task which has to be performed.

One of the lines of difference in interpretation is whether to treat the regressors as random variables, or as predefined constants. In the first case (random design) the regressors ''x_i'' are random and sampled together with the ''y_i''s from some population

Population is a set of humans or other organisms in a given region or area. Governments conduct a census to quantify the resident population size within a given jurisdiction. The term is also applied to non-human animals, microorganisms, and pl ...
, as in an observational study

In fields such as epidemiology, social sciences, psychology and statistics, an observational study draws inferences from a sample (statistics), sample to a statistical population, population where the dependent and independent variables, independ ...
. This approach allows for more natural study of the asymptotic properties of the estimators. In the other interpretation (fixed design), the regressors ''X'' are treated as known constants set by a design

A design is the concept or proposal for an object, process, or system. The word ''design'' refers to something that is or has been intentionally created by a thinking agent, and is sometimes used to refer to the inherent nature of something ...
, and ''y'' is sampled conditionally on the values of ''X'' as in an experiment

An experiment is a procedure carried out to support or refute a hypothesis, or determine the efficacy or likelihood of something previously untried. Experiments provide insight into cause-and-effect by demonstrating what outcome occurs whe ...
. For practical purposes, this distinction is often unimportant, since estimation and inference is carried out while conditioning on ''X''. All results stated in this article are within the random design framework.

Classical linear regression model

The classical model focuses on the "finite sample" estimation and inference, meaning that the number of observations ''n'' is fixed. This contrasts with the other approaches, which study the asymptotic behavior of OLS, and in which the behavior at a large number of samples is studied.

*Correct specification. The linear functional form must coincide with the form of the actual data-generating process.
* Strict exogeneity. The errors in the regression should have conditional mean

In probability theory, the conditional expectation, conditional expected value, or conditional mean of a random variable is its expected value evaluated with respect to the conditional probability distribution. If the random variable can take on ...
zero: $\operatorname,\varepsilon\mid X\, = 0.$ The immediate consequence of the exogeneity assumption is that the errors have mean zero: (for the law of total expectation
The proposition in probability theory known as the law of total expectation, the law of iterated expectations (LIE), Adam's law, the tower rule, and the smoothing property of conditional expectation, among other names, states that if X is a random ...
), and that the regressors are uncorrelated with the errors: . The exogeneity assumption is critical for the OLS theory. If it holds then the regressor variables are called ''exogenous''. If it does not, then those regressors that are correlated with the error term are called ''endogenous

Endogeny, in biology, refers to the property of originating or developing from within an organism, tissue, or cell.

For example, ''endogenous substances'', and ''endogenous processes'' are those that originate within a living system (e.g. an ...
'', and the OLS estimator becomes biased. In such case the method of instrumental variables may be used to carry out inference.
* No linear dependence. The regressors in ''X'' must all be linearly independent

In the theory of vector spaces, a set of vectors is said to be if there exists no nontrivial linear combination of the vectors that equals the zero vector. If such a linear combination exists, then the vectors are said to be . These concep ...
. Mathematically, this means that the matrix ''X'' must have full column rank almost surely: $\Pr\!\big,\operatorname(X) = p\,\big = 1.$ Usually, it is also assumed that the regressors have finite moments up to at least the second moment. Then the matrix is finite and positive semi-definite. When this assumption is violated the regressors are called linearly dependent or perfectly multicollinear. In such case the value of the regression coefficient ''β'' cannot be learned, although prediction of ''y'' values is still possible for new values of the regressors that lie in the same linearly dependent subspace.
* Spherical errors: $= \sigma^2 I_n,$ where is the identity matrix

In linear algebra, the identity matrix of size n is the n\times n square matrix with ones on the main diagonal and zeros elsewhere. It has unique properties, for example when the identity matrix represents a geometric transformation, the obje ...
in dimension ''n'', and ''σ''² is a parameter which determines the variance of each observation. This ''σ''² is considered a nuisance parameter

In statistics, a nuisance parameter is any parameter which is unspecified but which must be accounted for in the hypothesis testing of the parameters which are of interest.

The classic example of a nuisance parameter comes from the normal distri ...
in the model, although usually it is also estimated. If this assumption is violated then the OLS estimates are still valid, but no longer efficient. It is customary to split this assumption into two parts:
** Homoscedasticity

In statistics, a sequence of random variables is homoscedastic () if all its random variables have the same finite variance; this is also known as homogeneity of variance. The complementary notion is called heteroscedasticity, also known as hete ...
: , which means that the error term has the same variance ''σ''² in each observation. When this requirement is violated this is called heteroscedasticity

In statistics, a sequence of random variables is homoscedastic () if all its random variables have the same finite variance; this is also known as homogeneity of variance. The complementary notion is called heteroscedasticity, also known as hete ...
, in such case a more efficient estimator would be weighted least squares

Weighted least squares (WLS), also known as weighted linear regression, is a generalization of ordinary least squares and linear regression in which knowledge of the unequal variance of observations (''heteroscedasticity'') is incorporated into ...
. If the errors have infinite variance then the OLS estimates will also have infinite variance (although by the law of large numbers

In probability theory, the law of large numbers is a mathematical law that states that the average of the results obtained from a large number of independent random samples converges to the true value, if it exists. More formally, the law o ...
they will nonetheless tend toward the true values so long as the errors have zero mean). In this case, robust estimation techniques are recommended.
** No autocorrelation

Autocorrelation, sometimes known as serial correlation in the discrete time case, measures the correlation of a signal with a delayed copy of itself. Essentially, it quantifies the similarity between observations of a random variable at differe ...
: the errors are uncorrelated

In probability theory and statistics, two real-valued random variables, X, Y, are said to be uncorrelated if their covariance, \operatorname ,Y= \operatorname Y- \operatorname \operatorname /math>, is zero. If two variables are uncorrelated, ther ...
between observations: for . This assumption may be violated in the context of time series

In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. ...
data, panel data

In statistics and econometrics, panel data and longitudinal data are both multi-dimensional data involving measurements over time. Panel data is a subset of longitudinal data where observations are for the same subjects each time.

Time series and ...
, cluster samples, hierarchical data, repeated measures data, longitudinal data, and other data with dependencies. In such cases generalized least squares

In statistics, generalized least squares (GLS) is a method used to estimate the unknown parameters in a Linear regression, linear regression model. It is used when there is a non-zero amount of correlation between the Residual (statistics), resi ...
provides a better alternative than the OLS. Another expression for autocorrelation is ''serial correlation''.
* Normality. It is sometimes additionally assumed that the errors have normal distribution

In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is

f(x) = \frac ...
conditional on the regressors: $\varepsilon \mid X\sim \mathcal(0, \sigma^2I_n).$ This assumption is not needed for the validity of the OLS method, although certain additional finite-sample properties can be established in case when it does (especially in the area of hypotheses testing). Also when the errors are normal, the OLS estimator is equivalent to the maximum likelihood estimator

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ...
(MLE), and therefore it is asymptotically efficient in the class of all regular estimators. Importantly, the normality assumption applies only to the error terms; contrary to a popular misconception, the response (dependent) variable is not required to be normally distributed.

Independent and identically distributed (iid)

In some applications, especially with cross-sectional data

In statistics and econometrics, cross-sectional data is a type of data collected by observing many subjects (such as individuals, firms, countries, or regions) at a single point or period of time. Analysis of cross-sectional data usually consists ...
, an additional assumption is imposed — that all observations are independent and identically distributed
Independent or Independents may refer to:

Arts, entertainment, and media Artist groups
* Independents (artist group), a group of modernist painters based in Pennsylvania, United States
* Independentes (English: Independents), a Portuguese artist ...
. This means that all observations are taken from a random sample

In this statistics, quality assurance, and survey methodology, sampling is the selection of a subset or a statistical sample (termed sample for short) of individuals from within a statistical population to estimate characteristics of the whole ...
which makes all the assumptions listed earlier simpler and easier to interpret. Also this framework allows one to state asymptotic results (as the sample size ), which are understood as a theoretical possibility of fetching new independent observations from the data generating process. The list of assumptions in this case is:
* IID observations: (''x_i'', ''y_i'') is independent
Independent or Independents may refer to:

Arts, entertainment, and media Artist groups
* Independents (artist group), a group of modernist painters based in Pennsylvania, United States
* Independentes (English: Independents), a Portuguese artist ...
from, and has the same distribution Distribution may refer to:

Mathematics
*Distribution (mathematics), generalized functions used to formulate solutions of partial differential equations
*Probability distribution, the probability of a particular value or value range of a varia ...
as, (''x_j'', ''y_j'') for all ;
* No perfect multicollinearity: is a positive-definite matrix

In mathematics, a symmetric matrix M with real entries is positive-definite if the real number \mathbf^\mathsf M \mathbf is positive for every nonzero real column vector \mathbf, where \mathbf^\mathsf is the row vector transpose of \mathbf.
Mo ...
;
* Exogeneity:
* Homoscedasticity: .

Time series model

* The stochastic process

In probability theory and related fields, a stochastic () or random process is a mathematical object usually defined as a family of random variables in a probability space, where the index of the family often has the interpretation of time. Sto ...
is stationary and ergodic

In mathematics, ergodicity expresses the idea that a point of a moving system, either a dynamical system or a stochastic process, will eventually visit all parts of the space that the system moves in, in a uniform and random sense. This implies th ...
; if is nonstationary, OLS results are often spurious unless is co-integrating.
* The regressors are ''predetermined'': E 'x_iε_i''= 0 for all ''i'' = 1, ..., ''n'';
* The ''p''×''p'' matrix is of full rank, and hence positive-definite In mathematics, positive definiteness is a property of any object to which a bilinear form or a sesquilinear form may be naturally associated, which is positive-definite. See, in particular:

* Positive-definite bilinear form
* Positive-definite ...
;
* is a martingale difference sequence, with a finite matrix of second moments .

Finite sample properties

First of all, under the ''strict exogeneity'' assumption the OLS estimators $\scriptstyle\hat\beta$ and ''s''² are unbiased

Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is inaccurate, closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individ ...
, meaning that their expected values coincide with the true values of the parameters:^{roof/sup>
: $= \sigma^2.$
If the strict exogeneity does not hold (as is the case with many time series

In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. ...
models, where exogeneity is assumed only with respect to the past shocks but not the future ones), then these estimators will be biased in finite samples.

The '' variance-covariance matrix'' (or simply ''covariance matrix'') of $\scriptstyle\hat\beta$ is equal to
: $= \sigma^2\left(X ^\operatorname X\right)^ = \sigma^2 Q.$
In particular, the standard error of each coefficient $\scriptstyle\hat\beta_j$ is equal to square root of the ''j''-th diagonal element of this matrix. The estimate of this standard error is obtained by replacing the unknown quantity ''σ''² with its estimate ''s''². Thus,
: $\widehat(\hat_j) = \sqrt$

It can also be easily shown that the estimator $\scriptstyle\hat\beta$ is uncorrelated with the residuals from the model:
: $\operatorname, \hat\beta,\hat\varepsilon \mid X\, = 0.$

The ''Gauss–Markov theorem

In statistics, the Gauss–Markov theorem (or simply Gauss theorem for some authors) states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in ...
'' states that under the ''spherical errors'' assumption (that is, the errors should be uncorrelated

In probability theory and statistics, two real-valued random variables, X, Y, are said to be uncorrelated if their covariance, \operatorname ,Y= \operatorname Y- \operatorname \operatorname /math>, is zero. If two variables are uncorrelated, ther ...
and homoscedastic

In statistics, a sequence of random variables is homoscedastic () if all its random variables have the same finite variance; this is also known as homogeneity of variance. The complementary notion is called heteroscedasticity, also known as hete ...
) the estimator $\scriptstyle\hat\beta$ is efficient in the class of linear unbiased estimators. This is called the ''best linear unbiased estimator'' (BLUE). Efficiency should be understood as if we were to find some other estimator $\scriptstyle\tilde\beta$ which would be linear in ''y'' and unbiased, then
: $\geq 0$
in the sense that this is a nonnegative-definite matrix. This theorem establishes optimality only in the class of linear unbiased estimators, which is quite restrictive. Depending on the distribution of the error terms ''ε'', other, non-linear estimators may provide better results than OLS.

Assuming normality

The properties listed so far are all valid regardless of the underlying distribution of the error terms. However, if you are willing to assume that the ''normality assumption'' holds (that is, that ), then additional properties of the OLS estimators can be stated.

The estimator $\scriptstyle\hat\beta$ is normally distributed, with mean and variance as given before:
: $\hat\beta\ \sim\ \mathcal\big(\beta,\ \sigma^2(X ^\mathrm X)^\big).$
This estimator reaches the Cramér–Rao bound

In estimation theory and statistics, the Cramér–Rao bound (CRB) relates to estimation of a deterministic (fixed, though unknown) parameter. The result is named in honor of Harald Cramér and Calyampudi Radhakrishna Rao, but has also been d ...
for the model, and thus is optimal in the class of all unbiased estimators. Note that unlike the Gauss–Markov theorem

In statistics, the Gauss–Markov theorem (or simply Gauss theorem for some authors) states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in ...
, this result establishes optimality among both linear and non-linear estimators, but only in the case of normally distributed error terms.

The estimator ''s''² will be proportional to the chi-squared distribution

In probability theory and statistics, the \chi^2-distribution with k Degrees of freedom (statistics), degrees of freedom is the distribution of a sum of the squares of k Independence (probability theory), independent standard normal random vari ...
:
: $s^2\ \sim\ \frac \cdot \chi^2_$
The variance of this estimator is equal to , which does not attain the Cramér–Rao bound

In estimation theory and statistics, the Cramér–Rao bound (CRB) relates to estimation of a deterministic (fixed, though unknown) parameter. The result is named in honor of Harald Cramér and Calyampudi Radhakrishna Rao, but has also been d ...
of . However it was shown that there are no unbiased estimators of ''σ''² with variance smaller than that of the estimator ''s''². If we are willing to allow biased estimators, and consider the class of estimators that are proportional to the sum of squared residuals (SSR) of the model, then the best (in the sense of the mean squared error

In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference betwee ...
) estimator in this class will be , which even beats the Cramér–Rao bound in case when there is only one regressor ().

Moreover, the estimators $\scriptstyle\hat\beta$ and ''s''² are independent
Independent or Independents may refer to:

Arts, entertainment, and media Artist groups
* Independents (artist group), a group of modernist painters based in Pennsylvania, United States
* Independentes (English: Independents), a Portuguese artist ...
, the fact which comes in useful when constructing the t- and F-tests for the regression.

Influential observations

As was mentioned before, the estimator $\hat\beta$ is linear in ''y'', meaning that it represents a linear combination of the dependent variables ''y_i''. The weights in this linear combination are functions of the regressors ''X'', and generally are unequal. The observations with high weights are called influential because they have a more pronounced effect on the value of the estimator.

To analyze which observations are influential we remove a specific ''j''-th observation and consider how much the estimated quantities are going to change (similarly to the jackknife method). It can be shown that the change in the OLS estimator for ''β'' will be equal to
: $\hat\beta^ - \hat\beta = - \frac (X ^\mathrm X)^x_j ^\mathrm \hat\varepsilon_j\,,$
where is the ''j''-th diagonal element of the hat matrix ''P'', and ''x_j'' is the vector of regressors corresponding to the ''j''-th observation. Similarly, the change in the predicted value for ''j''-th observation resulting from omitting that observation from the dataset will be equal to
: $\hat_j^ - \hat_j = x_j ^\mathrm \hat\beta^ - x_j ^\operatorname \hat\beta = - \frac\,\hat\varepsilon_j$

From the properties of the hat matrix, , and they sum up to ''p'', so that on average . These quantities ''h_j'' are called the leverages, and observations with high ''h_j'' are called leverage points. Usually the observations with high leverage ought to be scrutinized more carefully, in case they are erroneous, or outliers, or in some other way atypical of the rest of the dataset.

Partitioned regression

Sometimes the variables and corresponding parameters in the regression can be logically split into two groups, so that the regression takes form
: $y = X_1\beta_1 + X_2\beta_2 + \varepsilon,$
where ''X''₁ and ''X''₂ have dimensions ''n''×''p''₁, ''n''×''p''₂, and ''β''₁, ''β''₂ are ''p''₁×1 and ''p''₂×1 vectors, with .

The Frisch–Waugh–Lovell theorem
In econometrics, the Frisch–Waugh–Lovell (FWL) theorem is named after the econometricians Ragnar Frisch, Frederick V. Waugh, and Michael C. Lovell.

The Frisch–Waugh–Lovell theorem states that if the regression we are concerned with is ...
states that in this regression the residuals $\hat\varepsilon$ and the OLS estimate $\scriptstyle\hat\beta_2$ will be numerically identical to the residuals and the OLS estimate for ''β''₂ in the following regression:
: $M_1y = M_1X_2\beta_2 + \eta\,,$
where ''M''₁ is the annihilator matrix for regressors ''X''₁.

The theorem can be used to establish a number of theoretical results. For example, having a regression with a constant and another regressor is equivalent to subtracting the means from the dependent variable and the regressor and then running the regression for the de-meaned variables but without the constant term.

Constrained estimation

Suppose it is known that the coefficients in the regression satisfy a system of linear equations
: $A\colon\quad Q ^\operatorname \beta = c, \,$
where ''Q'' is a ''p''×''q'' matrix of full rank, and ''c'' is a ''q''×1 vector of known constants, where . In this case least squares estimation is equivalent to minimizing the sum of squared residuals of the model subject to the constraint ''A''. The constrained least squares (CLS) estimator can be given by an explicit formula:
: $\hat\beta^c = \hat\beta - (X ^\operatorname X)^Q\Big(Q ^\operatorname (X ^\operatorname X)^Q\Big)^(Q ^\operatorname \hat\beta - c).$

This expression for the constrained estimator is valid as long as the matrix ''X^TX'' is invertible. It was assumed from the beginning of this article that this matrix is of full rank, and it was noted that when the rank condition fails, ''β'' will not be identifiable. However it may happen that adding the restriction ''A'' makes ''β'' identifiable, in which case one would like to find the formula for the estimator. The estimator is equal to
: $\hat\beta^c = R(R ^\operatorname X ^\operatorname XR)^R ^\operatorname X ^\operatorname y + \Big(I_p - R(R ^\operatorname X ^\operatorname XR)^R ^\operatorname X ^\operatorname X\Big)Q(Q ^\operatorname Q)^c,$
where ''R'' is a ''p''×(''p'' − ''q'') matrix such that the matrix is non-singular, and . Such a matrix can always be found, although generally it is not unique. The second formula coincides with the first in case when ''X^TX'' is invertible.

Large sample properties

The least squares estimators are point estimate
In statistics, point estimation involves the use of sample data to calculate a single value (known as a point estimate since it identifies a point in some parameter space) which is to serve as a "best guess" or "best estimate" of an unknown popu ...
s of the linear regression model parameters ''β''. However, generally we also want to know how close those estimates might be to the true values of parameters. In other words, we want to construct the interval estimates.

Since we have not made any assumption about the distribution of error term ''ε_i'', it is impossible to infer the distribution of the estimators $\hat\beta$ and $\hat\sigma^2$ . Nevertheless, we can apply the central limit theorem

In probability theory, the central limit theorem (CLT) states that, under appropriate conditions, the Probability distribution, distribution of a normalized version of the sample mean converges to a Normal distribution#Standard normal distributi ...
to derive their ''asymptotic'' properties as sample size ''n'' goes to infinity. While the sample size is necessarily finite, it is customary to assume that ''n'' is "large enough" so that the true distribution of the OLS estimator is close to its asymptotic limit.

We can show that under the model assumptions, the least squares estimator for ''β'' is consistent

In deductive logic, a consistent theory is one that does not lead to a logical contradiction. A theory T is consistent if there is no formula \varphi such that both \varphi and its negation \lnot\varphi are elements of the set of consequences ...
(that is $\hat\beta$ converges in probability
In probability theory, there exist several different notions of convergence of sequences of random variables, including ''convergence in probability'', ''convergence in distribution'', and ''almost sure convergence''. The different notions of conve ...
to ''β'') and asymptotically normal:^{roof/sup>
: $(\hat\beta - \beta)\ \xrightarrow\ \mathcal\big(0,\;\sigma^2Q_^\big),$
where $Q_ = X ^\operatorname X.$

Intervals

Using this asymptotic distribution, approximate two-sided confidence intervals for the ''j''-th component of the vector $\hat$ can be constructed as
: $\beta_j \in \bigg \hat\beta_j \pm q^_\!\sqrt\
\bigg$ at the confidence level,
where ''q'' denotes the quantile function

In probability and statistics, the quantile function is a function Q: ,1\mapsto \mathbb which maps some probability x \in ,1/math> of a random variable v to the value of the variable y such that P(v\leq y) = x according to its probability distr ...
of standard normal distribution, and �sub>''jj'' is the ''j''-th diagonal element of a matrix.

Similarly, the least squares estimator for ''σ''² is also consistent and asymptotically normal (provided that the fourth moment of ''ε_i'' exists) with limiting distribution
: $- \sigma^4\right).$

These asymptotic distributions can be used for prediction, testing hypotheses, constructing other estimators, etc.. As an example consider the problem of prediction. Suppose $x_0$ is some point within the domain of distribution of the regressors, and one wants to know what the response variable would have been at that point. The mean response is the quantity $y_0 = x_0^\mathrm \beta$ , whereas the predicted response is $\hat_0 = x_0^\mathrm \hat\beta$ . Clearly the predicted response is a random variable, its distribution can be derived from that of $\hat$ :
: $\left(\hat_0 - y_0\right)\ \xrightarrow\ \mathcal\left(0,\;\sigma^2 x_0^\mathrm Q_^ x_0\right),$

which allows construct confidence intervals for mean response $y_0$ to be constructed:
: $y_0 \in \left x_0^\mathrm \hat \pm q^_\!\sqrt\ \right /math> at the confidence level.$ Hypothesis testing

Two hypothesis tests are particularly widely used. First, one wants to know if the estimated regression equation is any better than simply predicting that all values of the response variable equal its sample mean (if not, it is said to have no explanatory power). The null hypothesis

The null hypothesis (often denoted ''H''0) is the claim in scientific research that the effect being studied does not exist. The null hypothesis can also be described as the hypothesis in which no relationship exists between two sets of data o ...
of no explanatory value of the estimated regression is tested using an F-test

An F-test is a statistical test that compares variances. It is used to determine if the variances of two samples, or if the ratios of variances among multiple samples, are significantly different. The test calculates a Test statistic, statistic, ...
. If the calculated F-value is found to be large enough to exceed its critical value for the pre-chosen level of significance, the null hypothesis is rejected and the alternative hypothesis

In statistical hypothesis testing, the alternative hypothesis is one of the proposed propositions in the hypothesis test. In general the goal of hypothesis test is to demonstrate that in the given condition, there is sufficient evidence supporting ..., that the regression has explanatory power, is accepted. Otherwise, the null hypothesis of no explanatory power is accepted.

Second, for each explanatory variable of interest, one wants to know whether its estimated coefficient differs significantly from zero—that is, whether this particular explanatory variable in fact has explanatory power in predicting the response variable. Here the null hypothesis is that the true coefficient is zero. This hypothesis is tested by computing the coefficient's t-statistic

In statistics, the ''t''-statistic is the ratio of the difference in a number’s estimated value from its assumed value to its standard error. It is used in hypothesis testing via Student's ''t''-test. The ''t''-statistic is used in a ''t''-t ...
, as the ratio of the coefficient estimate to its standard error

The standard error (SE) of a statistic (usually an estimator of a parameter, like the average or mean) is the standard deviation of its sampling distribution or an estimate of that standard deviation. In other words, it is the standard deviati .... If the t-statistic is larger than a predetermined value, the null hypothesis is rejected and the variable is found to have explanatory power, with its coefficient significantly different from zero. Otherwise, the null hypothesis of a zero value of the true coefficient is accepted.

In addition, the Chow test
The Chow test (), proposed by econometrician Gregory Chow in 1960, is a statistical test of whether the true coefficients in two linear regressions on different data sets are equal. In econometrics, it is most commonly used in time series analysis ... is used to test whether two subsamples both have the same underlying true coefficient values. The sum of squared residuals of regressions on each of the subsets and on the combined data set are compared by computing an F-statistic; if this exceeds a critical value, the null hypothesis of no difference between the two subsets is rejected; otherwise, it is accepted.
Example with real data

The following data set gives average heights and weights for American women aged 30–39 (source: ''The World Almanac and Book of Facts, 1975'').

:

When only one dependent variable is being modeled, a scatterplot

A scatter plot, also called a scatterplot, scatter graph, scatter chart, scattergram, or scatter diagram, is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of dat ...
will suggest the form and strength of the relationship between the dependent variable and regressors. It might also reveal outliers, heteroscedasticity, and other aspects of the data that may complicate the interpretation of a fitted regression model. The scatterplot suggests that the relationship is strong and can be approximated as a quadratic function. OLS can handle non-linear relationships by introducing the regressor ². The regression model then becomes a multiple linear model:

: $w_i = \beta_1 + \beta_2 h_i + \beta_3 h_i^2 + \varepsilon_i.$

The output from most popular statistical packages will look similar to this:
:

In this table:
* The ''Value'' column gives the least squares estimates of parameters ''β_j''
* The ''Std error'' column shows standard error

The standard error (SE) of a statistic (usually an estimator of a parameter, like the average or mean) is the standard deviation of its sampling distribution or an estimate of that standard deviation. In other words, it is the standard deviati ...
s of each coefficient estimate: $\right)^\frac$
* The ''t-statistic

In statistics, the ''t''-statistic is the ratio of the difference in a number’s estimated value from its assumed value to its standard error. It is used in hypothesis testing via Student's ''t''-test. The ''t''-statistic is used in a ''t''-t ...
'' and ''p-value'' columns are testing whether any of the coefficients might be equal to zero. The ''t''-statistic is calculated simply as $t=\hat\beta_j/\hat\sigma_j$ . If the errors ε follow a normal distribution, ''t'' follows a Student-t distribution. Under weaker conditions, ''t'' is asymptotically normal. Large values of ''t'' indicate that the null hypothesis can be rejected and that the corresponding coefficient is not zero. The second column, ''p''-value, expresses the results of the hypothesis test as a significance level
In statistical hypothesis testing, a result has statistical significance when a result at least as "extreme" would be very infrequent if the null hypothesis were true. More precisely, a study's defined significance level, denoted by \alpha, is the ...
. Conventionally, ''p''-values smaller than 0.05 are taken as evidence that the population coefficient is nonzero.
* ''R-squared'' is the coefficient of determination

In statistics, the coefficient of determination, denoted ''R''2 or ''r''2 and pronounced "R squared", is the proportion of the variation in the dependent variable that is predictable from the independent variable(s).
It is a statistic used in t ...
indicating goodness-of-fit of the regression. This statistic will be equal to one if fit is perfect, and to zero when regressors ''X'' have no explanatory power whatsoever. This is a biased estimate of the population ''R-squared'', and will never decrease if additional regressors are added, even if they are irrelevant.
* ''Adjusted R-squared'' is a slightly modified version of $R^2$ , designed to penalize for the excess number of regressors which do not add to the explanatory power of the regression. This statistic is always smaller than $R^2$ , can decrease as new regressors are added, and even be negative for poorly fitting models:
:: $\overline^2 = 1 - \frac(1 - R^2)$
* ''Log-likelihood'' is calculated under the assumption that errors follow normal distribution. Even though the assumption is not very reasonable, this statistic may still find its use in conducting LR tests.
* ''Durbin–Watson statistic

In statistics, the Durbin–Watson statistic is a test statistic used to detect the presence of autocorrelation at lag 1 in the residuals (prediction errors) from a regression analysis. It is named after James Durbin and Geoffrey Watson. The ...
'' tests whether there is any evidence of serial correlation between the residuals. As a rule of thumb, the value smaller than 2 will be an evidence of positive correlation.
* ''Akaike information criterion
The Akaike information criterion (AIC) is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to ...
'' and '' Schwarz criterion'' are both used for model selection. Generally when comparing two alternative models, smaller values of one of these criteria will indicate a better model.

* ''Standard error of regression'' is an estimate of ''σ'', standard error of the error term.
* ''Total sum of squares'', ''model sum of squared'', and ''residual sum of squares'' tell us how much of the initial variation in the sample were explained by the regression.
* ''F-statistic'' tries to test the hypothesis that all coefficients (except the intercept) are equal to zero. This statistic has ''F''(''p–1'',''n–p'') distribution under the null hypothesis and normality assumption, and its ''p-value'' indicates probability that the hypothesis is indeed true. Note that when errors are not normal this statistic becomes invalid, and other tests such as Wald test
In statistics, the Wald test (named after Abraham Wald) assesses constraints on statistical parameters based on the weighted distance between the unrestricted estimate and its hypothesized value under the null hypothesis, where the weight is the ...
or LR test should be used.

Ordinary least squares analysis often includes the use of diagnostic plots designed to detect departures of the data from the assumed form of the model. These are some of the common diagnostic plots:
* Residuals against the explanatory variables in the model. A non-linear relation between these variables suggests that the linearity of the conditional mean function may not hold. Different levels of variability in the residuals for different levels of the explanatory variables suggests possible heteroscedasticity.
* Residuals against explanatory variables not in the model. Any relation of the residuals to these variables would suggest considering these variables for inclusion in the model.
* Residuals against the fitted values, $\hat$ .
* Residuals against the preceding residual. This plot may identify serial correlations in the residuals.

An important consideration when carrying out statistical inference using regression models is how the data were sampled. In this example, the data are averages rather than measurements on individual women. The fit of the model is very good, but this does not imply that the weight of an individual woman can be predicted with high accuracy based only on her height.

Sensitivity to rounding

This example also demonstrates that coefficients determined by these calculations are sensitive to how the data is prepared. The heights were originally given rounded to the nearest inch and have been converted and rounded to the nearest centimetre. Since the conversion factor is one inch to 2.54 cm this is ''not'' an exact conversion. The original inches can be recovered by Round(x/0.0254) and then re-converted to metric without rounding. If this is done the results become:

Using either of these equations to predict the weight of a 5' 6" (1.6764 m) woman gives similar values: 62.94 kg with rounding vs. 62.98 kg without rounding. Thus a seemingly small variation in the data has a real effect on the coefficients but a small effect on the results of the equation.

While this may look innocuous in the middle of the data range it could become significant at the extremes or in the case where the fitted model is used to project outside the data range (extrapolation

In mathematics

Mathematics is a field of study that discovers and organizes methods, Mathematical theory, theories and theorems that are developed and Mathematical proof, proved for the needs of empirical sciences and mathematics itself. ...
).

This highlights a common error: this example is an abuse of OLS which inherently requires that the errors in the independent variable (in this case height) are zero or at least negligible. The initial rounding to nearest inch plus any actual measurement errors constitute a finite and non-negligible error. As a result, the fitted parameters are not the best estimates they are presumed to be. Though not totally spurious the error in the estimation will depend upon relative size of the ''x'' and ''y'' errors.

Another example with less real data

Problem statement

We can use the least square mechanism to figure out the equation of a two body orbit in polar base co-ordinates. The equation typically used is $r(\theta) = \frac$ where $r(\theta)$ is the radius of how far the object is from one of the bodies. In the equation the parameters $p$ and $e$ are used to determine the path of the orbit. We have measured the following data.

We need to find the least-squares approximation of $e$ and $p$ for the given data.

Solution

First we need to represent e and p in a linear form. So we are going to rewrite the equation $r(\theta)$ as $\frac = \frac - \frac\cos(\theta)$ .

Furthermore, one could fit for apsides

An apsis (; ) is the farthest or nearest point in the orbit of a planetary body about its primary body. The line of apsides (also called apse line, or major axis of the orbit) is the line connecting the two extreme values.

Apsides perta ...
by expanding $\cos(\theta)$ with an extra parameter as $\cos(\theta-\theta_0)=\cos(\theta)\cos(\theta_0)+\sin(\theta)\sin(\theta_0)$ , which is linear in both $\cos(\theta)$ and in the extra basis function $\sin(\theta)$ .

We use the original two-parameter form to represent our observational data as:

$A^A \binom = A^b ,$

where:

$x = 1 / p\,$ ; $y = e / p\,$ ; $A$ contains the coefficients of $1 / p$ in the first column, which are all 1, and the coefficients of $e / p$ in the second column, given by $\cos(\theta)\,$ ; and $b = 1 / r(\theta)$ , such that:

$A = \begin 1 & -0.731354\\1 & -0.707107\\1 & -0.615661\\1&\ 0.052336\\1& 0.309017\\1&0.438371 \end,\quad b = \begin 0.21220\\
0.21958\\
0.24741\\
0.45071\\
0.52883\\
0.56820\end.$

On solving we get $\binom = \binom\,$ ,

so $p=\frac = 2.3000$ and $e=p\cdot y = 0.70001$

See also

* Bayesian least squares
* Fama–MacBeth regression
* Nonlinear least squares
* Numerical methods for linear least squares
* Nonlinear system identification System identification is a method of identifying or measuring the mathematical model of a system from measurements of the system inputs and outputs. The applications of system identification include any system where the inputs and outputs can be mea ...

References

Further reading

*
*
*
*
*

{{DEFAULTSORT:Ordinary Least Squares
Parametric statistics
Least squares}}}}