statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

, generalized least squares (GLS) is a method used to estimate the unknown parameters in a

linear regression model In statistics, linear regression is a model that estimates the relationship between a scalar response (dependent variable) and one or more explanatory variables (regressor or independent variable). A model with exactly one explanatory variable ...

. It is used when there is a non-zero amount of

correlation In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistics ...

between the residuals in the regression model. GLS is employed to improve statistical efficiency and reduce the risk of drawing erroneous inferences, as compared to conventional

least squares The method of least squares is a mathematical optimization technique that aims to determine the best fit function by minimizing the sum of the squares of the differences between the observed values and the predicted values of the model. The me ...

and

weighted least squares Weighted least squares (WLS), also known as weighted linear regression, is a generalization of ordinary least squares and linear regression in which knowledge of the unequal variance of observations (''heteroscedasticity'') is incorporated into ...

methods. It was first described by

Alexander Aitken Alexander Craig "Alec" Aitken (1 April 1895 – 3 November 1967) was one of New Zealand's most eminent mathematicians. In a 1935 paper he introduced the concept of generalized least squares, along with now standard vector/matrix notation ...

in 1935. It requires knowledge of the

covariance matrix In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...

for the residuals. If this is unknown, estimating the covariance matrix gives the method of feasible generalized least squares (FGLS). However, FGLS provides fewer guarantees of improvement.

Method

In standard

linear regression In statistics, linear regression is a statistical model, model that estimates the relationship between a Scalar (mathematics), scalar response (dependent variable) and one or more explanatory variables (regressor or independent variable). A mode ...

models, one observes data

\_

on ''n''

statistical unit In statistics, a unit is one member of a set of entities being studied. It is the main source for the mathematical abstraction of a "random variable". Common examples of a unit would be a single person, animal, plant, manufactured item, or countr ...

s with ''k'' − 1 predictor values and one response value each. The response values are placed in a vector,

\mathbf \equiv
\begin
y_1
\\
\vdots
\\
y_n
\end,

and the predictor values are placed in the

design matrix In statistics and in particular in regression analysis, a design matrix, also known as model matrix or regressor matrix and often denoted by X, is a matrix of values of explanatory variables of a set of objects. Each row represents an individual o ...

\mathbf \equiv 
\begin
1 & x_ & x_ & \cdots & x_
\\
1 & x_ & x_ & \cdots & x_
\\
\vdots & \vdots & \vdots & \ddots & \vdots
\\
1 & x_ & x_ & \cdots & x_
\end
,

where each row is a vector of the

k

predictor variables (including a constant) for the

i

th data point. The model assumes that the

conditional mean In probability theory, the conditional expectation, conditional expected value, or conditional mean of a random variable is its expected value evaluated with respect to the conditional probability distribution. If the random variable can take on ...

\mathbf

given

\mathbf

to be a linear function of

\mathbf

and that the conditional

variance In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...

of the error term given

\mathbf

is a known

non-singular Singular may refer to: * Singular, the grammatical number that denotes a unit quantity, as opposed to the plural and other forms * Singular or sounder, a group of boar, see List of animal names * Singular (band), a Thai jazz pop duo *'' Singular ...

\mathbf

. That is,

\boldsymbol,

where

\boldsymbol\beta \in \mathbb^k

is a vector of unknown constants, called "regression coefficients", which are estimated from the data. If

\mathbf

is a candidate estimate for

\boldsymbol

, then the residual vector for

\mathbf

\mathbf- \mathbf \mathbf

. The generalized least squares method estimates

\boldsymbol

by minimizing the squared Mahalanobis length of this residual vector:

\begin
 & = \underset\operatorname\,(\mathbf- \mathbf \mathbf)^\mathbf^(\mathbf- \mathbf \mathbf) \\
                     & = \underset\operatorname\,\mathbf^\,\mathbf^\mathbf + (\mathbf \mathbf)^ \mathbf^ \mathbf \mathbf - \mathbf^\mathbf^\mathbf \mathbf-(\mathbf \mathbf)^\mathbf^\mathbf\, ,
\end

which is equivalent to

=  \underset\operatorname\,\mathbf^\,\mathbf^\mathbf + \mathbf^ \mathbf^ \mathbf^ \mathbf \mathbf -2 \mathbf^ \mathbf ^\mathbf^\mathbf,

which is a

quadratic programming Quadratic programming (QP) is the process of solving certain mathematical optimization problems involving quadratic functions. Specifically, one seeks to optimize (minimize or maximize) a multivariate quadratic function subject to linear constr ...

problem. The stationary point of the objective function occurs when

2 \mathbf^ \mathbf^ \mathbf  -2 \mathbf ^\mathbf^\mathbf = 0
,

so the estimator is

= \left( \mathbf^ \mathbf^ \mathbf \right)^ \mathbf^\mathbf^\mathbf.

The quantity

\mathbf^

is known as the ''

precision matrix In statistics, the precision matrix or concentration matrix is the matrix inverse of the covariance matrix or dispersion matrix, P = \Sigma^. For univariate distributions, the precision matrix degenerates into a scalar precision, defined as the ...

'' (or ''dispersion matrix''), a generalization of the diagonal weight matrix.

Properties

The GLS estimator is

unbiased Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is inaccurate, closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individ ...

consistent In deductive logic, a consistent theory is one that does not lead to a logical contradiction. A theory T is consistent if there is no formula \varphi such that both \varphi and its negation \lnot\varphi are elements of the set of consequences ...

, efficient, and asymptotically normal with

= (\mathbf^\boldsymbol\Omega^\mathbf)^.

GLS is equivalent to applying

ordinary least squares In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression In statistics, linear regression is a statistical model, model that estimates the relationship ...

(OLS) to a linearly transformed version of the data. This can be seen by factoring

\mathbf = \mathbf \mathbf^

using a method such as

Cholesky decomposition In linear algebra, the Cholesky decomposition or Cholesky factorization (pronounced ) is a decomposition of a Hermitian, positive-definite matrix into the product of a lower triangular matrix and its conjugate transpose, which is useful for eff ...

. Left-multiplying both sides of

\mathbf = \mathbf \boldsymbol + \boldsymbol

\mathbf^

yields an equivalent linear model:

\mathbf^ = \mathbf^ \boldsymbol + \boldsymbol^,
\quad
\text
\quad
\mathbf^ = \mathbf^ \mathbf,
\quad
\mathbf^ = \mathbf^ \mathbf,
\quad
\boldsymbol^ = \mathbf^ \boldsymbol.

In this model,

\mathbf^ \mathbf \left(\mathbf^ \right)^ = \mathbf

, where

\mathbf

is the

identity matrix In linear algebra, the identity matrix of size n is the n\times n square matrix with ones on the main diagonal and zeros elsewhere. It has unique properties, for example when the identity matrix represents a geometric transformation, the obje ...

. Then,

\boldsymbol

can be efficiently estimated by applying OLS to the transformed data, which requires minimizing the objective,

\left(\mathbf^ - \mathbf^ \boldsymbol \right)^ (\mathbf^ - \mathbf^ \boldsymbol) = (\mathbf- \mathbf \mathbf)^\,\mathbf^(\mathbf- \mathbf \mathbf).

This transformation effectively standardizes the scale of and de-correlates the errors. When OLS is used on data with homoscedastic errors, the

Gauss–Markov theorem In statistics, the Gauss–Markov theorem (or simply Gauss theorem for some authors) states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in ...

applies, so the GLS estimate is the best linear unbiased estimator for ''

\boldsymbol

''.

Weighted least squares

A special case of GLS, called weighted least squares (WLS), occurs when all the off-diagonal entries of Ω are 0. This situation arises when the variances of the observed values are unequal or when

heteroscedasticity In statistics, a sequence of random variables is homoscedastic () if all its random variables have the same finite variance; this is also known as homogeneity of variance. The complementary notion is called heteroscedasticity, also known as hete ...

is present, but no correlations exist among the observed variances. The weight for unit ''i'' is proportional to the reciprocal of the variance of the response for unit ''i''.

Derivation by maximum likelihood estimation

Ordinary least squares In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression In statistics, linear regression is a statistical model, model that estimates the relationship ...

can be interpreted as

maximum likelihood estimation In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...

with the

prior The term prior may refer to: * Prior (ecclesiastical), the head of a priory (monastery) * Prior convictions, the life history and previous convictions of a suspect or defendant in a criminal case * Prior probability, in Bayesian statistics * Prio ...

that the errors are independent and normally distributed with zero mean and common variance. In GLS, the prior is generalized to the case where errors may not be independent and may have differing variances. For given fit parameters

\mathbf b

, the conditional probability density function of the errors are assumed to be:

p(\boldsymbol\varepsilon,  \mathbf b) 
= 
\frac
\exp\left(-
\frac
\boldsymbol \varepsilon^ \boldsymbol \Omega^\boldsymbol \varepsilon
\right).

Bayes' theorem Bayes' theorem (alternatively Bayes' law or Bayes' rule, after Thomas Bayes) gives a mathematical rule for inverting Conditional probability, conditional probabilities, allowing one to find the probability of a cause given its effect. For exampl ...

p(\mathbf b ,  \boldsymbol \varepsilon)
=
\frac.

In GLS, a uniform (improper) prior is taken for

p(\mathbf b)

, and as

p(\boldsymbol\varepsilon)

is a marginal distribution, it does not depend on

\mathbf b

. Therefore the log-probability is

\log p(\mathbf b, \boldsymbol \varepsilon)
=
\log p(\boldsymbol \varepsilon ,  \mathbf b) 
+
\cdots
=
-\frac\boldsymbol \varepsilon^ \boldsymbol \Omega^ \boldsymbol\varepsilon +\cdots,

where the hidden terms are those that do not depend on

\mathbf b

, and

\log p(\boldsymbol \varepsilon ,  \mathbf b)

is the log-likelihood. The

maximum a posteriori An estimation procedure that is often claimed to be part of Bayesian statistics is the maximum a posteriori (MAP) estimate of an unknown quantity, that equals the mode of the posterior density with respect to some reference measure, typically ...

(MAP) estimate is then the

maximum likelihood estimate In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ...

(MLE), which is equivalent to the optimization problem from above,

= \underset\operatorname \; p(\mathbf b,  \boldsymbol \varepsilon) 
=\underset\operatorname \; \log p(\mathbf b ,  \boldsymbol \varepsilon) 
=\underset\operatorname \; \log p(\boldsymbol \varepsilon ,  \mathbf b ),

where the optimization problem has been re-written using the fact that the

logarithm In mathematics, the logarithm of a number is the exponent by which another fixed value, the base, must be raised to produce that number. For example, the logarithm of to base is , because is to the rd power: . More generally, if , the ...

is a strictly increasing function and the property that the argument solving an

optimization problem In mathematics, engineering, computer science and economics Economics () is a behavioral science that studies the Production (economics), production, distribution (economics), distribution, and Consumption (economics), consumption of goo ...

is independent of terms in the objective function which do not involve said terms. Substituting

\mathbf y - \mathbf X \mathbf b

for

\boldsymbol \varepsilon

= \underset\operatorname \; \frac (\mathbf y - \mathbf X \mathbf b)^  \boldsymbol \Omega^
(\mathbf y - \mathbf X \mathbf b ).

Feasible generalized least squares

If the covariance of the errors

\Omega

is unknown, one can get a consistent estimate of

\Omega

, say

\widehat \Omega

,Baltagi, B. H. (2008). Econometrics (4th ed.). New York: Springer. using an implementable version of GLS known as the feasible generalized least squares (FGLS) estimator. In FGLS, modeling proceeds in two stages: # The model is estimated by OLS or another consistent (but inefficient) estimator, and the residuals are used to build a consistent estimator of the errors covariance matrix (to do so, one often needs to examine the model adding additional constraints; for example, if the errors follow a time series process, a statistician generally needs some theoretical assumptions on this process to ensure that a consistent estimator is available). # Then, using the consistent estimator of the covariance matrix of the errors, one can implement GLS ideas. Whereas GLS is more efficient than OLS under

(also spelled heteroskedasticity) or

autocorrelation Autocorrelation, sometimes known as serial correlation in the discrete time case, measures the correlation of a signal with a delayed copy of itself. Essentially, it quantifies the similarity between observations of a random variable at differe ...

, this is not true for FGLS. The feasible estimator is ''asymptotically'' more efficient (provided the errors covariance matrix is consistently estimated), but for a small to medium-sized sample, it can be actually less efficient than OLS. This is why some authors prefer to use OLS and reformulate their inferences by simply considering an alternative estimator for the variance of the estimator robust to heteroscedasticity or serial autocorrelation. However, for large samples, FGLS is preferred over OLS under heteroskedasticity or serial correlation.Greene, W. H. (2003). Econometric Analysis (5th ed.). Upper Saddle River, NJ: Prentice Hall. A cautionary note is that the FGLS estimator is not always consistent. One case in which FGLS might be inconsistent is if there are individual-specific fixed effects. In general, this estimator has different properties than GLS. For large samples (i.e., asymptotically), all properties are (under appropriate conditions) common with respect to GLS, but for finite samples, the properties of FGLS estimators are unknown: they vary dramatically with each particular model, and as a general rule, their exact distributions cannot be derived analytically. For finite samples, FGLS may be less efficient than OLS in some cases. Thus, while GLS can be made feasible, it is not always wise to apply this method when the sample is small. A method used to improve the accuracy of the estimators in finite samples is to iterate; that is, to take the residuals from FGLS to update the errors' covariance estimator and then update the FGLS estimation, applying the same idea iteratively until the estimators vary less than some tolerance. However, this method does not necessarily improve the efficiency of the estimator very much if the original sample was small. A reasonable option when samples are not too large is to apply OLS but discard the classical variance estimator :

\sigma^2*(X^\operatorname X)^

(which is inconsistent in this framework) and instead use a HAC (Heteroskedasticity and Autocorrelation Consistent) estimator. In the context of autocorrelation, the Newey–West estimator can be used, and in heteroscedastic contexts, the Eicker–White estimator can be used instead. This approach is much safer, and it is the appropriate path to take unless the sample is large, where "large" is sometimes a slippery issue (e.g., if the error distribution is asymmetric the required sample will be much larger). The

(OLS) estimator is calculated by: :

\widehat \beta_\text = (X^\operatorname  X)^ X^\operatorname y

and estimates of the residuals

\widehat_j= (Y-X\widehat\beta_\text)_j

are constructed. For simplicity, consider the model for heteroscedastic and non-autocorrelated errors. Assume that the variance-covariance matrix

\Omega

of the error vector is diagonal, or equivalently that errors from distinct observations are uncorrelated. Then each diagonal entry may be estimated by the fitted residuals

\widehat_j

\widehat_

may be constructed by: :

\widehat_\text = \operatorname(\widehat^2_1, \widehat^2_2, \dots , \widehat^2_n).

It is important to notice that the squared residuals cannot be used in the previous expression; an estimator of the errors' variances is needed. To do so, a parametric heteroskedasticity model or nonparametric estimator can be used. Estimate

\beta_

using

\widehat_\text

using

: :

\widehat \beta_ = (X^\operatorname  \widehat^_\text X)^ X^\operatorname \widehat^_\text y

The procedure can be iterated. The first iteration is given by: :

\widehat_ = Y - X \widehat \beta_

\widehat_ = \operatorname(\widehat^2_, \widehat^2_, \dots ,\widehat^2_)

\widehat \beta_ = (X^\operatorname \widehat^_ X)^ X^\operatorname  \widehat^_ y

This estimation of

\widehat

can be iterated to convergence. Under regularity conditions, the FGLS estimator (or the estimator of its iterations, if a finite number of iterations are conducted) is asymptotically distributed as: :

\sqrt(\hat\beta_ - \beta)\ \xrightarrow\ \mathcal\!\left(0,\,V\right)

where

n

is the sample size, and :

V = \operatorname(X^\operatorname \Omega^X/n)

where

\text

means limit in probability.

Method

Properties

Weighted least squares

Derivation by maximum likelihood estimation

Feasible generalized least squares

See also

References

Further reading