statistics Statistics (from German: ''Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industri ...

, omitted-variable bias (OVB) occurs when a statistical model leaves out one or more relevant variables. The bias results in the model attributing the effect of the missing variables to those that were included. More specifically, OVB is the

bias Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individual, a group, ...

that appears in the estimates of

parameter A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...

s in a

regression analysis In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one ...

, when the assumed

specification A specification often refers to a set of documented requirements to be satisfied by a material, design, product, or service. A specification is often a type of technical standard. There are different types of technical or engineering specificati ...

is incorrect in that it omits an independent variable that is a determinant of the dependent variable and correlated with one or more of the included independent variables.

In linear regression

Intuition

Suppose the true cause-and-effect relationship is given by: :

y=a+bx+cz+u

with parameters ''a, b, c'', dependent variable ''y'', independent variables ''x'' and ''z'', and error term ''u''. We wish to know the effect of ''x'' itself upon ''y'' (that is, we wish to obtain an estimate of ''b''). Two conditions must hold true for omitted-variable bias to exist in

linear regression In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is cal ...

: * the omitted variable must be a determinant of the dependent variable (i.e., its true regression coefficient must not be zero); and * the omitted variable must be correlated with an independent variable specified in the regression (i.e., cov(''z'',''x'') must not equal zero). Suppose we omit ''z'' from the regression, and suppose the relation between ''x'' and ''z'' is given by :

z=d+fx+e

with parameters ''d'', ''f'' and error term ''e''. Substituting the second equation into the first gives :

y=(a+cd)+(b+cf)x+(u+ce).

If a regression of ''y'' is conducted upon ''x'' only, this last equation is what is estimated, and the regression coefficient on ''x'' is actually an estimate of (''b'' + ''cf'' ), giving not simply an estimate of the desired direct effect of ''x'' upon ''y'' (which is ''b''), but rather of its sum with the indirect effect (the effect ''f'' of ''x'' on ''z'' times the effect ''c'' of ''z'' on ''y''). Thus by omitting the variable ''z'' from the regression, we have estimated the

total derivative In mathematics, the total derivative of a function at a point is the best linear approximation near this point of the function with respect to its arguments. Unlike partial derivatives, the total derivative approximates the function with res ...

of ''y'' with respect to ''x'' rather than its

partial derivative In mathematics, a partial derivative of a function of several variables is its derivative with respect to one of those variables, with the others held constant (as opposed to the total derivative, in which all variables are allowed to vary). Par ...

with respect to ''x''. These differ if both ''c'' and ''f'' are non-zero. The direction and extent of the bias are both contained in ''cf'', since the effect sought is ''b'' but the regression estimates ''b+cf''. The extent of the bias is the absolute value of ''cf'', and the direction of bias is upward (toward a more positive or less negative value) if ''cf'' > 0 (if the direction of correlation between ''y'' and ''z'' is the same as that between ''x'' and ''z''), and it is downward otherwise.

Detailed analysis

As an example, consider a

linear model In statistics, the term linear model is used in different ways according to the context. The most common occurrence is in connection with regression models and the term is often taken as synonymous with linear regression model. However, the term ...

of the form :

y_i = x_i \beta + z_i \delta + u_i,\qquad i = 1,\dots,n

where * ''x''_''i'' is a 1 × ''p'' row vector of values of ''p''

independent variable Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...

s observed at time ''i'' or for the ''i''^th study participant; * ''β'' is a ''p'' × 1 column vector of unobservable parameters (the response coefficients of the dependent variable to each of the ''p'' independent variables in ''x''_''i'') to be estimated; * ''z''_''i'' is a scalar and is the value of another independent variable that is observed at time ''i'' or for the ''i''^th study participant; * ''δ'' is a scalar and is an unobservable parameter (the response coefficient of the dependent variable to ''z''_''i'') to be estimated; * ''u''_''i'' is the unobservable

error term In mathematics and statistics, an error term is an additive type of error. Common examples include: * errors and residuals in statistics, e.g. in linear regression * the error term in numerical integration In analysis, numerical integration com ...

occurring at time ''i'' or for the ''i''^th study participant; it is an unobserved realization of a

random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...

having

expected value In probability theory, the expected value (also called expectation, expectancy, mathematical expectation, mean, average, or first moment) is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a ...

0 (conditionally on ''x''_''i'' and ''z''_''i''); * ''y''_''i'' is the observation of the

dependent variable Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...

at time ''i'' or for the ''i''^th study participant. We collect the observations of all variables subscripted ''i'' = 1, ..., ''n'', and stack them one below another, to obtain the

matrix Matrix most commonly refers to: * ''The Matrix'' (franchise), an American media franchise ** ''The Matrix'', a 1999 science-fiction action film ** "The Matrix", a fictional setting, a virtual reality environment, within ''The Matrix'' (franchis ...

''X'' and the vectors ''Y'', ''Z'', and ''U'': :

\in \mathbb^,

and :

\in \mathbb^.

If the independent variable ''z'' is omitted from the regression, then the estimated values of the response parameters of the other independent variables will be given by the usual

least squares The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems (sets of equations in which there are more equations than unknowns) by minimizing the sum of the squares of the re ...

calculation, :

\widehat = (X'X)^X'Y\,

(where the "prime" notation means the

transpose In linear algebra, the transpose of a matrix is an operator which flips a matrix over its diagonal; that is, it switches the row and column indices of the matrix by producing another matrix, often denoted by (among other notations). The tr ...

of a matrix and the -1 superscript is

matrix inversion In linear algebra, an -by- square matrix is called invertible (also nonsingular or nondegenerate), if there exists an -by- square matrix such that :\mathbf = \mathbf = \mathbf_n \ where denotes the -by- identity matrix and the multiplicati ...

). Substituting for ''Y'' based on the assumed linear model, :

\begin
\widehat & = (X'X)^X'(X\beta+Z\delta+U) \\
& =(X'X)^X'X\beta + (X'X)^X'Z\delta + (X'X)^X'U \\
& =\beta + (X'X)^X'Z\delta + (X'X)^X'U.
\end

On taking expectations, the contribution of the final term is zero; this follows from the assumption that ''U'' is uncorrelated with the regressors ''X''. On simplifying the remaining terms: :

delta \\ & = \beta + \text. \end

The second term after the equal sign is the omitted-variable bias in this case, which is non-zero if the omitted variable ''z'' is correlated with any of the included variables in the matrix ''X'' (that is, if ''X′Z'' does not equal a vector of zeroes). Note that the bias is equal to the weighted portion of ''z''_''i'' which is "explained" by ''x''_''i''.

Effect in ordinary least squares

The

Gauss–Markov theorem In statistics, the Gauss–Markov theorem (or simply Gauss theorem for some authors) states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in ...

states that regression models which fulfill the classical linear regression model assumptions provide the most efficient, linear and

unbiased Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individual, a group, ...

estimators. In

ordinary least squares In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model (with fixed level-one effects of a linear function of a set of explanatory variables) by the prin ...

, the relevant assumption of the classical linear regression model is that the error term is uncorrelated with the regressors. The presence of omitted-variable bias violates this particular assumption. The violation causes the OLS estimator to be biased and

inconsistent In classical deductive logic, a consistent theory is one that does not lead to a logical contradiction. The lack of contradiction can be defined in either semantic or syntactic terms. The semantic definition states that a theory is consistent i ...

. The direction of the bias depends on the estimators as well as the

covariance In probability theory and statistics, covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the les ...

between the regressors and the omitted variables. A positive covariance of the omitted variable with both a regressor and the dependent variable will lead the OLS estimate of the included regressor's coefficient to be greater than the true value of that coefficient. This effect can be seen by taking the expectation of the parameter, as shown in the previous section.

References

* * * * {{Biases Regression analysis Experimental bias

In linear regression

Intuition

Detailed analysis

Effect in ordinary least squares

See also

References