In
statistics, errors-in-variables models or measurement error models are
regression models that account for
measurement errors
Observational error (or measurement error) is the difference between a measured value of a quantity and its true value.Dodge, Y. (2003) ''The Oxford Dictionary of Statistical Terms'', OUP. In statistics, an error is not necessarily a "mistake". ...
in the
independent variables. In contrast, standard regression models assume that those regressors have been measured exactly, or observed without error; as such, those models account only for errors in the
dependent variables, or responses.

In the case when some regressors have been measured with errors, estimation based on the standard assumption leads to
inconsistent estimates, meaning that the parameter estimates do not tend to the true values even in very large samples. For
simple linear regression
In statistics, simple linear regression is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable (conventionally, the ''x'' an ...
the effect is an underestimate of the coefficient, known as the ''
attenuation bias
Regression dilution, also known as regression attenuation, is the biasing of the linear regression slope towards zero (the underestimation of its absolute value), caused by errors in the independent variable.
Consider fitting a straight line for ...
''. In
non-linear models the direction of the bias is likely to be more complicated.
Motivating example
Consider a simple linear regression model of the form
:
where
denotes the ''true'' but
unobserved regressor. Instead we observe this value with an error:
:
where the measurement error
is assumed to be independent of the true value
.
If the
′s are simply regressed on the
′s (see
simple linear regression
In statistics, simple linear regression is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable (conventionally, the ''x'' an ...
), then the estimator for the slope coefficient is
:
which converges as the sample size
increases without bound:
:
Variances are non-negative, so that in the limit the estimate is smaller in magnitude than the true value of
, an effect which statisticians call ''attenuation'' or
regression dilution. Thus the ‘naïve’ least squares estimator is
inconsistent in this setting. However, the estimator is a
consistent estimator
In statistics, a consistent estimator or asymptotically consistent estimator is an estimator—a rule for computing estimates of a parameter ''θ''0—having the property that as the number of data points used increases indefinitely, the resul ...
of the parameter required for a best linear predictor of
given
: in some applications this may be what is required, rather than an estimate of the ‘true’ regression coefficient, although that would assume that the variance of the errors in observing
remains fixed. This follows directly from the result quoted immediately above, and the fact that the regression coefficient relating the
′s to the actually observed
′s, in a simple linear regression, is given by
:
It is this coefficient, rather than
, that would be required for constructing a predictor of
based on an observed
which is subject to noise.
It can be argued that almost all existing data sets contain errors of different nature and magnitude, so that attenuation bias is extremely frequent (although in multivariate regression the direction of bias is ambiguous).
Jerry Hausman sees this as an ''iron law of econometrics'': "The magnitude of the estimate is usually smaller than expected."
Specification
Usually measurement error models are described using the
latent variables approach. If
is the response variable and
are observed values of the regressors, then it is assumed there exist some latent variables
and
which follow the model's “true”
functional relationship , and such that the observed quantities are their noisy observations:
:
where
is the model's
parameter
A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...
and
are those regressors which are assumed to be error-free (for example when linear regression contains an intercept, the regressor which corresponds to the constant certainly has no "measurement errors"). Depending on the specification these error-free regressors may or may not be treated separately; in the latter case it is simply assumed that corresponding entries in the variance matrix of
's are zero.
The variables
,
,
are all ''observed'', meaning that the statistician possesses a
data set A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record of the d ...
of
statistical unit
In statistics, a unit is one member of a set of entities being studied. It is the main source for the mathematical abstraction of a "random variable". Common examples of a unit would be a single person, animal, plant, manufactured item, or country ...
s
which follow the
data generating process described above; the latent variables
,
,
, and
are not observed however.
This specification does not encompass all the existing errors-in-variables models. For example in some of them function
may be
non-parametric
Nonparametric statistics is the branch of statistics that is not based solely on parametrized families of probability distributions (common examples of parameters are the mean and variance). Nonparametric statistics is based on either being distr ...
or semi-parametric. Other approaches model the relationship between
and
as distributional instead of functional, that is they assume that
conditionally on
follows a certain (usually parametric) distribution.
Terminology and assumptions
* The observed variable
may be called the ''manifest'', ''indicator'', or
''proxy'' variable.
* The unobserved variable
may be called the ''latent'' or ''true'' variable. It may be regarded either as an unknown constant (in which case the model is called a ''functional model''), or as a random variable (correspondingly a ''structural model'').
* The relationship between the measurement error
and the latent variable
can be modeled in different ways:
** ''Classical errors'':
the errors are
independent
Independent or Independents may refer to:
Arts, entertainment, and media Artist groups
* Independents (artist group), a group of modernist painters based in the New Hope, Pennsylvania, area of the United States during the early 1930s
* Independe ...
of the latent variable. This is the most common assumption, it implies that the errors are introduced by the measuring device and their magnitude does not depend on the value being measured.
** ''Mean-independence'':
the errors are mean-zero for every value of the latent regressor. This is a less restrictive assumption than the classical one, as it allows for the presence of
heteroscedasticity
In statistics, a sequence (or a vector) of random variables is homoscedastic () if all its random variables have the same finite variance. This is also known as homogeneity of variance. The complementary notion is called heteroscedasticity. The s ...
or other effects in the measurement errors.
**
''Berkson's errors'':
the errors are independent of the ''observed'' regressor ''x''. This assumption has very limited applicability. One example is round-off errors: for example if a person's
age* is a
continuous random variable, whereas the observed
age is truncated to the next smallest integer, then the truncation error is approximately independent of the observed
age. Another possibility is with the fixed design experiment: for example if a scientist decides to make a measurement at a certain predetermined moment of time
, say at
, then the real measurement may occur at some other value of
(for example due to her finite reaction time) and such measurement error will be generally independent of the "observed" value of the regressor.
** ''Misclassification errors'': special case used for the
dummy regressors. If
is an indicator of a certain event or condition (such as person is male/female, some medical treatment given/not, etc.), then the measurement error in such regressor will correspond to the incorrect classification similar to
type I and type II errors
In statistical hypothesis testing, a type I error is the mistaken rejection of an actually true null hypothesis (also known as a "false positive" finding or conclusion; example: "an innocent person is convicted"), while a type II error is the fa ...
in statistical testing. In this case the error
may take only 3 possible values, and its distribution conditional on
is modeled with two parameters: