
In
statistics
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
, an errors-in-variables model or a measurement error model is a
regression model that accounts for
measurement errors in the
independent variables. In contrast, standard regression models assume that those regressors have been measured exactly, or observed without error; as such, those models account only for errors in the
dependent variables, or responses.
In the case when some regressors have been measured with errors, estimation based on the standard assumption leads to
inconsistent
In deductive logic, a consistent theory is one that does not lead to a logical contradiction. A theory T is consistent if there is no formula \varphi such that both \varphi and its negation \lnot\varphi are elements of the set of consequences o ...
estimates, meaning that the parameter estimates do not tend to the true values even in very large samples. For
simple linear regression
In statistics, simple linear regression (SLR) is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable (conventionally, the ''x ...
the effect is an underestimate of the coefficient, known as the ''
attenuation bias''. In
non-linear models the direction of the bias is likely to be more complicated.
Motivating example
Consider a simple linear regression model of the form
:
where
denotes the ''true'' but
unobserved regressor. Instead, we observe this value with an error:
:
where the measurement error
is assumed to be independent of the true value
.
A practical application is the standard school science experiment for
Hooke's law
In physics, Hooke's law is an empirical law which states that the force () needed to extend or compress a spring by some distance () scales linearly with respect to that distance—that is, where is a constant factor characteristic of ...
, in which one estimates the relationship between the weight added to a spring and the amount by which the spring stretches.
If the
′s are simply regressed on the
′s (see
simple linear regression
In statistics, simple linear regression (SLR) is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable (conventionally, the ''x ...
), then the estimator for the slope coefficient is
:
which converges as the sample size
increases without bound:
:
This is in contrast to the "true" effect of
, estimated using the
,:
:
Variances are non-negative, so that in the limit the estimated
is smaller than
, an effect which statisticians call ''attenuation'' or
regression dilution
Regression dilution, also known as regression attenuation, is the biasing of the linear regression slope towards zero (the underestimation of its absolute value), caused by errors in the independent variable.
Consider fitting a straight line ...
. Thus the ‘naïve’ least squares estimator
is an
inconsistent
In deductive logic, a consistent theory is one that does not lead to a logical contradiction. A theory T is consistent if there is no formula \varphi such that both \varphi and its negation \lnot\varphi are elements of the set of consequences o ...
estimator for
. However,
is a
consistent estimator
In statistics, a consistent estimator or asymptotically consistent estimator is an estimator—a rule for computing estimates of a parameter ''θ''0—having the property that as the number of data points used increases indefinitely, the result ...
of the parameter required for a best linear predictor of
given the observed
: in some applications this may be what is required, rather than an estimate of the 'true' regression coefficient
, although that would assume that the variance of the errors in the estimation and prediction is identical. This follows directly from the result quoted immediately above, and the fact that the regression coefficient relating the
′s to the actually observed
′s, in a simple linear regression, is given by
:
It is this coefficient, rather than
, that would be required for constructing a predictor of
based on an observed
which is subject to noise.
It can be argued that almost all existing data sets contain errors of different nature and magnitude, so that attenuation bias is extremely frequent (although in multivariate regression the direction of bias is ambiguous).
Jerry Hausman sees this as an ''iron law of econometrics'': "The magnitude of the estimate is usually smaller than expected."
Specification
Usually, measurement error models are described using the
latent variables approach. If
is the response variable and
are observed values of the regressors, then it is assumed there exist some latent variables
and
which follow the model's "true"
functional relationship
In mathematics, a function from a set to a set assigns to each element of exactly one element of .; the words ''map'', ''mapping'', ''transformation'', ''correspondence'', and ''operator'' are sometimes used synonymously. The set is called ...
, and such that the observed quantities are their noisy observations:
:
where
is the model's
parameter
A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...
and
are those regressors which are assumed to be error-free (for example, when linear regression contains an intercept, the regressor which corresponds to the constant certainly has no "measurement errors"). Depending on the specification these error-free regressors may or may not be treated separately; in the latter case it is simply assumed that corresponding entries in the variance matrix of
's are zero.
The variables
,
,
are all ''observed'', meaning that the statistician possesses a
data set
A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more table (database), database tables, where every column (database), column of a table represents a particular Variable (computer sci ...
of
statistical unit
In statistics, a unit is one member of a set of entities being studied. It is the main source for the mathematical abstraction of a "random variable". Common examples of a unit would be a single person, animal, plant, manufactured item, or countr ...
s
which follow the
data generating process described above; the latent variables
,
,
, and
are not observed, however.
This specification does not encompass all the existing errors-in-variables models. For example, in some of them, function
may be
non-parametric or semi-parametric. Other approaches model the relationship between
and
as distributional instead of functional; that is, they assume that
conditionally on
follows a certain (usually parametric) distribution.
Terminology and assumptions
* The observed variable
may be called the ''manifest'', ''indicator'', or
''proxy'' variable.
* The unobserved variable
may be called the ''latent'' or ''true'' variable. It may be regarded either as an unknown constant (in which case the model is called a ''functional model''), or as a random variable (correspondingly a ''structural model'').
* The relationship between the measurement error
and the latent variable
can be modeled in different ways:
** ''Classical errors'':
the errors are
independent
Independent or Independents may refer to:
Arts, entertainment, and media Artist groups
* Independents (artist group), a group of modernist painters based in Pennsylvania, United States
* Independentes (English: Independents), a Portuguese artist ...
of the latent variable. This is the most common assumption; it implies that the errors are introduced by the measuring device and their magnitude does not depend on the value being measured.
** ''Mean-independence'':
the errors are mean-zero for every value of the latent regressor. This is a less restrictive assumption than the classical one, as it allows for the presence of
heteroscedasticity
In statistics, a sequence of random variables is homoscedastic () if all its random variables have the same finite variance; this is also known as homogeneity of variance. The complementary notion is called heteroscedasticity, also known as hete ...
or other effects in the measurement errors.
**
''Berkson's errors'':
the errors are independent of the ''observed'' regressor ''x''. This assumption has very limited applicability. One example is round-off errors: for example, if a person's
age* is a
continuous random variable
In probability theory and statistics, a probability distribution is a function that gives the probabilities of occurrence of possible events for an experiment. It is a mathematical description of a random phenomenon in terms of its sample spa ...
, whereas the observed
age is truncated to the next smallest integer, then the truncation error is approximately independent of the observed
age. Another possibility is with the fixed design experiment: for example, if a scientist decides to make a measurement at a certain predetermined moment of time
, say at
, then the real measurement may occur at some other value of
(for example due to her finite reaction time) and such measurement error will be generally independent of the "observed" value of the regressor.
** ''Misclassification errors'': special case used for the
dummy regressors. If
is an indicator of a certain event or condition (such as person is male/female, some medical treatment given/not, etc.), then the measurement error in such regressor will correspond to the incorrect classification similar to
type I and type II errors
Type I error, or a false positive, is the erroneous rejection of a true null hypothesis in statistical hypothesis testing. A type II error, or a false negative, is the erroneous failure in bringing about appropriate rejection of a false null hy ...
in statistical testing. In this case the error
may take only 3 possible values, and its distribution conditional on
is modeled with two parameters:
Linear model
Linear errors-in-variables models were studied first, probably because
linear model
In statistics, the term linear model refers to any model which assumes linearity in the system. The most common occurrence is in connection with regression models and the term is often taken as synonymous with linear regression model. However, t ...
s were so widely used and they are easier than non-linear ones. Unlike standard
least squares
The method of least squares is a mathematical optimization technique that aims to determine the best fit function by minimizing the sum of the squares of the differences between the observed values and the predicted values of the model. The me ...
regression (OLS), extending errors in variables regression (EiV) from the simple to the multivariable case is not straightforward, unless one treats all variables in the same way i.e. assume equal reliability.
Simple linear model
The simple linear errors-in-variables model was already presented in the "motivation" section:
:
\begin
y_t = \alpha + \beta x_t^* + \varepsilon_t, \\
x_t = x_t^* + \eta_t,
\end
where all variables are
scalar. Here ''α'' and ''β'' are the parameters of interest, whereas ''σ
ε'' and ''σ
η''—standard deviations of the error terms—are the
nuisance parameters. The "true" regressor ''x*'' is treated as a random variable (''structural'' model), independent of the measurement error ''η'' (''classic'' assumption).
This model is
identifiable
In statistics, identifiability is a property which a model must satisfy for precise inference to be possible. A model is identifiable if it is theoretically possible to learn the true values of this model's underlying parameters after obtaining a ...
in two cases: (1) either the latent regressor ''x*'' is ''not''
normally distributed
In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real number, real-valued random variable. The general form of its probability density function is
f(x ...
, (2) or ''x*'' has normal distribution, but neither ''ε
t'' nor ''η
t'' are divisible by a normal distribution. That is, the parameters ''α'', ''β'' can be consistently estimated from the data set
\scriptstyle(x_t,\,y_t)_^T without any additional information, provided the latent regressor is not Gaussian.
Before this identifiability result was established, statisticians attempted to apply the
maximum likelihood
In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ...
technique by assuming that all variables are normal, and then concluded that the model is not identified. The suggested remedy was to ''assume'' that some of the parameters of the model are known or can be estimated from the outside source. Such estimation methods include
*
Deming regression — assumes that the ratio ''δ'' = ''σ²
ε''/''σ²
η'' is known. This could be appropriate for example when errors in ''y'' and ''x'' are both caused by measurements, and the accuracy of measuring devices or procedures are known. The case when ''δ'' = 1 is also known as the
orthogonal regression.
* Regression with known
reliability ratio ''λ'' = ''σ²''
∗/ ( ''σ²
η'' + ''σ²''
∗), where ''σ²''
∗ is the variance of the latent regressor. Such approach may be applicable for example when repeating measurements of the same unit are available, or when the reliability ratio has been known from the independent study. In this case the consistent estimate of slope is equal to the least-squares estimate divided by ''λ''.
* Regression with known ''σ²
η'' may occur when the source of the errors in ''xs is known and their variance can be calculated. This could include rounding errors, or errors introduced by the measuring device. When ''σ²
η'' is known we can compute the reliability ratio as ''λ'' = ( ''σ²
x'' − ''σ²
η'') / ''σ²
x'' and reduce the problem to the previous case.
Estimation methods that do not assume knowledge of some of the parameters of the model, include
Multivariable linear model
The multivariable model looks exactly like the simple linear model, only this time ''β'', ''η''
''t'', ''x''
''t'' and ''x*''
''t'' are ''k×''1 vectors.
:
\begin
y_t = \alpha + \beta'x_t^* + \varepsilon_t, \\
x_t = x_t^* + \eta_t.
\end
In the case when (''ε''
''t'',''η''
''t'') is jointly normal, the parameter ''β'' is not identified if and only if there is a non-singular ''k×k'' block matrix
'a A'' where ''a'' is a ''k×''1 vector such that ''a′x*'' is distributed normally and independently of ''A′x*''. In the case when ''ε''
''t'', ''η''
''t1'',..., ''η''
''tk'' are mutually independent, the parameter ''β'' is not identified if and only if in addition to the conditions above some of the errors can be written as the sum of two independent variables one of which is normal.
Some of the estimation methods for multivariable linear models are
Non-linear models
A generic non-linear measurement error model takes form
:
\begin
y_t = g(x^*_t) + \varepsilon_t, \\
x_t = x^*_t + \eta_t.
\end
Here function ''g'' can be either parametric or non-parametric. When function ''g'' is parametric it will be written as ''g''(''x''*, ''β'').
For a general vector-valued regressor ''x*'' the conditions for model
identifiability are not known. However, in the case of scalar ''x*'' the model is identified unless the function ''g'' is of the "log-exponential" form
:
g(x^*) = a + b \ln\big(e^ + d\big)
and the latent regressor ''x*'' has density
:
f_(x) = \begin
A e^(e^+E)^, & \text\ d>0 \\
A e^ & \text\ d=0
\end
where constants ''A'',''B'',''C'',''D'',''E'',''F'' may depend on ''a'',''b'',''c'',''d''.
Despite this optimistic result, as of now no methods exist for estimating non-linear errors-in-variables models without any extraneous information. However, there are several techniques which make use of some additional data: either the instrumental variables, or repeated observations.
Instrumental variables methods
Repeated observations
In this approach two (or maybe more) repeated observations of the regressor ''x*'' are available. Both observations contain their own measurement errors; however, those errors are required to be independent:
:
\begin
x_ = x^*_t + \eta_, \\
x_ = x^*_t + \eta_,
\end
where ''x*'' ⊥ ''η''
1 ⊥ ''η''
2. Variables ''η''
1, ''η''
2 need not be identically distributed (although if they are efficiency of the estimator can be slightly improved). With only these two observations it is possible to consistently estimate the density function of ''x*'' using Kotlarski's
deconvolution technique.
References
Further reading
*
*
*
External links
An Historical Overview of Linear Regression with Errors in both Variables J.W. Gillard 2006
* by
Mark Thoma.
{{DEFAULTSORT:Errors-In-Variables Models
Regression models
Least squares