Correction for attenuation
   HOME

TheInfoList



OR:

Regression dilution, also known as regression attenuation, is the
biasing In electronics, biasing is the setting of DC (direct current) operating conditions (current and voltage) of an active device in an amplifier. Many electronic devices, such as diodes, transistors and vacuum tubes, whose function is processing ...
of the
linear regression In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is cal ...
slope In mathematics, the slope or gradient of a line is a number that describes both the ''direction'' and the ''steepness'' of the line. Slope is often denoted by the letter ''m''; there is no clear answer to the question why the letter ''m'' is use ...
towards zero (the underestimation of its absolute value), caused by errors in the
independent variable Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or dema ...
. Consider fitting a straight line for the relationship of an outcome variable ''y'' to a predictor variable ''x'', and estimating the slope of the line. Statistical variability, measurement error or random noise in the ''y'' variable causes
uncertainty Uncertainty refers to epistemic situations involving imperfect or unknown information. It applies to predictions of future events, to physical measurements that are already made, or to the unknown. Uncertainty arises in partially observable ...
in the estimated slope, but not
bias Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individual, a group ...
: on average, the procedure calculates the right slope. However, variability, measurement error or random noise in the ''x'' variable causes bias in the estimated slope (as well as imprecision). The greater the variance in the ''x'' measurement, the closer the estimated slope must approach zero instead of the true value. It may seem counter-intuitive that noise in the predictor variable ''x'' induces a bias, but noise in the outcome variable ''y'' does not. Recall that linear regression is not symmetric: the line of best fit for predicting ''y'' from ''x'' (the usual linear regression) is not the same as the line of best fit for predicting ''x'' from ''y''.


Slope correction

Regression slope In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is cal ...
and other
regression coefficients In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is call ...
can be disattenuated as follows.


The case of a fixed ''x'' variable

The case that ''x'' is fixed, but measured with noise, is known as the ''functional model'' or ''functional relationship''. It can be corrected using
total least squares In applied statistics, total least squares is a type of errors-in-variables regression, a least squares data modeling technique in which observational errors on both dependent and independent variables are taken into account. It is a generalizati ...
and
errors-in-variables models In statistics, errors-in-variables models or measurement error models are regression models that account for measurement errors in the independent variables. In contrast, standard regression models assume that those regressors have been measured e ...
in general.


The case of a randomly distributed ''x'' variable

The case that the ''x'' variable arises randomly is known as the ''structural model'' or ''structural relationship''. For example, in a medical study patients are recruited as a sample from a population, and their characteristics such as
blood pressure Blood pressure (BP) is the pressure of circulating blood against the walls of blood vessels. Most of this pressure results from the heart pumping blood through the circulatory system. When used without qualification, the term "blood pressure ...
may be viewed as arising from a
random sample In statistics, quality assurance, and survey methodology, sampling is the selection of a subset (a statistical sample) of individuals from within a statistical population to estimate characteristics of the whole population. Statisticians atte ...
. Under certain assumptions (typically,
normal distribution In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is : f(x) = \frac e^ The parameter \mu ...
assumptions) there is a known
ratio In mathematics, a ratio shows how many times one number contains another. For example, if there are eight oranges and six lemons in a bowl of fruit, then the ratio of oranges to lemons is eight to six (that is, 8:6, which is equivalent to the ...
between the true slope, and the expected estimated slope. Frost and Thompson (2000) review several methods for estimating this ratio and hence correcting the estimated slope.Frost, C. and S. Thompson (2000). "Correcting for regression dilution bias: comparison of methods for a single predictor variable."
Journal of the Royal Statistical Society The ''Journal of the Royal Statistical Society'' is a peer-reviewed scientific journal of statistics. It comprises three series and is published by Wiley for the Royal Statistical Society. History The Statistical Society of London was founded ...
Series A 163: 173–190.
The term ''regression dilution ratio'', although not defined in quite the same way by all authors, is used for this general approach, in which the usual linear regression is fitted, and then a correction applied. The reply to Frost & Thompson by Longford (2001) refers the reader to other methods, expanding the regression model to acknowledge the variability in the x variable, so that no bias arises. Fuller (1987) is one of the standard references for assessing and correcting for regression dilution. Hughes (1993) shows that the regression dilution ratio methods apply approximately in survival models. Rosner (1992) shows that the ratio methods apply approximately to logistic regression models. Carroll et al. (1995) give more detail on regression dilution in nonlinear models, presenting the regression dilution ratio methods as the simplest case of ''regression calibration'' methods, in which additional covariates may also be incorporated.Carroll, R. J., Ruppert, D., and Stefanski, L. A. (1995). Measurement error in non-linear models. New York, Wiley. In general, methods for the structural model require some estimate of the variability of the x variable. This will require repeated measurements of the x variable in the same individuals, either in a sub-study of the main data set, or in a separate data set. Without this information it will not be possible to make a correction.


Multiple ''x'' variables

The case of multiple predictor variables subject to variability (possibly
correlated In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistic ...
) has been well-studied for linear regression, and for some non-linear regression models. Other non-linear models, such as
proportional hazards models Proportional hazards models are a class of survival models in statistics. Survival models relate the time that passes, before some event occurs, to one or more covariates that may be associated with that quantity of time. In a proportional haza ...
for
survival analysis Survival analysis is a branch of statistics for analyzing the expected duration of time until one event occurs, such as death in biological organisms and failure in mechanical systems. This topic is called reliability theory or reliability analysi ...
, have been considered only with a single predictor subject to variability.


Correlation correction

Charles Spearman Charles Edward Spearman, FRS (10 September 1863 – 17 September 1945) was an English psychologist known for work in statistics, as a pioneer of factor analysis, and for Spearman's rank correlation coefficient. He also did seminal work on mod ...
developed in 1904 a procedure for correcting correlations for regression dilution, i.e., to "rid a
correlation In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistic ...
coefficient from the weakening effect of
measurement error Observational error (or measurement error) is the difference between a measured value of a quantity and its true value.Dodge, Y. (2003) ''The Oxford Dictionary of Statistical Terms'', OUP. In statistics, an error is not necessarily a "mistake ...
". In
measurement Measurement is the quantification of attributes of an object or event, which can be used to compare with other objects or events. In other words, measurement is a process of determining how large or small a physical quantity is as compared ...
and
statistics Statistics (from German: '' Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, indust ...
, the procedure is also called correlation disattenuation or the disattenuation of correlation. The correction assures that the
Pearson correlation coefficient In statistics, the Pearson correlation coefficient (PCC, pronounced ) ― also known as Pearson's ''r'', the Pearson product-moment correlation coefficient (PPMCC), the bivariate correlation, or colloquially simply as the correlation coefficient ...
across data units (for example, people) between two sets of variables is estimated in a manner that accounts for error contained within the measurement of those variables.


Formulation

Let \beta and \theta be the true values of two attributes of some person or
statistical unit In statistics, a unit is one member of a set of entities being studied. It is the main source for the mathematical abstraction of a "random variable". Common examples of a unit would be a single person, animal, plant, manufactured item, or country ...
. These values are variables by virtue of the assumption that they differ for different statistical units in the
population Population typically refers to the number of people in a single area, whether it be a city or town, region, country, continent, or the world. Governments typically quantify the size of the resident population within their jurisdiction usi ...
. Let \hat and \hat be estimates of \beta and \theta derived either directly by observation-with-error or from application of a measurement model, such as the
Rasch model The Rasch model, named after Georg Rasch, is a psychometric model for analyzing categorical data, such as answers to questions on a reading assessment or questionnaire responses, as a function of the trade-off between the respondent's abilities, ...
. Also, let :: \hat = \beta + \epsilon_ , \quad\quad \hat = \theta + \epsilon_\theta, where \epsilon_ and \epsilon_\theta are the measurement errors associated with the estimates \hat and \hat. The estimated correlation between two sets of estimates is : \operatorname(\hat,\hat)= \frac ::::: =\frac, which, assuming the errors are uncorrelated with each other and with the true attribute values, gives : \operatorname(\hat,\hat)= \frac ::::: =\frac.\frac ::::: =\rho \sqrt, where R_\beta is the ''separation index'' of the set of estimates of \beta, which is analogous to
Cronbach's alpha Cronbach's alpha (Cronbach's \alpha), also known as tau-equivalent reliability (\rho_T) or coefficient alpha (coefficient \alpha), is a reliability coefficient that provides a method of measuring internal consistency of tests and measures. Nume ...
; that is, in terms of
classical test theory Classical test theory (CTT) is a body of related psychometric theory that predicts outcomes of psychological testing such as the difficulty of items or the ability of test-takers. It is a theory of testing based on the idea that a person's observe ...
, R_\beta is analogous to a reliability coefficient. Specifically, the separation index is given as follows: : R_\beta=\frac=\frac, where the mean squared standard error of person estimate gives an estimate of the variance of the errors, \epsilon_\beta. The standard errors are normally produced as a by-product of the estimation process (see Rasch model estimation). The disattenuated estimate of the correlation between the two sets of parameter estimates is therefore : \rho = \frac. That is, the disattenuated correlation estimate is obtained by dividing the correlation between the estimates by the
geometric mean In mathematics, the geometric mean is a mean or average which indicates a central tendency of a set of numbers by using the product of their values (as opposed to the arithmetic mean which uses their sum). The geometric mean is defined as the ...
of the separation indices of the two sets of estimates. Expressed in terms of classical test theory, the correlation is divided by the geometric mean of the reliability coefficients of two tests. Given two
random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
s X^\prime and Y^\prime measured as X and Y with measured
correlation In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistic ...
r_ and a known
reliability Reliability, reliable, or unreliable may refer to: Science, technology, and mathematics Computing * Data reliability (disambiguation), a property of some disk arrays in computer storage * High availability * Reliability (computer networking), a ...
for each variable, r_ and r_, the estimated correlation between X^\prime and Y^\prime corrected for attenuation is :r_ = \frac. How well the variables are measured affects the correlation of ''X'' and ''Y''. The correction for attenuation tells one what the estimated correlation is expected to be if one could measure ''X′'' and ''Y′'' with perfect reliability. Thus if X and Y are taken to be imperfect measurements of underlying variables X' and Y' with independent errors, then r_ estimates the true correlation between X' and Y'.


Is correction necessary?

In
statistical inference Statistical inference is the process of using data analysis to infer properties of an underlying distribution of probability.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical analysis infers properti ...
based on
regression coefficients In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is call ...
, yes; in
predictive modelling Predictive modelling uses statistics to predict outcomes. Most often the event one wants to predict is in the future, but predictive modelling can be applied to any type of unknown event, regardless of when it occurred. For example, predictive mod ...
applications, correction is neither necessary nor appropriate. To understand this, consider the measurement error as follows. Let ''y'' be the outcome variable, ''x'' be the true predictor variable, and ''w'' be an approximate observation of ''x''. Frost and Thompson suggest, for example, that ''x'' may be the true, long-term blood pressure of a patient, and ''w'' may be the blood pressure observed on one particular clinic visit. Regression dilution arises if we are interested in the relationship between ''y'' and ''x'', but estimate the relationship between ''y'' and ''w''. Because ''w'' is measured with variability, the slope of a regression line of ''y'' on ''w'' is less than the regression line of ''y'' on ''x''. Does this matter? In
predictive modelling Predictive modelling uses statistics to predict outcomes. Most often the event one wants to predict is in the future, but predictive modelling can be applied to any type of unknown event, regardless of when it occurred. For example, predictive mod ...
, no. Standard methods can fit a regression of y on w without bias. There is bias only if we then use the regression of y on w as an approximation to the regression of y on x. In the example, assuming that blood pressure measurements are similarly variable in future patients, our regression line of y on w (observed blood pressure) gives unbiased predictions. An example of a circumstance in which correction is desired is prediction of change. Suppose the change in ''x'' is known under some new circumstance: to estimate the likely change in an outcome variable ''y'', the slope of the regression of ''y'' on ''x'' is needed, not ''y'' on ''w''. This arises in
epidemiology Epidemiology is the study and analysis of the distribution (who, when, and where), patterns and determinants of health and disease conditions in a defined population. It is a cornerstone of public health, and shapes policy decisions and evi ...
. To continue the example in which ''x'' denotes blood pressure, perhaps a large
clinical trial Clinical trials are prospective biomedical or behavioral research studies on human participants designed to answer specific questions about biomedical or behavioral interventions, including new treatments (such as novel vaccines, drugs, diet ...
has provided an estimate of the change in blood pressure under a new treatment; then the possible effect on ''y'', under the new treatment, should be estimated from the slope in the regression of ''y'' on ''x''. Another circumstance is predictive modelling in which future observations are also variable, but not (in the phrase used above) "similarly variable". For example, if the current data set includes blood pressure measured with greater precision than is common in clinical practice. One specific example of this arose when developing a regression equation based on a clinical trial, in which blood pressure was the average of six measurements, for use in clinical practice, where blood pressure is usually a single measurement.


Caveats

All of these results can be shown mathematically, in the case of
simple linear regression In statistics, simple linear regression is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable (conventionally, the ''x'' and ...
assuming normal distributions throughout (the framework of Frost & Thompson). It has been discussed that a poorly executed correction for regression dilution, in particular when performed without checking for the underlying assumptions, may do more damage to an estimate than no correction.


Further reading

Regression dilution was first mentioned, under the name attenuation, by
Spearman A spear is a pole weapon consisting of a shaft, usually of wood, with a pointed head. The head may be simply the sharpened end of the shaft itself, as is the case with fire hardened spears, or it may be made of a more durable material fastene ...
(1904). Those seeking a readable mathematical treatment might like to start with Frost and Thompson (2000).


See also

*
Quantization (signal processing) Quantization, in mathematics and digital signal processing, is the process of mapping input values from a large set (often a continuous set) to output values in a (countable) smaller set, often with a finite number of elements. Rounding and ...
– a common source of error in the explanatory or independent variables


References

{{DEFAULTSORT:Regression Dilution Regression models