Control functions (also known as two-stage residual inclusion) are statistical methods to correct for
endogeneity problems by modelling the endogeneity in the
error term In mathematics and statistics, an error term is an additive type of error. Common examples include:
* errors and residuals in statistics, e.g. in linear regression
* the error term in numerical integration
In analysis, numerical integration ...
. The approach thereby differs in important ways from other models that try to account for the same
econometric problem.
Instrumental variables, for example, attempt to model the endogenous variable ''X'' as an often
invertible model with respect to a relevant and
exogenous
In a variety of contexts, exogeny or exogeneity () is the fact of an action or object originating externally. It contrasts with endogeneity or endogeny, the fact of being influenced within a system.
Economics
In an economic model, an exogeno ...
instrument ''Z''.
Panel analysis uses special data properties to difference out unobserved heterogeneity that is assumed to be fixed over time.
Control functions were introduced by
Heckman and Robb although the principle can be traced back to earlier papers. A particular reason why they are popular is because they work for non-invertible models (such as
discrete choice models) and allow for
heterogeneous
Homogeneity and heterogeneity are concepts often used in the sciences and statistics relating to the uniformity of a substance or organism. A material or image that is homogeneous is uniform in composition or character (i.e. color, shape, siz ...
effects, where effects at the individual level can differ from effects at the aggregate. A well-known example of the control function approach is the
Heckman correction.
Formal definition
Assume we start from a standard endogenous variable setup with additive errors, where ''X'' is an endogenous variable, and ''Z'' is an exogenous variable that can serve as an instrument.
A popular instrumental variable approach is to use a two-step procedure and estimate equation () first and then use the estimates of this first step to estimate equation () in a second step. The control function, however, uses that this model implies
The function ''h''(''V'') is effectively the control function that models the endogeneity and where this econometric approach lends its name from.
In a
Rubin causal model
The Rubin causal model (RCM), also known as the Neyman–Rubin causal model, is an approach to the statistical analysis of cause and effect based on the framework of potential outcomes, named after Donald Rubin. The name "Rubin causal model" was ...
potential outcomes framework, where ''Y''
1 is the outcome variable of people for who the participation indicator ''D'' equals 1, the control function approach leads to the following model
as long as the potential outcomes ''Y''
0 and ''Y''
1 are independent of ''D'' conditional on ''X'' and ''Z''.
[Heckman, J. J., and E. J. Vytlacil (2007): Econometric Evaluation of Social Programs, Part II: Using the Marginal Treatment Effect to Organize Alternative Econometric Estimators to Evaluate Social Programs, and to Forecast the Effects in New Environments. Handbook of Econometrics, Vol 6, ed. by J. J. Heckman and E. E. Leamer. North Holland.]
Variance correction
Since the second-stage regression includes
generated regressor
In least squares estimation problems, sometimes one or more regressors specified in the model are not observable. One way to circumvent this issue is to estimate or generate regressors from observable data. This generated regressor method is also ...
s, its variance-covariance matrix needs to be adjusted.
Examples
Endogeneity in Poisson regression
Wooldridge and Terza provide a methodology to both deal with and test for endogeneity within the exponential regression framework, which the following discussion follows closely. While the example focuses on a
Poisson regression model, it is possible to generalize to other exponential regression models, although this may come at the cost of additional assumptions (e.g. for binary response or censored data models).
Assume the following exponential regression model, where
is an unobserved term in the latent variable. We allow for correlation between
and
(implying
is possibly endogenous), but allow for no such correlation between
and
.
:
The variables
serve as instrumental variables for the potentially endogenous
. One can assume a linear relationship between these two variables or alternatively project the endogenous variable
onto the instruments to get the following reduced form equation:
The usual rank condition is needed to ensure identification. The endogeneity is then modeled in the following way, where
determines the severity of endogeneity and
is assumed to be independent of
.
:
Imposing these assumptions, assuming the models are correctly specified, and normalizing
, we can rewrite the conditional mean as follows:
If
were known at this point, it would be possible to estimate the relevant parameters by
quasi-maximum likelihood estimation (QMLE). Following the two step procedure strategies, Wooldridge and Terza propose estimating equation () by
ordinary least squares
In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model (with fixed level-one effects of a linear function of a set of explanatory variables) by the prin ...
. The fitted residuals from this regression can then be plugged into the estimating equation () and QMLE methods will lead to consistent estimators of the parameters of interest. Significance tests on
can then be used to test for endogeneity within the model.
Extensions
The original Heckit procedure makes
distributional assumptions about the error terms, however, more flexible estimation approaches with weaker distributional assumptions have been established. Furthermore, Blundell and Powell show how the control function approach can be particularly helpful in models with
nonadditive errors, such as discrete choice models.
[Blundell, R., and J. L. Powell (2003): Endogeneity in Nonparametric and Semiparametric Regression Models. Advances in Economics and Econometrics, Theory and Applications, Eight World Congress. Volume II, ed. by M. Dewatripont, L.P. Hansen, and S.J. Turnovsky. Cambridge University Press, Cambridge.] This latter approach, however, does implicitly make strong distributional and functional form assumptions.
See also
*
Two-stage least squares
In statistics, econometrics, epidemiology and related disciplines, the method of instrumental variables (IV) is used to estimate causal relationships when controlled experiments are not feasible or when a treatment is not successfully delivered to ...
*
Heckman correction
References
Further reading
*
*{{cite journal , first=Jeffrey M. , last=Wooldridge , author-link=Jeffrey Wooldridge , title=Control Function Methods in Applied Econometrics , journal=
Journal of Human Resources , year=2015 , volume=50 , issue=2 , pages=420–445 , doi=10.3368/jhr.50.2.420 , s2cid=119604644
Statistical models