History
Local regression and closely related procedures have a long and rich history, having been discovered and rediscovered in different fields on multiple occasions. An early work by Robert Henderson studying the problem of graduation (a term for smoothing used in Actuarial literature) introduced local regression using cubic polynomials. Specifically, let denote an ungraduated sequence of observations. Following Henderson, suppose that only the terms from to are to be taken into account when computing the graduated value of , and is the weight to be assigned to . Henderson then uses a local polynomial approximation , and sets up the following four equations for the coefficients: : Solving these equations for the polynomial coefficients yields the graduated value, . Henderson went further. In preceding years, many 'summation formula' methods of graduation had been developed, which derived graduation rules based on summation formulae (convolution of the series of obeservations with a chosen set of weights). Two such rules are the 15-point and 21-point rules of Spencer (1904). These graduation rules were carefully designed to have a quadratic-reproducing property: If the ungraduated values exactly follow a quadratic formula, then the graduated values equal the ungraduated values. This is an important property: a simple moving average, by contrast, cannot adequately model peaks and troughs in the data. Henderson's insight was to show that ''any'' such graduation rule can be represented as a local cubic (or quadratic) fit for an appropriate choice of weights. Further discussions of the historical work on graduation and local polynomial fitting can be found in Maculay (1931),Model definition
Local regression uses aMatrix representation of the local regression estimate
As with all least squares estimates, the estimated regression coefficients can be expressed in closed form (see Weighted least squares for details): where is a vector of the local regression coefficients; is the design matrix with entries ; is a diagonal matrix of the smoothing weights ; and is a vector of the responses . This matrix representation is crucial for studying the theoretical properties of local regression estimates. With appropriate definitions of the design and weight matrices, it immediately generalizes to the multiple-predictor setting.Selection issues: bandwidth, local model, fitting criteria
Implementation of local regression requires specification and selection of several components: # The bandwidth, and more generally the localized subsets of the data. # The degree of local polynomial, or more generally, the form of the local model. # The choice of weight function . # The choice of fitting criterion (least squares or something else). Each of these components has been the subject of extensive study; a summary is provided below.Localized subsets of data; Bandwidth
The bandwidth controls the resolution of the local regression estimate. If ''h'' is too small, the estimate may show high-resolution features that represent noise in the data, rather than any real structure in the mean function. Conversely, if ''h'' is too large, the estimate will only show low-resolution features, and important structure may be lost. This is the ''bias-variance tradeoff''; if ''h'' is too small, the estimate exhibits large variation; while at large ''h'', the estimate exhibits large bias. Careful choice of bandwidth is therefore crucial when applying local regression. Mathematical methods for bandwidth selection require, firstly, formal criteria to assess the performance of an estimate. One such criterion is prediction error: if a new observation is made at , how well does the estimate predict the new response ? Performance is often assessed using a squared-error loss function. The mean squared prediction error is The first term is the random variation of the observation; this is entirely independent of the local regression estimate. The second term, is the mean squared estimation error. This relation shows that, for squared error loss, minimizing prediction error and estimation error are equivalent problems. In global bandwidth selection, these measures can be integrated over the space ("mean integrated squared error", often used in theoretical work), or averaged over the actual (more useful for practical implementations). Some standard techniques from model selection can be readily adapted to local regression: # Cross Validation, which estimates the mean-squared prediction error. # Mallow's Cp and Akaike's Information Criterion, which estimate mean squared estimation error. # Other methods which attempt to estimate bias and variance variance components of the estimation error directly. Any of these criteria can be minimized to produce an automatic bandwidth selector. Cleveland and Devlin prefer a graphical method (the ''M''-plot) to visually display the bias-variance trade-off and guide bandwidth choice. One question not addressed above is, how should the bandwidth depend upon the fitting point ? Often a constant bandwidth is used, while LOWESS and LOESS prefer a nearest-neighbor bandwidth, meaning ''h'' is smaller in regions with many data points. Formally, the smoothing parameter, , is the fraction of the total number ''n'' of data points that are used in each local fit. The subset of data used in each weighted least squares fit thus comprises the points (rounded to the next largest integer) whose explanatory variables' values are closest to the point at which the response is being estimated.NISTDegree of local polynomials
Most sources, in both theoretical and computational work, use low-order polynomials as the local model, with polynomial degree ranging from 0 to 3. The degree 0 (local constant) model is equivalent to a kernel smoother; usually credited to Èlizbar Nadaraya (1964) and G. S. Watson (1964). This is the simplest model to use, but can suffer from bias when fitting near boundaries of the dataset. Local linear (degree 1) fitting can substantially reduce the boundary bias. Local quadratic (degree 2) and local cubic (degree 3) can result in improved fits, particularly when the underlying mean function has substantial curvature, or equivalently a large second derivative. In theory, higher orders of polynomial can lead to faster convergence of the estimate to the true mean , ''provided that has a sufficient number of derivatives''. See C. J. Stone (1980). Generally, it takes a large sample size for this faster convergence to be realized. There are also computational and stability issues that arise, particularly for multivariate smoothing. It is generally not recommended to use local polynomials with degree greater than 3. As with bandwidth selection, methods such as cross-validation can be used to compare the fits obtained with different degrees of polynomial.Weight function
As mentioned above, the weight function gives the most weight to the data points nearest the point of estimation and the least weight to the data points that are furthest away. The use of the weights is based on the idea that points near each other in the explanatory variable space are more likely to be related to each other in a simple way than points that are further apart. Following this logic, points that are likely to follow the local model best influence the local model parameter estimates the most. Points that are less likely to actually conform to the local model have less influence on the local modelChoice of fitting criterion
As described above, local regression uses a locally weighted least squares criterion to estimate the regression parameters. This inherits many of the advantages (ease of implementation and interpretation; good properties when errors are normally distributed) and disadvantages (sensitivity to extreme values and outliers; inefficiency when errors have unequal variance or are not normally distributed) usually associated with least squares regression. These disadvantages can be addressed by replacing the local least-squares estimation by something else. Two such ideas are presented here: local likelihood estimation, which applies local estimation to theLocal likelihood estimation
In local likelihood estimation, developed in Tibshirani and Hastie (1987), the observations are assumed to come from a parametric family of distributions, with a known probability density function (or mass function, for discrete data), where the parameter function is the unknown quantity to be estimated. To estimate at a particular point , the local likelihood criterion is Estimates of the regression coefficients (in, particular, ) are obtained by maximizing the local likelihood criterion, and the local likelihood estimate is When is the normal distribution and is the mean function, the local likelihood method reduces to the standard local least-squares regression. For other likelihood families, there is (usually) no closed-form solution for the local likelihood estimate, and iterative procedures such as iteratively reweighted least squares must be used to compute the estimate. ''Example'' (local logistic regression). All response observations are 0 or 1, and the mean function is the "success" probability, . Since must be between 0 and 1, a local polynomial model should not be used for directly. Insead, the logistic transformation can be used; equivalently, and the mass function is An asymptotic theory for local likelihood estimation is developed in J. Fan, Nancy E. Heckman and M.P.Wand (1995); the book Loader (1999) discusses many more applications of local likelihood.Robust local regression
To address the sensitivity to outliers, techniques from robust regression can be employed. In local M-estimation, the local least-squares criterion is replaced by a criterion of the form where is a robustness function and is a scale parameter. Discussion of the merits of different choices of robustness function is best left to the robust regression literature. The scale parameter must also be estimated. References for local M-estimation include Katkovnik (1985) and Alexandre Tsybakov (1986). The robustness iterations in LOWESS and LOESS correspond to the robustness function defined by and a robust global estimate of the scale parameter. If , the local criterion results; this does not require a scale parameter. When , this criterion is minimized by a locally weighted median; local regression can be interpreted as estimating the ''median'', rather than ''mean'', response. If the loss function is skewed, this becomes local quantile regression. See Keming Yu and M.C. Jones (1998).Advantages
As discussed above, the biggest advantage LOESS has over many other methods is the process of fitting a model to the sample data does not begin with the specification of a function. Instead the analyst only has to provide a smoothing parameter value and the degree of the local polynomial. In addition, LOESS is very flexible, making it ideal for modeling complex processes for which no theoretical models exist. These two advantages, combined with the simplicity of the method, make LOESS one of the most attractive of the modern regression methods for applications that fit the general framework of least squares regression but which have a complex deterministic structure. Although it is less obvious than for some of the other methods related to linear least squares regression, LOESS also accrues most of the benefits typically shared by those procedures. The most important of those is the theory for computing uncertainties for prediction and calibration. Many other tests and procedures used for validation of least squares models can also be extended to LOESS models .Disadvantages
LOESS makes less efficient use of data than other least squares methods. It requires fairly large, densely sampled data sets in order to produce good models. This is because LOESS relies on the local data structure when performing the local fitting. Thus, LOESS provides less complex data analysis in exchange for greater experimental costs. Another disadvantage of LOESS is the fact that it does not produce a regression function that is easily represented by a mathematical formula. This can make it difficult to transfer the results of an analysis to other people. In order to transfer the regression function to another person, they would need the data set and software for LOESS calculations. In nonlinear regression, on the other hand, it is only necessary to write down a functional form in order to provide estimates of the unknown parameters and the estimated uncertainty. Depending on the application, this could be either a major or a minor drawback to using LOESS. In particular, the simple form of LOESS can not be used for mechanistic modelling where fitted parameters specify particular physical properties of a system. Finally, as discussed above, LOESS is a computationally intensive method (with the exception of evenly spaced data, where the regression can then be phrased as a non-causal finite impulse response filter). LOESS is also prone to the effects of outliers in the data set, like other least squares methods. There is an iterative, robust version of LOESS leveland (1979)that can be used to reduce LOESS' sensitivity to outliers, but too many extreme outliers can still overcome even the robust method.Further reading
Books substantially covering local regression and extensions: * Macaulay (1931) "The Smoothing of Time Series", discusses graduation methods with several chapters related to local polynomial fitting. * Katkovnik (1985) "Nonparametric Identification and Smoothing of Data" in Russian. * Fan and Gijbels (1996) "Local Polynomial Modelling and Its Applications". * Loader (1999) "Local Regression and Likelihood". * Fotheringham, Brunsdon and Charlton (2002), "Geographically Weighted Regression" (a development of local regression for spatial data). Book chapters, Reviews: * "Smoothing by Local Regression: Principles and Methods" * "Local Regression and Likelihood", Chapter 13 of ''Observed Brain Dynamics'', Mitra and Bokil (2007) * Rafael Irizarry, "Local Regression". Chapter 3 of "Applied Nonparametric and Modern Statistics".See also
* Degrees of freedom (statistics)#In non-standard regression * Kernel regression * Moving least squares * Moving average * Multivariate adaptive regression splines * Non-parametric statistics * Savitzky–Golay filter * Segmented regressionReferences
Citations
Sources
* * *External links