In
statistics, projection pursuit regression (PPR) is a
statistical model
A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population). A statistical model represents, often in considerably idealized form, ...
developed by
Jerome H. Friedman
Jerome Harold Friedman (born December 29, 1939) is an American statistician, consultant and Professor of Statistics at Stanford University, known for his contributions in the field of statistics and data mining. and
Werner Stuetzle Werner may refer to:
People
* Werner (name), origin of the name and people with this name as surname and given name
Fictional characters
* Werner (comics), a German comic book character
* Werner Von Croy, a fictional character in the ''Tomb Ra ...
which is an extension of
additive model In statistics, an additive model (AM) is a nonparametric regression method. It was suggested by Jerome H. Friedman and Werner Stuetzle (1981) and is an essential part of the ACE algorithm. The ''AM'' uses a one-dimensional smoother to build a r ...
s. This model adapts the additive models in that it first projects the
data matrix
A Data Matrix is a two-dimensional code consisting of black and white "cells" or dots arranged in either a square or rectangular pattern, also known as a matrix. The information to be encoded can be text or numeric data. Usual data size is from ...
of
explanatory variable
Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...
s in the optimal direction before applying smoothing functions to these explanatory variables.
Model overview
The model consists of
linear combinations of
ridge function
In mathematics, a ridge function is any function f:\R^d\rightarrow\R that can be written as the composition of a univariate function with an affine transformation, that is: f(\boldsymbol) = g(\boldsymbol\cdot \boldsymbol) for some g:\R\rightarrow\ ...
s: non-linear transformations of linear combinations of the explanatory variables. The basic model takes the form
:
where ''x
i'' is a 1 × ''p'' row of the
design matrix
In statistics and in particular in regression analysis, a design matrix, also known as model matrix or regressor matrix and often denoted by X, is a matrix of values of explanatory variables of a set of objects. Each row represents an individual ...
containing the explanatory variables for example ''i'', ''y
i'' is a 1 × 1 prediction, is a collection of ''r'' vectors (each a unit vector of length ''p'') which contain the unknown parameters, is a collection of ''r'' initially unknown smooth functions that map from ℝ → ℝ, and ''r'' is a hyperparameter. Good values for ''r'' can be determined through
cross-validation or a forward stage-wise strategy which stops when the model fit cannot be significantly improved. As ''r'' approaches infinity and with an appropriate set of functions , the PPR model is a
universal estimator, as it can approximate any continuous function in ℝ
''p''.
Model estimation
For a given set of data
, the goal is to minimize the error function
:
over the functions
and vectors
. No method exists for solving over all variables at once, but it can be solved via
alternating optimization
Alternating may refer to:
Mathematics
* Alternating algebra, an algebra in which odd-grade elements square to zero
* Alternating form, a function formula in algebra
* Alternating group, the group of even permutations of a finite set
* Alter ...
. First, consider each
pair individually: Let all other parameters be fixed, and find a "residual", the variance of the output not accounted for by those other parameters, given by
:
The task of minimizing the error function now reduces to solving
:
for each ''j'' in turn. Typically new
pairs are added to the model in a forward stage-wise fashion.
Aside: Previously-fitted pairs can be readjusted after new fit-pairs are determined by an algorithm known as backfitting, which entails reconsidering a previous pair, recalculating the residual given how other pairs have changed, refitting to account for that new information, and then cycling through all fit-pairs this way until parameters converge. This process typically results in a model that performs better with fewer fit-pairs, though it takes longer to train, and it is usually possible to achieve the same performance by skipping backfitting and simply adding more fits to the model (increasing ''r'').
Solving the simplified error function to determine an
pair can be done with alternating optimization, where first a random
is used to project
in to 1D space, and then the optimal
is found to describe the relationship between that projection and the residuals via your favorite scatter plot regression method. Then if
is held constant, assuming
is once differentiable, the optimal updated weights
can be found via the
Gauss-Newton method—a quasi-Newton method in which the part of the Hessian involving the second derivative is discarded. To derive this, first
Taylor expand , then plug the expansion back in to the simplified error function
and do some algebraic manipulation to put it in the form
:
This is a
weighted least squares problem. If we solve for all weights
and put them in a diagonal matrix
, stack all the new targets
in to a vector, and use the full data matrix
instead of a single example
, then the optimal
is given by the closed-form
:
Use this updated
to find a new projection of
and refit
to the new scatter plot. Then use that new
to update
by resolving the above, and continue this alternating process until
converges.
It has been shown that the convergence rate, the bias and the variance are affected by the estimation of
and
.
Discussion
The PPR model takes the form of a basic additive model but with the additional
component, so each
fits a scatter plot of
vs the
residual (unexplained variance) during training rather than using the raw inputs themselves. This constrains the problem of finding each
to low dimension, making it solvable with common least squares or spline fitting methods and sidestepping the
curse of dimensionality
The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. Th ...
during training. Because
is taken of a projection of
, the result looks like a "ridge" orthogonal to the projection dimension, so
are often called "ridge functions". The directions
are chosen to optimize the fit of their corresponding ridge functions.
Note that because PPR attempts to fit projections of the data, it can be difficult to interpret the fitted model as a whole, because each input variable has been accounted for in a complex and multifaceted way. This can make the model more useful for prediction than for understanding the data, though visualizing individual ridge functions and considering which projections the model is discovering can yield some insight.
Advantages of PPR estimation
*It uses univariate regression functions instead of their multivariate form, thus effectively dealing with the
curse of dimensionality
The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. Th ...
*Univariate regression allows for simple and efficient estimation
*Relative to
generalized additive model In statistics, a generalized additive model (GAM) is a generalized linear model in which the linear response variable depends linearly on unknown smooth functions of some predictor variables, and interest focuses on inference about these smooth fun ...
s, PPR can estimate a much richer class of functions
*Unlike local averaging methods (such as
k-nearest neighbors
In statistics, the ''k''-nearest neighbors algorithm (''k''-NN) is a non-parametric supervised learning method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and regre ...
), PPR can ignore variables with low explanatory power.
Disadvantages of PPR estimation
*PPR requires examining an M-dimensional parameter space in order to estimate
.
*One must select the smoothing parameter for
.
*The model is often difficult to interpret
Extensions of PPR
*Alternate smoothers, such as the radial function, harmonic function and additive function, have been suggested and their performances vary depending on the data sets used.
*Alternate optimization criteria have been used as well, such as standard absolute deviations and
mean absolute deviation
The average absolute deviation (AAD) of a data set is the average of the Absolute value, absolute Deviation (statistics), deviations from a central tendency, central point. It is a summary statistics, summary statistic of statistical dispersion or ...
s.
*
Ordinary least squares
In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model (with fixed level-one effects of a linear function of a set of explanatory variables) by the ...
can be used to simplify calculations as often the data does not have strong non-linearities.
*Sliced Inverse Regression (SIR) has been used to choose the direction vectors for PPR.
*Generalized PPR combines regular PPR with iteratively reweighted least squares (IRLS) and a
link function
In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a ''link function'' and by ...
to estimate binary data.
PPR vs neural networks (NN)
Both projection pursuit regression and
neural networks
A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological ...
models project the input vector onto a one-dimensional hyperplane and then apply a nonlinear transformation of the input variables that are then added in a linear fashion. Thus both follow the same steps to overcome the curse of dimensionality. The main difference is that the functions
being fitted in PPR can be different for each combination of input variables and are estimated one at a time and then updated with the weights, whereas in NN these are all specified upfront and estimated simultaneously.
Thus, PPR estimation is more straightforward than NN and the transformations of variables in PPR are data driven whereas in NN, these transformations are fixed.
See also
*
Projection pursuit Projection pursuit (PP) is a type of statistical technique which involves finding the most "interesting" possible projections in multidimensional data. Often, projections which deviate more from a normal distribution are considered to be more inte ...
References
*Friedman, J.H. and Stuetzle, W. (1981
Projection Pursuit Regression Journal of the American Statistical Association, 76, 817–823.
*Hand, D.,
Mannila, H. and Smyth, P, (2001) Principles of Data Mining. MIT Press.
*Hall, P. (1988) Estimating the direction in which a data set is the most interesting, Probab. Theory Related Fields, 80, 51–77.
*Hastie, T. J., Tibshirani, R. J. and Friedman, J.H. (2009)
The Elements of Statistical Learning: Data Mining, Inference and Prediction Springer. {{ISBN, 978-0-387-84857-0
*Klinke, S. and Grassmann, J. (2000) ‘Projection Pursuit Regression’ in Smoothing and Regression: Approaches, Computation and Application. Ed. Schimek, M.G.. Wiley Interscience.
*Lingjarde, O. C. and Liestol, K. (1998
Generalized Projection Pursuit Regression SIAM Journal of Scientific Computing, 20, 844-857.
Regression analysis