statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

, the Huber loss is a

loss function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...

used in

robust regression In robust statistics, robust regression seeks to overcome some limitations of traditional regression analysis. A regression analysis models the relationship between one or more independent variables and a dependent variable. Standard types of re ...

, that is less sensitive to

outlier In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are ...

s in data than the squared error loss. A variant for classification is also sometimes used.

Definition

The Huber loss function describes the penalty incurred by an estimation procedure . Huber (1964) defines the loss function piecewise by

\delta \cdot \left(, a, - \frac\delta\right), & \text \end

This function is quadratic for small values of , and linear for large values, with equal values and slopes of the different sections at the two points where

, a,  = \delta

. The variable often refers to the residuals, that is to the difference between the observed and predicted values

a = y - f(x)

, so the former can be expanded to

\delta\ \cdot \left(\left, y - f(x)\ - \frac\delta\right), & \text \end

The Huber loss is the

convolution In mathematics (in particular, functional analysis), convolution is a operation (mathematics), mathematical operation on two function (mathematics), functions f and g that produces a third function f*g, as the integral of the product of the two ...

of the

absolute value In mathematics, the absolute value or modulus of a real number x, is the non-negative value without regard to its sign. Namely, , x, =x if x is a positive number, and , x, =-x if x is negative (in which case negating x makes -x positive), ...

function with the

rectangular function The rectangular function (also known as the rectangle function, rect function, Pi function, Heaviside Pi function, gate function, unit pulse, or the normalized boxcar function) is defined as \operatorname\left(\frac\right) = \Pi\left(\frac\ri ...

, scaled and translated. Thus it "smoothens out" the former's corner at the origin. Comparison of loss functions

Motivation

Two very commonly used loss functions are the squared loss,

L(a) = a^2

, and the absolute loss,

L(a)=, a,

. The squared loss function results in an

arithmetic mean In mathematics and statistics, the arithmetic mean ( ), arithmetic average, or just the ''mean'' or ''average'' is the sum of a collection of numbers divided by the count of numbers in the collection. The collection is often a set of results fr ...

unbiased estimator In statistics, the bias of an estimator (or bias function) is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called ''unbiased''. In stat ...

, and the absolute-value loss function results in a

median The median of a set of numbers is the value separating the higher half from the lower half of a Sample (statistics), data sample, a statistical population, population, or a probability distribution. For a data set, it may be thought of as the “ ...

-unbiased estimator (in the one-dimensional case, and a

geometric median In geometry, the geometric median of a discrete point set in a Euclidean space is the point minimizing the sum of distances to the sample points. This generalizes the median, which has the property of minimizing the sum of distances or absolute ...

-unbiased estimator for the multi-dimensional case). The squared loss has the disadvantage that it has the tendency to be dominated by outliers—when summing over a set of

a

's (as in

\sum_^n L(a_i)

), the sample mean is influenced too much by a few particularly large

a

-values when the distribution is heavy tailed: in terms of

estimation theory Estimation theory is a branch of statistics that deals with estimating the values of Statistical parameter, parameters based on measured empirical data that has a random component. The parameters describe an underlying physical setting in such ...

, the asymptotic relative efficiency of the mean is poor for heavy-tailed distributions. As defined above, the Huber loss function is strongly convex in a uniform neighborhood of its minimum

a=0

; at the boundary of this uniform neighborhood, the Huber loss function has a differentiable extension to an affine function at points

a=-\delta

and

a = \delta

. These properties allow it to combine much of the sensitivity of the mean-unbiased, minimum-variance estimator of the mean (using the quadratic loss function) and the robustness of the median-unbiased estimator (using the absolute value function).

Pseudo-Huber loss function

The Pseudo-Huber loss function can be used as a smooth approximation of the Huber loss function. It combines the best properties of L2 squared loss and L1 absolute loss by being strongly convex when close to the target/minimum and less steep for extreme values. The scale at which the Pseudo-Huber loss function transitions from L2 loss for values close to the minimum to L1 loss for extreme values and the steepness at extreme values can be controlled by the

\delta

value. The Pseudo-Huber loss function ensures that derivatives are continuous for all degrees. It is defined as

L_\delta (a) = \delta^2\left(\sqrt-1\right).

As such, this function approximates

a^2/2

for small values of

a

, and approximates a straight line with slope

\delta

for large values of

a

. While the above is the most common form, other smooth approximations of the Huber loss function also exist.

Variant for classification

For

classification Classification is the activity of assigning objects to some pre-existing classes or categories. This is distinct from the task of establishing the classes themselves (for example through cluster analysis). Examples include diagnostic tests, identif ...

purposes, a variant of the Huber loss called ''modified Huber'' is sometimes used. Given a prediction

f(x)

(a real-valued classifier score) and a true

binary Binary may refer to: Science and technology Mathematics * Binary number, a representation of numbers using only two values (0 and 1) for each digit * Binary function, a function that takes two arguments * Binary operation, a mathematical op ...

class label

y \in \

, the modified Huber loss is defined as

-4y \, f(x) & \text \end

The term

\max(0, 1 - y \, f(x))

is the

hinge loss In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs). For an intended output and a classifier score , ...

used by

support vector machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laborato ...

s; the quadratically smoothed hinge loss is a generalization of

L

Applications

The Huber loss function is used in

robust statistics Robust statistics are statistics that maintain their properties even if the underlying distributional assumptions are incorrect. Robust Statistics, statistical methods have been developed for many common problems, such as estimating location parame ...

, M-estimation and

additive model In statistics, an additive model (AM) is a nonparametric regression method. It was suggested by Jerome H. Friedman and Werner Stuetzle (1981) and is an essential part of the ACE algorithm. The ''AM'' uses a one-dimensional smoother to build a ...

ling.

References