mathematical optimization Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criteria, from some set of available alternatives. It is generally divided into two subfiel ...

and

decision theory Decision theory or the theory of rational choice is a branch of probability theory, probability, economics, and analytic philosophy that uses expected utility and probabilities, probability to model how individuals would behave Rationality, ratio ...

, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a

real number In mathematics, a real number is a number that can be used to measure a continuous one- dimensional quantity such as a duration or temperature. Here, ''continuous'' means that pairs of values can have arbitrarily small differences. Every re ...

intuitively representing some "cost" associated with the event. An

optimization problem In mathematics, engineering, computer science and economics Economics () is a behavioral science that studies the Production (economics), production, distribution (economics), distribution, and Consumption (economics), consumption of goo ...

seeks to minimize a loss function. An objective function is either a loss function or its opposite (in specific domains, variously called a reward function, a

profit function In economics, profit maximization is the short run or long run process by which a firm may determine the price, input and output levels that will lead to the highest possible total profit (or just profit in short). In neoclassical economics, wh ...

, a

utility function In economics, utility is a measure of a certain person's satisfaction from a certain state of the world. Over time, the term has been used with at least two meanings. * In a Normative economics, normative context, utility refers to a goal or ob ...

, a

fitness function A fitness function is a particular type of objective or cost function that is used to summarize, as a single figure of merit, how close a given candidate solution is to achieving the set aims. It is an important component of evolutionary algorit ...

, etc.), in which case it is to be maximized. The loss function could include terms from several levels of the hierarchy. In statistics, typically a loss function is used for parameter estimation, and the event in question is some function of the difference between estimated and true values for an instance of data. The concept, as old as

Laplace Pierre-Simon, Marquis de Laplace (; ; 23 March 1749 – 5 March 1827) was a French polymath, a scholar whose work has been instrumental in the fields of physics, astronomy, mathematics, engineering, statistics, and philosophy. He summariz ...

, was reintroduced in statistics by

Abraham Wald Abraham Wald (; ; , ; – ) was a Hungarian and American mathematician and statistician who contributed to decision theory, geometry and econometrics, and founded the field of sequential analysis. One of his well-known statistical works was ...

in the middle of the 20th century. In the context of

economics Economics () is a behavioral science that studies the Production (economics), production, distribution (economics), distribution, and Consumption (economics), consumption of goods and services. Economics focuses on the behaviour and interac ...

, for example, this is usually

economic cost Economic cost is the combination of losses of any goods that have a value attached to them by any one individual. Economic cost is used mainly by economists as means to compare the prudence of one course of action with that of another. The comparis ...

regret Regret is the emotion of wishing one had made a different decision in the past, because the consequences of the decision one did make were unfavorable. Regret is related to perceived opportunity. Its intensity varies over time after the decisi ...

. In

classification Classification is the activity of assigning objects to some pre-existing classes or categories. This is distinct from the task of establishing the classes themselves (for example through cluster analysis). Examples include diagnostic tests, identif ...

, it is the penalty for an incorrect classification of an example. In

actuarial science Actuarial science is the discipline that applies mathematics, mathematical and statistics, statistical methods to Risk assessment, assess risk in insurance, pension, finance, investment and other industries and professions. Actuary, Actuaries a ...

, it is used in an insurance context to model benefits paid over premiums, particularly since the works of

Harald Cramér Harald Cramér (; 25 September 1893 – 5 October 1985) was a Swedish mathematician, actuary, and statistician, specializing in mathematical statistics and probabilistic number theory. John Kingman described him as "one of the giants of statis ...

in the 1920s. In

optimal control Optimal control theory is a branch of control theory that deals with finding a control for a dynamical system over a period of time such that an objective function is optimized. It has numerous applications in science, engineering and operations ...

, the loss is the penalty for failing to achieve a desired value. In

financial risk management Financial risk management is the practice of protecting Value (economics), economic value in a business, firm by managing exposure to financial risk - principally credit risk and market risk, with more specific variants as listed aside - as well ...

, the function is mapped to a monetary loss. Comparison of loss functions

Examples

Regret

Leonard J. Savage argued that using non-Bayesian methods such as

minimax Minimax (sometimes Minmax, MM or saddle point) is a decision rule used in artificial intelligence, decision theory, combinatorial game theory, statistics, and philosophy for ''minimizing'' the possible loss function, loss for a Worst-case scenari ...

, the loss function should be based on the idea of ''

'', i.e., the loss associated with a decision should be the difference between the consequences of the best decision that could have been made under circumstances will be known and the decision that was in fact taken before they were known.

Quadratic loss function

The use of a quadratic loss function is common, for example when using

least squares The method of least squares is a mathematical optimization technique that aims to determine the best fit function by minimizing the sum of the squares of the differences between the observed values and the predicted values of the model. The me ...

techniques. It is often more mathematically tractable than other loss functions because of the properties of

variance In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...

s, as well as being symmetric: an error above the target causes the same loss as the same magnitude of error below the target. If the target is ''t'', then a quadratic loss function is :

\lambda(x) = C (t-x)^2 \;

for some constant ''C''; the value of the constant makes no difference to a decision, and can be ignored by setting it equal to 1. This is also known as the squared error loss (SEL). Many common

statistic A statistic (singular) or sample statistic is any quantity computed from values in a sample which is considered for a statistical purpose. Statistical purposes include estimating a population parameter, describing a sample, or evaluating a hypot ...

s, including

t-test Student's ''t''-test is a statistical test used to test whether the difference between the response of two groups is Statistical significance, statistically significant or not. It is any statistical hypothesis testing, statistical hypothesis test ...

s, regression models,

design of experiments The design of experiments (DOE), also known as experiment design or experimental design, is the design of any task that aims to describe and explain the variation of information under conditions that are hypothesized to reflect the variation. ...

, and much else, use

methods applied using

linear regression In statistics, linear regression is a statistical model, model that estimates the relationship between a Scalar (mathematics), scalar response (dependent variable) and one or more explanatory variables (regressor or independent variable). A mode ...

theory, which is based on the quadratic loss function. The quadratic loss function is also used in linear-quadratic optimal control problems. In these problems, even in the absence of uncertainty, it may not be possible to achieve the desired values of all target variables. Often loss is expressed as a

quadratic form In mathematics, a quadratic form is a polynomial with terms all of degree two (" form" is another name for a homogeneous polynomial). For example, 4x^2 + 2xy - 3y^2 is a quadratic form in the variables and . The coefficients usually belong t ...

in the deviations of the variables of interest from their desired values; this approach is tractable because it results in linear

first-order condition In calculus, a derivative test uses the derivatives of a function to locate the critical points of a function and determine whether each point is a local maximum, a local minimum, or a saddle point. Derivative tests can also give information abou ...

s. In the context of

stochastic control Stochastic control or stochastic optimal control is a sub field of control theory that deals with the existence of uncertainty either in observations or in the noise that drives the evolution of the system. The system designer assumes, in a Bayesi ...

, the expected value of the quadratic form is used. The quadratic loss assigns more importance to outliers than to the true data due to its square nature, so alternatives like the Huber, Log-Cosh and SMAE losses are used when the data has many large outliers. Fitting a straight line to a data with outliers

Fitting a straight line to a data with outliers

0-1 loss function

statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

and

, a frequently used loss function is the ''0-1 loss function'' :

L(\hat, y) = \left \hat \ne y \right

using

Iverson bracket In mathematics, the Iverson bracket, named after Kenneth E. Iverson, is a notation that generalises the Kronecker delta, which is the Iverson bracket of the statement . It maps any statement to a function of the free variables in that statement. ...

notation, i.e. it evaluates to 1 when

\hat \ne y

, and 0 otherwise.

Constructing loss and objective functions

In many applications, objective functions, including loss functions as a particular case, are determined by the problem formulation. In other situations, the decision maker’s preference must be elicited and represented by a scalar-valued function (called also

utility In economics, utility is a measure of a certain person's satisfaction from a certain state of the world. Over time, the term has been used with at least two meanings. * In a normative context, utility refers to a goal or objective that we wish ...

function) in a form suitable for optimization — the problem that

Ragnar Frisch Ragnar Anton Kittil Frisch (3 March 1895 – 31 January 1973) was an influential Norwegian economist and econometrician known for being one of the major contributors to establishing economics as a quantitative and statistically informed science ...

has highlighted in his

Nobel Prize The Nobel Prizes ( ; ; ) are awards administered by the Nobel Foundation and granted in accordance with the principle of "for the greatest benefit to humankind". The prizes were first awarded in 1901, marking the fifth anniversary of Alfred N ...

lecture. The existing methods for constructing objective functions are collected in the proceedings of two dedicated conferences. In particular,

Andranik Tangian Andranik Semovich Tangian (Melik-Tangyan) (Russian: Андраник Семович Тангян (Мелик-Тангян)); born March 29, 1952) is a Soviet Armenian-German mathematician, political economist and music theorist. He is professor o ...

showed that the most usable objective functions — quadratic and additive — are determined by a few indifference points. He used this property in the models for constructing these objective functions from either ordinal or

cardinal Cardinal or The Cardinal most commonly refers to * Cardinalidae, a family of North and South American birds **''Cardinalis'', genus of three species in the family Cardinalidae ***Northern cardinal, ''Cardinalis cardinalis'', the common cardinal of ...

data that were elicited through computer-assisted interviews with decision makers. Among other things, he constructed objective functions to optimally distribute budgets for 16 Westfalian universities and the European subsidies for equalizing unemployment rates among 271 German regions.

Expected loss

In some contexts, the value of the loss function itself is a random quantity because it depends on the outcome of a random variable ''X''.

Statistics

Both

frequentist Frequentist inference is a type of statistical inference based in frequentist probability, which treats “probability” in equivalent terms to “frequency” and draws conclusions from sample-data by means of emphasizing the frequency or pro ...

and Bayesian statistical theory involve making a decision based on the

expected value In probability theory, the expected value (also called expectation, expectancy, expectation operator, mathematical expectation, mean, expectation value, or first Moment (mathematics), moment) is a generalization of the weighted average. Informa ...

of the loss function; however, this quantity is defined differently under the two paradigms.

Frequentist expected loss

We first define the expected loss in the frequentist context. It is obtained by taking the expected value with respect to the

probability distribution In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...

, ''P''_''θ'', of the observed data, ''X''. This is also referred to as the risk function of the decision rule ''δ'' and the parameter ''θ''. Here the decision rule depends on the outcome of ''X''. The risk function is given by: :

R(\theta, \delta) = \operatorname_\theta L\big( \theta, \delta(X) \big) = \int_X L\big( \theta, \delta(x) \big) \, \mathrm P_\theta (x) .

Here, ''θ'' is a fixed but possibly unknown state of nature, ''X'' is a vector of observations stochastically drawn from a

population Population is a set of humans or other organisms in a given region or area. Governments conduct a census to quantify the resident population size within a given jurisdiction. The term is also applied to non-human animals, microorganisms, and pl ...

\operatorname_\theta

is the expectation over all population values of ''X'', ''dP''_''θ'' is a

probability measure In mathematics, a probability measure is a real-valued function defined on a set of events in a σ-algebra that satisfies Measure (mathematics), measure properties such as ''countable additivity''. The difference between a probability measure an ...

over the event space of ''X'' (parametrized by ''θ'') and the integral is evaluated over the entire support of ''X''.

Bayes Risk

In a Bayesian approach, the expectation is calculated using the

prior distribution A prior probability distribution of an uncertain quantity, simply called the prior, is its assumed probability distribution before some evidence is taken into account. For example, the prior could be the probability distribution representing the ...

^* of the parameter ''θ'': :

\rho(\pi^*,a) = \int_\Theta \int _ L(\theta, a(\bold x)) \, \mathrm P(\bold x \vert \theta) \,\mathrm \pi^* (\theta)= \int_ \int_\Theta L(\theta,a(\bold x))\,\mathrm \pi^*(\theta\vert \bold x)\,\mathrmM(\bold x)

where m(x) is known as the ''predictive likelihood'' wherein θ has been "integrated out," ^* (θ , x) is the posterior distribution, and the order of integration has been changed. One then should choose the action ''a^*'' which minimises this expected loss, which is referred to as ''Bayes Risk''. In the latter equation, the integrand inside dx is known as the ''Posterior Risk'', and minimising it with respect to decision ''a'' also minimizes the overall Bayes Risk. This optimal decision, ''a^*'' is known as the ''Bayes (decision) Rule'' - it minimises the average loss over all possible states of nature θ, over all possible (probability-weighted) data outcomes. One advantage of the Bayesian approach is to that one need only choose the optimal action under the actual observed data to obtain a uniformly optimal one, whereas choosing the actual frequentist optimal decision rule as a function of all possible observations, is a much more difficult problem. Of equal importance though, the Bayes Rule reflects consideration of loss outcomes under different states of nature, θ.

Examples in statistics

* For a scalar parameter ''θ'', a decision function whose output

\hat\theta

is an estimate of ''θ'', and a quadratic loss function ( squared error loss)

L(\theta,\hat\theta)=(\theta-\hat\theta)^2,

the risk function becomes the

mean squared error In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference betwee ...

of the estimate,

R(\theta,\hat\theta)= \operatorname_\theta \left (\theta-\hat\theta)^2 \right

Estimator In statistics, an estimator is a rule for calculating an estimate of a given quantity based on Sample (statistics), observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguish ...

found by minimizing the

Mean squared error In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference betwee ...

estimates the

Posterior distribution The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posterior ...

's mean. * In

density estimation In statistics, probability density estimation or simply density estimation is the construction of an estimate, based on observed data, of an unobservable underlying probability density function. The unobservable density function is thought o ...

, the unknown parameter is probability density itself. The loss function is typically chosen to be a

norm Norm, the Norm or NORM may refer to: In academic disciplines * Normativity, phenomenon of designating things as good or bad * Norm (geology), an estimate of the idealised mineral content of a rock * Norm (philosophy), a standard in normative e ...

in an appropriate

function space In mathematics, a function space is a set of functions between two fixed sets. Often, the domain and/or codomain will have additional structure which is inherited by the function space. For example, the set of functions from any set into a ve ...

. For example, for ''L''² norm,

L(f,\hat f) = \, f-\hat f\, _2^2\,,

the risk function becomes the

mean integrated squared error In statistics, the mean integrated squared error (MISE) is used in density estimation. The MISE of an estimate of an unknown probability density is given by :\operatorname\, f_n-f\, _2^2=\operatorname\int (f_n(x)-f(x))^2 \, dx where ''ƒ'' is ...

R(f,\hat f)=\operatorname \left ( \, f-\hat f\, ^2 \right ).\,

Economic choice under uncertainty

In economics, decision-making under uncertainty is often modelled using the von Neumann–Morgenstern utility function of the uncertain variable of interest, such as end-of-period wealth. Since the value of this variable is uncertain, so is the value of the utility function; it is the expected value of utility that is maximized.

Decision rules

A decision rule makes a choice using an optimality criterion. Some commonly used criteria are: *

Minimax Minimax (sometimes Minmax, MM or saddle point) is a decision rule used in artificial intelligence, decision theory, combinatorial game theory, statistics, and philosophy for ''minimizing'' the possible loss function, loss for a Worst-case scenari ...

: Choose the decision rule with the lowest worst loss — that is, minimize the worst-case (maximum possible) loss:

\underset  \ \max_ \ R(\theta,\delta).

* Invariance: Choose the decision rule which satisfies an invariance requirement. *Choose the decision rule with the lowest average loss (i.e. minimize the

of the loss function):

= \underset \ \int_ R(\theta,\delta) \, p(\theta) \,d\theta.

Selecting a loss function

Sound statistical practice requires selecting an estimator consistent with the actual acceptable variation experienced in the context of a particular applied problem. Thus, in the applied use of loss functions, selecting which statistical method to use to model an applied problem depends on knowing the losses that will be experienced from being wrong under the problem's particular circumstances. A common example involves estimating "

location In geography, location or place is used to denote a region (point, line, or area) on Earth's surface. The term ''location'' generally implies a higher degree of certainty than ''place'', the latter often indicating an entity with an ambiguous bou ...

". Under typical statistical assumptions, the

mean A mean is a quantity representing the "center" of a collection of numbers and is intermediate to the extreme values of the set of numbers. There are several kinds of means (or "measures of central tendency") in mathematics, especially in statist ...

or average is the statistic for estimating location that minimizes the expected loss experienced under the squared-error loss function, while the

median The median of a set of numbers is the value separating the higher half from the lower half of a Sample (statistics), data sample, a statistical population, population, or a probability distribution. For a data set, it may be thought of as the “ ...

is the estimator that minimizes expected loss experienced under the absolute-difference loss function. Still different estimators would be optimal under other, less common circumstances. In economics, when an agent is

risk neutral In economics and finance, risk neutral preferences are preference (economics), preferences that are neither risk aversion, risk averse nor risk seeking. A risk neutral party's decisions are not affected by the degree of uncertainty in a set of out ...

, the objective function is simply expressed as the expected value of a monetary quantity, such as profit, income, or end-of-period wealth. For

risk-averse In economics and finance, risk aversion is the tendency of people to prefer outcomes with low uncertainty to those outcomes with high uncertainty, even if the average outcome of the latter is equal to or higher in monetary value than the more c ...

risk-loving In accounting, finance, and economics, a risk-seeker or risk-lover is a person who has a preference ''for'' risk. While most investors are considered risk ''averse'', one could view casino-goers as risk-seeking. A common example to explain risk-s ...

agents, loss is measured as the negative of a

, and the objective function to be optimized is the expected value of utility. Other measures of cost are possible, for example mortality or

morbidity A disease is a particular abnormal condition that adversely affects the structure or function of all or part of an organism and is not immediately due to any external injury. Diseases are often known to be medical conditions that are asso ...

in the field of

public health Public health is "the science and art of preventing disease, prolonging life and promoting health through the organized efforts and informed choices of society, organizations, public and private, communities and individuals". Analyzing the de ...

safety engineering Safety engineering is an engineering Branches of science, discipline which assures that engineered systems provide acceptable levels of safety. It is strongly related to industrial engineering/systems engineering, and the subset system safety en ...

. For most

optimization algorithm Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criteria, from some set of available alternatives. It is generally divided into two subfiel ...

s, it is desirable to have a loss function that is globally

continuous Continuity or continuous may refer to: Mathematics * Continuity (mathematics), the opposing concept to discreteness; common examples include ** Continuous probability distribution or random variable in probability and statistics ** Continuous ...

and

differentiable In mathematics, a differentiable function of one real variable is a function whose derivative exists at each point in its domain. In other words, the graph of a differentiable function has a non- vertical tangent line at each interior point in ...

. Two very commonly used loss functions are the squared loss,

L(a) = a^2

, and the absolute loss,

L(a)=, a,

. However the absolute loss has the disadvantage that it is not differentiable at

a=0

. The squared loss has the disadvantage that it has the tendency to be dominated by

outlier In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are ...

s—when summing over a set of

a

's (as in

\sum_^n L(a_i)

), the final sum tends to be the result of a few particularly large ''a''-values, rather than an expression of the average ''a''-value. The choice of a loss function is not arbitrary. It is very restrictive and sometimes the loss function may be characterized by its desirable properties. Among the choice principles are, for example, the requirement of completeness of the class of symmetric statistics in the case of i.i.d. observations, the principle of complete information, and some others.

W. Edwards Deming William Edwards Deming (October 14, 1900 – December 20, 1993) was an American business theorist, composer, economist, industrial engineer, management consultant, statistician, and writer. Educated initially as an electrical engineer and later ...

and

Nassim Nicholas Taleb Nassim Nicholas Taleb (; alternatively ''Nessim ''or'' Nissim''; born 12 September 1960) is a Lebanese-American essayist, mathematical statistician, former option trader, risk analyst, and aphorist. His work concerns problems of randomness, ...

argue that empirical reality, not nice mathematical properties, should be the sole basis for selecting loss functions, and real losses often are not mathematically nice and are not differentiable, continuous, symmetric, etc. For example, a person who arrives before a plane gate closure can still make the plane, but a person who arrives after can not, a discontinuity and asymmetry which makes arriving slightly late much more costly than arriving slightly early. In drug dosing, the cost of too little drug may be lack of efficacy, while the cost of too much may be tolerable toxicity, another example of asymmetry. Traffic, pipes, beams, ecologies, climates, etc. may tolerate increased load or stress with little noticeable change up to a point, then become backed up or break catastrophically. These situations, Deming and Taleb argue, are common in real-life problems, perhaps more common than classical smooth, continuous, symmetric, differentials cases.