The softmax function, also known as softargmax or normalized exponential function, converts a vector of real numbers into a

probability distribution In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomeno ...

of possible outcomes. It is a generalization of the

logistic function A logistic function or logistic curve is a common S-shaped curve (sigmoid function, sigmoid curve) with equation f(x) = \frac, where For values of x in the domain of real numbers from -\infty to +\infty, the S-curve shown on the right is ...

to multiple dimensions, and used in

multinomial logistic regression In statistics, multinomial logistic regression is a statistical classification, classification method that generalizes logistic regression to multiclass classification, multiclass problems, i.e. with more than two possible discrete outcomes. T ...

. The softmax function is often used as the last activation function of a

neural network A neural network is a network or neural circuit, circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up ...

to normalize the output of a network to a

over predicted output classes, based on Luce's choice axiom.

Definition

The softmax function takes as input a vector of real numbers, and normalizes it into a

consisting of probabilities proportional to the exponentials of the input numbers. That is, prior to applying softmax, some vector components could be negative, or greater than one; and might not sum to 1; but after applying softmax, each component will be in the interval

(0, 1)

, and the components will add up to 1, so that they can be interpreted as probabilities. Furthermore, the larger input components will correspond to larger probabilities. The standard (unit) softmax function

\sigma : \R^K \to (0, 1)^K

is defined when

K \ge 1

by the formula

\sigma(\mathbf)_i = \frac \ \ \text i = 1, \dotsc, K \text \mathbf = (z_1, \dotsc, z_K) \in \R^K.

In simple words, it applies the standard

exponential function The exponential function is a mathematical function denoted by f(x)=\exp(x) or e^x (where the argument is written as an exponent). Unless otherwise specified, the term generally refers to the positive-valued function of a real variable, ...

to each element

z_i

of the input vector

\mathbf z

and normalizes these values by dividing by the sum of all these exponentials; this normalization ensures that the sum of the components of the output vector

\sigma(\mathbf z)

is 1. Instead of , a different base can be used. If , smaller input components will result in larger output probabilities, and decreasing the value of will create probability distributions that are more concentrated around the positions of the smallest input values. Conversely, if , larger input components will result in larger output probabilities, and increasing the value of will create probability distributions that are more concentrated around the positions of the largest input values. Writing

b = e^\beta

b = e^

(for real ) yields the expressions:

\sigma(\mathbf)_i = \frac \text \sigma(\mathbf)_i = \frac \text i = 1, \dotsc , K .

In some fields, the base is fixed, corresponding to a fixed scale, while in others the parameter is varied.

Interpretations

Smooth arg max

The name "softmax" is misleading; the function is not a smooth maximum (a smooth approximation to the

maximum In mathematical analysis, the maxima and minima (the respective plurals of maximum and minimum) of a function, known collectively as extrema (the plural of extremum), are the largest and smallest value of the function, either within a given r ...

function), but is rather a smooth approximation to the arg max function: the function whose value is ''which'' ''index'' has the maximum. In fact, the term "softmax" is also used for the closely related LogSumExp function, which is a smooth maximum. For this reason, some prefer the more accurate term "softargmax", but the term "softmax" is conventional in machine learning. This section uses the term "softargmax" to emphasize this interpretation. Formally, instead of considering the arg max as a function with categorical output

1, \dots, n

(corresponding to the index), consider the arg max function with one-hot representation of the output (assuming there is a unique maximum arg): :

\operatorname(z_1,\, \dots,\, z_n) = (y_1,\, \dots,\, y_n) = (0,\, \dots,\, 0,\, 1,\, 0,\, \dots,\, 0),

where the output coordinate

y_i = 1

if and only if

i

is the arg max of

(z_1, \dots, z_n)

, meaning

z_i

is the unique maximum value of

(z_1,\, \dots,\, z_n)

. For example, in this encoding

\operatorname(1, 5, 10) = (0, 0, 1),

since the third argument is the maximum. This can be generalized to multiple arg max values (multiple equal

z_i

being the maximum) by dividing the 1 between all max args; formally where is the number of arguments assuming the maximum. For example,

\operatorname(1,\, 5,\, 5) = (0,\, 1/2,\, 1/2),

since the second and third argument are both the maximum. In case all arguments are equal, this is simply

\operatorname(z, \dots, z) = (1/n, \dots, 1/n).

Points with multiple arg max values are singular points (or singularities, and form the singular set) – these are the points where arg max is discontinuous (with a

jump discontinuity Continuous functions are of utmost importance in mathematics, functions and applications. However, not all functions are continuous. If a function is not continuous at a point in its domain, one says that it has a discontinuity there. The set ...

) – while points with a single arg max are known as non-singular or regular points. With the last expression given in the introduction, softargmax is now a smooth approximation of arg max: as , softargmax converges to arg max. There are various notions of convergence of a function; softargmax converges to arg max

pointwise In mathematics, the qualifier pointwise is used to indicate that a certain property is defined by considering each value f(x) of some function f. An important class of pointwise concepts are the ''pointwise operations'', that is, operations defined ...

, meaning for each fixed input as ,

\sigma_\beta(\mathbf) \to \operatorname(\mathbf).

However, softargmax does not converge uniformly to arg max, meaning intuitively that different points converge at different rates, and may converge arbitrarily slowly. In fact, softargmax is continuous, but arg max is not continuous at the singular set where two coordinates are equal, while the uniform limit of continuous functions is continuous. The reason it fails to converge uniformly is that for inputs where two coordinates are almost equal (and one is the maximum), the arg max is the index of one or the other, so a small change in input yields a large change in output. For example,

\sigma_\beta(1,\, 1.0001) \to (0, 1),

but

\sigma_\beta(1,\, 0.9999) \to (1,\, 0),

and

\sigma_\beta(1,\, 1) = 1/2

for all inputs: the closer the points are to the singular set

(x, x)

, the slower they converge. However, softargmax does converge compactly on the non-singular set. Conversely, as , softargmax converges to arg min in the same way, where here the singular set is points with two arg ''min'' values. In the language of tropical analysis, the softmax is a deformation or "quantization" of arg max and arg min, corresponding to using the log semiring instead of the max-plus semiring (respectively

min-plus semiring In idempotent analysis, the tropical semiring is a semiring of extended real numbers with the operations of minimum (or maximum) and addition replacing the usual ("classical") operations of addition and multiplication, respectively. The tropical ...

), and recovering the arg max or arg min by taking the limit is called "tropicalization" or "dequantization". It is also the case that, for any fixed , if one input is much larger than the others ''relative'' to the temperature,

T = 1/\beta

, the output is approximately the arg max. For example, a difference of 10 is large relative to a temperature of 1:

\sigma(0,\, 10) := \sigma_1(0,\, 10) = \left(1/\left(1 + e^\right),\, e^/\left(1 + e^\right)\right) \approx (0.00005,\, 0.99995)

However, if the difference is small relative to the temperature, the value is not close to the arg max. For example, a difference of 10 is small relative to a temperature of 100:

\sigma_(0,\, 10) = \left(1/\left(1 + e^\right),\, e^/\left(1 + e^\right)\right) \approx (0.475,\, 0.525).

As , temperature goes to zero,

T = 1/\beta \to 0

, so eventually all differences become large (relative to a shrinking temperature), which gives another interpretation for the limit behavior.

Probability theory

probability theory Probability theory is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expressing it through a set o ...

, the output of the softargmax function can be used to represent a

categorical distribution In probability theory and statistics, a categorical distribution (also called a generalized Bernoulli distribution, multinoulli distribution) is a discrete probability distribution that describes the possible results of a random variable that ca ...

– that is, a

over different possible outcomes.

Statistical mechanics

In statistical mechanics, the softargmax function is known as the

Boltzmann distribution In statistical mechanics and mathematics, a Boltzmann distribution (also called Gibbs distribution Translated by J.B. Sykes and M.J. Kearsley. See section 28) is a probability distribution or probability measure that gives the probability ...

(or Gibbs distribution): the index set

are the microstates of the system; the inputs

z_i

are the energies of that state; the denominator is known as the partition function, often denoted by ; and the factor is called the

coldness In statistical thermodynamics, thermodynamic beta, also known as coldness, is the reciprocal of the thermodynamic temperature of a system:\beta = \frac (where is the temperature and is Boltzmann constant).J. Meixner (1975) "Coldness and Tempe ...

(or thermodynamic beta, or inverse temperature).

Applications

The softmax function is used in various

multiclass classification In machine learning and statistical classification, multiclass classification or multinomial classification is the problem of classifying instances into one of three or more classes (classifying instances into one of two classes is called binary c ...

methods, such as

(also known as softmax regression)
multiclass

linear discriminant analysis Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to find a linear combination of features ...

naive Bayes classifier In statistics, naive Bayes classifiers are a family of simple " probabilistic classifiers" based on applying Bayes' theorem with strong (naive) independence assumptions between the features (see Bayes classifier). They are among the simplest Baye ...

s, and

artificial neural network Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains. An ANN is based on a collection of connected units ...

s. Specifically, in multinomial logistic regression and linear discriminant analysis, the input to the function is the result of distinct

linear function In mathematics, the term linear function refers to two distinct but related notions: * In calculus and related areas, a linear function is a function whose graph is a straight line, that is, a polynomial function of degree zero or one. For di ...

s, and the predicted probability for the th class given a sample vector and a weighting vector is: :

P(y=j\mid \mathbf) = \frac

This can be seen as the composition of linear functions

\mathbf \mapsto \mathbf^\mathsf\mathbf_1, \ldots, \mathbf \mapsto \mathbf^\mathsf\mathbf_K

and the softmax function (where

\mathbf^\mathsf\mathbf

denotes the inner product of

\mathbf

and

\mathbf

). The operation is equivalent to applying a linear operator defined by

\mathbf

to vectors

\mathbf

, thus transforming the original, probably highly-dimensional, input to vectors in a -dimensional space

\mathbb^K

Neural networks

The standard softmax function is often used in the final layer of a neural network-based classifier. Such networks are commonly trained under a log loss (or cross-entropy) regime, giving a non-linear variant of multinomial logistic regression. Since the function maps a vector and a specific index

i

to a real value, the derivative needs to take the index into account:

\frac\sigma(\textbf, i) =  \sigma(\textbf, i)(\delta_ - \sigma(\textbf, k)).

This expression is symmetrical in the indexes

i, k

and thus may also be expressed as :

\frac\sigma(\textbf, i) =  \sigma(\textbf, k)(\delta_ - \sigma(\textbf, i)).

Here, the

Kronecker delta In mathematics, the Kronecker delta (named after Leopold Kronecker) is a function of two variables, usually just non-negative integers. The function is 1 if the variables are equal, and 0 otherwise: \delta_ = \begin 0 &\text i \neq j, \\ 1 ...

is used for simplicity (cf. the derivative of a

sigmoid function A sigmoid function is a mathematical function having a characteristic "S"-shaped curve or sigmoid curve. A common example of a sigmoid function is the logistic function shown in the first figure and defined by the formula: :S(x) = \frac = \ ...

, being expressed via the function itself). In order to achieve stable numerical computations of the derivative, one often subtracts a constant from the input vector. In theory, this does not change the output, and neither the derivative. But it is more stable as it can control explicitly the largest value computed in each exponent. If the function is scaled with the parameter

\beta

, then these expressions must be multiplied by

\beta

. See multinomial logit for a probability model which uses the softmax activation function.

Reinforcement learning

In the field of

reinforcement learning Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine ...

, a softmax function can be used to convert values into action probabilities. The function commonly used is:

P_t(a) = \frac \text

where the action value

q_t(a)

corresponds to the expected reward of following action a and

\tau

is called a temperature parameter (in allusion to statistical mechanics). For high temperatures (

\tau \to \infty

), all actions have nearly the same probability and the lower the temperature, the more expected rewards affect the probability. For a low temperature (

\tau \to 0^+

), the probability of the action with the highest expected reward tends to 1.

Computational complexity and remedies

In neural network applications, the number of possible outcomes is often large, e.g. in case of neural language models that predict the most likely outcome out of a vocabulary which might contain millions of possible words. This can make the calculations for the softmax layer (i.e. the matrix multiplications to determine the

z_i

, followed by the application of the softmax function itself) computationally expensive. What's more, the

gradient descent In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of ...

backpropagation In machine learning, backpropagation (backprop, BP) is a widely used algorithm for training feedforward artificial neural networks. Generalizations of backpropagation exist for other artificial neural networks (ANNs), and for functions gener ...

method for training such a neural network involves calculating the softmax for every training example, and the number of training examples can also become large. The computational effort for the softmax became a major limiting factor in the development of larger neural language models, motivating various remedies to reduce training times. Approaches that reorganize the softmax layer for more efficient calculation include the hierarchical softmax and the differentiated softmax. The hierarchical softmax (introduced by Morin and Bengio in 2005) uses a binary tree structure where the outcomes (vocabulary words) are the leaves and the intermediate nodes are suitably selected "classes" of outcomes, forming latent variables. The desired probability (softmax value) of a leaf (outcome) can then be calculated as the product of the probabilities of all nodes on the path from the root to that leaf. Ideally, when the tree is balanced, this would reduce the

computational complexity In computer science, the computational complexity or simply complexity of an algorithm is the amount of resources required to run it. Particular focus is given to computation time (generally measured by the number of needed elementary operations) ...

from

O(K)

O(\log_2 K)

. In practice, results depend on choosing a good strategy for clustering the outcomes into classes. A Huffman tree was used for this in Google's word2vec models (introduced in 2013) to achieve scalability. A second kind of remedies is based on approximating the softmax (during training) with modified loss functions that avoid the calculation of the full normalization factor. These include methods that restrict the normalization sum to a sample of outcomes (e.g. Importance Sampling, Target Sampling).

Mathematical properties

Geometrically the softmax function maps the

vector space In mathematics and physics, a vector space (also called a linear space) is a set whose elements, often called '' vectors'', may be added together and multiplied ("scaled") by numbers called '' scalars''. Scalars are often real numbers, but ...

\mathbb^K

to the boundary of the standard

(K-1)

-simplex, cutting the dimension by one (the range is a

(K - 1)

-dimensional simplex in

K

-dimensional space), due to the

linear constraint In mathematics, a linear equation is an equation that may be put in the form a_1x_1+\ldots+a_nx_n+b=0, where x_1,\ldots,x_n are the variables (or unknowns), and b,a_1,\ldots,a_n are the coefficients, which are often real numbers. The coefficien ...

that all output sum to 1 meaning it lies on a

hyperplane In geometry, a hyperplane is a subspace whose dimension is one less than that of its '' ambient space''. For example, if a space is 3-dimensional then its hyperplanes are the 2-dimensional planes, while if the space is 2-dimensional, its hype ...

. Along the main diagonal

(x,\, x,\, \dots,\, x),

softmax is just the uniform distribution on outputs,

(1/n, \dots, 1/n)

: equal scores yield equal probabilities. More generally, softmax is invariant under translation by the same value in each coordinate: adding

\mathbf = (c,\, \dots,\, c)

to the inputs

\mathbf

yields

\sigma(\mathbf + \mathbf) = \sigma(\mathbf)

, because it multiplies each exponent by the same factor,

e^c

(because

e^ = e^ \cdot e^c

), so the ratios do not change: :

\sigma(\mathbf + \mathbf)_j = \frac = \frac = \sigma(\mathbf)_j.

Geometrically, softmax is constant along diagonals: this is the dimension that is eliminated, and corresponds to the softmax output being independent of a translation in the input scores (a choice of 0 score). One can normalize input scores by assuming that the sum is zero (subtract the average:

\mathbf

where

c = \frac \sum z_i

), and then the softmax takes the hyperplane of points that sum to zero,

\sum z_i = 0

, to the open simplex of positive values that sum to 1

\sum \sigma(\mathbf)_i = 1

, analogously to how the exponent takes 0 to 1,

e^0 = 1

and is positive. By contrast, softmax is not invariant under scaling. For instance,

\sigma\bigl((0,\, 1)\bigr) = \bigl(1/(1 + e),\, e/(1 + e)\bigr)

but

\sigma\bigl((0, 2)\bigr) = \bigl(1/\left(1 + e^2\right),\, e^2/\left(1 + e^2\right)\bigr).

The standard logistic function is the special case for a 1-dimensional axis in 2-dimensional space, say the ''x''-axis in the plane. One variable is fixed at 0 (say

z_2 = 0

), so

e^0 = 1

, and the other variable can vary, denote it

z_1 = x

, so

e^/\sum_^2 e^ = e^x/\left(e^x + 1\right),

the standard logistic function, and

e^/\sum_^2 e^ = 1/\left(e^x + 1\right),

its complement (meaning they add up to 1). The 1-dimensional input could alternatively be expressed as the line

(x/2,\, -x/2)

, with outputs

e^/\left(e^ + e^\right) = e^x/\left(e^x + 1\right)

and

e^/\left(e^ + e^\right) = 1/\left(e^x + 1\right).

The softmax function is also the gradient of the LogSumExp function, a smooth maximum: :

\frac \operatorname(\mathbf) = \frac = \sigma(\mathbf)_i, \quad \text i = 1, \dotsc , K, \quad \mathbf = (z_1,\, \dotsc,\, z_K) \in\R^K,

where the LogSumExp function is defined as

\operatorname(z_1,\, \dots,\, z_n) = \log\left(\exp(z_1) + \cdots + \exp(z_n)\right)

History

The softmax function was used in statistical mechanics as the

in the foundational paper , formalized and popularized in the influential textbook . The use of the softmax in

decision theory Decision theory (or the theory of choice; not to be confused with choice theory) is a branch of applied probability theory concerned with the theory of making decisions based on assigning probabilities to various factors and assigning numerical ...

is credited to , who used the axiom of

independence of irrelevant alternatives The independence of irrelevant alternatives (IIA), also known as binary independence or the independence axiom, is an axiom of decision theory and various social sciences. The term is used in different connotation in several contexts. Although it a ...

rational choice theory Rational choice theory refers to a set of guidelines that help understand economic and social behaviour. The theory originated in the eighteenth century and can be traced back to political economist and philosopher, Adam Smith. The theory postul ...

to deduce the softmax in Luce's choice axiom for relative preferences. In machine learning, the term "softmax" is credited to John S. Bridle in two 1989 conference papers, : and :

Example

If we take an input of , the softmax of that is . The output has most of its weight where the "4" was in the original input. This is what the function is normally used for: to highlight the largest values and suppress values which are significantly below the maximum value. But note: softmax is not scale invariant, so if the input were (which sums to 1.6) the softmax would be . This shows that for values between 0 and 1 softmax, in fact, de-emphasizes the maximum value (note that 0.169 is not only less than 0.475, it is also less than the initial proportion of ). Computation of this example using Python code: >>> import numpy as np >>> a = .0, 2.0, 3.0, 4.0, 1.0, 2.0, 3.0>>> np.exp(a) / np.sum(np.exp(a)) array( .02364054, 0.06426166, 0.1746813, 0.474833, 0.02364054, 0.06426166, 0.1746813 Here is an example of

Julia Julia is usually a feminine given name. It is a Latinate feminine form of the name Julio and Julius. (For further details on etymology, see the Wiktionary entry "Julius".) The given name ''Julia'' had been in use throughout Late Antiquity (e ...

code: julia> A = .0, 2.0, 3.0, 4.0, 1.0, 2.0, 3.0 # semicolon to suppress interactive output julia> exp.(A) ./ sum(exp.(A)) 7-element Array: 0.0236405 0.0642617 0.174681 0.474833 0.0236405 0.0642617 0.174681 Here is an example of R code: > z <- c(1.0, 2.0, 3.0, 4.0, 1.0, 2.0, 3.0) > softmax <- exp(z)/sum(exp(z)) > softmax 0.02364054 0.06426166 0.17468130 0.47483300 0.02364054 0.06426166 0.17468130 Here is an example of

Elixir ELIXIR (the European life-sciences Infrastructure for biological Information) is an initiative that will allow life science laboratories across Europe to share and store their research data as part of an organised network. Its goal is to bring t ...

code: iex> t = Nx.tensor( 1, 2 , 4) iex> Nx.divide(Nx.exp(t), Nx.sum(Nx.exp(t))) #Nx.Tensor< f64 2] [0.03205860328008499, 0.08714431874203257 [0.23688281808991013, 0.6439142598879722">.03205860328008499,_0.08714431874203257.html" ;"title=" [0.03205860328008499, 0.08714431874203257"> [0.03205860328008499, 0.08714431874203257 [0.23688281808991013, 0.6439142598879722 ] > Here is an example of Raku (programming language), Raku code: > my @z = .0, 2.0, 3.0, 4.0, 1.0, 2.0, 3.0 > say @z.map: (0.023640543021591385 0.06426165851049616 0.17468129859572226 0.4748329997443803 0.023640543021591385 0.06426165851049616 0.17468129859572226)

Notes

References

{{Differentiable computing Computational neuroscience Logistic regression Artificial neural networks Functions and mappings Articles with example Python (programming language) code Exponentials Articles with example Julia code Articles with example R code