Softmax
   HOME

TheInfoList



OR:

The softmax function, also known as softargmax or normalized exponential function, converts a
tuple In mathematics, a tuple is a finite sequence or ''ordered list'' of numbers or, more generally, mathematical objects, which are called the ''elements'' of the tuple. An -tuple is a tuple of elements, where is a non-negative integer. There is o ...
of real numbers into a
probability distribution In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...
of possible outcomes. It is a generalization of the
logistic function A logistic function or logistic curve is a common S-shaped curve ( sigmoid curve) with the equation f(x) = \frac where The logistic function has domain the real numbers, the limit as x \to -\infty is 0, and the limit as x \to +\infty is L. ...
to multiple dimensions, and is used in
multinomial logistic regression In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes. That is, it is a model that is used to predict the prob ...
. The softmax function is often used as the last
activation function The activation function of a node in an artificial neural network is a function that calculates the output of the node based on its individual inputs and their weights. Nontrivial problems can be solved using only a few nodes if the activation f ...
of a
neural network A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either biological cells or signal pathways. While individual neurons are simple, many of them together in a network can perfor ...
to normalize the output of a network to a
probability distribution In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...
over predicted output classes.


Definition

The softmax function takes as input a tuple of real numbers, and normalizes it into a
probability distribution In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...
consisting of probabilities proportional to the exponentials of the input numbers. That is, prior to applying softmax, some tuple components could be negative, or greater than one; and might not sum to 1; but after applying softmax, each component will be in the interval (0, 1), and the components will add up to 1, so that they can be interpreted as probabilities. Furthermore, the larger input components will correspond to larger probabilities. Formally, the standard (unit) softmax function \sigma\colon \R^K \to (0, 1)^K, where , takes a tuple \mathbf = (z_1, \dotsc, z_K) \in \R^K and computes each component of vector \sigma(\mathbf) \in (0, 1)^K with \sigma(\mathbf)_i = \frac\,. In words, the softmax applies the standard exponential function to each element z_i of the input tuple \mathbf z (consisting of K real numbers), and normalizes these values by dividing by the sum of all these exponentials. The normalization ensures that the sum of the components of the output vector \sigma(\mathbf z) is 1. The term "softmax" derives from the amplifying effects of the exponential on any maxima in the input tuple. For example, the standard softmax of (1,2,8) is approximately (0.001,0.002,0.997), which amounts to assigning almost all of the total unit weight in the result to the position of the tuple's maximal element (of 8). In general, instead of a different base can be used. As above, if then larger input components will result in larger output probabilities, and increasing the value of will create probability distributions that are more concentrated around the positions of the largest input values. Conversely, if then smaller input components will result in larger output probabilities, and decreasing the value of will create probability distributions that are more concentrated around the positions of the smallest input values. Writing b = e^\beta or b = e^ (for real ) yields the expressions: \sigma(\mathbf)_i = \frac \text \sigma(\mathbf)_i = \frac \text i = 1, \dotsc , K . A value proportional to the reciprocal of is sometimes referred to as the ''temperature'': \beta = 1 / kT, where is typically 1 or the
Boltzmann constant The Boltzmann constant ( or ) is the proportionality factor that relates the average relative thermal energy of particles in a ideal gas, gas with the thermodynamic temperature of the gas. It occurs in the definitions of the kelvin (K) and the ...
and is the temperature. A higher temperature results in a more uniform output distribution (i.e. with higher
entropy Entropy is a scientific concept, most commonly associated with states of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynamics, where it was first recognized, to the micros ...
; it is "more random"), while a lower temperature results in a sharper output distribution, with one value dominating. In some fields, the base is fixed, corresponding to a fixed scale, while in others the parameter (or ) is varied.


Interpretations


Smooth arg max

The Softmax function is a smooth approximation to the
arg max In mathematics, the arguments of the maxima (abbreviated arg max or argmax) and arguments of the minima (abbreviated arg min or argmin) are the input points at which a Function (mathematics), function output value is Maxima and minima, maximized ...
function: the function whose value is the ''index'' of a tuple's largest element. The name "softmax" may be misleading. Softmax is not a
smooth maximum In mathematics, a smooth maximum of an indexed family ''x''1, ..., ''x'n'' of numbers is a smooth approximation to the maximum function \max(x_1,\ldots,x_n), meaning a parametric family of functions m_\alpha(x_1,\ldots,x_n) such that ...
(that is, a
smooth approximation In mathematical analysis, the smoothness of a function is a property measured by the number of continuous derivatives (''differentiability class)'' it has over its domain. A function of class C^k is a function of smoothness at least ; that ...
to the
maximum In mathematical analysis, the maximum and minimum of a function (mathematics), function are, respectively, the greatest and least value taken by the function. Known generically as extremum, they may be defined either within a given Interval (ma ...
function). The term "softmax" is also used for the closely related
LogSumExp The LogSumExp (LSE) (also called RealSoftMax or multivariable softplus) function is a smooth maximum – a smooth approximation to the maximum function, mainly used by machine learning algorithms. It is defined as the logarithm of the sum of the ...
function, which is a smooth maximum. For this reason, some prefer the more accurate term "softargmax", though the term "softmax" is conventional in machine learning. This section uses the term "softargmax" for clarity. Formally, instead of considering the arg max as a function with categorical output 1, \dots, n (corresponding to the index), consider the arg max function with
one-hot In digital circuits and machine learning, a one-hot is a group of bits among which the legal combinations of values are only those with a single high (1) bit and all the others low (0). A similar implementation in which all bits are '1' except ...
representation of the output (assuming there is a unique maximum arg): \operatorname(z_1,\, \dots,\, z_n) = (y_1,\, \dots,\, y_n) = (0,\, \dots,\, 0,\, 1,\, 0,\, \dots,\, 0), where the output coordinate y_i = 1 if and only if i is the arg max of (z_1, \dots, z_n), meaning z_i is the unique maximum value of (z_1,\, \dots,\, z_n). For example, in this encoding \operatorname(1, 5, 10) = (0, 0, 1), since the third argument is the maximum. This can be generalized to multiple arg max values (multiple equal z_i being the maximum) by dividing the 1 between all max args; formally where is the number of arguments assuming the maximum. For example, \operatorname(1,\, 5,\, 5) = (0,\, 1/2,\, 1/2), since the second and third argument are both the maximum. In case all arguments are equal, this is simply \operatorname(z, \dots, z) = (1/n, \dots, 1/n). Points with multiple arg max values are singular points (or singularities, and form the singular set) – these are the points where arg max is discontinuous (with a
jump discontinuity Continuous functions are of utmost importance in mathematics, functions and applications. However, not all Function (mathematics), functions are continuous. If a function is not continuous at a limit point (also called "accumulation point" or "clu ...
) – while points with a single arg max are known as non-singular or regular points. With the last expression given in the introduction, softargmax is now a smooth approximation of arg max: as , softargmax converges to arg max. There are various notions of convergence of a function; softargmax converges to arg max
pointwise In mathematics, the qualifier pointwise is used to indicate that a certain property is defined by considering each value f(x) of some Function (mathematics), function f. An important class of pointwise concepts are the ''pointwise operations'', that ...
, meaning for each fixed input as , \sigma_\beta(\mathbf) \to \operatorname(\mathbf). However, softargmax does not converge uniformly to arg max, meaning intuitively that different points converge at different rates, and may converge arbitrarily slowly. In fact, softargmax is continuous, but arg max is not continuous at the singular set where two coordinates are equal, while the uniform limit of continuous functions is continuous. The reason it fails to converge uniformly is that for inputs where two coordinates are almost equal (and one is the maximum), the arg max is the index of one or the other, so a small change in input yields a large change in output. For example, \sigma_\beta(1,\, 1.0001) \to (0, 1), but \sigma_\beta(1,\, 0.9999) \to (1,\, 0), and \sigma_\beta(1,\, 1) = 1/2 for all inputs: the closer the points are to the singular set (x, x), the slower they converge. However, softargmax does converge compactly on the non-singular set. Conversely, as , softargmax converges to arg min in the same way, where here the singular set is points with two arg ''min'' values. In the language of
tropical analysis In the mathematical discipline of idempotent analysis, tropical analysis is the study of the tropical semiring. Applications The max tropical semiring can be used appropriately to determine marking times within a given Petri net and a vector fil ...
, the softmax is a deformation or "quantization" of arg max and arg min, corresponding to using the
log semiring In mathematics, in the field of tropical analysis, the log semiring is the semiring structure on the logarithmic scale, obtained by considering the extended real numbers as logarithms. That is, the operations of addition and multiplication are def ...
instead of the
max-plus semiring In idempotent analysis, the tropical semiring is a semiring of extended real numbers with the operations of minimum (or maximum) and addition replacing the usual ("classical") operations of addition and multiplication, respectively. The tropical s ...
(respectively
min-plus semiring In idempotent analysis, the tropical semiring is a semiring of extended real numbers with the operations of minimum (or maximum) and addition replacing the usual ("classical") operations of addition and multiplication, respectively. The tropical s ...
), and recovering the arg max or arg min by taking the limit is called "tropicalization" or "dequantization". It is also the case that, for any fixed , if one input is much larger than the others ''relative'' to the temperature, T = 1/\beta, the output is approximately the arg max. For example, a difference of 10 is large relative to a temperature of 1: \sigma(0,\, 10) := \sigma_1(0,\, 10) = \left(1/\left(1 + e^\right),\, e^/\left(1 + e^\right)\right) \approx (0.00005,\, 0.99995) However, if the difference is small relative to the temperature, the value is not close to the arg max. For example, a difference of 10 is small relative to a temperature of 100: \sigma_(0,\, 10) = \left(1/\left(1 + e^\right),\, e^/\left(1 + e^\right)\right) \approx (0.475,\, 0.525). As , temperature goes to zero, T = 1/\beta \to 0, so eventually all differences become large (relative to a shrinking temperature), which gives another interpretation for the limit behavior.


Statistical mechanics

In
statistical mechanics In physics, statistical mechanics is a mathematical framework that applies statistical methods and probability theory to large assemblies of microscopic entities. Sometimes called statistical physics or statistical thermodynamics, its applicati ...
, the softargmax function is known as the
Boltzmann distribution In statistical mechanics and mathematics, a Boltzmann distribution (also called Gibbs distribution Translated by J.B. Sykes and M.J. Kearsley. See section 28) is a probability distribution or probability measure that gives the probability tha ...
(or
Gibbs distribution In statistical mechanics and mathematics, a Boltzmann distribution (also called Gibbs distribution Translated by J.B. Sykes and M.J. Kearsley. See section 28) is a probability distribution or probability measure that gives the probability tha ...
): the index set are the
microstates A microstate or ministate is a sovereign state having a very small population or land area, usually both. However, the meanings of "state" and "very small" are not well-defined in international law. Some recent attempts to define microstates ...
of the system; the inputs z_i are the energies of that state; the denominator is known as the partition function, often denoted by ; and the factor is called the
coldness In statistical thermodynamics, thermodynamic beta, also known as coldness, is the reciprocal of the thermodynamic temperature of a system:\beta = \frac (where is the temperature and is Boltzmann constant). Thermodynamic beta has units recip ...
(or
thermodynamic beta In statistical thermodynamics, thermodynamic beta, also known as coldness, is the reciprocal of the thermodynamic temperature of a system:\beta = \frac (where is the temperature and is Boltzmann constant). Thermodynamic beta has units recip ...
, or
inverse temperature In statistical thermodynamics, thermodynamic beta, also known as coldness, is the reciprocal of the thermodynamic temperature of a system:\beta = \frac (where is the temperature and is Boltzmann constant). Thermodynamic beta has units recipr ...
).


Applications

The softmax function is used in various
multiclass classification In machine learning and statistical classification, multiclass classification or multinomial classification is the problem of classifying instances into one of three or more classes (classifying instances into one of two classes is called binary ...
methods, such as
multinomial logistic regression In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes. That is, it is a model that is used to predict the prob ...
(also known as softmax regression), multiclass
linear discriminant analysis Linear discriminant analysis (LDA), normal discriminant analysis (NDA), canonical variates analysis (CVA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to fi ...
,
naive Bayes classifier In statistics, naive (sometimes simple or idiot's) Bayes classifiers are a family of " probabilistic classifiers" which assumes that the features are conditionally independent, given the target class. In other words, a naive Bayes model assumes th ...
s, and
artificial neural network In machine learning, a neural network (also artificial neural network or neural net, abbreviated ANN or NN) is a computational model inspired by the structure and functions of biological neural networks. A neural network consists of connected ...
s. Specifically, in multinomial logistic regression and linear discriminant analysis, the input to the function is the result of distinct
linear function In mathematics, the term linear function refers to two distinct but related notions: * In calculus and related areas, a linear function is a function whose graph is a straight line, that is, a polynomial function of degree zero or one. For di ...
s, and the predicted probability for the th class given a sample tuple and a weighting vector is: P(y=j\mid \mathbf) = \frac This can be seen as the
composition Composition or Compositions may refer to: Arts and literature *Composition (dance), practice and teaching of choreography * Composition (language), in literature and rhetoric, producing a work in spoken tradition and written discourse, to include ...
of linear functions \mathbf \mapsto \mathbf^\mathsf\mathbf_1, \ldots, \mathbf \mapsto \mathbf^\mathsf\mathbf_K and the softmax function (where \mathbf^\mathsf\mathbf denotes the inner product of \mathbf and \mathbf). The operation is equivalent to applying a linear operator defined by \mathbf to tuples \mathbf, thus transforming the original, probably highly-dimensional, input to vectors in a -dimensional space \mathbb^K.


Neural networks

The standard softmax function is often used in the final layer of a neural network-based classifier. Such networks are commonly trained under a
log loss In information theory, the cross-entropy between two probability distributions p and q, over the same underlying set of events, measures the average number of bits needed to identify an event drawn from the set when the coding scheme used for the ...
(or
cross-entropy In information theory, the cross-entropy between two probability distributions p and q, over the same underlying set of events, measures the average number of bits needed to identify an event drawn from the set when the coding scheme used for the ...
) regime, giving a non-linear variant of multinomial logistic regression. Since the function maps a tuple and a specific index i to a real value, the derivative needs to take the index into account: \frac\sigma(\textbf, i) = \sigma(\textbf, i)(\delta_ - \sigma(\textbf, k)). This expression is symmetrical in the indexes i, k and thus may also be expressed as \frac\sigma(\textbf, i) = \sigma(\textbf, k)(\delta_ - \sigma(\textbf, i)). Here, the
Kronecker delta In mathematics, the Kronecker delta (named after Leopold Kronecker) is a function of two variables, usually just non-negative integers. The function is 1 if the variables are equal, and 0 otherwise: \delta_ = \begin 0 &\text i \neq j, \\ 1 &\ ...
is used for simplicity (cf. the derivative of a
sigmoid function A sigmoid function is any mathematical function whose graph of a function, graph has a characteristic S-shaped or sigmoid curve. A common example of a sigmoid function is the logistic function, which is defined by the formula :\sigma(x ...
, being expressed via the function itself). To ensure stable numerical computations subtracting the maximum value from the input tuple is common. This approach, while not altering the output or the derivative theoretically, enhances stability by directly controlling the maximum exponent value computed. If the function is scaled with the parameter \beta, then these expressions must be multiplied by \beta. See
multinomial logit In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes. That is, it is a model that is used to predict the prob ...
for a probability model which uses the softmax activation function.


Reinforcement learning

In the field of
reinforcement learning Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learnin ...
, a softmax function can be used to convert values into action probabilities. The function commonly used is: P_t(a) = \frac \text where the action value q_t(a) corresponds to the expected reward of following action a and \tau is called a temperature parameter (in allusion to
statistical mechanics In physics, statistical mechanics is a mathematical framework that applies statistical methods and probability theory to large assemblies of microscopic entities. Sometimes called statistical physics or statistical thermodynamics, its applicati ...
). For high temperatures (\tau \to \infty), all actions have nearly the same probability and the lower the temperature, the more expected rewards affect the probability. For a low temperature (\tau \to 0^+), the probability of the action with the highest expected reward tends to 1.


Computational complexity and remedies

In neural network applications, the number of possible outcomes is often large, e.g. in case of neural language models that predict the most likely outcome out of a vocabulary which might contain millions of possible words. This can make the calculations for the softmax layer (i.e. the matrix multiplications to determine the z_i, followed by the application of the softmax function itself) computationally expensive. What's more, the
gradient descent Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for minimizing a differentiable multivariate function. The idea is to take repeated steps in the opposite direction of the gradi ...
backpropagation In machine learning, backpropagation is a gradient computation method commonly used for training a neural network to compute its parameter updates. It is an efficient application of the chain rule to neural networks. Backpropagation computes th ...
method for training such a neural network involves calculating the softmax for every training example, and the number of training examples can also become large. The computational effort for the softmax became a major limiting factor in the development of larger neural language models, motivating various remedies to reduce training times. Approaches that reorganize the softmax layer for more efficient calculation include the hierarchical softmax and the differentiated softmax. The hierarchical softmax (introduced by Morin and Bengio in 2005) uses a binary tree structure where the outcomes (vocabulary words) are the leaves and the intermediate nodes are suitably selected "classes" of outcomes, forming
latent variable In statistics, latent variables (from Latin: present participle of ) are variables that can only be inferred indirectly through a mathematical model from other observable variables that can be directly observed or measured. Such '' latent va ...
s. The desired probability (softmax value) of a leaf (outcome) can then be calculated as the product of the probabilities of all nodes on the path from the root to that leaf. Ideally, when the tree is balanced, this would reduce the
computational complexity In computer science, the computational complexity or simply complexity of an algorithm is the amount of resources required to run it. Particular focus is given to computation time (generally measured by the number of needed elementary operations ...
from O(K) to O(\log_2 K). In practice, results depend on choosing a good strategy for clustering the outcomes into classes. A
Huffman tree In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code is Huffman coding, an algorithm developed by ...
was used for this in Google's
word2vec Word2vec is a technique in natural language processing (NLP) for obtaining vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec algorithm estimates these rep ...
models (introduced in 2013) to achieve scalability. A second kind of remedies is based on approximating the softmax (during training) with modified loss functions that avoid the calculation of the full normalization factor. These include methods that restrict the normalization sum to a sample of outcomes (e.g. Importance Sampling, Target Sampling).


Numerical algorithms

The standard softmax is numerically unstable because of large exponentiations. The safe softmax method calculates instead\sigma(\mathbf)_i = \fracwhere m = \max_i z_i is the largest factor involved. Subtracting by it guarantees that the exponentiations result in at most 1. The
attention mechanism In machine learning, attention is a method that determines the importance of each component in a sequence relative to the other components in that sequence. In natural language processing, importance is represented b"soft"weights assigned to eac ...
in
Transformers ''Transformers'' is a media franchise produced by American toy company Hasbro and Japanese toy company Tomy, Takara Tomy. It primarily follows the heroic Autobots and the villainous Decepticons, two Extraterrestrials in fiction, alien robot fac ...
takes three arguments: a "query vector" q, a list of "key vectors" k_1, \dots, k_N, and a list of "value vectors" v_1, \dots, v_N, and outputs a softmax-weighted sum over value vectors:o = \sum_^N \frac v_iThe standard softmax method involves several loops over the inputs, which would be bottlenecked by memory bandwidth. The FlashAttention method is a
communication-avoiding algorithm Communication-avoiding algorithms minimize movement of data within a memory hierarchy for improving its running-time and energy consumption. These minimize the total of two costs (in terms of time and energy): arithmetic and communication. Communic ...
that fuses these operations into a single loop, increasing the
arithmetic intensity Arithmetic is an elementary branch of mathematics that deals with numerical operations like addition, subtraction, multiplication, and division. In a wider sense, it also includes exponentiation, extraction of roots, and taking logarithms. Ar ...
. It is an
online algorithm In computer science, an online algorithm is one that can process its input piece-by-piece in a serial fashion, i.e., in the order that the input is fed to the algorithm, without having the entire input available from the start. In contrast, an of ...
that computes the following quantities:\begin z_i &= q^T k_i &\\ m_i &= \max(z_1, \dots, z_i) &=& \max(m_, z_i)\\ l_i &= e^ + \dots + e^ &=& e^ l_ + e^\\ o_i &= e^ v_1 + \dots + e^v_i &=& e^ o_ + e^v_i \endand returns o_N/l_N. In practice, FlashAttention operates over multiple queries and keys per loop iteration, in a similar way as blocked matrix multiplication. If
backpropagation In machine learning, backpropagation is a gradient computation method commonly used for training a neural network to compute its parameter updates. It is an efficient application of the chain rule to neural networks. Backpropagation computes th ...
is needed, then the output vectors and the intermediate arrays _1, \dots, m_N _1, \dots, l_N/math> are cached, and during the backward pass, attention matrices are rematerialized from these, making it a form of gradient checkpointing.


Mathematical properties

Geometrically the softmax function maps the
Euclidean space Euclidean space is the fundamental space of geometry, intended to represent physical space. Originally, in Euclid's ''Elements'', it was the three-dimensional space of Euclidean geometry, but in modern mathematics there are ''Euclidean spaces ...
\mathbb^K to the boundary of the standard (K-1)-simplex, cutting the dimension by one (the range is a (K - 1)-dimensional simplex in K-dimensional space), due to the linear constraint that all output sum to 1 meaning it lies on a
hyperplane In geometry, a hyperplane is a generalization of a two-dimensional plane in three-dimensional space to mathematical spaces of arbitrary dimension. Like a plane in space, a hyperplane is a flat hypersurface, a subspace whose dimension is ...
. Along the main diagonal (x,\, x,\, \dots,\, x), softmax is just the uniform distribution on outputs, (1/n, \dots, 1/n): equal scores yield equal probabilities. More generally, softmax is invariant under translation by the same value in each coordinate: adding \mathbf = (c,\, \dots,\, c) to the inputs \mathbf yields \sigma(\mathbf + \mathbf) = \sigma(\mathbf), because it multiplies each exponent by the same factor, e^c (because e^ = e^ \cdot e^c), so the ratios do not change: \sigma(\mathbf + \mathbf)_j = \frac = \frac = \sigma(\mathbf)_j. Geometrically, softmax is constant along diagonals: this is the dimension that is eliminated, and corresponds to the softmax output being independent of a translation in the input scores (a choice of 0 score). One can normalize input scores by assuming that the sum is zero (subtract the average: \mathbf where c = \frac \sum z_i), and then the softmax takes the hyperplane of points that sum to zero, \sum z_i = 0, to the open simplex of positive values that sum to 1\sum \sigma(\mathbf)_i = 1, analogously to how the exponent takes 0 to 1, e^0 = 1 and is positive. By contrast, softmax is not invariant under scaling. For instance, \sigma\bigl((0,\, 1)\bigr) = \bigl(1/(1 + e),\, e/(1 + e)\bigr) but \sigma\bigl((0, 2)\bigr) = \bigl(1/\left(1 + e^2\right),\, e^2/\left(1 + e^2\right)\bigr). The standard logistic function is the special case for a 1-dimensional axis in 2-dimensional space, say the ''x''-axis in the plane. One variable is fixed at 0 (say z_2 = 0), so e^0 = 1, and the other variable can vary, denote it z_1 = x, so e^/\sum_^2 e^ = e^x/\left(e^x + 1\right), the standard logistic function, and e^/\sum_^2 e^ = 1/\left(e^x + 1\right), its complement (meaning they add up to 1). The 1-dimensional input could alternatively be expressed as the line (x/2,\, -x/2), with outputs e^/\left(e^ + e^\right) = e^x/\left(e^x + 1\right) and e^/\left(e^ + e^\right) = 1/\left(e^x + 1\right).


Gradients

The softmax function is also the gradient of the
LogSumExp The LogSumExp (LSE) (also called RealSoftMax or multivariable softplus) function is a smooth maximum – a smooth approximation to the maximum function, mainly used by machine learning algorithms. It is defined as the logarithm of the sum of the ...
function:\frac \operatorname(\mathbf) = \frac = \sigma(\mathbf)_i, \quad \text i = 1, \dotsc , K, \quad \mathbf = (z_1,\, \dotsc,\, z_K) \in\R^K,where the LogSumExp function is defined as \operatorname(z_1,\, \dots,\, z_n) = \log\left(\exp(z_1) + \cdots + \exp(z_n)\right). The gradient of softmax is thus \partial_ \sigma_i = \sigma_i (\delta_ - \sigma_j).


History

The softmax function was used in
statistical mechanics In physics, statistical mechanics is a mathematical framework that applies statistical methods and probability theory to large assemblies of microscopic entities. Sometimes called statistical physics or statistical thermodynamics, its applicati ...
as the
Boltzmann distribution In statistical mechanics and mathematics, a Boltzmann distribution (also called Gibbs distribution Translated by J.B. Sykes and M.J. Kearsley. See section 28) is a probability distribution or probability measure that gives the probability tha ...
in the foundational paper , formalized and popularized in the influential textbook . The use of the softmax in
decision theory Decision theory or the theory of rational choice is a branch of probability theory, probability, economics, and analytic philosophy that uses expected utility and probabilities, probability to model how individuals would behave Rationality, ratio ...
is credited to
R. Duncan Luce Robert Duncan Luce (May 16, 1925 – August 11, 2012) was an American mathematician and social scientist, and one of the most preeminent figures in the field of mathematical psychology. At the end of his life, he held the position of Disting ...
, who used the axiom of
independence of irrelevant alternatives Independence of irrelevant alternatives (IIA) is an axiom of decision theory which codifies the intuition that a choice between A and B (which are both related) should not depend on the quality of a third, unrelated outcome C. There are several dif ...
in
rational choice theory Rational choice modeling refers to the use of decision theory (the theory of rational choice) as a set of guidelines to help understand economic and social behavior. The theory tries to approximate, predict, or mathematically model human behav ...
to deduce the softmax in
Luce's choice axiom In probability theory, Luce's choice axiom, formulated by R. Duncan Luce (1959), states that the relative odds of selecting one item over another from a pool of many items is not affected by the presence or absence of other items in the pool. Sel ...
for relative preferences. In machine learning, the term "softmax" is credited to John S. Bridle in two 1989 conference papers, : and :


Example

With an input of , the softmax is approximately . The output has most of its weight where the "4" was in the original input. This is what the function is normally used for: to highlight the largest values and suppress values which are significantly below the maximum value. But note: a change of ''temperature'' changes the output. When the temperature is multiplied by 10, the inputs are effectively and the softmax is approximately . This shows that high temperatures de-emphasize the maximum value. Computation of this example using
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (prog ...
code: >>> import numpy as np >>> z = np.array( .0, 2.0, 3.0, 4.0, 1.0, 2.0, 3.0 >>> beta = 1.0 >>> np.exp(beta * z) / np.sum(np.exp(beta * z)) array( .02364054, 0.06426166, 0.1746813, 0.474833, 0.02364054, 0.06426166, 0.1746813


Alternatives

The softmax function generates probability predictions densely distributed over its
support Support may refer to: Arts, entertainment, and media * Supporting character * Support (art), a solid surface upon which a painting is executed Business and finance * Support (technical analysis) * Child support * Customer support * Income Su ...
. Other functions like sparsemax or α- entmax can be used when sparse probability predictions are desired."Speeding Up Entmax" by Maxat Tezekbayev, Vassilina Nikoulina, Matthias Gallé, Zhenisbek Assylbekov, https://arxiv.org/abs/2111.06832v3 Also the Gumbel-softmax reparametrization trick can be used when sampling from a discrete-discrete distribution needs to be mimicked in a differentiable manner.


See also

*
Softplus In mathematics and machine learning, the softplus function is : f(x) = \log(1 + e^x). It is a smooth approximation (in fact, an analytic function) to the ramp function, which is known as the ''rectifier'' or ''ReLU (rectified linear unit)'' in m ...
*
Multinomial logistic regression In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes. That is, it is a model that is used to predict the prob ...
*
Dirichlet distribution In probability and statistics, the Dirichlet distribution (after Peter Gustav Lejeune Dirichlet), often denoted \operatorname(\boldsymbol\alpha), is a family of continuous multivariate probability distributions parameterized by a vector of pos ...
– an alternative way to sample categorical distributions * Partition function *
Exponential tilting Exponential Tilting (ET), Exponential Twisting, or Exponential Change of Measure (ECM) is a distribution shifting technique used in many parts of mathematics. The different exponential tiltings of a random variable X is known as the natural expone ...
– a generalization of Softmax to more general probability distributions


Notes


References

{{Artificial intelligence navbox Computational neuroscience Logistic regression Artificial neural networks Functions and mappings Articles with example Python (programming language) code Exponentials Articles with example Julia code Articles with example R code