information theory Information theory is the scientific study of the quantification, storage, and communication of information. The field was originally established by the works of Harry Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s. ...

, the cross-entropy between two

probability distribution In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon ...

p

and

q

over the same underlying set of events measures the average number of

bit The bit is the most basic unit of information in computing and digital communications. The name is a portmanteau of binary digit. The bit represents a logical state with one of two possible values. These values are most commonly represente ...

s needed to identify an event drawn from the set if a coding scheme used for the set is optimized for an estimated probability distribution

q

, rather than the true distribution

p

Definition

The cross-entropy of the distribution

q

relative to a distribution

p

over a given set is defined as follows: :

H(p, q) = -\operatorname_p log q /math>,

where E_p cdot /math> is the

expected value In probability theory, the expected value (also called expectation, expectancy, mathematical expectation, mean, average, or first moment) is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a ...

operator with respect to the distribution

p

. The definition may be formulated using the

Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how one probability distribution ''P'' is different fr ...

D_(p \parallel q)

, divergence of

p

from

q

(also known as the ''relative entropy'' of

p

with respect to

q

). :

H(p, q) = H(p) + D_(p \parallel q),

where

H(p)

is the

entropy Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodyna ...

p

. For discrete probability distributions

p

and

q

with the same

support Support may refer to: Arts, entertainment, and media * Supporting character Business and finance * Support (technical analysis) * Child support * Customer support * Income Support Construction * Support (structure), or lateral support, a ...

\mathcal

this means The situation for

continuous Continuity or continuous may refer to: Mathematics * Continuity (mathematics), the opposing concept to discreteness; common examples include ** Continuous probability distribution or random variable in probability and statistics ** Continuous g ...

distributions is analogous. We have to assume that

p

and

q

are

absolutely continuous In calculus, absolute continuity is a smoothness property of functions that is stronger than continuity and uniform continuity. The notion of absolute continuity allows one to obtain generalizations of the relationship between the two central ope ...

with respect to some reference measure

r

(usually

r

is a

Lebesgue measure In measure theory, a branch of mathematics, the Lebesgue measure, named after French mathematician Henri Lebesgue, is the standard way of assigning a measure to subsets of ''n''-dimensional Euclidean space. For ''n'' = 1, 2, or 3, it coincides wi ...

on a Borel

σ-algebra In mathematical analysis and in probability theory, a σ-algebra (also σ-field) on a set ''X'' is a collection Σ of subsets of ''X'' that includes the empty subset, is closed under complement, and is closed under countable unions and countabl ...

). Let

P

and

Q

be probability density functions of

p

and

q

with respect to

r

. Then :

-\int_\mathcal P(x)\, \log Q(x)\, dr(x) = \operatorname_p \log Q /math>

and therefore



NB: The notation H(p,q) is also used for a different concept, the joint entropy of p and q .

Motivation

, the Kraft–McMillan theorem establishes that any directly decodable coding scheme for coding a message to identify one value

x_i

out of a set of possibilities

\

can be seen as representing an implicit probability distribution

q(x_i) = \left(\frac\right)^

over

\

, where

\ell_i

is the length of the code for

x_i

in bits. Therefore, cross-entropy can be interpreted as the expected message-length per datum when a wrong distribution

q

is assumed while the data actually follows a distribution

p

. That is why the expectation is taken over the true probability distribution

p

and not

q

. Indeed the expected message-length under the true distribution

p

is :

= - \sum_ p(x_i)\, \log_2 q(x_i) = -\sum_x p(x)\, \log_2 q(x) = H(p, q).

Estimation

There are many situations where cross-entropy needs to be measured but the distribution of

p

is unknown. An example is

language model A language model is a probability distribution over sequences of words. Given any sequence of words of length , a language model assigns a probability P(w_1,\ldots,w_m) to the whole sequence. Language models generate probabilities by training on ...

ing, where a model is created based on a training set

T

, and then its cross-entropy is measured on a test set to assess how accurate the model is in predicting the test data. In this example,

p

is the true distribution of words in any corpus, and

q

is the distribution of words as predicted by the model. Since the true distribution is unknown, cross-entropy cannot be directly calculated. In these cases, an estimate of cross-entropy is calculated using the following formula: :

H(T,q) = -\sum_^N \frac \log_2 q(x_i)

where

N

is the size of the test set, and

q(x)

is the probability of event

x

estimated from the training set. In other words,

q(x_i)

is the probability estimate of the model that the i-th word of the text is

x_i

. The sum is averaged over the

N

words of the test. This is a Monte Carlo estimate of the true cross-entropy, where the test set is treated as samples from

p(x)

Relation to maximum likelihood

In classification problems we want to estimate the probability of different outcomes. Let the estimated probability of outcome

i

q_(X=i)

with to-be-optimized parameters

\theta

and let the frequency (empirical probability) of outcome

i

in the training set be

p(X=i)

. Given N conditionally independent samples in the training set, then the

likelihood The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood functi ...

of the parameters

\theta

of the model

q_(X=x)

on the training set is :

\mathcal(\theta)=\prod_ (\mbox i)^ = \prod_i q_(X=i)^

so the log-likelihood, divided by

N

is :

\frac\log(\mathcal(\theta))=\frac \log \prod_i q_(X=i)^ = \sum_i p(X=i) \log q_(X=i) = -H(p, q)

so that maximizing the likelihood with respect to the parameters

\theta

is the same as minimizing the cross-entropy.

Cross-entropy minimization

Cross-entropy minimization is frequently used in optimization and rare-event probability estimation. When comparing a distribution

q

against a fixed reference distribution

p

, cross-entropy and

KL divergence KL, kL, kl, or kl. may refer to: Businesses and organizations * KLM, a Dutch airline (IATA airline designator KL) * Koninklijke Landmacht, the Royal Netherlands Army * Kvenna Listin ("Women's List"), a political party in Iceland * KL FM, a Mala ...

are identical up to an additive constant (since

p

is fixed): According to the Gibbs' inequality, both take on their minimal values when

p = q

, which is

0

for KL divergence, and

\mathrm(p)

for cross-entropy. In the engineering literature, the principle of minimizing KL divergence (Kullback's " Principle of Minimum Discrimination Information") is often called the Principle of Minimum Cross-Entropy (MCE), or Minxent. However, as discussed in the article ''

'', sometimes the distribution

q

is the fixed prior reference distribution, and the distribution

p

is optimized to be as close to

q

as possible, subject to some constraint. In this case the two minimizations are ''not'' equivalent. This has led to some ambiguity in the literature, with some authors attempting to resolve the inconsistency by redefining cross-entropy to be

D_(p \parallel q)

, rather than

H(p, q)

Cross-entropy loss function and logistic regression

Cross-entropy can be used to define a loss function in

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

and

optimization Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criterion, from some set of available alternatives. It is generally divided into two subfi ...

. The true probability

p_i

is the true label, and the given distribution

q_i

is the predicted value of the current model. This is also known as the log loss (or logarithmic loss or logistic loss); the terms "log loss" and "cross-entropy loss" are used interchangeably. More specifically, consider a

binary regression In statistics, specifically regression analysis, a binary regression estimates a relationship between one or more explanatory variables and a single output binary variable. Generally the probability of the two alternatives is modeled, instead of s ...

model which can be used to classify observations into two possible classes (often simply labelled

0

and

1

). The output of the model for a given observation, given a vector of input features

x

, can be interpreted as a probability, which serves as the basis for classifying the observation. In

logistic regression In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear combination of one or more independent variables. In regression an ...

, the probability is modeled using the

logistic function A logistic function or logistic curve is a common S-shaped curve (sigmoid curve) with equation f(x) = \frac, where For values of x in the domain of real numbers from -\infty to +\infty, the S-curve shown on the right is obtained, with the ...

g(z) = 1/(1+e^)

where

z

is some function of the input vector

x

, commonly just a linear function. The probability of the output

y=1

is given by :

q_ = \hat \equiv g(\mathbf\cdot\mathbf) = \frac 1 ,

where the vector of weights

\mathbf

is optimized through some appropriate algorithm such as

gradient descent In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the ...

. Similarly, the complementary probability of finding the output

y=0

is simply given by :

q_ = 1-\hat

Having set up our notation,

p\in\

and

q\in\

, we can use cross-entropy to get a measure of dissimilarity between

p

and

q

: :

H(p,q)\ =\ -\sum_i p_i\log q_i\ =\ -y\log\hat - (1-y)\log(1-\hat)

Logistic regression typically optimizes the log loss for all the observations on which it is trained, which is the same as optimizing the average cross-entropy in the sample. For example, suppose we have

N

samples with each sample indexed by

n=1,\dots,N

. The ''average'' of the loss function is then given by: :

,, \end

where

\hat_n\equiv g(\mathbf\cdot\mathbf_n) = 1/(1+e^)

, with

g(z)

the logistic function as before. The logistic loss is sometimes called cross-entropy loss. It is also known as log loss (In this case, the binary label is often denoted by ). Remark: The gradient of the cross-entropy loss for logistic regression is the same as the gradient of the squared error loss for

linear regression In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is cal ...

. That is, define :

X^T=\begin
1&x_&\dots&x_\\ 
1&x_&\cdots&x_\\ 
\vdots & \vdots && \vdots \\
1&x_&\cdots&x_\\ 
\end\in \mathbb^

\hat = \hat(x_,\dots,x_) = \frac

L(\overrightarrow)=-\sum_^N^i\log \hat^i+(1-y^i)\log(1-\hat^i) /math>

Then we have the result 

: \fracL(\overrightarrow)=X^T(\hat-Y) The proof is as follows. For any \hat^i, we have

: \frac\ln\frac = \frac : \frac\ln \left(1-\frac\right)=\frac : \begin
\fracL(\overrightarrow)&=-\sum_^\left frac-(1-y^i)\frac\right \
    &=-\sum_^^i-\hat^i = \sum_^(\hat^i-y^i)
\end : \frac\ln \frac = \frac : \frac\ln\left -\frac\right = \frac : \fracL(\overrightarrow) = -\sum_^N x_(y^i-\hat^i) = \sum_^N x_(\hat^i-y^i) In a similar way, we eventually obtain the desired result.

References

External links

Cross Entropy
{{DEFAULTSORT:Cross Entropy Entropy and information Loss functions

Definition

Motivation

Estimation

Relation to maximum likelihood

Cross-entropy minimization

Cross-entropy loss function and logistic regression

See also

References

External links