In
information theory
Information theory is the mathematical study of the quantification (science), quantification, Data storage, storage, and telecommunications, communication of information. The field was established and formalized by Claude Shannon in the 1940s, ...
, perplexity is a measure of uncertainty in the value of a sample from a discrete probability distribution. The larger the perplexity, the less likely it is that an observer can guess the value which will be drawn from the distribution. Perplexity was originally introduced in 1977 in the context of
speech recognition
Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also ...
by
Frederick Jelinek,
Robert Leroy Mercer, Lalit R. Bahl, and
James K. Baker.
Perplexity of a probability distribution
The perplexity ''PP'' of a discrete
probability distribution
In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...
''p'' is a concept widely used in information theory,
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
, and statistical modeling. It is defined as
:
where ''x'' ranges over the
events, where is defined to be , and where the value of does not affect the result; can be chosen to be 2, 10,
, or any other positive value other than . In some contexts, this measure is also referred to as the ''
(order-1 true) diversity''.
The logarithm is the
entropy
Entropy is a scientific concept, most commonly associated with states of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynamics, where it was first recognized, to the micros ...
of the distribution; it is expressed in
bits if the base of the logarithm is 2, and it is expressed in
nats if the
natural logarithm
The natural logarithm of a number is its logarithm to the base of a logarithm, base of the e (mathematical constant), mathematical constant , which is an Irrational number, irrational and Transcendental number, transcendental number approxima ...
is used.
Perplexity of a
random variable
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a Mathematics, mathematical formalization of a quantity or object which depends on randomness, random events. The term 'random variable' in its mathema ...
''X'' may be defined as the perplexity of the distribution over its possible values ''x''. It can be thought of as a measure of uncertainty or "surprise" related to the outcomes.
For a probability distribution ''p'' where exactly ''k'' outcomes each have a probability of ''1/k'' and all other outcomes have a probability of zero, the perplexity of this distribution is simply ''k''. This is because the distribution models a fair ''k''-sided
die, with each of the ''k'' outcomes being equally likely. In this context, the perplexity ''k'' indicates that there is as much uncertainty as there would be when rolling a fair ''k''-sided die. Even if a random variable has more than ''k'' possible outcomes, the perplexity will still be ''k'' if the distribution is uniform over ''k'' outcomes and zero for the rest. Thus, a random variable with a perplexity of ''k'' can be described as being "''k''-ways perplexed," meaning it has the same level of uncertainty as a fair ''k-''sided die.
Perplexity is sometimes used as a measure of the difficulty of a prediction problem. It is, however, generally not a straight forward representation of the relevant probability. For example, if you have two choices, one with probability 0.9, your chances of a correct guess using the optimal strategy are 90 percent. Yet, the perplexity is 2
−0.9 log2 0.9 - 0.1 log2 0.1= 1.38. The inverse of the perplexity, 1/1.38 = 0.72, does not correspond to the 0.9 probability.
The perplexity is the exponentiation of the entropy, a more commonly encountered quantity. Entropy measures the expected or "average" number of bits required to encode the outcome of the random variable using an optimal
variable-length code
In coding theory, a variable-length code is a code which maps source symbols to a ''variable'' number of bits. The equivalent concept in computer science is '' bit string''.
Variable-length codes can allow sources to be compressed and decompr ...
. It can also be regarded as the expected information gain from learning the outcome of the random variable, providing insight into the uncertainty and complexity of the underlying probability distribution.
Perplexity of a probability model
A model of an unknown probability distribution ''p'', may be proposed based on a training sample that was drawn from ''p''. Given a proposed probability model ''q'', one may evaluate ''q'' by asking how well it predicts a separate test sample ''x''
1, ''x''
2, ..., ''x
N'' also drawn from ''p''. The perplexity of the model ''q'' is defined as
:
where
is customarily 2. Better models ''q'' of the unknown distribution ''p'' will tend to assign higher probabilities ''q''(''x
i'') to the test events. Thus, they have lower perplexity because they are less surprised by the test sample. This is equivalent to saying that better models have higher
likelihoods for the test data, which leads to a lower perplexity value.
The exponent above may be regarded as the average number of bits needed to represent a test event ''x
i'' if one uses an optimal code based on ''q''. Low-perplexity models do a better job of
compressing the test sample, requiring few bits per test element on average because ''q''(''x
i'') tends to be high.
The exponent
may also be interpreted as a
cross-entropy:
:
where
denotes the
empirical distribution of the test sample (i.e.,
if ''x'' appeared ''n'' times in the test sample of size ''N'').
By the definition of
KL divergence, it is also equal to
, which is
. Consequently, the perplexity is minimized when
.
Perplexity per token
In
natural language processing
Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
(NLP), a
corpus
Corpus (plural ''corpora'') is Latin for "body". It may refer to:
Linguistics
* Text corpus, in linguistics, a large and structured set of texts
* Speech corpus, in linguistics, a large set of speech audio files
* Corpus linguistics, a branch of ...
is a structured collection of
texts or documents, and a
language model is a probability distribution over entire texts or documents. Consequently, in NLP, the more commonly used measure is perplexity per token (word or, more frequently, sub-word), defined as:
where
are the
documents in the corpus and
is the number of ''tokens'' in the corpus. This normalizes the perplexity by the length of the text, allowing for more meaningful comparisons between different texts or models rather than documents.
Suppose the average text ''x
i'' in the corpus has a probability of
according to the language model. This would give a model perplexity of 2
190 per sentence. However, in NLP, it is more common to normalize by the length of a text. Thus, if the test sample has a length of 1,000 tokens, and could be coded using 7.95 bits per token, one could report a model perplexity of 2
7.95 = 247 ''per token.'' In other words, the model is as confused on test data as if it had to choose uniformly and independently among 247 possibilities for each token.
There are two standard evaluation metrics for language models: perplexity or word error rate(WER). The simpler of these measures, WER, is simply the percentage of erroneously recognized words E (deletions, insertions, substitutions) to total number of words N, in a speech recognition task i.e.
The second metric, perplexity (per token), is an information theoretic measure that evaluates the similarity of proposed model ''m'' to the original distribution ''p''. It can be computed as a inverse of (geometric) average probability of test set ''T''
where ''N'' is the number of tokens in test set ''T''. This equation can be seen as the exponentiated cross entropy, where cross entropy H (''p'';''m'') is approximated as
Recent advances in language modeling
Since 2007, significant advancements in language modeling have emerged, particularly with the advent of
deep learning
Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...
techniques. Perplexity per token, a measure that quantifies the predictive power of a language model, has remained central to evaluating models such as the dominant
transformer
In electrical engineering, a transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple Electrical network, circuits. A varying current in any coil of the transformer produces ...
models like Google's
BERT, OpenAI's
GPT-4
Generative Pre-trained Transformer 4 (GPT-4) is a multimodal large language model trained and created by OpenAI and the fourth in its series of GPT foundation models. It was launched on March 14, 2023, and made publicly available via the p ...
and other
large language models
A large language model (LLM) is a language model trained with Self-supervised learning, self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially Natural language generation, language g ...
(LLMs).
This measure was employed to compare different models on the same dataset and guide the optimization of
hyperparameters, although it has been found sensitive to factors such as linguistic features and sentence length.
Despite its pivotal role in language model development, perplexity has shown limitations, particularly as an inadequate predictor of
speech recognition
Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also ...
performance,
overfitting
In mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfi ...
and
generalization
A generalization is a form of abstraction whereby common properties of specific instances are formulated as general concepts or claims. Generalizations posit the existence of a domain or set of elements, as well as one or more common characteri ...
, raising questions about the benefits of blindly optimizing perplexity alone.
Brown Corpus
The lowest perplexity that had been published on the
Brown Corpus
The Brown University Standard Corpus of Present-Day American English, better known as simply the Brown Corpus, is an electronic collection of text samples of American English, the first major structured Text_corpus, corpus of varied genres. This ...
(1 million words of American English of varying topics and genres) as of 1992 is indeed about 247 per word/token, corresponding to a cross-entropy of log
2247 = 7.95 bits per word or 1.75 bits per letter using a
trigram model. While this figure represented the state of the art (SOTA) at the time, advancements in techniques such as deep learning have led to significant improvements in perplexity on other benchmarks, such as the One Billion Word Benchmark.
In the context of the
Brown Corpus
The Brown University Standard Corpus of Present-Day American English, better known as simply the Brown Corpus, is an electronic collection of text samples of American English, the first major structured Text_corpus, corpus of varied genres. This ...
, simply guessing that the next word is "the" will achieve an accuracy of 7 percent, contrasting with the 1/247 = 0.4 percent that might be expected from a naive use of perplexity. This difference underscores the importance of the
statistical model
A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of Sample (statistics), sample data (and similar data from a larger Statistical population, population). A statistical model repre ...
used and the nuanced nature of perplexity as a measure of predictiveness.
[Wilcox, Ethan Gotlieb, et al. "On the predictive power of neural language models for human real-time comprehension behavior." arXiv preprint arXiv:2006.01912 (2020)]
The guess is based on unigram statistics, not on the trigram statistics that yielded the perplexity of 247, and utilizing trigram statistics would further refine the prediction.
See also
*
Cross-entropy
*
Statistical model validation
References
{{Machine learning evaluation metrics
Entropy and information
Language modeling