In
probability theory
Probability theory or probability calculus is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expre ...
and
statistics
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
, the Zipf–Mandelbrot law is a
discrete probability distribution
In probability theory and statistics, a probability distribution is a function that gives the probabilities of occurrence of possible events for an experiment. It is a mathematical description of a random phenomenon in terms of its sample spa ...
. Also known as the
Pareto–Zipf law, it is a
power-law
In statistics, a power law is a functional relationship between two quantities, where a relative change in one quantity results in a relative change in the other quantity proportional to the change raised to a constant exponent: one quantity var ...
distribution on
ranked data, named after the
linguist
Linguistics is the scientific study of language. The areas of linguistic analysis are syntax (rules governing the structure of sentences), semantics (meaning), Morphology (linguistics), morphology (structure of words), phonetics (speech sounds ...
George Kingsley Zipf, who suggested a simpler distribution called
Zipf's law
Zipf's law (; ) is an empirical law stating that when a list of measured values is sorted in decreasing order, the value of the -th entry is often approximately inversely proportional to .
The best known instance of Zipf's law applies to the ...
, and the mathematician
Benoit Mandelbrot
Benoit B. Mandelbrot (20 November 1924 – 14 October 2010) was a Polish-born French-American mathematician and polymath with broad interests in the practical sciences, especially regarding what he labeled as "the art of roughness" of phy ...
, who subsequently generalized it.
The
probability mass function
In probability and statistics, a probability mass function (sometimes called ''probability function'' or ''frequency function'') is a function that gives the probability that a discrete random variable is exactly equal to some value. Sometimes i ...
is given by
:
where
is given by
:
which may be thought of as a generalization of a
harmonic number
In mathematics, the -th harmonic number is the sum of the reciprocals of the first natural numbers:
H_n= 1+\frac+\frac+\cdots+\frac =\sum_^n \frac.
Starting from , the sequence of harmonic numbers begins:
1, \frac, \frac, \frac, \frac, \dot ...
. In the formula,
is the rank of the data, and
and
are parameters of the distribution. In the limit as
approaches infinity, this becomes the
Hurwitz zeta function . For finite
and
the Zipf–Mandelbrot law becomes
Zipf's law
Zipf's law (; ) is an empirical law stating that when a list of measured values is sorted in decreasing order, the value of the -th entry is often approximately inversely proportional to .
The best known instance of Zipf's law applies to the ...
. For infinite
and
it becomes a
zeta distribution.
Applications
The distribution of words ranked by their
frequency
Frequency is the number of occurrences of a repeating event per unit of time. Frequency is an important parameter used in science and engineering to specify the rate of oscillatory and vibratory phenomena, such as mechanical vibrations, audio ...
in a random
text corpus
In linguistics and natural language processing, a corpus (: corpora) or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.
Annotated, they have been used in corp ...
is approximated by a
power-law
In statistics, a power law is a functional relationship between two quantities, where a relative change in one quantity results in a relative change in the other quantity proportional to the change raised to a constant exponent: one quantity var ...
distribution, known as
Zipf's law
Zipf's law (; ) is an empirical law stating that when a list of measured values is sorted in decreasing order, the value of the -th entry is often approximately inversely proportional to .
The best known instance of Zipf's law applies to the ...
.
If one plots the frequency rank of words contained in a moderately sized corpus of text data versus the number of occurrences or actual frequencies, one obtains a
power-law
In statistics, a power law is a functional relationship between two quantities, where a relative change in one quantity results in a relative change in the other quantity proportional to the change raised to a constant exponent: one quantity var ...
distribution, with
exponent
In mathematics, exponentiation, denoted , is an operation involving two numbers: the ''base'', , and the ''exponent'' or ''power'', . When is a positive integer, exponentiation corresponds to repeated multiplication of the base: that is, i ...
close to one (but see Powers, 1998 and Gelbukh & Sidorov, 2001). Zipf's law implicitly assumes a fixed vocabulary size, but the
Harmonic series with ''s'' = 1 does not converge, while the Zipf–Mandelbrot generalization with ''s'' > 1 does. Furthermore, there is evidence that the closed class of functional words that define a language obeys a Zipf–Mandelbrot distribution with different parameters from the open classes of contentive words that vary by topic, field and register.
In ecological field studies, the
relative abundance distribution (i.e. the graph of the number of species observed as a function of their abundance) is often found to conform to a Zipf–Mandelbrot law.
Within music, many metrics of measuring "pleasing" music conform to Zipf–Mandelbrot distributions.
Notes
References
* Reprinted as
**
*
*
* Van Droogenbroeck F. J. (2019)
"An essential rephrasing of the Zipf–Mandelbrot law to solve authorship attribution applications by Gaussian statistics"
External links
Z. K. Silagadze: Citations and the Zipf–Mandelbrot's law*
ttps://github.com/gkohri/discreteRNG C++ Library for generating random Zipf–Mandelbrot deviates.
{{DEFAULTSORT:Zipf-Mandelbrot Law
Discrete distributions
Power laws
Computational linguistics
Quantitative linguistics
Corpus linguistics