probability theory Probability theory is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expressing it through a set ...

and

statistics Statistics (from German: '' Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, indust ...

, a categorical distribution (also called a generalized Bernoulli distribution, multinoulli distribution) is a

discrete probability distribution In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon ...

that describes the possible results of a random variable that can take on one of ''K'' possible categories, with the probability of each category separately specified. There is no innate underlying ordering of these outcomes, but numerical labels are often attached for convenience in describing the distribution, (e.g. 1 to ''K''). The ''K''-dimensional categorical distribution is the most general distribution over a ''K''-way event; any other discrete distribution over a size-''K''

sample space In probability theory, the sample space (also called sample description space, possibility space, or outcome space) of an experiment or random trial is the set of all possible outcomes or results of that experiment. A sample space is usually den ...

is a special case. The parameters specifying the probabilities of each possible outcome are constrained only by the fact that each must be in the range 0 to 1, and all must sum to 1. The categorical distribution is the

generalization A generalization is a form of abstraction whereby common properties of specific instances are formulated as general concepts or claims. Generalizations posit the existence of a domain or set of elements, as well as one or more common character ...

of the

Bernoulli distribution In probability theory and statistics, the Bernoulli distribution, named after Swiss mathematician Jacob Bernoulli,James Victor Uspensky: ''Introduction to Mathematical Probability'', McGraw-Hill, New York 1937, page 45 is the discrete probabi ...

for a categorical random variable, i.e. for a discrete variable with more than two possible outcomes, such as the roll of a

die Die, as a verb, refers to death, the cessation of life. Die may also refer to: Games * Die, singular of dice, small throwable objects used for producing random numbers Manufacturing * Die (integrated circuit), a rectangular piece of a semicondu ...

. On the other hand, the categorical distribution is a

special case In logic, especially as applied in mathematics, concept is a special case or specialization of concept precisely if every instance of is also an instance of but not vice versa, or equivalently, if is a generalization of . A limiting case ...

of the

multinomial distribution In probability theory, the multinomial distribution is a generalization of the binomial distribution. For example, it models the probability of counts for each side of a ''k''-sided dice rolled ''n'' times. For ''n'' independent trials each of wh ...

, in that it gives the probabilities of potential outcomes of a single drawing rather than multiple drawings.

Terminology

Occasionally, the categorical distribution is termed the "discrete distribution". However, this properly refers not to one particular family of distributions but to a general class of distributions. In some fields, such as

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

and

natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...

, the categorical and

s are conflated, and it is common to speak of a "multinomial distribution" when a "categorical distribution" would be more precise.Minka, T. (2003
Bayesian inference, entropy and the multinomial distribution
Technical report Microsoft Research. This imprecise usage stems from the fact that it is sometimes convenient to express the outcome of a categorical distribution as a "1-of-''K''" vector (a vector with one element containing a 1 and all other elements containing a 0) rather than as an integer in the range 1 to ''K''; in this form, a categorical distribution is equivalent to a multinomial distribution for a single observation (see below). However, conflating the categorical and multinomial distributions can lead to problems. For example, in a Dirichlet-multinomial distribution, which arises commonly in natural language processing models (although not usually with this name) as a result of collapsed Gibbs sampling where

Dirichlet distribution In probability and statistics, the Dirichlet distribution (after Peter Gustav Lejeune Dirichlet), often denoted \operatorname(\boldsymbol\alpha), is a family of continuous multivariate probability distributions parameterized by a vector \bold ...

s are collapsed out of a hierarchical Bayesian model, it is very important to distinguish categorical from multinomial. The

joint distribution Given two random variables that are defined on the same probability space, the joint probability distribution is the corresponding probability distribution on all possible pairs of outputs. The joint distribution can just as well be considered ...

of the same variables with the same Dirichlet-multinomial distribution has two different forms depending on whether it is characterized as a distribution whose domain is over individual categorical nodes or over multinomial-style counts of nodes in each particular category (similar to the distinction between a set of Bernoulli-distributed nodes and a single binomial-distributed node). Both forms have very similar-looking

probability mass function In probability and statistics, a probability mass function is a function that gives the probability that a discrete random variable is exactly equal to some value. Sometimes it is also known as the discrete density function. The probability mass ...

s (PMFs), which both make reference to multinomial-style counts of nodes in a category. However, the multinomial-style PMF has an extra factor, a

multinomial coefficient In mathematics, the multinomial theorem describes how to expand a power of a sum in terms of powers of the terms in that sum. It is the generalization of the binomial theorem from binomials to multinomials. Theorem For any positive integer ...

, that is a constant equal to 1 in the categorical-style PMF. Confusing the two can easily lead to incorrect results in settings where this extra factor is not constant with respect to the distributions of interest. The factor is frequently constant in the complete conditionals used in Gibbs sampling and the optimal distributions in variational methods.

Formulating distributions

A categorical distribution is a discrete probability distribution whose

is the set of ''k'' individually identified items. It is the generalization of the

for a categorical random variable. In one formulation of the distribution, the

is taken to be a finite sequence of integers. The exact integers used as labels are unimportant; they might be or or any other arbitrary set of values. In the following descriptions, we use for convenience, although this disagrees with the convention for the

, which uses . In this case, the

''f'' is: :

f(x=i\mid \boldsymbol ) = p_i ,

where

\boldsymbol = (p_1,\ldots,p_k)

p_i

represents the probability of seeing element ''i'' and

\textstyle

. Another formulation that appears more complex but facilitates mathematical manipulations is as follows, using the

Iverson bracket In mathematics, the Iverson bracket, named after Kenneth E. Iverson, is a notation that generalises the Kronecker delta, which is the Iverson bracket of the statement . It maps any statement to a function of the free variables in that statement ...

: :

f(x\mid \boldsymbol ) = \prod_^k p_i^ ,

where

=i /math> evaluates to 1 if x=i, 0 otherwise. There are various advantages of this formulation, e.g.:
* It is easier to write out the

likelihood function The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood functi ...

of a set of

independent identically distributed In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent. This property is us ...

categorical variables. * It connects the categorical distribution with the related

. * It shows why the

is the

conjugate prior In Bayesian probability theory, if the posterior distribution p(\theta \mid x) is in the same probability distribution family as the prior probability distribution p(\theta), the prior and posterior are then called conjugate distributions, and ...

of the categorical distribution, and allows the

posterior distribution The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posterior p ...

of the parameters to be calculated. Yet another formulation makes explicit the connection between the categorical and

s by treating the categorical distribution as a special case of the multinomial distribution in which the parameter ''n'' of the multinomial distribution (the number of sampled items) is fixed at 1. In this formulation, the sample space can be considered to be the set of 1-of-''K'' encoded random vectors x of dimension ''k'' having the property that exactly one element has the value 1 and the others have the value 0. The particular element having the value 1 indicates which category has been chosen. The

''f'' in this formulation is: :

f( \mathbf\mid \boldsymbol ) = \prod_^k p_i^ ,

where

p_i

represents the probability of seeing element ''i'' and

\textstyle

. This is the formulation adopted by

Bishop A bishop is an ordained clergy member who is entrusted with a position of authority and oversight in a religious institution. In Christianity, bishops are normally responsible for the governance of dioceses. The role or office of bishop is ...

. Bishop, C. (2006) ''Pattern Recognition and Machine Learning'', Springer. .

Properties

* The distribution is completely given by the probabilities associated with each number ''i'':

p_i = P(X = i)

, ''i'' = 1,...,''k'', where

\textstyle

. The possible sets of probabilities are exactly those in the standard

(k-1)

-dimensional simplex; for ''k'' = 2 this reduces to the possible probabilities of the Bernoulli distribution being the 1-simplex,

p_1+p_2=1, 0 \leq p_1,p_2 \leq 1 .

* The distribution is a special case of a "multivariate Bernoulli distribution" in which exactly one of the ''k'' 0-1 variables takes the value one. *

= \boldsymbol

* Let

\boldsymbol

be the realisation from a categorical distribution. Define the random vector ''Y'' as composed of the elements: ::

Y_i=I(\boldsymbol=i),

: where ''I'' is the

indicator function In mathematics, an indicator function or a characteristic function of a subset of a set is a function that maps elements of the subset to one, and all other elements to zero. That is, if is a subset of some set , one has \mathbf_(x)=1 if x\i ...

. Then ''Y'' has a distribution which is a special case of the multinomial distribution with parameter

n=1

. The sum of

n

independent and identically distributed such random variables ''Y'' constructed from a categorical distribution with parameter

\boldsymbol

is multinomially distributed with parameters

n

and

\boldsymbol .

* The

distribution of a categorical distribution is a

. See the section below for more discussion. * The

sufficient statistic In statistics, a statistic is ''sufficient'' with respect to a statistical model and its associated unknown parameter if "no other statistic that can be calculated from the same sample provides any additional information as to the value of the para ...

from ''n'' independent observations is the set of counts (or, equivalently, proportion) of observations in each category, where the total number of trials (=''n'') is fixed. * The indicator function of an observation having a value ''i'', equivalent to the

function

=i /math> or the

Kronecker delta In mathematics, the Kronecker delta (named after Leopold Kronecker) is a function of two variables, usually just non-negative integers. The function is 1 if the variables are equal, and 0 otherwise: \delta_ = \begin 0 &\text i \neq j, \\ 1 & ...

function

\delta_,

is Bernoulli distributed with parameter

p_i .

Bayesian inference using conjugate prior

Bayesian statistics Bayesian statistics is a theory in the field of statistics based on the Bayesian interpretation of probability where probability expresses a ''degree of belief'' in an event. The degree of belief may be based on prior knowledge about the event, ...

, the

is the

distribution of the categorical distribution (and also the

). This means that in a model consisting of a data point having a categorical distribution with unknown parameter vector p, and (in standard Bayesian style) we choose to treat this parameter as a

random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...

and give it a

prior distribution In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken into ...

defined using a

, then the

of the parameter, after incorporating the knowledge gained from the observed data, is also a Dirichlet. Intuitively, in such a case, starting from what is known about the parameter prior to observing the data point, knowledge can then be updated based on the data point, yielding a new distribution of the same form as the old one. As such, knowledge of a parameter can be successively updated by incorporating new observations one at a time, without running into mathematical difficulties. Formally, this can be expressed as follows. Given a model :

\begin
\boldsymbol\alpha &=& (\alpha_1, \ldots, \alpha_K) &=& \text \\
\mathbf\mid\boldsymbol\alpha &=& (p_1, \ldots, p_K) &\sim& \operatorname(K, \boldsymbol\alpha) \\
\mathbb\mid\mathbf &=& (x_1, \ldots, x_N) &\sim& \operatorname(K,\mathbf)
\end

then the following holds: :

\\ \mathbf \mid \mathbb,\boldsymbol\alpha &\sim& \operatorname(K,\mathbf+\boldsymbol\alpha) &=& \operatorname(K,c_1+\alpha_1,\ldots,c_K+\alpha_K) \end

This relationship is used in

to estimate the underlying parameter p of a categorical distribution given a collection of ''N'' samples. Intuitively, we can view the hyperprior vector α as

pseudocount In statistics, additive smoothing, also called Laplace smoothing or Lidstone smoothing, is a technique used to smooth categorical data. Given a set of observation counts \textstyle from a \textstyle -dimensional multinomial distribution with ...

s, i.e. as representing the number of observations in each category that we have already seen. Then we simply add in the counts for all the new observations (the vector c) in order to derive the posterior distribution. Further intuition comes from the

expected value In probability theory, the expected value (also called expectation, expectancy, mathematical expectation, mean, average, or first moment) is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a ...

of the posterior distribution (see the article on the

): :

= \frac

This says that the expected probability of seeing a category ''i'' among the various discrete distributions generated by the posterior distribution is simply equal to the proportion of occurrences of that category actually seen in the data, including the pseudocounts in the prior distribution. This makes a great deal of intuitive sense: if, for example, there are three possible categories, and category 1 is seen in the observed data 40% of the time, one would expect on average to see category 1 40% of the time in the posterior distribution as well. (This intuition is ignoring the effect of the prior distribution. Furthermore, the posterior is a ''distribution over distributions''. The posterior distribution in general describes the parameter in question, and in this case the parameter itself is a discrete probability distribution, i.e. the actual categorical distribution that generated the data. For example, if 3 categories in the ratio 40:5:55 are in the observed data, then ignoring the effect of the prior distribution, the true parameter – i.e. the true, underlying distribution that generated our observed data – would be expected to have the average value of (0.40,0.05,0.55), which is indeed what the posterior reveals. However, the true distribution might actually be (0.35,0.07,0.58) or (0.42,0.04,0.54) or various other nearby possibilities. The amount of uncertainty involved here is specified by the

variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbe ...

of the posterior, which is controlled by the total number of observations – the more data observed, the less uncertainty about the true parameter.) (Technically, the prior parameter

\alpha_i

should actually be seen as representing

\alpha_i-1

prior observations of category

i

. Then, the updated posterior parameter

c_i+\alpha_i

represents

c_i+\alpha_i-1

posterior observations. This reflects the fact that a Dirichlet distribution with

\boldsymbol\alpha = (1,1,\ldots)

has a completely flat shape — essentially, a

uniform distribution Uniform distribution may refer to: * Continuous uniform distribution * Discrete uniform distribution * Uniform distribution (ecology) * Equidistributed sequence See also * * Homogeneous distribution In mathematics, a homogeneous distribution ...

over the

simplex In geometry, a simplex (plural: simplexes or simplices) is a generalization of the notion of a triangle or tetrahedron to arbitrary dimensions. The simplex is so-named because it represents the simplest possible polytope in any given dimension. ...

of possible values of p. Logically, a flat distribution of this sort represents total ignorance, corresponding to no observations of any sort. However, the mathematical updating of the posterior works fine if we ignore the

\cdots-1

term and simply think of the α vector as directly representing a set of pseudocounts. Furthermore, doing this avoids the issue of interpreting

\alpha_i

values less than 1.)

MAP estimation

The maximum-a-posteriori estimate of the parameter p in the above model is simply the mode of the posterior Dirichlet distribution, i.e., :

\operatorname\limits_ p(\mathbf \mid \mathbb) = \frac, \qquad \forall i \; \alpha_i + c_i > 1

In many practical applications, the only way to guarantee the condition that

\forall i \; \alpha_i + c_i > 1

is to set

\alpha_i > 1

for all ''i''.

Marginal likelihood

In the above model, the

marginal likelihood A marginal likelihood is a likelihood function that has been integrated over the parameter space. In Bayesian statistics, it represents the probability of generating the observed sample from a prior and is therefore often referred to as model evi ...

of the observations (i.e. the

of the observations, with the prior parameter

marginalized out In probability theory and statistics, the marginal distribution of a subset of a collection of random variables is the probability distribution of the variables contained in the subset. It gives the probabilities of various values of the variab ...

) is a Dirichlet-multinomial distribution: :

\begin
p(\mathbb\mid\boldsymbol) &= \int_p(\mathbb\mid \mathbf)p(\mathbf\mid\boldsymbol)\textrm\mathbf \\
&= \frac
\prod_^K\frac
\end

This distribution plays an important role in hierarchical Bayesian models, because when doing

inference Inferences are steps in reasoning, moving from premises to logical consequences; etymologically, the word ''wikt:infer, infer'' means to "carry forward". Inference is theoretically traditionally divided into deductive reasoning, deduction and in ...

over such models using methods such as

Gibbs sampling In statistics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for obtaining a sequence of observations which are approximated from a specified multivariate probability distribution, when direct sampling is diff ...

or variational Bayes, Dirichlet prior distributions are often marginalized out. See the article on this distribution for more details.

Posterior predictive distribution

The

posterior predictive distribution Posterior may refer to: * Posterior (anatomy), the end of an organism opposite to its head ** Buttocks, as a euphemism * Posterior horn (disambiguation) * Posterior probability The posterior probability is a type of conditional probability that r ...

of a new observation in the above model is the distribution that a new observation

\tilde

would take given the set

\mathbb

of ''N'' categorical observations. As shown in the Dirichlet-multinomial distribution article, it has a very simple form: :

\\ &\propto\, c_i + \alpha_i. \\ \end

There are various relationships among this formula and the previous ones: * The posterior predictive probability of seeing a particular category is the same as the relative proportion of previous observations in that category (including the pseudo-observations of the prior). This makes logical sense — intuitively, we would expect to see a particular category according to the frequency already observed of that category. * The posterior predictive probability is the same as the

of the posterior distribution. This is explained more below. * As a result, this formula can be expressed as simply "the posterior predictive probability of seeing a category is proportional to the total observed count of that category", or as "the ''expected count'' of a category is the same as the total observed count of the category", where "observed count" is taken to include the pseudo-observations of the prior. The reason for the equivalence between posterior predictive probability and the expected value of the posterior distribution of p is evident with re-examination of the above formula. As explained in the

article, the formula for the posterior predictive probability has the form of an expected value taken with respect to the posterior distribution: :

\end

The crucial line above is the third. The second follows directly from the definition of expected value. The third line is particular to the categorical distribution, and follows from the fact that, in the categorical distribution specifically, the expected value of seeing a particular value ''i'' is directly specified by the associated parameter ''p_i''. The fourth line is simply a rewriting of the third in a different notation, using the notation farther up for an expectation taken with respect to the posterior distribution of the parameters. Observe data points one by one and each time consider their predictive probability before observing the data point and updating the posterior. For any given data point, the probability of that point assuming a given category depends on the number of data points already in that category. In this scenario, if a category has a high frequency of occurrence, then new data points are more likely to join that category — further enriching the same category. This type of scenario is often termed a

preferential attachment A preferential attachment process is any of a class of processes in which some quantity, typically some form of wealth or credit, is distributed among a number of individuals or objects according to how much they already have, so that those who ...

(or "rich get richer") model. This models many real-world processes, and in such cases the choices made by the first few data points have an outsize influence on the rest of the data points.

Posterior conditional distribution

, one typically needs to draw from

conditional distribution In probability theory and statistics, given two jointly distributed random variables X and Y, the conditional probability distribution of Y given X is the probability distribution of Y when X is known to be a particular value; in some cases the ...

s in multi-variable Bayes networks where each variable is conditioned on all the others. In networks that include categorical variables with Dirichlet priors (e.g.

mixture model In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observati ...

s and models including mixture components), the Dirichlet distributions are often "collapsed out" (

) of the network, which introduces dependencies among the various categorical nodes dependent on a given prior (specifically, their

is a Dirichlet-multinomial distribution). One of the reasons for doing this is that in such a case, the distribution of one categorical node given the others is exactly the

of the remaining nodes. That is, for a set of nodes

\mathbb

, if the node in question is denoted as

x_n

and the remainder as

\mathbb^

, then :

\begin
p(x_n=i\mid\mathbb^,\boldsymbol) &=\, \frac
&\propto\, c_i^ + \alpha_i
\end

where

c_i^

is the number of nodes having category ''i'' among the nodes other than node ''n''.

Sampling

There are a number of

methods Method ( grc, μέθοδος, methodos) literally means a pursuit of knowledge, investigation, mode of prosecuting such inquiry, or system. In recent centuries it more often means a prescribed process for completing a task. It may refer to: *Scien ...

, but the most common way to sample from a categorical distribution uses a type of

inverse transform sampling Inverse transform sampling (also known as inversion sampling, the inverse probability integral transform, the inverse transformation method, Smirnov transform, or the golden ruleAalto University, N. Hyvönen, Computational methods in inverse probl ...

: Assume a distribution is expressed as "proportional to" some expression, with unknown

normalizing constant The concept of a normalizing constant arises in probability theory and a variety of other areas of mathematics. The normalizing constant is used to reduce any probability function to a probability density function with total probability of one. ...

. Before taking any samples, one prepares some values as follows: #Compute the unnormalized value of the distribution for each category. #Sum them up and divide each value by this sum, in order to normalize them. #Impose some sort of order on the categories (e.g. by an index that runs from 1 to ''k'', where ''k'' is the number of categories). #Convert the values to a

cumulative distribution function In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x. Eve ...

(CDF) by replacing each value with the sum of all of the previous values. This can be done in time ''O(k)''. The resulting value for the first category will be 0. Then, each time it is necessary to sample a value: # Pick a uniformly distributed number between 0 and 1. # Locate the greatest number in the CDF whose value is less than or equal to the number just chosen. This can be done in time ''O(log(k))'', by

binary search In computer science, binary search, also known as half-interval search, logarithmic search, or binary chop, is a search algorithm that finds the position of a target value within a sorted array. Binary search compares the target value to the ...

. # Return the category corresponding to this CDF value. If it is necessary to draw many values from the same categorical distribution, the following approach is more efficient. It draws n samples in O(n) time (assuming an O(1) approximation is used to draw values from the binomial distribution).

function draw_categorical(n) // where n is the number of samples to draw from the categorical distribution
  r = 1
  s = 0
  for i from 1 to k // where k is the number of categories
    v = draw from a binomial(n, p / r) distribution // where p is the probability of category i
    for j from 1 to v
      z ++= i // where z is an array in which the results are stored
    n = n - v
    r = r - p   shuffle (randomly re-order) the elements in z
  return z

Sampling via the Gumbel distribution

it is typical to parametrize the categorical distribution,

p_1,\ldots,p_k

via an unconstrained representation in

\mathbb^k

, whose components are given by: :

\gamma_i = \log p_i + \alpha

where

\alpha

is any real constant. Given this representation,

p_1,\ldots,p_k

can be recovered using the

softmax function The softmax function, also known as softargmax or normalized exponential function, converts a vector of real numbers into a probability distribution of possible outcomes. It is a generalization of the logistic function to multiple dimensions, a ...

, which can then be sampled using the techniques described above. There is however a more direct sampling method that uses samples from the

Gumbel distribution In probability theory and statistics, the Gumbel distribution (also known as the type-I generalized extreme value distribution) is used to model the distribution of the maximum (or the minimum) of a number of samples of various distributions. Th ...

. Let

g_1,\ldots,g_k

be ''k'' independent draws from the standard Gumbel distribution, then :

c = \operatorname\limits_i \left( \gamma_i + g_i \right)

will be a sample from the desired categorical distribution. (If

u_i

is a sample from the standard

, then

g_i=-\log(-\log u_i)

is a sample from the standard Gumbel distribution.)

Notes

References

{{DEFAULTSORT:Categorical Distribution Categorical data Discrete distributions Exponential family distributions

Terminology

Formulating distributions

Properties

Bayesian inference using conjugate prior

MAP estimation

Marginal likelihood

Posterior predictive distribution

Posterior conditional distribution

Sampling

Sampling via the Gumbel distribution

See also

Related distributions

Notes

References