In
probability theory and
statistics
Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
, the Dirichlet-multinomial distribution is a family of discrete multivariate
probability distribution
In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon i ...
s on a finite support of non-negative integers. It is also called the Dirichlet compound multinomial distribution (DCM) or multivariate Pólya distribution (after
George Pólya). It is a
compound probability distribution, where a probability vector p is drawn from a
Dirichlet distribution with parameter vector
, and an observation drawn from a
multinomial distribution with probability vector p and number of trials ''n''. The Dirichlet parameter vector captures the prior belief about the situation and can be seen as a pseudocount: observations of each outcome that occur before the actual data is collected. The compounding corresponds to a
Pólya urn scheme. It is frequently encountered in
Bayesian statistics,
machine learning,
empirical Bayes methods and
classical statistics
Frequentist inference is a type of statistical inference based in frequentist probability, which treats “probability” in equivalent terms to “frequency” and draws conclusions from sample-data by means of emphasizing the frequency or pr ...
as an
overdispersed multinomial distribution.
It reduces to the
categorical distribution as a special case when ''n'' = 1. It also approximates the
multinomial distribution arbitrarily well for large ''α''. The Dirichlet-multinomial is a multivariate extension of the
beta-binomial distribution, as the multinomial and Dirichlet distributions are multivariate versions of the
binomial distribution
In probability theory and statistics, the binomial distribution with parameters ''n'' and ''p'' is the discrete probability distribution of the number of successes in a sequence of ''n'' independent experiments, each asking a yes–no quest ...
and
beta distributions, respectively.
Specification
Dirichlet-multinomial as a compound distribution
The Dirichlet distribution is a
conjugate distribution to the multinomial distribution. This fact leads to an analytically tractable
compound distribution.
For a random vector of category counts
, distributed according to a
multinomial distribution, the
marginal distribution is obtained by integrating on the distribution for p which can be thought of as a
random vector following a Dirichlet distribution:
:
which results in the following explicit formula:
:
where
is defined as the sum
. Another form for this same compound distribution, written more compactly in terms of the
beta function, ''B'', is as follows:
The latter form emphasizes the fact that zero count categories can be ignored in the calculation - a useful fact when the number of categories is very large and
sparse (e.g. word counts in documents).
Observe that the pdf is the Beta-binomial distribution when
. It can also be shown that it approaches the multinomial distribution as
approaches infinity. The parameter
governs the degree of overdispersion or
burstiness relative to the multinomial. Alternative choices to denote
found in the literature are S and A.
Dirichlet-multinomial as an urn model
The Dirichlet-multinomial distribution can also be motivated via an
urn model for positive
integer values of the vector α, known as the
Polya urn model. Specifically, imagine an urn containing balls of K colors numbering
for the ith color, where random draws are made. When a ball is randomly drawn and observed, then two balls of the same color are returned to the urn. If this is performed n times, then the probability of observing the random vector
of color counts is a Dirichlet-multinomial with parameters n and α.
If the random draws are with simple replacement (no balls over and above the observed ball are added to the urn), then the distribution follows a multinomial distribution and if the random draws are made without replacement, the distribution follows a
multivariate hypergeometric distribution.
Properties
Moments
Once again, let
and let
, then the
expected number of times the outcome ''i'' was observed over ''n'' trials is
:
The
covariance matrix
In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...
is as follows. Each diagonal entry is the
variance of a beta-binomially distributed random variable, and is therefore
:
The off-diagonal entries are the
covariances:
:
for ''i'', ''j'' distinct.
All covariances are negative because for fixed ''n'', an increase in one component of a Dirichlet-multinomial vector requires a decrease in another component.
This is a ''K'' × ''K''
positive-semidefinite matrix of
rank ''K'' − 1.
The entries of the corresponding
correlation matrix
In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistic ...
are
:
:
The sample size drops out of this expression.
Each of the ''k'' components separately has a beta-binomial distribution.
The
support
Support may refer to:
Arts, entertainment, and media
* Supporting character
Business and finance
* Support (technical analysis)
* Child support
* Customer support
* Income Support
Construction
* Support (structure), or lateral support, a ...
of the Dirichlet-multinomial distribution is the set
:
Its number of elements is
:
Matrix notation
In matrix notation,
:
and
:
with = the row vector transpose of the column vector . Letting
:
, we can write alternatively
:
The parameter
is known as the "intra class" or "intra cluster" correlation. It is this positive correlation which gives rise to overdispersion relative to the multinomial distribution.
Aggregation
If
:
then, if the random variables with subscripts ''i'' and ''j'' are dropped from the vector and replaced by their sum,
:
This aggregation property may be used to derive the marginal distribution of
.
Likelihood function
Conceptually, we are making ''N'' independent draws from a categorical distribution with ''K'' categories. Let us represent the independent draws as random categorical variables
for
. Let us denote the number of times a particular category
has been seen (for
) among all the categorical variables as
, and
. Then, we have two separate views onto this problem:
# A set of
categorical variables
.
# A single vector-valued variable
, distributed according to a
multinomial distribution.
The former case is a set of random variables specifying each ''individual'' outcome, while the latter is a variable specifying the ''number'' of outcomes of each of the ''K'' categories. The distinction is important, as the two cases have correspondingly different probability distributions.
The parameter of the categorical distribution is
where
is the probability to draw value
;
is likewise the parameter of the multinomial distribution
. Rather than specifying
directly, we give it a
conjugate prior distribution, and hence it is drawn from a Dirichlet distribution with parameter vector
.
By integrating out
, we obtain a compound distribution. However, the form of the distribution is different depending on which view we take.
For a set of individual outcomes
Joint distribution
For categorical variables
, the
marginal joint distribution is obtained by integrating out
:
:
which results in the following explicit formula:
:
where
is the
gamma function
In mathematics, the gamma function (represented by , the capital letter gamma from the Greek alphabet) is one commonly used extension of the factorial function to complex numbers. The gamma function is defined for all complex numbers except ...
, with
:
Note the absence of the multinomial coefficient due to the formula being about the probability of a sequence of categorical variables instead of a probability on the counts within each category.
Although the variables
do not appear explicitly in the above formula, they enter in through the
values.
Conditional distribution
Another useful formula, particularly in the context of
Gibbs sampling, asks what the conditional density of a given variable
is, conditioned on all the other variables (which we will denote
). It turns out to have an extremely simple form:
:
where
specifies the number of counts of category
seen in all variables other than
.
It may be useful to show how to derive this formula. In general,
conditional distributions are proportional to the corresponding
joint distributions, so we simply start with the above formula for the joint distribution of all the
values and then eliminate any factors not dependent on the particular
in question. To do this, we make use of the notation
defined above, and
:
We also use the fact that
:
Then:
:
In general, it is not necessary to worry about the
normalizing constant
The concept of a normalizing constant arises in probability theory and a variety of other areas of mathematics. The normalizing constant is used to reduce any probability function to a probability density function with total probability of one.
...
at the time of deriving the equations for conditional distributions. The normalizing constant will be determined as part of the algorithm for sampling from the distribution (see
Categorical distribution#Sampling). However, when the conditional distribution is written in the simple form above, it turns out that the normalizing constant assumes a simple form:
:
Hence
:
This formula is closely related to the
Chinese restaurant process
In probability theory, the Chinese restaurant process is a discrete-time stochastic process, analogous to seating customers at tables in a restaurant.
Imagine a restaurant with an infinite number of circular tables, each with infinite capacity. C ...
, which results from taking the limit as
.
In a Bayesian network
In a larger
Bayesian network in which categorical (or so-called "multinomial") distributions occur with
Dirichlet distribution priors as part of a larger network, all Dirichlet priors can be collapsed provided that the only nodes depending on them are categorical distributions. The collapsing happens for each Dirichlet-distribution node separately from the others, and occurs regardless of any other nodes that may depend on the categorical distributions. It also occurs regardless of whether the categorical distributions depend on nodes additional to the Dirichlet priors (although in such a case, those other nodes must remain as additional conditioning factors). Essentially, all of the categorical distributions depending on a given Dirichlet-distribution node become connected into a single Dirichlet-multinomial joint distribution defined by the above formula. The joint distribution as defined this way will depend on the parent(s) of the integrated-out Dirichet prior nodes, as well as any parent(s) of the categorical nodes other than the Dirichlet prior nodes themselves.
In the following sections, we discuss different configurations commonly found in Bayesian networks. We repeat the probability density from above, and define it using the symbol
:
:
=Multiple Dirichlet priors with the same hyperprior
=
Imagine we have a hierarchical model as follows:
:
In cases like this, we have multiple Dirichet priors, each of which generates some number of categorical observations (possibly a different number for each prior). The fact that they are all dependent on the same hyperprior, even if this is a random variable as above, makes no difference. The effect of integrating out a Dirichlet prior links the categorical variables attached to that prior, whose joint distribution simply inherits any conditioning factors of the Dirichlet prior. The fact that multiple priors may share a hyperprior makes no difference:
:
where
is simply the collection of categorical variables dependent on prior ''d''.
Accordingly, the conditional probability distribution can be written as follows:
:
where
specifically means the number of variables ''among the set''
, excluding
itself, that have the value
.
It is necessary to count only the variables having the value ''k'' that are tied together to the variable in question through having the same prior. We do not want to count any other variables also having the value ''k''.
=Multiple Dirichlet priors with the same hyperprior, with dependent children
=
Now imagine a slightly more complicated hierarchical model as follows:
:
This model is the same as above, but in addition, each of the categorical variables has a child variable dependent on it. This is typical of a
mixture model.
Again, in the joint distribution, only the categorical variables dependent on the same prior are linked into a single Dirichlet-multinomial:
:
The conditional distribution of the categorical variables dependent only on their parents and ancestors would have the identical form as above in the simpler case. However, in Gibbs sampling it is necessary to determine the conditional distribution of a given node
dependent not only on
and ancestors such as
but on ''all'' the other parameters.
The simplified expression for the conditional distribution is derived above simply by rewriting the expression for the joint probability and removing constant factors. Hence, the same simplification would apply in a larger joint probability expression such as the one in this model, composed of Dirichlet-multinomial densities plus factors for many other random variables dependent on the values of the categorical variables.
This yields the following:
:
Here the probability density of
appears directly. To do
random sampling over
, we would compute the unnormalized probabilities for all ''K'' possibilities for
using the above formula, then normalize them and proceed as normal using the algorithm described in the
categorical distribution article.
Correctly speaking, the additional factor that appears in the conditional distribution is derived not from the model specification but directly from the joint distribution. This distinction is important when considering models where a given node with Dirichlet-prior parent has multiple dependent children, particularly when those children are dependent on each other (e.g. if they share a parent that is collapsed out). This is discussed more below.
=Multiple Dirichlet priors with shifting prior membership
=
Now imagine we have a hierarchical model as follows:
:
Here we have a tricky situation where we have multiple Dirichlet priors as before and a set of dependent categorical variables, but the relationship between the priors and dependent variables isn't fixed, unlike before. Instead, the choice of which prior to use is dependent on another random categorical variable. This occurs, for example, in topic models, and indeed the names of the variables above are meant to correspond to those in
latent Dirichlet allocation. In this case, the set
is a set of words, each of which is drawn from one of
possible topics, where each topic is a Dirichlet prior over a vocabulary of
possible words, specifying the frequency of different words in the topic. However, the topic membership of a given word isn't fixed; rather, it's determined from a set of
latent variables
. There is one latent variable per word, a
-dimensional
categorical variable
In statistics, a categorical variable (also called qualitative variable) is a variable that can take on one of a limited, and usually fixed, number of possible values, assigning each individual or other unit of observation to a particular group or ...
specifying the topic the word belongs to.
In this case, all variables dependent on a given prior are tied together (i.e.
correlated) in a group, as before — specifically, all words belonging to a given topic are linked. In this case, however, the group membership shifts, in that the words are not fixed to a given topic but the topic depends on the value of a latent variable associated with the word. However, the definition of the Dirichlet-multinomial density doesn't actually depend on the number of categorical variables in a group (i.e. the number of words in the document generated from a given topic), but only on the counts of how many variables in the group have a given value (i.e. among all the word tokens generated from a given topic, how many of them are a given word). Hence, we can still write an explicit formula for the joint distribution:
: