probability theory Probability theory or probability calculus is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expre ...

and

statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

, the Dirichlet-multinomial distribution is a family of discrete multivariate

probability distribution In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...

s on a finite support of non-negative integers. It is also called the Dirichlet compound multinomial distribution (DCM) or multivariate Pólya distribution (after

George Pólya George Pólya (; ; December 13, 1887 – September 7, 1985) was a Hungarian-American mathematician. He was a professor of mathematics from 1914 to 1940 at ETH Zürich and from 1940 to 1953 at Stanford University. He made fundamental contributi ...

). It is a

compound probability distribution In probability and statistics, a compound probability distribution (also known as a mixture distribution or contagious distribution) is the probability distribution that results from assuming that a random variable is distributed according to some ...

, where a probability vector p is drawn from a

Dirichlet distribution In probability and statistics, the Dirichlet distribution (after Peter Gustav Lejeune Dirichlet), often denoted \operatorname(\boldsymbol\alpha), is a family of continuous multivariate probability distributions parameterized by a vector of pos ...

with parameter vector

\boldsymbol

, and an observation drawn from a

multinomial distribution In probability theory, the multinomial distribution is a generalization of the binomial distribution. For example, it models the probability of counts for each side of a ''k''-sided die rolled ''n'' times. For ''n'' statistical independence, indepen ...

with probability vector p and number of trials ''n''. The Dirichlet parameter vector captures the prior belief about the situation and can be seen as a pseudocount: observations of each outcome that occur before the actual data is collected. The compounding corresponds to a Pólya urn scheme. It is frequently encountered in

Bayesian statistics Bayesian statistics ( or ) is a theory in the field of statistics based on the Bayesian interpretation of probability, where probability expresses a ''degree of belief'' in an event. The degree of belief may be based on prior knowledge about ...

machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

, empirical Bayes methods and classical statistics as an overdispersed

. It reduces to the

categorical distribution In probability theory and statistics, a categorical distribution (also called a generalized Bernoulli distribution, multinoulli distribution) is a discrete probability distribution that describes the possible results of a random variable that can ...

as a special case when ''n'' = 1. It also approximates the

arbitrarily well for large ''α''. The Dirichlet-multinomial is a multivariate extension of the beta-binomial distribution, as the multinomial and Dirichlet distributions are multivariate versions of the

binomial distribution In probability theory and statistics, the binomial distribution with parameters and is the discrete probability distribution of the number of successes in a sequence of statistical independence, independent experiment (probability theory) ...

and

beta distribution In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval

, 1 The comma is a punctuation mark that appears in several variants in different languages. Some typefaces render it as a small line, slightly curved or straight, but inclined from the vertical; others give it the appearance of a miniature fille ...

or (0, 1) in terms of two positive Statistical parameter, parameters, denoted by ''alpha'' (''α'') an ...

s, respectively.

Specification

Dirichlet-multinomial as a compound distribution

The Dirichlet distribution is a conjugate distribution to the multinomial distribution. This fact leads to an analytically tractable compound distribution. For a random vector of category counts

\mathbf=(x_1,\dots,x_K)

, distributed according to a

, the

marginal distribution In probability theory and statistics, the marginal distribution of a subset of a collection of random variables is the probability distribution of the variables contained in the subset. It gives the probabilities of various values of the variable ...

is obtained by integrating on the distribution for p which can be thought of as a

random vector In probability, and statistics, a multivariate random variable or random vector is a list or vector of mathematical variables each of whose value is unknown, either because the value has not yet occurred or because there is imperfect knowledge ...

following a Dirichlet distribution: :

\Pr(\mathbf\mid n,\boldsymbol)=\int_\mathrm(\mathbf\mid  n,\mathbf)\mathrm(\mathbf\mid\boldsymbol)\textrm\mathbf

which results in the following explicit formula: :

\Pr(\mathbf\mid n, \boldsymbol)=\frac
\prod_^K\frac

where

\alpha_0

is defined as the sum

\alpha_0 = \sum \alpha_k

. Another form for this same compound distribution, written more compactly in terms of the

beta function In mathematics, the beta function, also called the Euler integral of the first kind, is a special function that is closely related to the gamma function and to binomial coefficients. It is defined by the integral : \Beta(z_1,z_2) = \int_0^1 t^ ...

, ''B'', is as follows:

\Pr(\mathbf\mid n,\boldsymbol)=\frac
 .

The latter form emphasizes the fact that zero count categories can be ignored in the calculation - a useful fact when the number of categories is very large and sparse (e.g. word counts in documents). Observe that the pdf is the Beta-binomial distribution when

K=2

. It can also be shown that it approaches the multinomial distribution as

\alpha_

approaches infinity. The parameter

\alpha_

governs the degree of overdispersion or burstiness relative to the multinomial. Alternative choices to denote

\alpha_

found in the literature are S and A.

Dirichlet-multinomial as an urn model

The Dirichlet-multinomial distribution can also be motivated via an urn model for positive

integer An integer is the number zero (0), a positive natural number (1, 2, 3, ...), or the negation of a positive natural number (−1, −2, −3, ...). The negations or additive inverses of the positive natural numbers are referred to as negative in ...

values of the vector

\boldsymbol

, known as the Polya urn model. Specifically, imagine an urn containing balls of

K

colors numbering

\alpha_

for the ith color, where random draws are made. When a ball is randomly drawn and observed, then two balls of the same color are returned to the urn. If this is performed

n

times, then the probability of observing the random vector

x

of color counts is a Dirichlet-multinomial with parameters

n

and

\boldsymbol\alpha

. If the random draws are with simple replacement (no balls over and above the observed ball are added to the urn), then the distribution follows a multinomial distribution and if the random draws are made without replacement, the distribution follows a multivariate hypergeometric distribution.

Properties

Moments

Once again, let

\alpha_0 = \sum \alpha_k

and let

p_i =\frac=\frac

, then the expected number of times the outcome ''i'' was observed over ''n'' trials is :

\operatorname(X_i) = n p_i=n\frac.\,

The

covariance matrix In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...

is as follows. Each diagonal entry is the

variance In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...

of a beta-binomially distributed random variable, and is therefore :

\operatorname(X_i)=np_i(1-p_i)\left(\frac\right)=n\frac\left(1-\frac\right)\left(\frac\right).\,

The off-diagonal entries are the

covariance In probability theory and statistics, covariance is a measure of the joint variability of two random variables. The sign of the covariance, therefore, shows the tendency in the linear relationship between the variables. If greater values of one ...

s: :

\operatorname(X_i,X_j)=-np_i p_j\left(\frac\right)=-n\frac\left(\frac\right)\,

for ''i'', ''j'' distinct. All covariances are negative because for fixed ''n'', an increase in one component of a Dirichlet-multinomial vector requires a decrease in another component. This is a ''K'' × ''K'' positive-semidefinite matrix of rank ''K'' − 1. The entries of the corresponding

correlation matrix In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistics ...

are :

\rho(X_i,X_i) = 1.

\rho(X_i,X_j) = \frac = \frac =  -\sqrt.

The sample size drops out of this expression. Each of the ''k'' components separately has a beta-binomial distribution. The support of the Dirichlet-multinomial distribution is the set :

\.\,

Its number of elements is :

.

Matrix notation

In matrix notation, :

\operatorname(\mathbf) = n \mathbf,\,

and :

\operatorname(\mathbf) = n \lbrace \operatorname(\mathbf) - \mathbf\mathbf^ \rbrace \left( \frac \right) ,\,

with = the row vector transpose of the column vector . Letting :

\alpha_0 =  \frac\,

, we can write alternatively :

\operatorname(\mathbf) = n \lbrace \operatorname(\mathbf) - \mathbf\mathbf^ \rbrace (1+\rho^2(n-1)) ,\,

The parameter

\rho \!

is known as the "intra class" or "intra cluster" correlation. It is this positive correlation which gives rise to overdispersion relative to the multinomial distribution.

Aggregation

If :

X = (X_1, \ldots, X_K)\sim\operatorname(\alpha_1,\cdots,\alpha_K)

then, if the random variables with subscripts ''i'' and ''j'' are dropped from the vector and replaced by their sum, :

X' = (X_1, \ldots, X_i + X_j, \ldots, X_K)\sim\operatorname \left(\alpha_1,\cdots,\alpha_i+\alpha_j,\cdots,\alpha_K \right).

This aggregation property may be used to derive the marginal distribution of

X_i

Likelihood function

Conceptually, we are making ''N'' independent draws from a categorical distribution with ''K'' categories. Let us represent the independent draws as random categorical variables

z_n

for

n = 1 \dots N

. Let us denote the number of times a particular category

k

has been seen (for

k = 1 \dots K

) among all the categorical variables as

n_k

, and

\sum_k n_k = N

. Then, we have two separate views onto this problem: # A set of

N

categorical variables

z_1,\dots,z_N

. # A single vector-valued variable

\mathbf=(n_1,\dots,n_K)

, distributed according to a

. The former case is a set of random variables specifying each ''individual'' outcome, while the latter is a variable specifying the ''number'' of outcomes of each of the ''K'' categories. The distinction is important, as the two cases have correspondingly different probability distributions. The parameter of the categorical distribution is

\mathbf = (p_1,p_2,\dots,p_K),

where

p_k

is the probability to draw value

k

;

\mathbf

is likewise the parameter of the multinomial distribution

P(\mathbf, \mathbf)

. Rather than specifying

\mathbf

directly, we give it a conjugate prior distribution, and hence it is drawn from a Dirichlet distribution with parameter vector

\boldsymbol\alpha=(\alpha_1,\alpha_2,\ldots,\alpha_K)

. By integrating out

\mathbf

, we obtain a compound distribution. However, the form of the distribution is different depending on which view we take.

For a set of individual outcomes

Joint distribution

For categorical variables

\mathbb=z_1,\dots,z_N

, the marginal

joint distribution A joint or articulation (or articular surface) is the connection made between bones, ossicles, or other hard structures in the body which link an animal's skeletal system into a functional whole.Saladin, Ken. Anatomy & Physiology. 7th ed. McGraw- ...

is obtained by integrating out

\mathbf

: :

\Pr(\mathbb\mid\boldsymbol)=\int_\Pr(\mathbb\mid \mathbf)\Pr(\mathbf\mid\boldsymbol)\textrm\mathbf

which results in the following explicit formula: :

\Pr(\mathbb\mid\boldsymbol)=\frac
\prod_^K\frac

where

\Gamma

is the

gamma function In mathematics, the gamma function (represented by Γ, capital Greek alphabet, Greek letter gamma) is the most common extension of the factorial function to complex numbers. Derived by Daniel Bernoulli, the gamma function \Gamma(z) is defined ...

, with :

\alpha_0=\sum_k \alpha_k\textN=\sum_k n_k\textn_k=\textz_n\textk.

Note the absence of the multinomial coefficient due to the formula being about the probability of a sequence of categorical variables instead of a probability on the counts within each category. Although the variables

z_1,\dots,z_N

do not appear explicitly in the above formula, they enter in through the

n_k

values.

Conditional distribution

Another useful formula, particularly in the context of

Gibbs sampling In statistics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for sampling from a specified multivariate distribution, multivariate probability distribution when direct sampling from the joint distribution is dif ...

, asks what the conditional density of a given variable

z_n

is, conditioned on all the other variables (which we will denote

\mathbb^

). It turns out to have an extremely simple form: :

\Pr(z_n=k\mid\mathbb^,\boldsymbol) \propto n_k^ + \alpha_k

where

n_k^

specifies the number of counts of category

k

seen in all variables other than

z_n

. It may be useful to show how to derive this formula. In general,

conditional distribution Conditional (if then) may refer to: * Causal conditional, if X then Y, where X is a cause of Y *Conditional probability, the probability of an event A given that another event B * Conditional proof, in logic: a proof that asserts a conditional, ...

s are proportional to the corresponding

s, so we simply start with the above formula for the joint distribution of all the

z_1,\dots,z_N

values and then eliminate any factors not dependent on the particular

z_n

in question. To do this, we make use of the notation

n_k^

defined above, and :

n_j=
\begin
 n_j^, & \textj\not=k \\
 n_j^+1, & \textj=k
\end

We also use the fact that :

\Gamma(n+1) = n\Gamma(n)

Then: :

\begin
& \Pr(z_n=k\mid\mathbb^,\boldsymbol)\\ 
\propto\ & \Pr(z_n=k,\mathbb^\mid\boldsymbol) \\
=\ &\ \frac\prod_^K\frac \\
\propto\ & \prod_^K\Gamma(n_+\alpha_) \\
=\ & \Gamma(n_+\alpha_)\prod_\Gamma(n_+\alpha_) \\
=\ & \Gamma(n_k^+1+\alpha_)\prod_\Gamma(n_j^+\alpha_) \\
=\ & (n_k^+\alpha_) \Gamma(n_k^+\alpha_)\prod_\Gamma(n_j^+\alpha_) \\
=\ & (n_k^+\alpha_) \prod_\Gamma(n_j^+\alpha_) \\
\propto\ & n_k^+\alpha_\\
\end

In general, it is not necessary to worry about the

normalizing constant In probability theory, a normalizing constant or normalizing factor is used to reduce any probability function to a probability density function with total probability of one. For example, a Gaussian function can be normalized into a probabilit ...

at the time of deriving the equations for conditional distributions. The normalizing constant will be determined as part of the algorithm for sampling from the distribution (see Categorical distribution#Sampling). However, when the conditional distribution is written in the simple form above, it turns out that the normalizing constant assumes a simple form: :

\sum_k \left( n_k^ + \alpha_ \right) = \alpha_ + \sum_k n_k^ = \alpha_ + N - 1

Hence :

\Pr(z_n=k\mid\mathbb^,\boldsymbol) = \frac

This formula is closely related to the Chinese restaurant process, which results from taking the limit as

K \to \infty

In a Bayesian network

In a larger

Bayesian network A Bayesian network (also known as a Bayes network, Bayes net, belief network, or decision network) is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Whi ...

in which categorical (or so-called "multinomial") distributions occur with

priors as part of a larger network, all Dirichlet priors can be collapsed provided that the only nodes depending on them are categorical distributions. The collapsing happens for each Dirichlet-distribution node separately from the others, and occurs regardless of any other nodes that may depend on the categorical distributions. It also occurs regardless of whether the categorical distributions depend on nodes additional to the Dirichlet priors (although in such a case, those other nodes must remain as additional conditioning factors). Essentially, all of the categorical distributions depending on a given Dirichlet-distribution node become connected into a single Dirichlet-multinomial joint distribution defined by the above formula. The joint distribution as defined this way will depend on the parent(s) of the integrated-out Dirichet prior nodes, as well as any parent(s) of the categorical nodes other than the Dirichlet prior nodes themselves. In the following sections, we discuss different configurations commonly found in Bayesian networks. We repeat the probability density from above, and define it using the symbol

\operatorname(\mathbb\mid\boldsymbol)

: :

\Pr(\mathbb\mid\boldsymbol)=\operatorname(\mathbb\mid\boldsymbol)=\frac
\prod_^K\frac

=Multiple Dirichlet priors with the same hyperprior

= Imagine we have a hierarchical model as follows: :

\begin
\boldsymbol\alpha &\sim& \text \\
\boldsymbol\theta_ &\sim& \operatorname_K(\boldsymbol\alpha) \\
z_ &\sim& \operatorname_K(\boldsymbol\theta_d)
\end

In cases like this, we have multiple Dirichet priors, each of which generates some number of categorical observations (possibly a different number for each prior). The fact that they are all dependent on the same hyperprior, even if this is a random variable as above, makes no difference. The effect of integrating out a Dirichlet prior links the categorical variables attached to that prior, whose joint distribution simply inherits any conditioning factors of the Dirichlet prior. The fact that multiple priors may share a hyperprior makes no difference: :

\Pr(\mathbb\mid\boldsymbol\alpha) = \prod_d \operatorname(\mathbb_d\mid\boldsymbol\alpha)

where

\mathbb_d

is simply the collection of categorical variables dependent on prior ''d''. Accordingly, the conditional probability distribution can be written as follows: :

\Pr(z_=k\mid\mathbb^,\boldsymbol\alpha)\ \propto\ n_^ + \alpha_k

where

n_^

specifically means the number of variables ''among the set''

\mathbb_d

, excluding

z_

itself, that have the value

k

. It is necessary to count only the variables having the value ''k'' that are tied together to the variable in question through having the same prior. We do not want to count any other variables also having the value ''k''.

=Multiple Dirichlet priors with the same hyperprior, with dependent children

= Now imagine a slightly more complicated hierarchical model as follows: :

\begin
\boldsymbol\alpha &\sim& \text \\
\boldsymbol\theta_ &\sim& \operatorname_K(\boldsymbol\alpha) \\
z_ &\sim& \operatorname_K(\boldsymbol\theta_d) \\
\boldsymbol\phi &\sim& \text \\
w_ &\sim& \operatorname(w_\mid z_,\boldsymbol\phi)
\end

This model is the same as above, but in addition, each of the categorical variables has a child variable dependent on it. This is typical of a

mixture model In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observati ...

. Again, in the joint distribution, only the categorical variables dependent on the same prior are linked into a single Dirichlet-multinomial: :

\Pr(\mathbb,\mathbb\mid\boldsymbol\alpha,\boldsymbol\phi) = \prod_d \operatorname(\mathbb_d\mid\boldsymbol\alpha) \prod_^ \prod_^ \operatorname(w_\mid z_,\boldsymbol\phi)

The conditional distribution of the categorical variables dependent only on their parents and ancestors would have the identical form as above in the simpler case. However, in Gibbs sampling it is necessary to determine the conditional distribution of a given node

z_

dependent not only on

\mathbb^

and ancestors such as

\alpha

but on ''all'' the other parameters. The simplified expression for the conditional distribution is derived above simply by rewriting the expression for the joint probability and removing constant factors. Hence, the same simplification would apply in a larger joint probability expression such as the one in this model, composed of Dirichlet-multinomial densities plus factors for many other random variables dependent on the values of the categorical variables. This yields the following: :

\Pr(z_=k\mid\mathbb^,\mathbb,\boldsymbol\alpha,\boldsymbol\phi)\ \propto\ (n_^ + \alpha_k) \operatorname(w_\mid z_,\boldsymbol\phi)

Here the probability density of

\operatorname

appears directly. To do

random sampling In this statistics, quality assurance, and survey methodology, sampling is the selection of a subset or a statistical sample (termed sample for short) of individuals from within a statistical population to estimate characteristics of the who ...

over

z_

, we would compute the unnormalized probabilities for all ''K'' possibilities for

z_

using the above formula, then normalize them and proceed as normal using the algorithm described in the

article. Correctly speaking, the additional factor that appears in the conditional distribution is derived not from the model specification but directly from the joint distribution. This distinction is important when considering models where a given node with Dirichlet-prior parent has multiple dependent children, particularly when those children are dependent on each other (e.g. if they share a parent that is collapsed out). This is discussed more below.

=Multiple Dirichlet priors with shifting prior membership

= Now imagine we have a hierarchical model as follows: :

\begin
\boldsymbol\theta &\sim& \text \\
z_ &\sim& \operatorname_K(\boldsymbol\theta) \\
\boldsymbol\alpha &\sim& \text \\
\boldsymbol\phi_ &\sim& \operatorname_V(\boldsymbol\alpha) \\
w_ &\sim& \operatorname_V(\boldsymbol\phi_) \\
\end

Here we have a tricky situation where we have multiple Dirichlet priors as before and a set of dependent categorical variables, but the relationship between the priors and dependent variables isn't fixed, unlike before. Instead, the choice of which prior to use is dependent on another random categorical variable. This occurs, for example, in topic models, and indeed the names of the variables above are meant to correspond to those in

latent Dirichlet allocation In natural language processing, latent Dirichlet allocation (LDA) is a Bayesian network (and, therefore, a generative statistical model) for modeling automatically extracted topics in textual corpora. The LDA is an example of a Bayesian topic ...

. In this case, the set

\mathbb

is a set of words, each of which is drawn from one of

K

possible topics, where each topic is a Dirichlet prior over a vocabulary of

V

possible words, specifying the frequency of different words in the topic. However, the topic membership of a given word isn't fixed; rather, it's determined from a set of

latent variable In statistics, latent variables (from Latin: present participle of ) are variables that can only be inferred indirectly through a mathematical model from other observable variables that can be directly observed or measured. Such '' latent va ...

\mathbb

. There is one latent variable per word, a

K

-dimensional

categorical variable In statistics, a categorical variable (also called qualitative variable) is a variable that can take on one of a limited, and usually fixed, number of possible values, assigning each individual or other unit of observation to a particular group or ...

specifying the topic the word belongs to. In this case, all variables dependent on a given prior are tied together (i.e.

correlated In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistic ...

) in a group, as before — specifically, all words belonging to a given topic are linked. In this case, however, the group membership shifts, in that the words are not fixed to a given topic but the topic depends on the value of a latent variable associated with the word. However, the definition of the Dirichlet-multinomial density doesn't actually depend on the number of categorical variables in a group (i.e. the number of words in the document generated from a given topic), but only on the counts of how many variables in the group have a given value (i.e. among all the word tokens generated from a given topic, how many of them are a given word). Hence, we can still write an explicit formula for the joint distribution: :

\Pr(\mathbb\mid\boldsymbol\alpha,\mathbb) = \prod_^K \operatorname(\mathbb_k\mid\mathbb,\boldsymbol\alpha) = \prod_^K \left frac
\prod_^V\frac \right /math>

Here we use the notation n_v^to denote the number of word tokens whose value is word symbol ''v'' and which belong to topic ''k''.

The conditional distribution still has the same form:

: \Pr(w_n=v\mid\mathbb^,\mathbb,\boldsymbol\alpha)\ \propto\ n_v^ + \alpha_v Here again, only the categorical variables for words belonging to a given topic are linked (even though this linking will depend on the assignments of the latent variables), and hence the word counts need to be over only the words generated by a given topic.  Hence the symbol n_v^, which is the count of words tokens having the word symbol ''v'', but only among those generated by topic ''k'', and excluding the word itself whose distribution is being described.

(The reason why excluding the word itself is necessary, and why it even makes sense at all, is that in a

context, we repeatedly resample the values of each random variable, after having run through and sampled all previous variables. Hence the variable will already have a value, and we need to exclude this existing value from the various counts that we make use of.)

=A combined example: LDA topic models

= We now show how to combine some of the above scenarios to demonstrate how to Gibbs sample a real-world model, specifically a smoothed

(LDA)

topic model In statistics and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden ...

. The model is as follows: :

\begin
\boldsymbol\alpha &\sim& \text \\
\boldsymbol\beta &\sim& \text \\
\boldsymbol\theta_ &\sim& \operatorname_K(\boldsymbol\alpha) \\
\boldsymbol\phi_ &\sim& \operatorname_V(\boldsymbol\beta) \\
z_ &\sim& \operatorname_K(\boldsymbol\theta_d) \\
w_ &\sim& \operatorname_V(\boldsymbol\phi_) \\
\end

Essentially we combine the previous three scenarios: We have categorical variables dependent on multiple priors sharing a hyperprior; we have categorical variables with dependent children (the

topic identities); and we have categorical variables with shifting membership in multiple priors sharing a hyperprior. In the standard LDA model, the words are completely observed, and hence we never need to resample them. (However, Gibbs sampling would equally be possible if only some or none of the words were observed. In such a case, we would want to initialize the distribution over the words in some reasonable fashion — e.g. from the output of some process that generates sentences, such as a

machine translation Machine translation is use of computational techniques to translate text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages. Early approaches were mostly rule-based or statisti ...

model — in order for the resulting posterior latent variable distributions to make any sense.) Using the above formulas, we can write down the conditional probabilities directly: :

\begin
\Pr(w_=v\mid\mathbb^,\mathbb,\boldsymbol\beta)\ &\propto\ & \#\mathbb_v^ + \beta_v \\
\Pr(z_=k\mid\mathbb^,w_=v,\mathbb^,\boldsymbol\alpha)\ &\propto\ &(\#\mathbb_k^ + \alpha_k) \Pr(w_=v\mid\mathbb^,\mathbb,\boldsymbol\beta) \\
\end

Here we have defined the counts more explicitly to clearly separate counts of words and counts of topics: :

\begin
\#\mathbb_v^ &=& \textv\textk\textw_ \\
\#\mathbb_k^ &=& \textk\textd\textz_ \\
\end

As in the scenario above with categorical variables with dependent children, the conditional probability of those dependent children appears in the definition of the parent's conditional probability. In this case, each latent variable has only a single dependent child word, so only one such term appears. (If there were multiple dependent children, all would have to appear in the parent's conditional probability, regardless of whether there was overlap between different parents and the same children, i.e. regardless of whether the dependent children of a given parent also have other parents. In a case where a child has multiple parents, the conditional probability for that child appears in the conditional probability definition of each of its parents.) The definition above specifies only the ''unnormalized'' conditional probability of the words, while the topic conditional probability requires the ''actual'' (i.e. normalized) probability. Hence we have to normalize by summing over all word symbols: :

\begin
\Pr(z_=k\mid\mathbb^,w_=v,\mathbb^,\boldsymbol\alpha)\ &\propto\ &\bigl(\#\mathbb_k^ + \alpha_k\bigr) \dfrac \\
&& \\
&=& \bigl(\#\mathbb_k^ + \alpha_k\bigr) \dfrac
\end

where :

\begin
\#\mathbb^ &=& \textk \\
B &=& \sum_^ \beta_v \\
\end

It's also worth making another point in detail, which concerns the second factor above in the conditional probability. Remember that the conditional distribution in general is derived from the joint distribution, and simplified by removing terms not dependent on the domain of the conditional (the part on the left side of the vertical bar). When a node

z

has dependent children, there will be one or more factors

\operatorname(\dots\mid z)

in the joint distribution that are dependent on

z

. ''Usually'' there is one factor for each dependent node, and it has the same density function as the distribution appearing the mathematical definition. However, if a dependent node has another parent as well (a co-parent), and that co-parent is collapsed out, then the node will become dependent on all other nodes sharing that co-parent, and in place of multiple terms for each such node, the joint distribution will have only one joint term. We have exactly that situation here. Even though

z_

has only one child

w_

, that child has a Dirichlet co-parent that we have collapsed out, which induces a Dirichlet-multinomial over the entire set of nodes

\mathbb^

. It happens in this case that this issue does not cause major problems, precisely because of the one-to-one relationship between

z_

and

w_

. We can rewrite the joint distribution as follows: :

\begin
p(\mathbb^\mid z_) &=& p(w_\mid\mathbb^,z_)\,p(\mathbb^\mid z_) \\
&=& p(w_\mid\mathbb^,z_)\,p(\mathbb^) \\
&\sim& p(w_\mid\mathbb^,z_)
\end

where in the set

\mathbb^

(i.e. the set of nodes

\mathbb^

excluding

w_

), none of the nodes have

z_

as a parent. Hence it can be eliminated as a conditioning factor (line 2), meaning that the entire factor can be eliminated from the conditional distribution (line 3).

=A second example: Naive Bayes document clustering

= Here is another model, with a different set of issues. This is an implementation of an unsupervised

Naive Bayes In statistics, naive (sometimes simple or idiot's) Bayes classifiers are a family of " probabilistic classifiers" which assumes that the features are conditionally independent, given the target class. In other words, a naive Bayes model assumes th ...

model for document clustering. That is, we would like to classify documents into multiple categories (e.g. "

spam Spam most often refers to: * Spam (food), a consumer brand product of canned processed pork of the Hormel Foods Corporation * Spamming, unsolicited or undesired electronic messages ** Email spam, unsolicited, undesired, or illegal email messages ...

" or "non-spam", or "scientific journal article", "newspaper article about finance", "newspaper article about politics", "love letter") based on textual content. However, we don't already know the correct category of any documents; instead, we want to cluster them based on mutual similarities. (For example, a set of scientific articles will tend to be similar to each other in word use but very different from a set of love letters.) This is a type of

unsupervised learning Unsupervised learning is a framework in machine learning where, in contrast to supervised learning, algorithms learn patterns exclusively from unlabeled data. Other frameworks in the spectrum of supervisions include weak- or semi-supervision, wh ...

. (The same technique can be used for doing

semi-supervised learning Weak supervision (also known as semi-supervised learning) is a paradigm in machine learning, the relevance and notability of which increased with the advent of large language models due to large amount of data required to train them. It is charact ...

, i.e. where we know the correct category of some fraction of the documents and would like to use this knowledge to help in clustering the remaining documents.) The model is as follows: :

\begin
\boldsymbol\alpha &\sim& \text \\
\boldsymbol\beta &\sim& \text \\
\boldsymbol\theta_ &\sim& \operatorname_K(\boldsymbol\alpha) \\
\boldsymbol\phi_ &\sim& \operatorname_V(\boldsymbol\beta) \\
z_ &\sim& \operatorname_K(\boldsymbol\theta_d) \\
w_ &\sim& \operatorname_V(\boldsymbol\phi_) \\
\end

In many ways, this model is very similar to the LDA

described above, but it assumes one topic per document rather than one topic per word, with a document consisting of a mixture of topics. This can be seen clearly in the above model, which is identical to the LDA model except that there is only one

per document instead of one per word. Once again, we assume that we are collapsing all of the Dirichlet priors. The conditional probability for a given word is almost identical to the LDA case. Once again, all words generated by the same Dirichlet prior are interdependent. In this case, this means the words of all documents having a given label — again, this can vary depending on the label assignments, but all we care about is the total counts. Hence: :

\begin
\Pr(w_=v\mid\mathbb^,\mathbb,\boldsymbol\beta)\ &\propto\ & \#\mathbb_v^ + \beta_v \\
\end

where :

\begin
\#\mathbb_v^ &=& \textv\textk\textw_ \\
\end

However, there is a critical difference in the conditional distribution of the latent variables for the label assignments, which is that a given label variable has multiple children nodes instead of just one — in particular, the nodes for all the words in the label's document. This relates closely to the discussion above about the factor

\operatorname(\dots\mid z_d)

that stems from the joint distribution. In this case, the joint distribution needs to be taken over all words in all documents containing a label assignment equal to the value of

z_d

, and has the value of a Dirichlet-multinomial distribution. Furthermore, we cannot reduce this joint distribution down to a conditional distribution over a single word. Rather, we can reduce it down only to a smaller joint conditional distribution over the words in the document for the label in question, and hence we cannot simplify it using the trick above that yields a simple sum of expected count and prior. Although it is in fact possible to rewrite it as a product of such individual sums, the number of factors is very large, and is not clearly more efficient than directly computing the Dirichlet-multinomial distribution probability.

Related distributions

The one-dimensional version of the Dirichlet-multinomial distribution is known as the Beta-binomial distribution. The Dirichlet-multinomial distribution has a relationship with the

negative binomial In probability theory and statistics, the negative binomial distribution, also called a Pascal distribution, is a discrete probability distribution that models the number of failures in a sequence of independent and identically distributed Berno ...

distribution analogous to the relationship of the

with the

Poisson distribution In probability theory and statistics, the Poisson distribution () is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time if these events occur with a known const ...

.Theorem 1 of

Uses

The Dirichlet-multinomial distribution is used in automated

document classification Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more Class (philosophy), classes or Categorization, categories. This may be do ...

and clustering,

genetics Genetics is the study of genes, genetic variation, and heredity in organisms.Hartl D, Jones E (2005) It is an important branch in biology because heredity is vital to organisms' evolution. Gregor Mendel, a Moravian Augustinians, Augustinian ...

economy An economy is an area of the Production (economics), production, Distribution (economics), distribution and trade, as well as Consumption (economics), consumption of Goods (economics), goods and Service (economics), services. In general, it is ...

, combat modeling, and quantitative marketing.

References

Citations

Sources

* Elkan, C. (2006
Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution
ICML, 289–296. * Johnson, N. L., Kotz, S. and Balakrishnan, N. (1997) Discrete multivariate distributions (Vol. 165). New York: Wiley. * Kvam, P. and Day, D. (2001) The multivariate Polya distribution in combat modeling. Naval Research Logistics, 48, 1–17. * Madsen, R. E., Kauchak, D. and Elkan, C. (2005
Modeling Word Burstiness Using the Dirichlet Distribution
ICML, 545–552. * Minka, T. (2003
Estimating a Dirichlet distribution
Technical report Microsoft Research. Includes Matlab code for fitting distributions to data. * Mosimann, J. E. (1962
On the compound multinomial distribution, the multivariate β-distribution, and correlations among proportions
Biometrika, 49(1–2), 65–82. * Wagner, U. and Taudes, A. (1986) A Multivariate Polya Model of Brand Choice and Purchase Incidence. Marketing Science, 5(3), 219–244. {{Peter Gustav Lejeune Dirichlet Multivariate discrete distributions Discrete distributions Compound probability distributions

Specification

Dirichlet-multinomial as a compound distribution

Dirichlet-multinomial as an urn model

Properties

Moments

Matrix notation

Aggregation

Likelihood function

For a set of individual outcomes

Joint distribution

Conditional distribution

In a Bayesian network

=Multiple Dirichlet priors with the same hyperprior

=Multiple Dirichlet priors with the same hyperprior, with dependent children

=Multiple Dirichlet priors with shifting prior membership

=A combined example: LDA topic models

=A second example: Naive Bayes document clustering

Related distributions

Uses

See also

References

Citations

Sources