probability theory Probability theory or probability calculus is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expre ...

and

statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

, a concentration parameter is a special kind of numerical parameter of a

parametric family In mathematics and its applications, a parametric family or a parameterized family is a indexed family, family of objects (a set of related objects) whose differences depend only on the chosen values for a set of parameters. Common examples are p ...

probability distribution In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...

s. Concentration parameters occur in two kinds of distribution: In the

Von Mises–Fisher distribution In directional statistics, the von Mises–Fisher distribution (named after Richard von Mises and Ronald Fisher), is a probability distribution on the (p-1)-sphere in \mathbb^. If p=2 the distribution reduces to the von Mises distribution on the c ...

, and in conjunction with distributions whose domain is a probability distribution, such as the symmetric Dirichlet distribution and the

Dirichlet process In probability theory, Dirichlet processes (after the distribution associated with Peter Gustav Lejeune Dirichlet) are a family of stochastic processes whose realizations are probability distributions. In other words, a Dirichlet process is a pro ...

. The rest of this article focuses on the latter usage. The larger the value of the concentration parameter, the more evenly distributed is the resulting distribution (the more it tends towards the uniform distribution). The smaller the value of the concentration parameter, the more sparsely distributed is the resulting distribution, with most values or ranges of values having a probability near zero (in other words, the more it tends towards a distribution concentrated on a single point, the

degenerate distribution In probability theory, a degenerate distribution on a measure space (E, \mathcal, \mu) is a probability distribution whose support is a null set with respect to \mu. For instance, in the -dimensional space endowed with the Lebesgue measure, an ...

defined by the

Dirac delta function In mathematical analysis, the Dirac delta function (or distribution), also known as the unit impulse, is a generalized function on the real numbers, whose value is zero everywhere except at zero, and whose integral over the entire real line ...

Dirichlet distribution

In the case of multivariate Dirichlet distributions, there is some confusion over how to define the concentration parameter. In the topic modelling literature, it is often defined as the sum of the individual Dirichlet parameters, when discussing symmetric Dirichlet distributions (where the parameters are the same for all dimensions) it is often defined to be the value of the single Dirichlet parameter used in all dimensions. This second definition is smaller by a factor of the dimension of the distribution. A concentration parameter of 1 (or ''k'', the dimension of the Dirichlet distribution, by the definition used in the topic modelling literature) results in all sets of probabilities being equally likely, i.e., in this case the Dirichlet distribution of dimension ''k'' is equivalent to a uniform distribution over a ''k-1''-dimensional simplex. This is ''not'' the same as what happens when the concentration parameter tends towards infinity. In the former case, all resulting distributions are equally likely (the distribution over distributions is uniform). In the latter case, only near-uniform distributions are likely (the distribution over distributions is highly peaked around the uniform distribution). Meanwhile, in the limit as the concentration parameter tends towards zero, only distributions with nearly all mass concentrated on one of their components are likely (the distribution over distributions is highly peaked around the ''k'' possible

Dirac delta distribution In mathematical analysis, the Dirac delta function (or distribution), also known as the unit impulse, is a generalized function on the real numbers, whose value is zero everywhere except at zero, and whose integral over the entire real line ...

s centered on one of the components, or in terms of the ''k''-dimensional simplex, is highly peaked at corners of the simplex).

Sparse prior

An example of where a sparse prior (concentration parameter much less than 1) is called for, consider a

topic model In statistics and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden ...

, which is used to learn the topics that are discussed in a set of documents, where each "topic" is described using a

categorical distribution In probability theory and statistics, a categorical distribution (also called a generalized Bernoulli distribution, multinoulli distribution) is a discrete probability distribution that describes the possible results of a random variable that can ...

over a vocabulary of words. A typical vocabulary might have 100,000 words, leading to a 100,000-dimensional categorical distribution. The

prior distribution A prior probability distribution of an uncertain quantity, simply called the prior, is its assumed probability distribution before some evidence is taken into account. For example, the prior could be the probability distribution representing the ...

for the parameters of the categorical distribution would likely be a symmetric Dirichlet distribution. However, a coherent topic might only have a few hundred words with any significant probability mass. Accordingly, a reasonable setting for the concentration parameter might be 0.01 or 0.001. With a larger vocabulary of around 1,000,000 words, an even smaller value, e.g. 0.0001, might be appropriate.

References

{{reflist Statistical parameters

Dirichlet distribution

Sparse prior

See also

References