HOME

TheInfoList



OR:

In statistics, a mixture model is a
probabilistic model A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population). A statistical model represents, often in considerably idealized form, ...
for representing the presence of
subpopulation In statistics, a population is a Set (mathematics), set of similar items or events which is of interest for some question or experiment. A statistical population can be a group of existing objects (e.g. the set of all stars within the Milky Way g ...
s within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. Formally a mixture model corresponds to the mixture distribution that represents the probability distribution of observations in the overall population. However, while problems associated with "mixture distributions" relate to deriving the properties of the overall population from those of the sub-populations, "mixture models" are used to make statistical inferences about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information. Mixture models should not be confused with models for
compositional data In statistics, compositional data are quantitative descriptions of the parts of some whole, conveying relative information. Mathematically, compositional data is represented by points on a simplex. Measurements involving probabilities, proportions, ...
, i.e., data whose components are constrained to sum to a constant value (1, 100%, etc.). However, compositional models can be thought of as mixture models, where members of the population are sampled at random. Conversely, mixture models can be thought of as compositional models, where the total size reading population has been normalized to 1.


Structure


General mixture model

A typical finite-dimensional mixture model is a hierarchical model consisting of the following components: *''N'' random variables that are observed, each distributed according to a mixture of ''K'' components, with the components belonging to the same parametric family of distributions (e.g., all
normal Normal(s) or The Normal(s) may refer to: Film and television * ''Normal'' (2003 film), starring Jessica Lange and Tom Wilkinson * ''Normal'' (2007 film), starring Carrie-Anne Moss, Kevin Zegers, Callum Keith Rennie, and Andrew Airlie * ''Norma ...
, all Zipfian, etc.) but with different parameters *''N'' random
latent variable In statistics, latent variables (from Latin: present participle of ''lateo'', “lie hidden”) are variables that can only be inferred indirectly through a mathematical model from other observable variables that can be directly observed or me ...
s specifying the identity of the mixture component of each observation, each distributed according to a ''K''-dimensional
categorical distribution In probability theory and statistics, a categorical distribution (also called a generalized Bernoulli distribution, multinoulli distribution) is a discrete probability distribution that describes the possible results of a random variable that can ...
*A set of ''K'' mixture weights, which are probabilities that sum to 1. *A set of ''K'' parameters, each specifying the parameter of the corresponding mixture component. In many cases, each "parameter" is actually a set of parameters. For example, if the mixture components are Gaussian distributions, there will be a mean and
variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbe ...
for each component. If the mixture components are
categorical distribution In probability theory and statistics, a categorical distribution (also called a generalized Bernoulli distribution, multinoulli distribution) is a discrete probability distribution that describes the possible results of a random variable that can ...
s (e.g., when each observation is a token from a finite alphabet of size ''V''), there will be a vector of ''V'' probabilities summing to 1. In addition, in a Bayesian setting, the mixture weights and parameters will themselves be random variables, and prior distributions will be placed over the variables. In such a case, the weights are typically viewed as a ''K''-dimensional random vector drawn from a Dirichlet distribution (the conjugate prior of the categorical distribution), and the parameters will be distributed according to their respective conjugate priors. Mathematically, a basic parametric mixture model can be described as follows: : \begin K &=& \text \\ N &=& \text \\ \theta_ &=& \text i \\ \phi_ &=& \text i \\ \boldsymbol\phi &=& K\text \phi_ \text \\ z_ &=& \text i \\ x_ &=& \text i \\ F(x, \theta) &=& \text \theta \\ z_ &\sim& \operatorname(\boldsymbol\phi) \\ x_, z_ &\sim& F(\theta_) \end In a Bayesian setting, all parameters are associated with random variables, as follows: : \begin K,N &=& \text \\ \theta_, \phi_, \boldsymbol\phi &=& \text \\ z_, x_, F(x, \theta) &=& \text \\ \alpha &=& \text \\ \beta &=& \text \\ H(\theta, \alpha) &=& \text \alpha \\ \theta_ &\sim& H(\theta, \alpha) \\ \boldsymbol\phi &\sim& \operatorname_K(\beta) \\ z_, \boldsymbol\phi &\sim& \operatorname(\boldsymbol\phi) \\ x_, z_,\theta_ &\sim& F(\theta_) \end This characterization uses ''F'' and ''H'' to describe arbitrary distributions over observations and parameters, respectively. Typically ''H'' will be the conjugate prior of ''F''. The two most common choices of ''F'' are Gaussian aka "
normal Normal(s) or The Normal(s) may refer to: Film and television * ''Normal'' (2003 film), starring Jessica Lange and Tom Wilkinson * ''Normal'' (2007 film), starring Carrie-Anne Moss, Kevin Zegers, Callum Keith Rennie, and Andrew Airlie * ''Norma ...
" (for real-valued observations) and categorical (for discrete observations). Other common possibilities for the distribution of the mixture components are: * Binomial distribution, for the number of "positive occurrences" (e.g., successes, yes votes, etc.) given a fixed number of total occurrences *
Multinomial distribution In probability theory, the multinomial distribution is a generalization of the binomial distribution. For example, it models the probability of counts for each side of a ''k''-sided dice rolled ''n'' times. For ''n'' independent trials each of wh ...
, similar to the binomial distribution, but for counts of multi-way occurrences (e.g., yes/no/maybe in a survey) * Negative binomial distribution, for binomial-type observations but where the quantity of interest is the number of failures before a given number of successes occurs *
Poisson distribution In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known co ...
, for the number of occurrences of an event in a given period of time, for an event that is characterized by a fixed rate of occurrence * Exponential distribution, for the time before the next event occurs, for an event that is characterized by a fixed rate of occurrence * Log-normal distribution, for positive real numbers that are assumed to grow exponentially, such as incomes or prices *
Multivariate normal distribution In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional ( univariate) normal distribution to higher dimensions. One ...
(aka multivariate Gaussian distribution), for vectors of correlated outcomes that are individually Gaussian-distributed * Multivariate Student's ''t''-distribution, for vectors of heavy-tailed correlated outcomes *A vector of
Bernoulli Bernoulli can refer to: People *Bernoulli family of 17th and 18th century Swiss mathematicians: ** Daniel Bernoulli (1700–1782), developer of Bernoulli's principle **Jacob Bernoulli (1654–1705), also known as Jacques, after whom Bernoulli numbe ...
-distributed values, corresponding, e.g., to a black-and-white image, with each value representing a pixel; see the handwriting-recognition example below


Specific examples


Gaussian mixture model

A typical non-Bayesian Gaussian mixture model looks like this: : \begin K,N &=& \text \\ \phi_, \boldsymbol\phi &=& \text \\ z_, x_ &=& \text \\ \theta_ &=& \ \\ \mu_ &=& \text i \\ \sigma^2_ &=& \text i \\ z_ &\sim& \operatorname(\boldsymbol\phi) \\ x_ &\sim& \mathcal(\mu_, \sigma^2_) \end A Bayesian version of a Gaussian mixture model is as follows: : \begin K,N &=& \text \\ \phi_, \boldsymbol\phi &=& \text \\ z_, x_ &=& \text \\ \theta_ &=& \ \\ \mu_ &=& \text i \\ \sigma^2_ &=& \text i \\ \mu_0, \lambda, \nu, \sigma_0^2 &=& \text \\ \mu_ &\sim& \mathcal(\mu_0, \lambda\sigma_i^2) \\ \sigma_^2 &\sim& \operatorname(\nu, \sigma_0^2) \\ \boldsymbol\phi &\sim& \operatorname_K(\beta) \\ z_ &\sim& \operatorname(\boldsymbol\phi) \\ x_ &\sim& \mathcal(\mu_, \sigma^2_) \end


Multivariate Gaussian mixture model

A Bayesian Gaussian mixture model is commonly extended to fit a vector of unknown parameters (denoted in bold), or multivariate normal distributions. In a multivariate distribution (i.e. one modelling a vector \boldsymbol with ''N'' random variables) one may model a vector of parameters (such as several observations of a signal or patches within an image) using a Gaussian mixture model prior distribution on the vector of estimates given by : p(\boldsymbol) = \sum_^K\phi_i \mathcal(\boldsymbol) where the ''ith'' vector component is characterized by normal distributions with weights \phi_i, means \boldsymbol and covariance matrices \boldsymbol. To incorporate this prior into a Bayesian estimation, the prior is multiplied with the known distribution p(\boldsymbol) of the data \boldsymbol conditioned on the parameters \boldsymbol to be estimated. With this formulation, the
posterior distribution The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posterior p ...
p(\boldsymbol) is ''also'' a Gaussian mixture model of the form : p(\boldsymbol) = \sum_^K\tilde \mathcal(\boldsymbol) with new parameters \tilde, \boldsymbol and \boldsymbol that are updated using the EM algorithm. Although EM-based parameter updates are well-established, providing the initial estimates for these parameters is currently an area of active research. Note that this formulation yields a closed-form solution to the complete posterior distribution. Estimations of the random variable \boldsymbol may be obtained via one of several estimators, such as the mean or maximum of the posterior distribution. Such distributions are useful for assuming patch-wise shapes of images and clusters, for example. In the case of image representation, each Gaussian may be tilted, expanded, and warped according to the covariance matrices \boldsymbol. One Gaussian distribution of the set is fit to each patch (usually of size 8x8 pixels) in the image. Notably, any distribution of points around a cluster (see ''k''-means) may be accurately given enough Gaussian components, but scarcely over ''K''=20 components are needed to accurately model a given image distribution or cluster of data.


Categorical mixture model

A typical non-Bayesian mixture model with categorical observations looks like this: *K,N: as above *\phi_, \boldsymbol\phi: as above *z_, x_: as above *V: dimension of categorical observations, e.g., size of word vocabulary *\theta_: probability for component i of observing item j *\boldsymbol\theta_: vector of dimension V, composed of \theta_; must sum to 1 The random variables: : \begin z_ &\sim& \operatorname(\boldsymbol\phi) \\ x_ &\sim& \text(\boldsymbol\theta_) \end A typical Bayesian mixture model with categorical observations looks like this: *K,N: as above *\phi_, \boldsymbol\phi: as above *z_, x_: as above *V: dimension of categorical observations, e.g., size of word vocabulary *\theta_: probability for component i of observing item j *\boldsymbol\theta_: vector of dimension V, composed of \theta_; must sum to 1 *\alpha: shared concentration hyperparameter of \boldsymbol\theta for each component *\beta: concentration hyperparameter of \boldsymbol\phi The random variables: : \begin \boldsymbol\phi &\sim& \operatorname_K(\beta) \\ \boldsymbol\theta_ &\sim& \text_V(\alpha) \\ z_ &\sim& \operatorname(\boldsymbol\phi) \\ x_ &\sim& \text(\boldsymbol\theta_) \end


Examples


A financial model

Financial returns often behave differently in normal situations and during crisis times. A mixture model for return data seems reasonable. Sometimes the model used is a
jump-diffusion model Jump diffusion is a stochastic process that involves jump process, jumps and diffusion process, diffusion. It has important applications in magnetic reconnection, coronal mass ejections, condensed matter physics, option pricing, and pattern theory a ...
, or as a mixture of two normal distributions. See and for further context.


House prices

Assume that we observe the prices of ''N'' different houses. Different types of houses in different neighborhoods will have vastly different prices, but the price of a particular type of house in a particular neighborhood (e.g., three-bedroom house in moderately upscale neighborhood) will tend to cluster fairly closely around the mean. One possible model of such prices would be to assume that the prices are accurately described by a mixture model with ''K'' different components, each distributed as a normal distribution with unknown mean and variance, with each component specifying a particular combination of house type/neighborhood. Fitting this model to observed prices, e.g., using the expectation-maximization algorithm, would tend to cluster the prices according to house type/neighborhood and reveal the spread of prices in each type/neighborhood. (Note that for values such as prices or incomes that are guaranteed to be positive and which tend to grow
exponentially Exponential may refer to any of several mathematical topics related to exponentiation, including: *Exponential function, also: **Matrix exponential, the matrix analogue to the above *Exponential decay, decrease at a rate proportional to value *Expo ...
, a log-normal distribution might actually be a better model than a normal distribution.)


Topics in a document

Assume that a document is composed of ''N'' different words from a total vocabulary of size ''V'', where each word corresponds to one of ''K'' possible topics. The distribution of such words could be modelled as a mixture of ''K'' different ''V''-dimensional
categorical distribution In probability theory and statistics, a categorical distribution (also called a generalized Bernoulli distribution, multinoulli distribution) is a discrete probability distribution that describes the possible results of a random variable that can ...
s. A model of this sort is commonly termed a
topic model In statistics and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden ...
. Note that
expectation maximization Expectation or Expectations may refer to: Science * Expectation (epistemic) * Expected value, in mathematical probability theory * Expectation value (quantum mechanics) * Expectation–maximization algorithm, in statistics Music * ''Expectati ...
applied to such a model will typically fail to produce realistic results, due (among other things) to the excessive number of parameters. Some sorts of additional assumptions are typically necessary to get good results. Typically two sorts of additional components are added to the model: #A prior distribution is placed over the parameters describing the topic distributions, using a Dirichlet distribution with a
concentration parameter In probability theory and statistics, a concentration parameter is a special kind of numerical parameter of a parametric family of probability distributions. Concentration parameters occur in two kinds of distribution: In the Von Mises–Fisher ...
that is set significantly below 1, so as to encourage sparse distributions (where only a small number of words have significantly non-zero probabilities). #Some sort of additional constraint is placed over the topic identities of words, to take advantage of natural clustering. :*For example, a Markov chain could be placed on the topic identities (i.e., the latent variables specifying the mixture component of each observation), corresponding to the fact that nearby words belong to similar topics. (This results in a hidden Markov model, specifically one where a prior distribution is placed over state transitions that favors transitions that stay in the same state.) :*Another possibility is the
latent Dirichlet allocation In natural language processing, Latent Dirichlet Allocation (LDA) is a generative statistical model that explains a set of observations through unobserved groups, and each group explains why some parts of the data are similar. The LDA is an ex ...
model, which divides up the words into ''D'' different documents and assumes that in each document only a small number of topics occur with any frequency.


Handwriting recognition

The following example is based on an example in Christopher M. Bishop, ''Pattern Recognition and Machine Learning''. Imagine that we are given an ''N''×''N'' black-and-white image that is known to be a scan of a hand-written digit between 0 and 9, but we don't know which digit is written. We can create a mixture model with K=10 different components, where each component is a vector of size N^2 of Bernoulli distributions (one per pixel). Such a model can be trained with the expectation-maximization algorithm on an unlabeled set of hand-written digits, and will effectively cluster the images according to the digit being written. The same model could then be used to recognize the digit of another image simply by holding the parameters constant, computing the probability of the new image for each possible digit (a trivial calculation), and returning the digit that generated the highest probability.


Assessing projectile accuracy (a.k.a. circular error probable, CEP)

Mixture models apply in the problem of directing multiple projectiles at a target (as in air, land, or sea defense applications), where the physical and/or statistical characteristics of the projectiles differ within the multiple projectiles. An example might be shots from multiple munitions types or shots from multiple locations directed at one target. The combination of projectile types may be characterized as a Gaussian mixture model. Further, a well-known measure of accuracy for a group of projectiles is the circular error probable (CEP), which is the number ''R'' such that, on average, half of the group of projectiles falls within the circle of radius ''R'' about the target point. The mixture model can be used to determine (or estimate) the value ''R''. The mixture model properly captures the different types of projectiles.


Direct and indirect applications

The financial example above is one direct application of the mixture model, a situation in which we assume an underlying mechanism so that each observation belongs to one of some number of different sources or categories. This underlying mechanism may or may not, however, be observable. In this form of mixture, each of the sources is described by a component probability density function, and its mixture weight is the probability that an observation comes from this component. In an indirect application of the mixture model we do not assume such a mechanism. The mixture model is simply used for its mathematical flexibilities. For example, a mixture of two normal distributions with different means may result in a density with two modes, which is not modeled by standard parametric distributions. Another example is given by the possibility of mixture distributions to model fatter tails than the basic Gaussian ones, so as to be a candidate for modeling more extreme events. When combined with dynamical consistency, this approach has been applied to financial derivatives valuation in presence of the
volatility smile Volatility smiles are implied volatility patterns that arise in pricing financial options. It is a parameter (implied volatility) that is needed to be modified for the Black–Scholes formula to fit market prices. In particular for a given expi ...
in the context of
local volatility A local volatility model, in mathematical finance and financial engineering, is an option pricing model that treats volatility as a function of both the current asset level S_t and of time t . As such, it is a generalisation of the Black–Sch ...
models. This defines our application.


Predictive Maintenance

The mixture model-based clustering is also predominantly used in identifying the state of the machine in
predictive maintenance Predictive maintenance techniques are designed to help determine the condition of in-service equipment in order to estimate when maintenance should be performed. This approach promises cost savings over routine or time-based preventive maintena ...
. Density plots are used to analyze the density of high dimensional features. If multi-model densities are observed, then it is assumed that a finite set of densities are formed by a finite set of normal mixtures. A multivariate Gaussian mixture model is used to cluster the feature data into k number of groups where k represents each state of the machine. The machine state can be a normal state, power off state, or faulty state. Each formed cluster can be diagnosed using techniques such as spectral analysis. In the recent years, this has also been widely used in other areas such as early fault detection.


Fuzzy image segmentation

In image processing and computer vision, traditional
image segmentation In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into multiple image segments, also known as image regions or image objects ( sets of pixels). The goal of segmentation is to simpli ...
models often assign to one
pixel In digital imaging, a pixel (abbreviated px), pel, or picture element is the smallest addressable element in a raster image, or the smallest point in an all points addressable display device. In most digital display devices, pixels are the ...
only one exclusive pattern. In fuzzy or soft segmentation, any pattern can have certain "ownership" over any single pixel. If the patterns are Gaussian, fuzzy segmentation naturally results in Gaussian mixtures. Combined with other analytic or geometric tools (e.g., phase transitions over diffusive boundaries), such spatially regularized mixture models could lead to more realistic and computationally efficient segmentation methods.


Point set registration

Probabilistic mixture models such as Gaussian mixture models (GMM) are used to resolve point set registration problems in image processing and computer vision fields. For pair-wise point set registration, one point set is regarded as the centroids of mixture models, and the other point set is regarded as data points (observations). State-of-the-art methods are e.g. coherent point drift (CPD) and Student's t-distribution mixture models (TMM). The result of recent research demonstrate the superiority of hybrid mixture models (e.g. combining Student's t-Distritubtion and Watson distribution/ Bingham distribution to model spatial positions and axes orientations separately) compare to CPD and TMM, in terms of inherent robustness, accuracy and discriminative capacity.


Identifiability

Identifiability refers to the existence of a unique characterization for any one of the models in the class (family) being considered. Estimation procedures may not be well-defined and asymptotic theory may not hold if a model is not identifiable.


Example

Let ''J'' be the class of all binomial distributions with . Then a mixture of two members of ''J'' would have :p_0=\pi(1-\theta_1)^2+(1-\pi)(1-\theta_2)^2 :p_1=2\pi\theta_1(1-\theta_1)+2(1-\pi)\theta_2(1-\theta_2) and . Clearly, given ''p''0 and ''p''1, it is not possible to determine the above mixture model uniquely, as there are three parameters (''π'', ''θ''1, ''θ''2) to be determined.


Definition

Consider a mixture of parametric distributions of the same class. Let :J=\ be the class of all component distributions. Then the convex hull ''K'' of ''J'' defines the class of all finite mixture of distributions in ''J'': :K=\left\ ''K'' is said to be identifiable if all its members are unique, that is, given two members ''p'' and in ''K'', being mixtures of ''k'' distributions and distributions respectively in ''J'', we have if and only if, first of all, and secondly we can reorder the summations such that and for all ''i''.


Parameter estimation and system identification

Parametric mixture models are often used when we know the distribution ''Y'' and we can sample from ''X'', but we would like to determine the ''ai'' and ''θi'' values. Such situations can arise in studies in which we sample from a population that is composed of several distinct subpopulations. It is common to think of probability mixture modeling as a missing data problem. One way to understand this is to assume that the data points under consideration have "membership" in one of the distributions we are using to model the data. When we start, this membership is unknown, or missing. The job of estimation is to devise appropriate parameters for the model functions we choose, with the connection to the data points being represented as their membership in the individual model distributions. A variety of approaches to the problem of mixture decomposition have been proposed, many of which focus on maximum likelihood methods such as
expectation maximization Expectation or Expectations may refer to: Science * Expectation (epistemic) * Expected value, in mathematical probability theory * Expectation value (quantum mechanics) * Expectation–maximization algorithm, in statistics Music * ''Expectati ...
(EM) or maximum ''a posteriori'' estimation (MAP). Generally these methods consider separately the questions of system identification and parameter estimation; methods to determine the number and functional form of components within a mixture are distinguished from methods to estimate the corresponding parameter values. Some notable departures are the graphical methods as outlined in Tarter and Lock and more recently
minimum message length Minimum message length (MML) is a Bayesian information-theoretic method for statistical model comparison and selection. It provides a formal information theory restatement of Occam's Razor: even when models are equal in their measure of fit-accurac ...
(MML) techniques such as Figueiredo and Jain and to some extent the moment matching pattern analysis routines suggested by McWilliam and Loh (2009).

/ref>


Expectation maximization (EM)

Expectation maximization Expectation or Expectations may refer to: Science * Expectation (epistemic) * Expected value, in mathematical probability theory * Expectation value (quantum mechanics) * Expectation–maximization algorithm, in statistics Music * ''Expectati ...
(EM) is seemingly the most popular technique used to determine the parameters of a mixture with an ''a priori'' given number of components. This is a particular way of implementing
maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stat ...
estimation for this problem. EM is of particular appeal for finite normal mixtures where closed-form expressions are possible such as in the following iterative algorithm by Dempster ''et al.'' (1977) : w_s^ = \frac \sum_^N h_s^(t) : \mu_s^ = \frac : \Sigma_s^ = \frac with the posterior probabilities : h_s^(t) = \frac. Thus on the basis of the current estimate for the parameters, the
conditional probability In probability theory, conditional probability is a measure of the probability of an event occurring, given that another event (by assumption, presumption, assertion or evidence) has already occurred. This particular method relies on event B occu ...
for a given observation ''x''(''t'') being generated from state ''s'' is determined for each ; ''N'' being the sample size. The parameters are then updated such that the new component weights correspond to the average conditional probability and each component mean and covariance is the component specific weighted average of the mean and covariance of the entire sample. Dempster also showed that each successive EM iteration will not decrease the likelihood, a property not shared by other gradient based maximization techniques. Moreover, EM naturally embeds within it constraints on the probability vector, and for sufficiently large sample sizes positive definiteness of the covariance iterates. This is a key advantage since explicitly constrained methods incur extra computational costs to check and maintain appropriate values. Theoretically EM is a first-order algorithm and as such converges slowly to a fixed-point solution. Redner and Walker (1984) make this point arguing in favour of superlinear and second order Newton and quasi-Newton methods and reporting slow convergence in EM on the basis of their empirical tests. They do concede that convergence in likelihood was rapid even if convergence in the parameter values themselves was not. The relative merits of EM and other algorithms vis-à-vis convergence have been discussed in other literature. Other common objections to the use of EM are that it has a propensity to spuriously identify local maxima, as well as displaying sensitivity to initial values. One may address these problems by evaluating EM at several initial points in the parameter space but this is computationally costly and other approaches, such as the annealing EM method of Udea and Nakano (1998) (in which the initial components are essentially forced to overlap, providing a less heterogeneous basis for initial guesses), may be preferable. Figueiredo and Jain note that convergence to 'meaningless' parameter values obtained at the boundary (where regularity conditions breakdown, e.g., Ghosh and Sen (1985)) is frequently observed when the number of model components exceeds the optimal/true one. On this basis they suggest a unified approach to estimation and identification in which the initial ''n'' is chosen to greatly exceed the expected optimal value. Their optimization routine is constructed via a minimum message length (MML) criterion that effectively eliminates a candidate component if there is insufficient information to support it. In this way it is possible to systematize reductions in ''n'' and consider estimation and identification jointly.


The expectation step

With initial guesses for the parameters of our mixture model, "partial membership" of each data point in each constituent distribution is computed by calculating
expectation value In probability theory, the expected value (also called expectation, expectancy, mathematical expectation, mean, average, or first moment) is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a l ...
s for the membership variables of each data point. That is, for each data point ''xj'' and distribution ''Yi'', the membership value ''y''''i'', ''j'' is: : y_ = \frac.


The maximization step

With expectation values in hand for group membership, plug-in estimates are recomputed for the distribution parameters. The mixing coefficients ''ai'' are the means of the membership values over the ''N'' data points. : a_i = \frac\sum_^N y_ The component model parameters ''θi'' are also calculated by expectation maximization using data points ''xj'' that have been weighted using the membership values. For example, if ''θ'' is a mean ''μ'' : \mu_ = \frac. With new estimates for ''ai'' and the ''θis, the expectation step is repeated to recompute new membership values. The entire procedure is repeated until model parameters converge.


Markov chain Monte Carlo

As an alternative to the EM algorithm, the mixture model parameters can be deduced using posterior sampling as indicated by Bayes' theorem. This is still regarded as an incomplete data problem whereby membership of data points is the missing data. A two-step iterative procedure known as Gibbs sampling can be used. The previous example of a mixture of two Gaussian distributions can demonstrate how the method works. As before, initial guesses of the parameters for the mixture model are made. Instead of computing partial memberships for each elemental distribution, a membership value for each data point is drawn from a Bernoulli distribution (that is, it will be assigned to either the first or the second Gaussian). The Bernoulli parameter ''θ'' is determined for each data point on the basis of one of the constituent distributions. Draws from the distribution generate membership associations for each data point. Plug-in estimators can then be used as in the M step of EM to generate a new set of mixture model parameters, and the binomial draw step repeated.


Moment matching

The method of moment matching is one of the oldest techniques for determining the mixture parameters dating back to Karl Pearson's seminal work of 1894. In this approach the parameters of the mixture are determined such that the composite distribution has moments matching some given value. In many instances extraction of solutions to the moment equations may present non-trivial algebraic or computational problems. Moreover, numerical analysis by Day has indicated that such methods may be inefficient compared to EM. Nonetheless, there has been renewed interest in this method, e.g., Craigmile and Titterington (1998) and Wang. McWilliam and Loh (2009) consider the characterisation of a hyper-cuboid normal mixture copula in large dimensional systems for which EM would be computationally prohibitive. Here a pattern analysis routine is used to generate multivariate tail-dependencies consistent with a set of univariate and (in some sense) bivariate moments. The performance of this method is then evaluated using equity log-return data with Kolmogorov–Smirnov test statistics suggesting a good descriptive fit.


Spectral method

Some problems in mixture model estimation can be solved using
spectral method Spectral methods are a class of techniques used in applied mathematics and scientific computing to numerically solve certain differential equations. The idea is to write the solution of the differential equation as a sum of certain "basis function ...
s. In particular it becomes useful if data points ''xi'' are points in high-dimensional real space, and the hidden distributions are known to be log-concave (such as Gaussian distribution or Exponential distribution). Spectral methods of learning mixture models are based on the use of Singular Value Decomposition of a matrix which contains data points. The idea is to consider the top ''k'' singular vectors, where ''k'' is the number of distributions to be learned. The projection of each data point to a linear subspace spanned by those vectors groups points originating from the same distribution very close together, while points from different distributions stay far apart. One distinctive feature of the spectral method is that it allows us to prove that if distributions satisfy certain separation condition (e.g., not too close), then the estimated mixture will be very close to the true one with high probability.


Graphical Methods

Tarter and Lock describe a graphical approach to mixture identification in which a kernel function is applied to an empirical frequency plot so to reduce intra-component variance. In this way one may more readily identify components having differing means. While this ''λ''-method does not require prior knowledge of the number or functional form of the components its success does rely on the choice of the kernel parameters which to some extent implicitly embeds assumptions about the component structure.


Other methods

Some of them can even probably learn mixtures of
heavy-tailed distribution In probability theory, heavy-tailed distributions are probability distributions whose tails are not exponentially bounded: that is, they have heavier tails than the exponential distribution. In many applications it is the right tail of the distr ...
s including those with infinite
variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbe ...
(see links to papers below). In this setting, EM based methods would not work, since the Expectation step would diverge due to presence of outliers.


A simulation

To simulate a sample of size ''N'' that is from a mixture of distributions ''F''''i'', ''i''=1 to ''n'', with probabilities ''p''''i'' (sum= ''p''''i'' = 1): # Generate ''N'' random numbers from a
categorical distribution In probability theory and statistics, a categorical distribution (also called a generalized Bernoulli distribution, multinoulli distribution) is a discrete probability distribution that describes the possible results of a random variable that can ...
of size ''n'' and probabilities ''p''''i'' for ''i''= 1= to ''n''. These tell you which of the ''F''''i'' each of the ''N'' values will come from. Denote by ''mi'' the quantity of random numbers assigned to the ''i''th category. # For each ''i'', generate ''mi'' random numbers from the ''F''''i'' distribution.


Extensions

In a Bayesian setting, additional levels can be added to the
graphical model A graphical model or probabilistic graphical model (PGM) or structured probabilistic model is a probabilistic model for which a graph expresses the conditional dependence structure between random variables. They are commonly used in probabili ...
defining the mixture model. For example, in the common
latent Dirichlet allocation In natural language processing, Latent Dirichlet Allocation (LDA) is a generative statistical model that explains a set of observations through unobserved groups, and each group explains why some parts of the data are similar. The LDA is an ex ...
topic model In statistics and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden ...
, the observations are sets of words drawn from ''D'' different documents and the ''K'' mixture components represent topics that are shared across documents. Each document has a different set of mixture weights, which specify the topics prevalent in that document. All sets of mixture weights share common
hyperparameter In Bayesian statistics, a hyperparameter is a parameter of a prior distribution; the term is used to distinguish them from parameters of the model for the underlying system under analysis. For example, if one is using a beta distribution to mo ...
s. A very common extension is to connect the
latent variable In statistics, latent variables (from Latin: present participle of ''lateo'', “lie hidden”) are variables that can only be inferred indirectly through a mathematical model from other observable variables that can be directly observed or me ...
s defining the mixture component identities into a Markov chain, instead of assuming that they are independent identically distributed random variables. The resulting model is termed a hidden Markov model and is one of the most common sequential hierarchical models. Numerous extensions of hidden Markov models have been developed; see the resulting article for more information.


History

Mixture distributions and the problem of mixture decomposition, that is the identification of its constituent components and the parameters thereof, has been cited in the literature as far back as 1846 (Quetelet in McLachlan, 2000) although common reference is made to the work of Karl Pearson (1894) as the first author to explicitly address the decomposition problem in characterising non-normal attributes of forehead to body length ratios in female shore crab populations. The motivation for this work was provided by the zoologist
Walter Frank Raphael Weldon Walter Frank Raphael Weldon FRS (15 March 1860 – 13 April 1906), was an English evolutionary biologist and a founder of biometry. He was the joint founding editor of ''Biometrika'', with Francis Galton and Karl Pearson. Family Weldon was th ...
who had speculated in 1893 (in Tarter and Lock) that asymmetry in the histogram of these ratios could signal evolutionary divergence. Pearson's approach was to fit a univariate mixture of two normals to the data by choosing the five parameters of the mixture such that the empirical moments matched that of the model. While his work was successful in identifying two potentially distinct sub-populations and in demonstrating the flexibility of mixtures as a moment matching tool, the formulation required the solution of a 9th degree (nonic) polynomial which at the time posed a significant computational challenge. Subsequent works focused on addressing these problems, but it was not until the advent of the modern computer and the popularisation of
Maximum Likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stat ...
(MLE) parameterisation techniques that research really took off. Since that time there has been a vast body of research on the subject spanning areas such as
fisheries research ''Fisheries Research'' is a peer-reviewed academic journal on fisheries science published by Elsevier since 1982. The journal is abstracted and indexed in the Science Citation Index, Scopus, Biosis, Academic Search Premier, and PASCAL. According ...
,
agriculture Agriculture or farming is the practice of cultivating plants and livestock. Agriculture was the key development in the rise of sedentary human civilization, whereby farming of domesticated species created food surpluses that enabled people t ...
,
botany Botany, also called , plant biology or phytology, is the science of plant life and a branch of biology. A botanist, plant scientist or phytologist is a scientist who specialises in this field. The term "botany" comes from the Ancient Greek w ...
,
economics Economics () is the social science that studies the production, distribution, and consumption of goods and services. Economics focuses on the behaviour and interactions of economic agents and how economies work. Microeconomics analyzes ...
,
medicine Medicine is the science and practice of caring for a patient, managing the diagnosis, prognosis, prevention, treatment, palliation of their injury or disease, and promoting their health. Medicine encompasses a variety of health care pr ...
,
genetics Genetics is the study of genes, genetic variation, and heredity in organisms.Hartl D, Jones E (2005) It is an important branch in biology because heredity is vital to organisms' evolution. Gregor Mendel, a Moravian Augustinian friar wor ...
,
psychology Psychology is the scientific study of mind and behavior. Psychology includes the study of conscious and unconscious phenomena, including feelings and thoughts. It is an academic discipline of immense scope, crossing the boundaries between ...
, palaeontology, electrophoresis, finance,
geology Geology () is a branch of natural science concerned with Earth and other astronomical objects, the features or rocks of which it is composed, and the processes by which they change over time. Modern geology significantly overlaps all other Ea ...
and
zoology Zoology ()The pronunciation of zoology as is usually regarded as nonstandard, though it is not uncommon. is the branch of biology that studies the animal kingdom, including the structure, embryology, evolution, classification, habits, and ...
.


See also


Mixture

* Mixture density * Mixture (probability) *
Flexible Mixture Model (FMM) Flexible may refer to: Science and technology * Power cord, a flexible electrical cable. ** Flexible cable, an Electrical cable as used on electrical appliances * Flexible electronics * Flexible response * Flexible-fuel vehicle * Flexible rake rec ...
*
Subspace Gaussian mixture model Subspace Gaussian mixture model (SGMM) is an acoustic modeling approach in which all phonetic states share a common Gaussian mixture model In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations w ...


Hierarchical models

*
Graphical model A graphical model or probabilistic graphical model (PGM) or structured probabilistic model is a probabilistic model for which a graph expresses the conditional dependence structure between random variables. They are commonly used in probabili ...
*
Hierarchical Bayes model Multilevel models (also known as hierarchical linear models, linear mixed-effect model, mixed models, nested data models, random coefficient, random-effects models, random parameter models, or split-plot designs) are statistical models of parame ...


Outlier detection

*
RANSAC Random sample consensus (RANSAC) is an iterative method to estimate parameters of a mathematical model from a set of observed data that contains outliers, when outliers are to be accorded no influence on the values of the estimates. Therefore, it a ...


References


Further reading


Books on mixture models

* * * * * *


Application of Gaussian mixture models

# # #* # # # # # # #


External links

* * Th
SOCR demonstrations of EM and Mixture Modeling
(and th

program for
Minimum Message Length Minimum message length (MML) is a Bayesian information-theoretic method for statistical model comparison and selection. It provides a formal information theory restatement of Occam's Razor: even when models are equal in their measure of fit-accurac ...
( MML) applied to finite mixture models), maintained by D.L. Dowe.
PyMix
– Python Mixture Package, algorithms and data structures for a broad variety of mixture model based data mining applications in Python

– A module from the
scikit-learn scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support-vector ...
Python library for learning Gaussian Mixture Models (and sampling from them), previously packaged with SciPy and now packaged as
SciKitGMM.m
Matlab code for GMM Implementation

C++ implementation of Bayesian Mixture Models using EM and MCMC with 100x speed acceleration using GPGPU.

Matlab code for GMM Implementation using EM algorithm

jMEF: A Java open source library for learning and processing mixtures of exponential families (using duality with Bregman divergences). Includes a Matlab wrapper. * Very Fast and clean C implementation of th
Expectation Maximization
(EM) algorithm for estimatin
Gaussian Mixture Models
(GMMs).

is an R package for mixture modeling.
dpgmm
Pure Python Dirichlet process Gaussian mixture model implementation (variational).

Blog post on Gaussian Mixture Models trained via Expectation Maximization, with an implementation in Python. {{DEFAULTSORT:Mixture Model Cluster analysis Latent variable models Probabilistic models