Mixture of experts (MoE) is a
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
technique where multiple expert
networks
Network, networking and networked may refer to:
Science and technology
* Network theory, the study of graphs as a representation of relations between discrete objects
* Network science, an academic field that studies complex networks
Mathematics
...
(learners) are used to divide a problem space into homogeneous regions. MoE represents a form of
ensemble learning
In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.
Unlike a statistical ensemble in statist ...
. They were also called committee machines.
Basic theory
MoE always has the following components, but they are implemented and combined differently according to the problem being solved:
* Experts
, each taking the same input
, and producing outputs
.
* A weighting function (also known as a gating function)
, which takes input
and produces a vector of outputs
. This may or may not be a probability distribution, but in both cases, its entries are non-negative.
*
is the set of parameters. The parameter
is for the weighting function. The parameters
are for the experts.
* Given an input
, the mixture of experts produces a single output by combining
according to the weights
in some way, usually by
.
Both the experts and the weighting function are trained by minimizing some
loss function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...
, generally via
gradient descent
Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for minimizing a differentiable multivariate function.
The idea is to take repeated steps in the opposite direction of the gradi ...
. There is much freedom in choosing the precise form of experts, the weighting function, and the loss function.
Meta-pi network
The meta-pi network, reported by Hampshire and Waibel, uses
as the output. The model is trained by performing gradient descent on the mean-squared error loss
. The experts may be arbitrary functions.
In their original publication, they were solving the problem of classifying
phoneme
A phoneme () is any set of similar Phone (phonetics), speech sounds that are perceptually regarded by the speakers of a language as a single basic sound—a smallest possible Phonetics, phonetic unit—that helps distinguish one word fr ...
s in speech signal from 6 different Japanese speakers, 2 females and 4 males. They trained 6 experts, each being a "time-delayed neural network" (essentially a multilayered
convolution network over the
mel spectrogram). They found that the resulting mixture of experts dedicated 5 experts for 5 of the speakers, but the 6th (male) speaker does not have a dedicated expert, instead his voice was classified by a linear combination of the experts for the other 3 male speakers.
Adaptive mixtures of local experts
The adaptive mixtures of local experts uses a
Gaussian mixture model
In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observati ...
. Each expert simply predicts a Gaussian distribution, and totally ignores the input. Specifically, the
-th expert predicts that the output is
, where
is a learnable parameter. The weighting function is a linear-softmax function:
The mixture of experts predict that the output is distributed according to the log-probability density function: