machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

, ensemble averaging is the process of creating multiple models (typically

artificial neural networks In machine learning, a neural network (also artificial neural network or neural net, abbreviated ANN or NN) is a computational model inspired by the structure and functions of biological neural networks. A neural network consists of connected ...

) and combining them to produce a desired output, as opposed to creating just one model. Ensembles of models often outperform individual models, as the various errors of the ensemble constituents "average out".

Overview

Ensemble averaging is one of the simplest types of committee machines. Along with boosting, it is one of the two major types of static committee machines.Haykin, Simon. Neural networks: a comprehensive foundation. 2nd ed. Upper Saddle River N.J.: Prentice Hall, 1999. In contrast to standard neural network design, in which many networks are generated but only one is kept, ensemble averaging keeps the less satisfactory networks, but with less weight assigned to their outputs.Hashem, S. "Optimal linear combinations of neural networks." Neural Networks 10, no. 4 (1997): 599–614. The theory of ensemble averaging relies on two properties of artificial neural networks:Naftaly, U., N. Intrator, and D. Horn. "Optimal ensemble averaging of neural networks." Network: Computation in Neural Systems 8, no. 3 (1997): 283–296. # In any network, the bias can be reduced at the cost of increased variance # In a group of networks, the variance can be reduced at no cost to the bias. This is known as the

bias–variance tradeoff In statistics and machine learning, the bias–variance tradeoff describes the relationship between a model's complexity, the accuracy of its predictions, and how well it can make predictions on previously unseen data that were not used to train ...

. Ensemble averaging creates a group of networks, each with low bias and high variance, and combines them to form a new network which should theoretically exhibit low bias and low variance. Hence, this can be thought of as a resolution of the bias–variance tradeoff.Geman, S., E. Bienenstock, and R. Doursat. "Neural networks and the bias/variance dilemma." Neural computation 4, no. 1 (1992): 1–58. The idea of combining experts can be traced back to

Pierre-Simon Laplace Pierre-Simon, Marquis de Laplace (; ; 23 March 1749 – 5 March 1827) was a French polymath, a scholar whose work has been instrumental in the fields of physics, astronomy, mathematics, engineering, statistics, and philosophy. He summariz ...

Method

The theory mentioned above gives an obvious strategy: create a set of experts with low bias and high variance, and average them. Generally, what this means is to create a set of experts with varying parameters; frequently, these are the initial synaptic weights of a neural network, although other factors (such as learning rate, momentum, etc.) may also be varied. Some authors recommend against varying weight decay and early stopping. The steps are therefore: # Generate ''N'' experts, each with their own initial parameters (these values are usually sampled randomly from a distribution) # Train each expert separately # Combine the experts and average their values. Alternatively,

domain knowledge Domain knowledge is knowledge of a specific discipline or field in contrast to general (or domain-independent) knowledge. The term is often used in reference to a more general discipline—for example, in describing a software engineer who has ge ...

may be used to generate several ''classes'' of experts. An expert from each class is trained, and then combined. A more complex version of ensemble average views the final result not as a mere average of all the experts, but rather as a weighted sum. If each expert is

y_i

, then the overall result

\tilde

can be defined as: :

\tilde(\mathbf; \mathbf) = \sum_^ \alpha_j y_j(\mathbf)

where

\mathbf

is a set of weights. The optimization problem of finding alpha is readily solved through neural networks, hence a "meta-network" where each "neuron" is in fact an entire neural network can be trained, and the synaptic weights of the final network is the weight applied to each expert. This is known as a ''linear combination of experts''. It can be seen that most forms of neural network are some subset of a linear combination: the standard neural net (where only one expert is used) is simply a linear combination with all

\alpha_j = 0

and one

\alpha_k=1

. A raw average is where all

\alpha_j

are equal to some constant value, namely one over the total number of experts. A more recent ensemble averaging method is negative correlation learning, proposed by Y. Liu and X. Yao. This method has been widely used in

evolutionary computing Evolutionary computation from computer science is a family of algorithms for global optimization inspired by biological evolution, and the subfield of artificial intelligence and soft computing studying these algorithms. In technical terms, ...

Benefits

* The resulting committee is almost always less complex than a single network that would achieve the same level of performancePearlmutter, B. A., and R. Rosenfeld. "Chaitin–Kolmogorov complexity and generalization in neural networks." In Proceedings of the 1990 conference on Advances in neural information processing systems 3, 931. Morgan Kaufmann Publishers Inc., 1990. * The resulting committee can be trained more easily on smaller

dataset A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record o ...

s * The resulting committee often has improved performance over any single model * The risk of

overfitting In mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfi ...

is lessened, as there are fewer parameters (e.g. neural network weights) which need to be set.

Overview

Method

Benefits

See also

References

Further reading