In
statistics and
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
, ensemble methods use multiple learning algorithms to obtain better
predictive performance than could be obtained from any of the constituent learning algorithms alone.
Unlike a
statistical ensemble
In physics, specifically statistical mechanics, an ensemble (also statistical ensemble) is an idealization consisting of a large number of virtual copies (sometimes infinitely many) of a system, considered all at once, each of which represents ...
in statistical mechanics, which is usually infinite, a machine learning ensemble consists of only a concrete finite set of alternative models, but typically allows for much more flexible structure to exist among those alternatives.
Overview
Supervised learning algorithms perform the task of searching through a hypothesis space to find a suitable hypothesis that will make good predictions with a particular problem. Even if the hypothesis space contains hypotheses that are very well-suited for a particular problem, it may be very difficult to find a good one. Ensembles combine multiple hypotheses to form a (hopefully) better hypothesis. The term ''ensemble'' is usually reserved for methods that generate multiple hypotheses using the same base learner.
The broader term of ''multiple classifier systems'' also covers hybridization of hypotheses that are not induced by the same base learner.
Evaluating the prediction of an ensemble typically requires more computation than evaluating the prediction of a single model. In one sense, ensemble learning may be thought of as a way to compensate for poor learning algorithms by performing a lot of extra computation. On the other hand, the alternative is to do a lot more learning on one non-ensemble system. An ensemble system may be more efficient at improving overall accuracy for the same increase in compute, storage, or communication resources by using that increase on two or more methods, than would have been improved by increasing resource use for a single method. Fast algorithms such as
decision trees are commonly used in ensemble methods (for example,
random forests), although slower algorithms can benefit from ensemble techniques as well.
By analogy, ensemble techniques have been used also in
unsupervised learning scenarios, for example in
consensus clustering or in
anomaly detection.
Ensemble theory
Empirically, ensembles tend to yield better results when there is a significant diversity among the models. Many ensemble methods, therefore, seek to promote diversity among the models they combine. Although perhaps non-intuitive, more random algorithms (like random decision trees) can be used to produce a stronger ensemble than very deliberate algorithms (like entropy-reducing decision trees). Using a variety of strong learning algorithms, however, has been shown to be more effective than using techniques that attempt to ''dumb-down'' the models in order to promote diversity. It is possible to increase diversity in the training stage of the model using correlation for regression tasks or using information measures such as cross entropy for classification tasks.
Ensemble size
While the number of component classifiers of an ensemble has a great impact on the accuracy of prediction, there is a limited number of studies addressing this problem. ''A priori'' determining of ensemble size and the volume and velocity of big data streams make this even more crucial for online ensemble classifiers. Mostly statistical tests were used for determining the proper number of components. More recently, a theoretical framework suggested that there is an ideal number of component classifiers for an ensemble such that having more or less than this number of classifiers would deteriorate the accuracy. It is called "the law of diminishing returns in ensemble construction." Their theoretical framework shows that using the same number of independent component classifiers as class labels gives the highest accuracy.
Common types of ensembles
Bayes optimal classifier
The Bayes optimal classifier is a classification technique. It is an ensemble of all the hypotheses in the hypothesis space. On average, no other ensemble can outperform it. The naive Bayes optimal classifier is a version of this that assumes that the data is conditionally independent on the class and makes the computation more feasible. Each hypothesis is given a vote proportional to the likelihood that the training dataset would be sampled from a system if that hypothesis were true. To facilitate training data of finite size, the vote of each hypothesis is also multiplied by the prior probability of that hypothesis. The Bayes optimal classifier can be expressed with the following equation:
:
where
is the predicted class,
is the set of all possible classes,
is the hypothesis space,
refers to a ''probability'', and
is the training data. As an ensemble, the Bayes optimal classifier represents a hypothesis that is not necessarily in
. The hypothesis represented by the Bayes optimal classifier, however, is the optimal hypothesis in ''ensemble space'' (the space of all possible ensembles consisting only of hypotheses in
).
This formula can be restated using
Bayes' theorem, which says that the posterior is proportional to the likelihood times the prior:
:
hence,
:
Bootstrap aggregating (bagging)

Bootstrap aggregation (''bagging'') involves training an ensemble on ''bootstrapped'' data sets. A bootstrapped set is created by selecting from original training data set with replacement. Thus, a bootstrap set may contain a given example zero, one, or multiple times. Ensemble members can also have limits on the features (e.g., nodes of a decision tree), to encourage exploring of diverse features. The variance of local information in the bootstrap sets and feature considerations promote diversity in the ensemble, and can strengthen the ensemble. To reduce overfitting, a member can be validated using the out-of-bag set (the examples that are not in its bootstrap set).
Inference is done by voting of predictions of ensemble members, called aggregation. It is illustrated below with an ensemble of four decision trees. The query example is classified by each tree. Because three of the four predict the ''positive'' class, the ensemble's overall classification is ''positive''.
Random forests like the one shown are a common application of bagging.
Boosting
Boosting involves training successively models by emphasizing training data mis-classified by previously learned models. Initially, all data (D1) has equal weight and is used to learn a base model M1. The examples mis-classified by M1 are assigned a weight greater than correctly classified examples. This boosted data (D2) is used to train a second base model M2, and so on. Inference is done by voting.
In some cases, boosting has yielded better accuracy than bagging, but tends to over-fit more. The most common implementation of boosting is
Adaboost
AdaBoost, short for ''Adaptive Boosting'', is a statistical classification meta-algorithm formulated by Yoav Freund and Robert Schapire in 1995, who won the 2003 Gödel Prize for their work. It can be used in conjunction with many other types ...
, but some newer algorithms are reported to achieve better results.
Bayesian model averaging
Bayesian model averaging (BMA) makes predictions by averaging the predictions of models weighted by their posterior probabilities given the data. BMA is known to generally give better answers than a single model, obtained, e.g., via
stepwise regression, especially where very different models have nearly identical performance in the training set but may otherwise perform quite differently.
The question with any use of
Bayes' theorem is the prior, i.e., the probability (perhaps subjective) that each model is the best to use for a given purpose. Conceptually, BMA can be used with any prior. ''R'' packages ensembleBMA and BMA use the prior implied by the , (BIC), following Raftery (1995). ''R'' package BAS supports the use of the priors implied by
Akaike information criterion
The Akaike information criterion (AIC) is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to ...
(AIC) and other criteria over the alternative models as well as priors over the coefficients.
The difference between BIC and AIC is the strength of preference for parsimony. BIC's penalty for model complexity is
, while AIC's is
. Large-sample asymptotic theory establishes that if there is a best model, then with increasing sample sizes, BIC is strongly consistent, i.e., will almost certainly find it, while AIC may not, because AIC may continue to place excessive posterior probability on models that are more complicated than they need to be. On the other hand, AIC and AICc are asymptotically “efficient” (i.e., minimum mean square prediction error), while BIC is not .
Haussler et al. (1994) showed that when BMA is used for classification, its expected error is at most twice the expected error of the Bayes optimal classifier. Burnham and Anderson (1998, 2002) contributed greatly to introducing a wider audience to the basic ideas of Bayesian model averaging and popularizing the methodology. The availability of software, including other free open-source packages for beyond those mentioned above, helped make the methods accessible to a wider audience.
Bayesian model combination
Bayesian model combination (BMC) is an algorithmic correction to Bayesian model averaging (BMA). Instead of sampling each model in the ensemble individually, it samples from the space of possible ensembles (with model weights drawn randomly from a Dirichlet distribution having uniform parameters). This modification overcomes the tendency of BMA to converge toward giving all the weight to a single model. Although BMC is somewhat more computationally expensive than BMA, it tends to yield dramatically better results. BMC has been shown to be better on average (with statistical significance) than BMA and bagging.
Use of Bayes' law to compute model weights requires computing the probability of the data given each model. Typically, none of the models in the ensemble are exactly the distribution from which the training data were generated, so all of them correctly receive a value close to zero for this term. This would work well if the ensemble were big enough to sample the entire model-space, but this is rarely possible. Consequently, each pattern in the training data will cause the ensemble weight to shift toward the model in the ensemble that is closest to the distribution of the training data. It essentially reduces to an unnecessarily complex method for doing model selection.
The possible weightings for an ensemble can be visualized as lying on a simplex. At each vertex of the simplex, all of the weight is given to a single model in the ensemble. BMA converges toward the vertex that is closest to the distribution of the training data. By contrast, BMC converges toward the point where this distribution projects onto the simplex. In other words, instead of selecting the one model that is closest to the generating distribution, it seeks the combination of models that is closest to the generating distribution.
The results from BMA can often be approximated by using cross-validation to select the best model from a bucket of models. Likewise, the results from BMC may be approximated by using cross-validation to select the best ensemble combination from a random sampling of possible weightings.
Bucket of models
A "bucket of models" is an ensemble technique in which a model selection algorithm is used to choose the best model for each problem. When tested with only one problem, a bucket of models can produce no better results than the best model in the set, but when evaluated across many problems, it will typically produce much better results, on average, than any model in the set.
The most common approach used for model-selection is
cross-validation selection (sometimes called a "bake-off contest"). It is described with the following pseudo-code:
For each model m in the bucket:
Do c times: (where 'c' is some constant)
Randomly divide the training dataset into two sets: A and B
Train m with A
Test m with B
Select the model that obtains the highest average score
Cross-Validation Selection can be summed up as: "try them all with the training set, and pick the one that works best".
Gating is a generalization of Cross-Validation Selection. It involves training another learning model to decide which of the models in the bucket is best-suited to solve the problem. Often, a
perceptron is used for the gating model. It can be used to pick the "best" model, or it can be used to give a linear weight to the predictions from each model in the bucket.
When a bucket of models is used with a large set of problems, it may be desirable to avoid training some of the models that take a long time to train. Landmark learning is a meta-learning approach that seeks to solve this problem. It involves training only the fast (but imprecise) algorithms in the bucket, and then using the performance of these algorithms to help determine which slow (but accurate) algorithm is most likely to do best.
Stacking
Stacking (sometimes called ''stacked generalization'') involves training a model to combine the predictions of several other learning algorithms. First, all of the other algorithms are trained using the available data, then a combiner algorithm is trained to make a final prediction using all the predictions of the other algorithms as additional inputs. If an arbitrary combiner algorithm is used, then stacking can theoretically represent any of the ensemble techniques described in this article, although, in practice, a
logistic regression
In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear function (calculus), linear combination of one or more independent var ...
model is often used as the combiner.
Stacking typically yields performance better than any single one of the trained models. It has been successfully used on both supervised learning tasks (regression, classification and distance learning ) and unsupervised learning (density estimation). It has also been used to estimate bagging's error rate.
It has been reported to out-perform Bayesian model-averaging. The two top-performers in the Netflix competition utilized blending, which may be considered a form of stacking.
Voting
Voting is another form of ensembling. See e.g.
Weighted majority algorithm (machine learning)
In machine learning, weighted majority algorithm (WMA) is a meta learning algorithm used to construct a compound algorithm from a pool of prediction algorithms, which could be any type of learning algorithms, classifiers, or even real human ex ...
.
Implementations in statistics packages
*
R: at least three packages offer Bayesian model averaging tools, including the (an acronym for Bayesian Model Selection) package, the (an acronym for Bayesian Adaptive Sampling) package, and the package.
*
Python:
scikit-learn, a package for machine learning in Python offers packages for ensemble learning including packages for bagging, voting and averaging methods.
*
MATLAB
MATLAB (an abbreviation of "MATrix LABoratory") is a proprietary multi-paradigm programming language and numeric computing environment developed by MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementa ...
: classification ensembles are implemented in Statistics and Machine Learning Toolbox.
Ensemble learning applications
In the recent years, due to the growing computational power which allows training large ensemble learning in a reasonable time frame, the number of its applications has grown increasingly.
Some of the applications of ensemble classifiers include:
Remote sensing
Land cover mapping
Land cover mapping is one of the major applications of
Earth observation satellite
An Earth observation satellite or Earth remote sensing satellite is a satellite used or designed for Earth observation (EO) from orbit, including spy satellites and similar ones intended for non-military uses such as environmental monitorin ...
sensors, using
remote sensing
Remote sensing is the acquisition of information about an object or phenomenon without making physical contact with the object, in contrast to in situ or on-site observation. The term is applied especially to acquiring information about Ear ...
and
geospatial data, to identify the materials and objects which are located on the surface of target areas. Generally, the classes of target materials include roads, buildings, rivers, lakes, and vegetation.
Some different ensemble learning approaches based on
artificial neural networks,
kernel principal component analysis (KPCA),
decision trees with
boosting,
random forest and automatic design of multiple classifier systems, are proposed to efficiently identify
land cover objects.
Change detection
Change detection is an
image analysis
Image analysis or imagery analysis is the extraction of meaningful information from images; mainly from digital images by means of digital image processing techniques. Image analysis tasks can be as simple as reading bar coded tags or as sophi ...
problem, consisting of the identification of places where the
land cover has changed over time.
Change detection is widely used in fields such as
urban growth,
forest and vegetation dynamics,
land use
Land use involves the management and modification of natural environment or wilderness into built environment such as settlements and semi-natural habitats such as arable fields, pastures, and managed woods. Land use by humans has a long his ...
and
disaster monitoring.
The earliest applications of ensemble classifiers in change detection are designed with the majority
voting
Voting is a method by which a group, such as a meeting or an electorate, can engage for the purpose of making a collective decision or expressing an opinion usually following discussions, debates or election campaigns. Democracies elect hol ...
,
Bayesian average A Bayesian average is a method of estimating the mean of a population using outside information, especially a pre-existing belief, which is factored into the calculation. This is a central feature of Bayesian interpretation. This is useful when the ...
and the
maximum posterior probability
In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the ...
.
Computer security
Distributed denial of service
Distributed denial of service is one of the most threatening
cyber-attacks that may happen to an
internet service provider
An Internet service provider (ISP) is an organization that provides services for accessing, using, or participating in the Internet. ISPs can be organized in various forms, such as commercial, community-owned, non-profit, or otherwise privatel ...
.
By combining the output of single classifiers, ensemble classifiers reduce the total error of detecting and discriminating such attacks from legitimate
flash crowds.
Malware Detection
Classification of
malware
Malware (a portmanteau for ''malicious software'') is any software intentionally designed to cause disruption to a computer, server, client, or computer network, leak private information, gain unauthorized access to information or systems, de ...
codes such as
computer virus
A computer virus is a type of computer program that, when executed, replicates itself by modifying other computer programs and inserting its own code. If this replication succeeds, the affected areas are then said to be "infected" with a compu ...
es,
computer worm
A computer worm is a standalone malware computer program that replicates itself in order to spread to other computers. It often uses a computer network to spread itself, relying on security failures on the target computer to access it. It wi ...
s,
trojans,
ransomware and
spyware
Spyware (a portmanteau for spying software) is software with malicious behaviour that aims to gather information about a person or organization and send it to another entity in a way that harms the user—for example, by violating their privac ...
s with the usage of
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
techniques, is inspired by the
document categorization problem. Ensemble learning systems have shown a proper efficacy in this area.
Intrusion detection
An
intrusion detection system monitors
computer network
A computer network is a set of computers sharing resources located on or provided by network nodes. The computers use common communication protocols over digital interconnections to communicate with each other. These interconnections ar ...
or
computer system
A computer is a machine that can be programmed to carry out sequences of arithmetic or logical operations ( computation) automatically. Modern digital electronic computers can perform generic sets of operations known as programs. These prog ...
s to identify intruder codes like an
anomaly detection process. Ensemble learning successfully aids such monitoring systems to reduce their total error.
Face recognition
Face recognition, which recently has become one of the most popular research areas of
pattern recognition, copes with identification or verification of a person by their
digital images.
Hierarchical ensembles based on Gabor Fisher classifier and
independent component analysis preprocessing techniques are some of the earliest ensembles employed in this field.
Emotion recognition
While
speech recognition
Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the ma ...
is mainly based on
deep learning because most of the industry players in this field like
Google
Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...
,
Microsoft
Microsoft Corporation is an American multinational corporation, multinational technology company, technology corporation producing Software, computer software, consumer electronics, personal computers, and related services headquartered at th ...
and
IBM reveal that the core technology of their
speech recognition
Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the ma ...
is based on this approach, speech-based
emotion recognition can also have a satisfactory performance with ensemble learning.
It is also being successfully used in
facial emotion recognition.
Fraud detection
Fraud detection deals with the identification of
bank fraud
Bank fraud is the use of potentially illegal means to obtain money, assets, or other property owned or held by a financial institution, or to obtain money from depositors by fraudulently posing as a bank or other financial institution. In many i ...
, such as
money laundering
Money laundering is the process of concealing the origin of money, obtained from illicit activities such as drug trafficking, corruption, embezzlement or gambling, by converting it into a legitimate source. It is a crime in many jurisdiction ...
,
credit card fraud and
telecommunication fraud
Telecommunication is the transmission of information by various types of technologies over wire, radio, optical, or other electromagnetic systems. It has its origin in the desire of humans for communication over a distance greater than that ...
, which have vast domains of research and applications of
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
. Because ensemble learning improves the robustness of the normal behavior modelling, it has been proposed as an efficient technique to detect such fraudulent cases and activities in banking and credit card systems.
Financial decision-making
The accuracy of prediction of business failure is a very crucial issue in financial decision-making. Therefore, different ensemble classifiers are proposed to predict
financial crises and
financial distress.
Also, in the
trade-based manipulation problem, where traders attempt to manipulate
stock price
A share price is the price of a single share of a number of saleable equity shares of a company.
In layman's terms, the stock price is the highest amount someone is willing to pay for the stock, or the lowest amount that it can be bought for.
B ...
s by buying and selling activities, ensemble classifiers are required to analyze the changes in the
stock market data and detect suspicious symptom of
stock price
A share price is the price of a single share of a number of saleable equity shares of a company.
In layman's terms, the stock price is the highest amount someone is willing to pay for the stock, or the lowest amount that it can be bought for.
B ...
manipulation
Manipulation may refer to:
* Manipulation (psychology) - the action of manipulating someone in a clever or unscrupulous way
* Crowd manipulation - use of crowd psychology to direct the behavior of a crowd toward a specific action
::*Internet mani ...
.
Medicine
Ensemble classifiers have been successfully applied in
neuroscience
Neuroscience is the science, scientific study of the nervous system (the brain, spinal cord, and peripheral nervous system), its functions and disorders. It is a Multidisciplinary approach, multidisciplinary science that combines physiology, an ...
,
proteomics and
medical diagnosis
Medical diagnosis (abbreviated Dx, Dx, or Ds) is the process of determining which disease or condition explains a person's symptoms and signs. It is most often referred to as a diagnosis with the medical context being implicit. The information r ...
like in
neuro-cognitive disorder (i.e.
Alzheimer
Alzheimer's disease (AD) is a neurodegenerative disease that usually starts slowly and progressively worsens. It is the cause of 60–70% of cases of dementia. The most common early symptom is difficulty in remembering recent events. As t ...
or
myotonic dystrophy) detection based on MRI datasets, and cervical cytology classification.
See also
*
Ensemble averaging (machine learning)
*
Bayesian structural time series
Bayesian structural time series (BSTS) model is a statistical technique used for feature selection, time series forecasting, nowcasting, inferring causal impact and other applications. The model is designed to work with time series data.
The mod ...
(BSTS)
References
Further reading
*
*
External links
* {{scholarpedia, title=Ensemble learning, urlname=Ensemble_learning, curator=Robi Polikar
* The
Waffles (machine learning)
Waffles is a collection of command-line tools for performing machine learning operations developed at Brigham Young University. These tools are written in C++, and are available under the GNU Lesser General Public License.
Description
The Waffles ...
toolkit contains implementations of Bagging, Boosting, Bayesian Model Averaging, Bayesian Model Combination, Bucket-of-models, and other ensemble techniques