Bayesian Model Averaging
   HOME

TheInfoList



OR:

In
statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
and
machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. Unlike a
statistical ensemble In physics, specifically statistical mechanics, an ensemble (also statistical ensemble) is an idealization consisting of a large number of virtual copies (sometimes infinitely many) of a system, considered all at once, each of which represents a ...
in statistical mechanics, which is usually infinite, a machine learning ensemble consists of only a concrete finite set of alternative models, but typically allows for much more flexible structure to exist among those alternatives.


Overview

Supervised learning In machine learning, supervised learning (SL) is a paradigm where a Statistical model, model is trained using input objects (e.g. a vector of predictor variables) and desired output values (also known as a ''supervisory signal''), which are often ...
algorithms search through a
hypothesis A hypothesis (: hypotheses) is a proposed explanation for a phenomenon. A scientific hypothesis must be based on observations and make a testable and reproducible prediction about reality, in a process beginning with an educated guess o ...
space to find a suitable hypothesis that will make good predictions with a particular problem. Even if this space contains hypotheses that are very well-suited for a particular problem, it may be very difficult to find a good one. Ensembles combine multiple hypotheses to form one which should be theoretically better. ''Ensemble learning'' trains two or more machine learning algorithms on a specific
classification Classification is the activity of assigning objects to some pre-existing classes or categories. This is distinct from the task of establishing the classes themselves (for example through cluster analysis). Examples include diagnostic tests, identif ...
or
regression Regression or regressions may refer to: Arts and entertainment * ''Regression'' (film), a 2015 horror film by Alejandro Amenábar, starring Ethan Hawke and Emma Watson * ''Regression'' (magazine), an Australian punk rock fanzine (1982–1984) * ...
task. The algorithms within the ensemble model are generally referred as "base models", "base learners", or "weak learners" in literature. These base models can be constructed using a single modelling algorithm, or several different algorithms. The idea is to train a diverse set of weak models on the same modelling task, such that the outputs of each weak learner have poor predictive ability (i.e., high
bias Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is inaccurate, closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individ ...
), and among all weak learners, the outcome and error values exhibit high
variance In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...
. Fundamentally, an ensemble learning model trains at least two high-bias (weak) and high-variance (diverse) models to be combined into a better-performing model. The set of weak models — which would not produce satisfactory predictive results individually — are combined or averaged to produce a single, high performing, accurate, and low-variance model to fit the task as required. Ensemble learning typically refers to bagging (
bootstrap aggregating Bootstrap aggregating, also called bagging (from bootstrap aggregating) or bootstrapping, is a machine learning (ML) ensemble meta-algorithm designed to improve the stability and accuracy of ML classification and regression algorithms. It also ...
), boosting or stacking/blending techniques to induce high variance among the base models. Bagging creates diversity by generating random samples from the training observations and fitting the same model to each different sample — also known as ''homogeneous parallel ensembles''. Boosting follows an iterative process by sequentially training each base model on the up-weighted errors of the previous base model, producing an additive model to reduce the final model errors — also known as ''sequential ensemble learning''. Stacking or blending consists of different base models, each trained independently (i.e. diverse/high variance) to be combined into the ensemble model — producing a ''heterogeneous parallel ensemble''. Common applications of ensemble learning include
random forest Random forests or random decision forests is an ensemble learning method for statistical classification, classification, regression analysis, regression and other tasks that works by creating a multitude of decision tree learning, decision trees ...
s (an extension of bagging), Boosted Tree models, and Gradient Boosted Tree Models. Models in applications of stacking are generally more task-specific — such as combining clustering techniques with other parametric and/or non-parametric techniques. Evaluating the prediction of an ensemble typically requires more computation than evaluating the prediction of a single model. In one sense, ensemble learning may be thought of as a way to compensate for poor learning algorithms by performing a lot of extra computation. On the other hand, the alternative is to do a lot more learning with one non-ensemble model. An ensemble may be more efficient at improving overall accuracy for the same increase in compute, storage, or communication resources by using that increase on two or more methods, than would have been improved by increasing resource use for a single method. Fast algorithms such as
decision trees A decision tree is a decision support system, decision support recursive partitioning structure that uses a Tree (graph theory), tree-like Causal model, model of decisions and their possible consequences, including probability, chance event ou ...
are commonly used in ensemble methods (e.g., random forests), although slower algorithms can benefit from ensemble techniques as well. By analogy, ensemble techniques have been used also in
unsupervised learning Unsupervised learning is a framework in machine learning where, in contrast to supervised learning, algorithms learn patterns exclusively from unlabeled data. Other frameworks in the spectrum of supervisions include weak- or semi-supervision, wh ...
scenarios, for example in
consensus clustering Consensus clustering is a method of aggregating (potentially conflicting) results from multiple clustering algorithms. Also called cluster ensembles or aggregation of clustering (or partitions), it refers to the situation in which a number of diff ...
or in
anomaly detection In data analysis, anomaly detection (also referred to as outlier detection and sometimes as novelty detection) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority of ...
.


Ensemble theory

Empirically, ensembles tend to yield better results when there is a significant diversity among the models. Many ensemble methods, therefore, seek to promote diversity among the models they combine. Although perhaps non-intuitive, more random algorithms (like random decision trees) can be used to produce a stronger ensemble than very deliberate algorithms (like entropy-reducing decision trees). Using a variety of strong learning algorithms, however, has been shown to be more effective than using techniques that attempt to ''dumb-down'' the models in order to promote diversity. It is possible to increase diversity in the training stage of the model using correlation for regression tasks or using information measures such as cross entropy for classification tasks. Theoretically, one can justify the diversity concept because the lower bound of the error rate of an ensemble system can be decomposed into accuracy, diversity, and the other term.


The geometric framework

Ensemble learning, including both regression and classification tasks, can be explained using a geometric framework. Within this framework, the output of each individual classifier or regressor for the entire dataset can be viewed as a point in a multi-dimensional space. Additionally, the target result is also represented as a point in this space, referred to as the "ideal point." The Euclidean distance is used as the metric to measure both the performance of a single classifier or regressor (the distance between its point and the ideal point) and the dissimilarity between two classifiers or regressors (the distance between their respective points). This perspective transforms ensemble learning into a deterministic problem. For example, within this geometric framework, it can be proved that the averaging of the outputs (scores) of all base classifiers or regressors can lead to equal or better results than the average of all the individual models. It can also be proved that if the optimal weighting scheme is used, then a weighted averaging approach can outperform any of the individual classifiers or regressors that make up the ensemble or as good as the best performer at least.


Ensemble size

While the number of component classifiers of an ensemble has a great impact on the accuracy of prediction, there is a limited number of studies addressing this problem. ''A priori'' determining of ensemble size and the volume and velocity of big data streams make this even more crucial for online ensemble classifiers. Mostly statistical tests were used for determining the proper number of components. More recently, a theoretical framework suggested that there is an ideal number of component classifiers for an ensemble such that having more or less than this number of classifiers would deteriorate the accuracy. It is called "the law of diminishing returns in ensemble construction." Their theoretical framework shows that using the same number of independent component classifiers as class labels gives the highest accuracy.


Common types of ensembles


Bayes optimal classifier

The Bayes optimal classifier is a classification technique. It is an ensemble of all the hypotheses in the hypothesis space. On average, no other ensemble can outperform it. The
Naive Bayes classifier In statistics, naive (sometimes simple or idiot's) Bayes classifiers are a family of " probabilistic classifiers" which assumes that the features are conditionally independent, given the target class. In other words, a naive Bayes model assumes th ...
is a version of this that assumes that the data is conditionally independent on the class and makes the computation more feasible. Each hypothesis is given a vote proportional to the likelihood that the training dataset would be sampled from a system if that hypothesis were true. To facilitate training data of finite size, the vote of each hypothesis is also multiplied by the prior probability of that hypothesis. The Bayes optimal classifier can be expressed with the following equation: :y=\underset \sum_ where y is the predicted class, C is the set of all possible classes, H is the hypothesis space, P refers to a ''probability'', and T is the training data. As an ensemble, the Bayes optimal classifier represents a hypothesis that is not necessarily in H. The hypothesis represented by the Bayes optimal classifier, however, is the optimal hypothesis in ''ensemble space'' (the space of all possible ensembles consisting only of hypotheses in H). This formula can be restated using
Bayes' theorem Bayes' theorem (alternatively Bayes' law or Bayes' rule, after Thomas Bayes) gives a mathematical rule for inverting Conditional probability, conditional probabilities, allowing one to find the probability of a cause given its effect. For exampl ...
, which says that the posterior is proportional to the likelihood times the prior: :P(h_i, T) \propto P(T, h_i)P(h_i) hence, :y=\underset \sum_


Bootstrap aggregating (bagging)

Bootstrap aggregation (''bagging'') involves training an ensemble on ''bootstrapped'' data sets. A bootstrapped set is created by selecting from original training data set with replacement. Thus, a bootstrap set may contain a given example zero, one, or multiple times. Ensemble members can also have limits on the features (e.g., nodes of a decision tree), to encourage exploring of diverse features. The variance of local information in the bootstrap sets and feature considerations promote diversity in the ensemble, and can strengthen the ensemble. To reduce overfitting, a member can be validated using the out-of-bag set (the examples that are not in its bootstrap set). Inference is done by voting of predictions of ensemble members, called aggregation. It is illustrated below with an ensemble of four decision trees. The query example is classified by each tree. Because three of the four predict the ''positive'' class, the ensemble's overall classification is ''positive''.
Random forest Random forests or random decision forests is an ensemble learning method for statistical classification, classification, regression analysis, regression and other tasks that works by creating a multitude of decision tree learning, decision trees ...
s like the one shown are a common application of bagging.


Boosting

Boosting involves training successive models by emphasizing training data mis-classified by previously learned models. Initially, all data (D1) has equal weight and is used to learn a base model M1. The examples mis-classified by M1 are assigned a weight greater than correctly classified examples. This boosted data (D2) is used to train a second base model M2, and so on. Inference is done by voting. In some cases, boosting has yielded better accuracy than bagging, but tends to over-fit more. The most common implementation of boosting is
Adaboost AdaBoost (short for Adaptive Boosting) is a statistical classification meta-algorithm formulated by Yoav Freund and Robert Schapire in 1995, who won the 2003 Gödel Prize for their work. It can be used in conjunction with many types of learnin ...
, but some newer algorithms are reported to achieve better results.


Bayesian model averaging

Bayesian model averaging (BMA) makes predictions by averaging the predictions of models weighted by their posterior probabilities given the data. BMA is known to generally give better answers than a single model, obtained, e.g., via
stepwise regression In statistics, stepwise regression is a method of fitting regression models in which the choice of predictive variables is carried out by an automatic procedure. In each step, a variable is considered for addition to or subtraction from the set of ...
, especially where very different models have nearly identical performance in the training set but may otherwise perform quite differently. The question with any use of
Bayes' theorem Bayes' theorem (alternatively Bayes' law or Bayes' rule, after Thomas Bayes) gives a mathematical rule for inverting Conditional probability, conditional probabilities, allowing one to find the probability of a cause given its effect. For exampl ...
is the prior, i.e., the probability (perhaps subjective) that each model is the best to use for a given purpose. Conceptually, BMA can be used with any prior. ''R'' packages ensembleBMA and BMA use the prior implied by the
Bayesian information criterion In statistics, the Bayesian information criterion (BIC) or Schwarz information criterion (also SIC, SBC, SBIC) is a criterion for model selection among a finite set of models; models with lower BIC are generally preferred. It is based, in part, on ...
, (BIC), following Raftery (1995). ''R'' package BAS supports the use of the priors implied by
Akaike information criterion The Akaike information criterion (AIC) is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to ...
(AIC) and other criteria over the alternative models as well as priors over the coefficients. The difference between BIC and AIC is the strength of preference for parsimony. BIC's penalty for model complexity is \ln(n) k , while AIC's is 2k. Large-sample asymptotic theory establishes that if there is a best model, then with increasing sample sizes, BIC is strongly consistent, i.e., will almost certainly find it, while AIC may not, because AIC may continue to place excessive posterior probability on models that are more complicated than they need to be. On the other hand, AIC and AICc are asymptotically "efficient" (i.e., minimum mean square prediction error), while BIC is not . Haussler et al. (1994) showed that when BMA is used for classification, its expected error is at most twice the expected error of the Bayes optimal classifier. Burnham and Anderson (1998, 2002) contributed greatly to introducing a wider audience to the basic ideas of Bayesian model averaging and popularizing the methodology. The availability of software, including other free open-source packages for R beyond those mentioned above, helped make the methods accessible to a wider audience.


Bayesian model combination

Bayesian model combination (BMC) is an algorithmic correction to Bayesian model averaging (BMA). Instead of sampling each model in the ensemble individually, it samples from the space of possible ensembles (with model weights drawn randomly from a Dirichlet distribution having uniform parameters). This modification overcomes the tendency of BMA to converge toward giving all the weight to a single model. Although BMC is somewhat more computationally expensive than BMA, it tends to yield dramatically better results. BMC has been shown to be better on average (with statistical significance) than BMA and bagging. Use of Bayes' law to compute model weights requires computing the probability of the data given each model. Typically, none of the models in the ensemble are exactly the distribution from which the training data were generated, so all of them correctly receive a value close to zero for this term. This would work well if the ensemble were big enough to sample the entire model-space, but this is rarely possible. Consequently, each pattern in the training data will cause the ensemble weight to shift toward the model in the ensemble that is closest to the distribution of the training data. It essentially reduces to an unnecessarily complex method for doing model selection. The possible weightings for an ensemble can be visualized as lying on a simplex. At each vertex of the simplex, all of the weight is given to a single model in the ensemble. BMA converges toward the vertex that is closest to the distribution of the training data. By contrast, BMC converges toward the point where this distribution projects onto the simplex. In other words, instead of selecting the one model that is closest to the generating distribution, it seeks the combination of models that is closest to the generating distribution. The results from BMA can often be approximated by using cross-validation to select the best model from a bucket of models. Likewise, the results from BMC may be approximated by using cross-validation to select the best ensemble combination from a random sampling of possible weightings.


Bucket of models

A "bucket of models" is an ensemble technique in which a model selection algorithm is used to choose the best model for each problem. When tested with only one problem, a bucket of models can produce no better results than the best model in the set, but when evaluated across many problems, it will typically produce much better results, on average, than any model in the set. The most common approach used for model-selection is cross-validation selection (sometimes called a "bake-off contest"). It is described with the following pseudo-code: For each model m in the bucket: Do c times: (where 'c' is some constant) Randomly divide the training dataset into two sets: A and B Train m with A Test m with B Select the model that obtains the highest average score Cross-Validation Selection can be summed up as: "try them all with the training set, and pick the one that works best". Gating is a generalization of Cross-Validation Selection. It involves training another learning model to decide which of the models in the bucket is best-suited to solve the problem. Often, a
perceptron In machine learning, the perceptron is an algorithm for supervised classification, supervised learning of binary classification, binary classifiers. A binary classifier is a function that can decide whether or not an input, represented by a vect ...
is used for the gating model. It can be used to pick the "best" model, or it can be used to give a linear weight to the predictions from each model in the bucket. When a bucket of models is used with a large set of problems, it may be desirable to avoid training some of the models that take a long time to train. Landmark learning is a meta-learning approach that seeks to solve this problem. It involves training only the fast (but imprecise) algorithms in the bucket, and then using the performance of these algorithms to help determine which slow (but accurate) algorithm is most likely to do best.


Amended Cross-Entropy Cost: An Approach for Encouraging Diversity in Classification Ensemble

The most common approach for training classifier is using
Cross-entropy In information theory, the cross-entropy between two probability distributions p and q, over the same underlying set of events, measures the average number of bits needed to identify an event drawn from the set when the coding scheme used for the ...
cost function. However, one would like to train an ensemble of models that have diversity so when we combine them it would provide best results. Assuming we use a simple ensemble of averaging K classifiers. Then the Amended Cross-Entropy Cost is : e^k = H(p,q^k)-\frac\sum_H(q^j,q^k) where e^k is the cost function of the k^ classifier, q^k is the probability of the k^ classifier, p is the true probability that we need to estimate and \lambda is a parameter between 0 and 1 that define the diversity that we would like to establish. When \lambda=0 we want each classifier to do its best regardless of the ensemble and when \lambda=1 we would like the classifier to be as diverse as possible.


Stacking

Stacking (sometimes called ''stacked generalization'') involves training a model to combine the predictions of several other learning algorithms. First, all of the other algorithms are trained using the available data, then a combiner algorithm (final estimator) is trained to make a final prediction using all the predictions of the other algorithms (base estimators) as additional inputs or using cross-validated predictions from the base estimators which can prevent overfitting. If an arbitrary combiner algorithm is used, then stacking can theoretically represent any of the ensemble techniques described in this article, although, in practice, a
logistic regression In statistics, a logistic model (or logit model) is a statistical model that models the logit, log-odds of an event as a linear function (calculus), linear combination of one or more independent variables. In regression analysis, logistic regres ...
model is often used as the combiner. Stacking typically yields performance better than any single one of the trained models. It has been successfully used on both supervised learning tasks (regression, classification and distance learning ) and unsupervised learning (density estimation). It has also been used to estimate bagging's error rate. It has been reported to out-perform Bayesian model-averaging. The two top-performers in the Netflix competition utilized blending, which may be considered a form of stacking.


Voting

Voting is another form of ensembling. See e.g.
Weighted majority algorithm (machine learning) A weight function is a mathematical device used when performing a sum, integral, or average to give some elements more "weight" or influence on the result than other elements in the same set. The result of this application of a weight function is ...
.


Implementations in statistics packages

* R: at least three packages offer Bayesian model averaging tools, including the (an acronym for Bayesian Model Selection) package, the (an acronym for Bayesian Adaptive Sampling) package, and the package. *
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (prog ...
:
scikit-learn scikit-learn (formerly scikits.learn and also known as sklearn) is a free and open-source machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support ...
, a package for machine learning in Python offers packages for ensemble learning including packages for bagging, voting and averaging methods. *
MATLAB MATLAB (an abbreviation of "MATrix LABoratory") is a proprietary multi-paradigm programming language and numeric computing environment developed by MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementat ...
: classification ensembles are implemented in Statistics and Machine Learning Toolbox.


Ensemble learning applications

In recent years, due to growing computational power, which allows for training in large ensemble learning in a reasonable time frame, the number of ensemble learning applications has grown increasingly. Some of the applications of ensemble classifiers include:


Remote sensing


Land cover mapping

Land cover mapping is one of the major applications of
Earth observation satellite An Earth observation satellite or Earth remote sensing satellite is a satellite used or designed for Earth observation (EO) from orbit, including spy satellites and similar ones intended for non-military uses such as environmental monitoring, me ...
sensors, using
remote sensing Remote sensing is the acquisition of information about an physical object, object or phenomenon without making physical contact with the object, in contrast to in situ or on-site observation. The term is applied especially to acquiring inform ...
and
geospatial data Geographic data and information is defined in the ISO/TC 211 series of standards as data and information having an implicit or explicit association with a location relative to Earth (a geographic location or geographic position). It is also cal ...
, to identify the materials and objects which are located on the surface of target areas. Generally, the classes of target materials include roads, buildings, rivers, lakes, and vegetation. Some different ensemble learning approaches based on
artificial neural networks In machine learning, a neural network (also artificial neural network or neural net, abbreviated ANN or NN) is a computational model inspired by the structure and functions of biological neural networks. A neural network consists of connected ...
,
kernel principal component analysis In the field of multivariate statistics, kernel principal component analysis (kernel PCA) is an extension of principal component analysis (PCA) using techniques of kernel methods. Using a kernel, the originally linear operations of PCA are performe ...
(KPCA),
decision trees A decision tree is a decision support system, decision support recursive partitioning structure that uses a Tree (graph theory), tree-like Causal model, model of decisions and their possible consequences, including probability, chance event ou ...
with boosting,
random forest Random forests or random decision forests is an ensemble learning method for statistical classification, classification, regression analysis, regression and other tasks that works by creating a multitude of decision tree learning, decision trees ...
and automatic design of multiple classifier systems, are proposed to efficiently identify
land cover Land cover is the physical material at the land surface of Earth. Land covers include flora, concrete, built structures, bare ground, and temporary water. Earth cover is the expression used by ecologist Frederick Edward Clements that has as ...
objects.


Change detection

Change detection In statistical analysis, change detection or change point detection tries to identify times when the probability distribution of a stochastic process or time series changes. In general the problem concerns both detecting whether or not a change ...
is an
image analysis Image analysis or imagery analysis is the extraction of meaningful information from images; mainly from digital images by means of digital image processing techniques. Image analysis tasks can be as simple as reading barcode, bar coded tags or a ...
problem, consisting of the identification of places where the
land cover Land cover is the physical material at the land surface of Earth. Land covers include flora, concrete, built structures, bare ground, and temporary water. Earth cover is the expression used by ecologist Frederick Edward Clements that has as ...
has changed over time.
Change detection In statistical analysis, change detection or change point detection tries to identify times when the probability distribution of a stochastic process or time series changes. In general the problem concerns both detecting whether or not a change ...
is widely used in fields such as
urban growth Urban sprawl (also known as suburban sprawl or urban encroachment) is defined as "the spreading of urban developments (such as houses and shopping centers) on undeveloped land near a city". Urban sprawl has been described as the unrestricted ...
, forest and vegetation dynamics,
land use Land use is an umbrella term to describe what happens on a parcel of land. It concerns the benefits derived from using the land, and also the land management actions that humans carry out there. The following categories are used for land use: fo ...
and disaster monitoring. The earliest applications of ensemble classifiers in change detection are designed with the majority
voting Voting is the process of choosing officials or policies by casting a ballot, a document used by people to formally express their preferences. Republics and representative democracies are governments where the population chooses representative ...
, Bayesian model averaging, and the maximum posterior probability. Given the growth of satellite data over time, the past decade sees more use of time series methods for continuous change detection from image stacks. One example is a Bayesian ensemble changepoint detection method called BEAST, with the software available as a package Rbeast in R, Python, and Matlab.


Computer security


Distributed denial of service

Distributed denial of service In computing, a denial-of-service attack (DoS attack) is a cyberattack in which the perpetrator seeks to make a machine or network resource unavailable to its intended users by temporarily or indefinitely disrupting services of a host conne ...
is one of the most threatening
cyber-attack A cyberattack (or cyber attack) occurs when there is an unauthorized action against computer infrastructure that compromises the CIA triad, confidentiality, integrity, or availability of its content. The rising dependence on increasingly comple ...
s that may happen to an
internet service provider An Internet service provider (ISP) is an organization that provides a myriad of services related to accessing, using, managing, or participating in the Internet. ISPs can be organized in various forms, such as commercial, community-owned, no ...
. By combining the output of single classifiers, ensemble classifiers reduce the total error of detecting and discriminating such attacks from legitimate
flash crowd Flash, flashes, or FLASH may refer to: Arts, entertainment, and media Fictional aliases * The Flash, several DC Comics superheroes with super speed: ** Flash (Jay Garrick) ** Barry Allen ** Wally West, the first Kid Flash and third adult Flash ...
s.


Malware Detection

Classification of
malware Malware (a portmanteau of ''malicious software'')Tahir, R. (2018)A study on malware and malware detection techniques . ''International Journal of Education and Management Engineering'', ''8''(2), 20. is any software intentionally designed to caus ...
codes such as
computer virus A computer virus is a type of malware that, when executed, replicates itself by modifying other computer programs and Code injection, inserting its own Computer language, code into those programs. If this replication succeeds, the affected areas ...
es,
computer worm A computer worm is a standalone malware computer program that replicates itself in order to spread to other computers. It often uses a computer network to spread itself, relying on security failures on the target computer to access it. It will ...
s,
trojans Trojan or Trojans may refer to: * Of or from the ancient city of Troy * Trojan language, the language of the historical Trojans Arts and entertainment Music * ''Les Troyens'' ('The Trojans'), an opera by Berlioz, premiered part 1863, part 1890 ...
,
ransomware Ransomware is a type of malware that Encryption, encrypts the victim's personal data until a ransom is paid. Difficult-to-trace Digital currency, digital currencies such as paysafecard or Bitcoin and other cryptocurrency, cryptocurrencies are com ...
and
spyware Spyware (a portmanteau for spying software) is any malware that aims to gather information about a person or organization and send it to another entity in a way that harms the user by violating their privacy, endangering their device's securit ...
s with the usage of
machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
techniques, is inspired by the document categorization problem. Ensemble learning systems have shown a proper efficacy in this area.


Intrusion detection

An
intrusion detection system An intrusion detection system (IDS) is a device or software application that monitors a network or systems for malicious activity or policy violations. Any intrusion activity or violation is typically either reported to an administrator or collec ...
monitors
computer network A computer network is a collection of communicating computers and other devices, such as printers and smart phones. In order to communicate, the computers and devices must be connected by wired media like copper cables, optical fibers, or b ...
or
computer system A computer is a machine that can be programmed to automatically carry out sequences of arithmetic or logical operations (''computation''). Modern digital electronic computers can perform generic sets of operations known as ''programs'', wh ...
s to identify intruder codes like an
anomaly detection In data analysis, anomaly detection (also referred to as outlier detection and sometimes as novelty detection) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority of ...
process. Ensemble learning successfully aids such monitoring systems to reduce their total error.


Face recognition

Face recognition A facial recognition system is a technology potentially capable of matching a human face from a digital image or a Film frame, video frame against a database of faces. Such a system is typically employed to authenticate users through ID verif ...
, which recently has become one of the most popular research areas of
pattern recognition Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess PR capabilities but their p ...
, copes with identification or verification of a person by their
digital image A digital image is an image composed of picture elements, also known as pixels, each with '' finite'', '' discrete quantities'' of numeric representation for its intensity or gray level that is an output from its two-dimensional functions f ...
s. Hierarchical ensembles based on Gabor Fisher classifier and
independent component analysis In signal processing, independent component analysis (ICA) is a computational method for separating a multivariate statistics, multivariate signal into additive subcomponents. This is done by assuming that at most one subcomponent is Gaussian and ...
preprocessing techniques are some of the earliest ensembles employed in this field.


Emotion recognition

While
speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also ...
is mainly based on
deep learning Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...
because most of the industry players in this field like
Google Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...
,
Microsoft Microsoft Corporation is an American multinational corporation and technology company, technology conglomerate headquartered in Redmond, Washington. Founded in 1975, the company became influential in the History of personal computers#The ear ...
and
IBM International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American Multinational corporation, multinational technology company headquartered in Armonk, New York, and present in over 175 countries. It is ...
reveal that the core technology of their
speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also ...
is based on this approach, speech-based
emotion recognition Emotion recognition is the process of identifying human emotion. People vary widely in their accuracy at recognizing the emotions of others. Use of technology to help people with emotion recognition is a relatively nascent research area. Gener ...
can also have a satisfactory performance with ensemble learning. It is also being successfully used in facial emotion recognition.


Fraud detection

Fraud detection In law, fraud is intent (law), intentional deception to deprive a victim of a legal right or to gain from a victim unlawfully or unfairly. Fraud can violate Civil law (common law), civil law (e.g., a fraud victim may sue the fraud perpetrato ...
deals with the identification of
bank fraud Bank fraud is the use of potentially illegal means to obtain money, assets, or other property owned or held by a financial institution, or to obtain money from depositors by fraudulently posing as a bank or other financial institution. In many ins ...
, such as
money laundering Money laundering is the process of illegally concealing the origin of money obtained from illicit activities (often known as dirty money) such as drug trafficking, sex work, terrorism, corruption, and embezzlement, and converting the funds i ...
,
credit card fraud Credit card fraud is an inclusive term for fraud committed using a payment card, such as a credit card or debit card. The purpose may be to obtain goods or services or to make payment to another account, which is controlled by a criminal. The P ...
and telecommunication fraud, which have vast domains of research and applications of
machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
. Because ensemble learning improves the robustness of the normal behavior modelling, it has been proposed as an efficient technique to detect such fraudulent cases and activities in banking and credit card systems.


Financial decision-making

The accuracy of prediction of business failure is a very crucial issue in financial decision-making. Therefore, different ensemble classifiers are proposed to predict
financial crises A financial crisis is any of a broad variety of situations in which some financial assets suddenly lose a large part of their nominal value. In the 19th and early 20th centuries, many financial crises were associated with Bank run#Systemic banki ...
and
financial distress Financial distress is a term in corporate finance used to indicate a condition when promises to creditors of a company are broken or honored with difficulty. If financial distress cannot be relieved, it can lead to bankruptcy. Financial dist ...
. Also, in the trade-based manipulation problem, where traders attempt to manipulate
stock price A share price is the price of a single share of a number of saleable equity shares of a company. In layman's terms, the stock price is the highest amount someone is willing to pay for the stock, or the lowest amount that it can be bought for. B ...
s by buying and selling activities, ensemble classifiers are required to analyze the changes in the
stock market A stock market, equity market, or share market is the aggregation of buyers and sellers of stocks (also called shares), which represent ownership claims on businesses; these may include ''securities'' listed on a public stock exchange a ...
data and detect suspicious symptom of
stock price A share price is the price of a single share of a number of saleable equity shares of a company. In layman's terms, the stock price is the highest amount someone is willing to pay for the stock, or the lowest amount that it can be bought for. B ...
manipulation.


Medicine

Ensemble classifiers have been successfully applied in
neuroscience Neuroscience is the scientific study of the nervous system (the brain, spinal cord, and peripheral nervous system), its functions, and its disorders. It is a multidisciplinary science that combines physiology, anatomy, molecular biology, ...
,
proteomics Proteomics is the large-scale study of proteins. Proteins are vital macromolecules of all living organisms, with many functions such as the formation of structural fibers of muscle tissue, enzymatic digestion of food, or synthesis and replicatio ...
and
medical diagnosis Medical diagnosis (abbreviated Dx, Dx, or Ds) is the process of determining which disease or condition explains a person's symptoms and signs. It is most often referred to as a diagnosis with the medical context being implicit. The information ...
like in neuro-cognitive disorder (i.e.
Alzheimer Alzheimer's disease (AD) is a neurodegenerative disease and the cause of 60–70% of cases of dementia. The most common early symptom is difficulty in remembering recent events. As the disease advances, symptoms can include problems wit ...
or
myotonic dystrophy Myotonic dystrophy (DM) is a type of muscular dystrophy, a group of genetic disorders that cause progressive muscle loss and weakness. In DM, muscles are often myotonia, unable to relax after contraction. Other manifestations may include catarac ...
) detection based on MRI datasets, cervical cytology classification. Besides, ensembles have been successfully applied in medical segmentation tasks, for example brain tumor and hyperintensities segmentation.


See also

*
Ensemble averaging (machine learning) In machine learning, ensemble averaging is the process of creating multiple models (typically artificial neural networks) and combining them to produce a desired output, as opposed to creating just one model. Ensembles of models often outperform i ...
*
Bayesian structural time series Bayesian structural time series (BSTS) model is a statistical technique used for feature selection, time series forecasting, nowcasting, inferring causal impact and other applications. The model is designed to work with time series data. The mod ...
(BSTS) *
Mixture of experts Mixture of experts (MoE) is a machine learning technique where multiple expert networks (learners) are used to divide a problem space into homogeneous regions. MoE represents a form of ensemble learning. They were also called committee machines. ...


References


Further reading

* *


External links

* {{scholarpedia, title=Ensemble learning, urlname=Ensemble_learning, curator=Robi Polikar * The
Waffles (machine learning) Waffles is a collection of command-line tools for performing machine learning operations developed at Brigham Young University. These tools are written in C++, and are available under the GNU Lesser General Public License. Description The Waffle ...
toolkit contains implementations of Bagging, Boosting, Bayesian Model Averaging, Bayesian Model Combination, Bucket-of-models, and other ensemble techniques