Linear Programming Boosting (LPBoost) is a
supervised classifier from the
boosting family of classifiers. LPBoost maximizes a ''margin'' between training samples of different classes and hence also belongs to the class of margin-maximizing supervised classification algorithms. Consider a classification function
:
which classifies samples from a space
into one of two classes, labelled 1 and -1, respectively. LPBoost is an algorithm to ''learn'' such a classification function given a set of training examples with known class labels. LPBoost is a
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
technique and especially suited for applications of joint classification and feature selection in structured domains.
LPBoost overview
As in all boosting classifiers, the final classification function is of the form
:
where
are non-negative weightings for ''weak'' classifiers
. Each individual weak classifier
may be just a little bit better than random, but the resulting linear combination of many weak classifiers can perform very well.
LPBoost constructs
by starting with an empty set of weak classifiers. Iteratively, a single weak classifier to add to the set of considered weak classifiers is selected, added and all the weights
for the current set of weak classifiers are adjusted. This is repeated until no weak classifiers to add remain.
The property that all classifier weights are adjusted in each iteration is known as ''totally-corrective'' property. Early boosting methods, such as
AdaBoost
AdaBoost, short for ''Adaptive Boosting'', is a statistical classification meta-algorithm formulated by Yoav Freund and Robert Schapire in 1995, who won the 2003 Gödel Prize for their work. It can be used in conjunction with many other types ...
do not have this property and converge slower.
Linear program
More generally, let
be the possibly infinite set of weak classifiers, also termed ''hypotheses''. One way to write down the problem LPBoost solves is as a
linear program
Linear programming (LP), also called linear optimization, is a method to achieve the best outcome (such as maximum profit or lowest cost) in a mathematical model whose requirements are represented by linear relationships. Linear programming is ...
with infinitely many variables.
The primal linear program of LPBoost, optimizing over the non-negative weight vector
, the non-negative vector
of slack variables and the ''margin''
is the following.
:
Note the effects of slack variables
: their one-norm is penalized in the objective function by a constant factor
, which—if small enough—always leads to a primal feasible linear program.
Here we adopted the notation of a parameter space
, such that for a choice
the weak classifier
is uniquely defined.
When the above linear program was first written down in early publications about boosting methods it was disregarded as intractable due to the large number of variables
. Only later it was discovered that such linear programs can indeed be solved efficiently using the classic technique of
column generation.
Column generation for LPBoost
In a
linear program
Linear programming (LP), also called linear optimization, is a method to achieve the best outcome (such as maximum profit or lowest cost) in a mathematical model whose requirements are represented by linear relationships. Linear programming is ...
a ''column'' corresponds to a primal variable.
Column generation is a technique to solve large linear programs. It typically works in a restricted problem, dealing only with a subset of variables. By generating primal variables iteratively and on-demand, eventually the original unrestricted problem with all variables is recovered. By cleverly choosing the columns to generate the problem can be solved such that while still guaranteeing the obtained solution to be optimal for the original full problem, only a small fraction of columns has to be created.
LPBoost dual problem
Columns in the primal linear program corresponds to rows in the
dual linear program The dual of a given linear program (LP) is another LP that is derived from the original (the primal) LP in the following schematic way:
* Each variable in the primal LP becomes a constraint in the dual LP;
* Each constraint in the primal LP becomes ...
. The equivalent dual linear program of LPBoost is the following linear program.
:
For
linear program
Linear programming (LP), also called linear optimization, is a method to achieve the best outcome (such as maximum profit or lowest cost) in a mathematical model whose requirements are represented by linear relationships. Linear programming is ...
s the optimal value of the primal and
dual problem
In mathematical optimization theory, duality or the duality principle is the principle that optimization problems may be viewed from either of two perspectives, the primal problem or the dual problem. If the primal is a minimization problem then t ...
are equal. For the above primal and dual problems, the optimal value is equal to the negative 'soft margin'. The soft margin is the size of the margin separating positive from negative training instances minus positive slack variables that carry penalties for margin-violating samples. Thus, the soft margin may be positive although not all samples are linearly separated by the classification function. The latter is called the 'hard margin' or 'realized margin'.
Convergence criterion
Consider a subset of the satisfied constraints in the dual problem. For any finite subset we can solve the linear program and thus satisfy all constraints. If we could prove that of all the constraints which we did not add to the dual problem no single constraint is violated, we would have proven that solving our restricted problem is equivalent to solving the original problem. More formally, let
be the optimal objective function value for any restricted instance. Then, we can formulate a search problem for the 'most violated constraint' in the original problem space, namely finding
as
:
That is, we search the space
for a single
decision stump
A decision stump is a machine learning model consisting of a one-level decision tree. That is, it is a decision tree with one internal node (the root) which is immediately connected to the terminal nodes (its leaves). A decision stump makes a predi ...
maximizing the left hand side of the dual constraint. If the constraint cannot be violated by any choice of decision stump, none of the corresponding constraint can be active in the original problem and the restricted problem is equivalent.
Penalization constant
The positive value of penalization constant
has to be found using
model selection
Model selection is the task of selecting a statistical model from a set of candidate models, given data. In the simplest cases, a pre-existing set of data is considered. However, the task can also involve the design of experiments such that the ...
techniques. However, if we choose
, where
is the number of training samples and
, then the new parameter
has the following properties.
*
is an upper bound on the fraction of training errors; that is, if
denotes the number of misclassified training samples, then
.
*
is a lower bound on the fraction of training samples outside or on the margin.
Algorithm
* Input:
** Training set
,
** Training labels
,
** Convergence threshold
* Output:
** Classification function
# Initialization
## Weights, uniform
## Edge
## Hypothesis count
# Iterate
##
## if
then
### break
##
##
##
solution of the LPBoost dual
##
Lagrangian multipliers of solution to LPBoost dual problem
#
Note that if the convergence threshold is set to
the solution obtained is the global optimal solution of the above linear program. In practice,
is set to a small positive value in order obtain a good solution quickly.
Realized margin
The actual margin separating the training samples is termed the ''realized margin'' and is defined as
:
The realized margin can and will usually be negative in the first iterations. For a hypothesis space that permits singling out of any single sample, as is commonly the case, the realized margin will eventually converge to some positive value.
Convergence guarantee
While the above algorithm is proven to converge, in contrast to other boosting formulations, such as
AdaBoost
AdaBoost, short for ''Adaptive Boosting'', is a statistical classification meta-algorithm formulated by Yoav Freund and Robert Schapire in 1995, who won the 2003 Gödel Prize for their work. It can be used in conjunction with many other types ...
and
TotalBoost, there are no known convergence bounds for LPBoost. In practise however, LPBoost is known to converge quickly, often faster than other formulations.
Base learners
LPBoost is an
ensemble learning
In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.
Unlike a statistical ensemble in statist ...
method and thus does not dictate the choice of base learners, the space of hypotheses
. Demiriz et al. showed that under mild assumptions, any base learner can be used. If the base learners are particularly simple, they are often referred to as ''
decision stump
A decision stump is a machine learning model consisting of a one-level decision tree. That is, it is a decision tree with one internal node (the root) which is immediately connected to the terminal nodes (its leaves). A decision stump makes a predi ...
s''.
The number of base learners commonly used with Boosting in the literature is large. For example, if
, a base learner could be a linear soft margin
support vector machine
In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories ...
. Or even more simple, a simple stump of the form
:
The above decision stumps looks only along a single dimension
of the input space and simply thresholds the respective column of the sample using a constant threshold
. Then, it can decide in either direction, depending on
for a positive or negative class.
Given weights for the training samples, constructing the optimal decision stump of the above form simply involves searching along all sample columns and determining
,
and
in order to optimize the gain function.
References
Linear Programming Boosting via Column Generation A. Demiriz and K.P. Bennett and J. Shawe-Taylor. Published 2002 in Kluwer Machine Learning 46, pages 225–254.
Ensemble learning