Random forests or random decision forests is an
ensemble learning
In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.
Unlike a statistical ensemble in statist ...
method for
classification
Classification is the activity of assigning objects to some pre-existing classes or categories. This is distinct from the task of establishing the classes themselves (for example through cluster analysis). Examples include diagnostic tests, identif ...
,
regression and other tasks that works by creating a multitude of
decision trees
A decision tree is a decision support system, decision support recursive partitioning structure that uses a Tree (graph theory), tree-like Causal model, model of decisions and their possible consequences, including probability, chance event ou ...
during training. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the output is the average of the predictions of the trees.
Random forests correct for decision trees' habit of
overfitting
In mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfi ...
to their
training set
In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from ...
.
The first algorithm for random decision forests was created in 1995 by
Tin Kam Ho using the
random subspace method,
which, in Ho's formulation, is a way to implement the "stochastic discrimination" approach to classification proposed by Eugene Kleinberg.
An extension of the algorithm was developed by
Leo Breiman and
Adele Cutler,
who registered "Random Forests" as a
trademark
A trademark (also written trade mark or trade-mark) is a form of intellectual property that consists of a word, phrase, symbol, design, or a combination that identifies a Good (economics and accounting), product or Service (economics), service f ...
in 2006 (, owned by
Minitab, Inc.). The extension combines Breiman's "
bagging" idea and random selection of features, introduced first by Ho
and later independently by Amit and
Geman in order to construct a collection of decision trees with controlled variance.
History
The general method of random decision forests was first proposed by Salzberg and Heath in 1993, with a method that used a randomized decision tree algorithm to create multiple trees and then combine them using majority voting. This idea was developed further by Ho in 1995.
Ho established that forests of trees splitting with oblique hyperplanes can gain accuracy as they grow without suffering from overtraining, as long as the forests are randomly restricted to be sensitive to only selected
feature
Feature may refer to:
Computing
* Feature recognition, could be a hole, pocket, or notch
* Feature (computer vision), could be an edge, corner or blob
* Feature (machine learning), in statistics: individual measurable properties of the phenome ...
dimensions. A subsequent work along the same lines
concluded that other splitting methods behave similarly, as long as they are randomly forced to be insensitive to some feature dimensions. This observation that a more complex classifier (a larger forest) gets more accurate nearly monotonically is in sharp contrast to the common belief that the complexity of a classifier can only grow to a certain level of accuracy before being hurt by overfitting. The explanation of the forest method's resistance to overtraining can be found in Kleinberg's theory of stochastic discrimination.
The early development of Breiman's notion of random forests was influenced by the work of Amit and Geman
who introduced the idea of searching over a random subset of the available decisions when splitting a node, in the context of growing a single
tree
In botany, a tree is a perennial plant with an elongated stem, or trunk, usually supporting branches and leaves. In some usages, the definition of a tree may be narrower, e.g., including only woody plants with secondary growth, only ...
. The idea of random subspace selection from Ho
was also influential in the design of random forests. This method grows a forest of trees, and introduces variation among the trees by projecting the training data into a randomly chosen
subspace before fitting each tree or each node. Finally, the idea of randomized node optimization, where the decision at each node is selected by a randomized procedure, rather than a deterministic optimization was first introduced by
Thomas G. Dietterich.
The proper introduction of random forests was made in a paper by
Leo Breiman.
This paper describes a method of building a forest of uncorrelated trees using a
CART
A cart or dray (Australia and New Zealand) is a vehicle designed for transport, using two wheels and normally pulled by draught animals such as horses, donkeys, mules and oxen, or even smaller animals such as goats or large dogs.
A handcart ...
like procedure, combined with randomized node optimization and
bagging. In addition, this paper combines several ingredients, some previously known and some novel, which form the basis of the modern practice of random forests, in particular:
# Using
out-of-bag error as an estimate of the
generalization error
For supervised learning applications in machine learning and statistical learning theory, generalization errorMohri, M., Rostamizadeh A., Talwakar A., (2018) ''Foundations of Machine learning'', 2nd ed., Boston: MIT Press (also known as the out-of- ...
.
# Measuring variable importance through permutation.
The report also offers the first theoretical result for random forests in the form of a bound on the
generalization error
For supervised learning applications in machine learning and statistical learning theory, generalization errorMohri, M., Rostamizadeh A., Talwakar A., (2018) ''Foundations of Machine learning'', 2nd ed., Boston: MIT Press (also known as the out-of- ...
which depends on the strength of the trees in the forest and their
correlation
In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistics ...
.
Algorithm
Preliminaries: decision tree learning
Decision trees are a popular method for various machine learning tasks. Tree learning is almost "an off-the-shelf procedure for data mining", say
Hastie ''et al.'', "because it is invariant under scaling and various other transformations of feature values, is robust to inclusion of irrelevant features, and produces inspectable models. However, they are seldom accurate".
In particular, trees that are grown very deep tend to learn highly irregular patterns: they
overfit their training sets, i.e. have
low bias, but very high variance. Random forests are a way of averaging multiple deep decision trees, trained on different parts of the same training set, with the goal of reducing the variance.
This comes at the expense of a small increase in the bias and some loss of interpretability, but generally greatly boosts the performance in the final model.
Bagging

The training algorithm for random forests applies the general technique of
bootstrap aggregating, or bagging, to tree learners. Given a training set = , ..., with responses = , ..., , bagging repeatedly (''B'' times) selects a
random sample with replacement of the training set and fits trees to these samples:
After training, predictions for unseen samples can be made by averaging the predictions from all the individual regression trees on :
or by taking the plurality vote in the case of classification trees.
This bootstrapping procedure leads to better model performance because it decreases the
variance
In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...
of the model, without increasing the bias. This means that while the predictions of a single tree are highly sensitive to noise in its training set, the average of many trees is not, as long as the trees are not correlated. Simply training many trees on a single training set would give strongly correlated trees (or even the same tree many times, if the training algorithm is deterministic); bootstrap sampling is a way of de-correlating the trees by showing them different training sets.
Additionally, an estimate of the uncertainty of the prediction can be made as the standard deviation of the predictions from all the individual regression trees on :
The number of samples (equivalently, of trees) is a free parameter. Typically, a few hundred to several thousand trees are used, depending on the size and nature of the training set. can be optimized using
cross-validation, or by observing the ''
out-of-bag error'': the mean prediction error on each training sample , using only the trees that did not have in their bootstrap sample.
The training and test error tend to level off after some number of trees have been fit.
From bagging to random forests
The above procedure describes the original bagging algorithm for trees. Random forests also include another type of bagging scheme: they use a modified tree learning algorithm that selects, at each candidate split in the learning process, a
random subset of the features. This process is sometimes called "feature bagging". The reason for doing this is the correlation of the trees in an ordinary bootstrap sample: if one or a few
features are very strong predictors for the response variable (target output), these features will be selected in many of the trees, causing them to become correlated. An analysis of how bagging and random subspace projection contribute to accuracy gains under different conditions is given by Ho.
Typically, for a classification problem with features, (rounded down) features are used in each split.
For regression problems the inventors recommend (rounded down) with a minimum node size of 5 as the default.
In practice, the best values for these parameters should be tuned on a case-to-case basis for every problem.
ExtraTrees
Adding one further step of randomization yields ''extremely randomized trees'', or ExtraTrees. As with ordinary random forests, they are an ensemble of individual trees, but there are two main differences: (1) each tree is trained using the whole learning sample (rather than a bootstrap sample), and (2) the top-down splitting is randomized: for each feature under consideration, a number of ''random'' cut-points are selected, instead of computing the locally ''optimal'' cut-point (based on, e.g.,
information gain
Information is an abstract concept that refers to something which has the power to inform. At the most fundamental level, it pertains to the interpretation (perhaps formally) of that which may be sensed, or their abstractions. Any natur ...
or the
Gini impurity). The values are chosen from a uniform distribution within the feature's empirical range (in the tree's training set). Then, of all the randomly chosen splits, the split that yields the highest score is chosen to split the node.
Similar to ordinary random forests, the number of randomly selected features to be considered at each node can be specified. Default values for this parameter are
for classification and
for regression, where
is the number of features in the model.
Random forests for high-dimensional data
The basic random forest procedure may not work well in situations where there are a large number of features but only a small proportion of these features are informative with respect to sample classification. This can be addressed by encouraging the procedure to focus mainly on features and trees that are informative. Some methods for accomplishing this are:
* Prefiltering: Eliminate features that are mostly just noise.
* Enriched random forest (ERF): Use weighted random sampling instead of simple random sampling at each node of each tree, giving greater weight to features that appear to be more informative.
* Tree-weighted random forest (TWRF): Give more weight to more accurate trees.
Properties
Variable importance
Random forests can be used to rank the importance of variables in a regression or classification problem in a natural way. The following technique was described in Breiman's original paper
[ and is implemented in the R package ]randomForest
.[
]
Permutation importance
To measure a feature's importance in a data set , first a random forest is trained on the data. During training, the out-of-bag error for each data point is recorded and averaged over the forest. (If bagging is not used during training, we can instead compute errors on an independent test set.)
After training, the values of the feature are permuted in the out-of-bag samples and the out-of-bag error is again computed on this perturbed data set. The importance for the feature is computed by averaging the difference in out-of-bag error before and after the permutation over all trees. The score is normalized by the standard deviation of these differences.
Features which produce large values for this score are ranked as more important than features which produce small values. The statistical definition of the variable importance measure was given and analyzed by Zhu ''et al.''
This method of determining variable importance has some drawbacks:
* When features have different numbers of values, random forests favor features with more values. Solutions to this problem include partial permutation
In combinatorial mathematics, a partial permutation, or sequence without repetition, on a finite set ''S''
is a bijection between two specified subsets of ''S''. That is, it is defined by two subsets ''U'' and ''V'' of equal size, and a one-to-one ...
s and growing unbiased trees.
* If the data contain groups of correlated features of similar relevance, then smaller groups are favored over large groups.
* If there are collinear features, the procedure may fail to identify important features. A solution is to permute groups of correlated features together.
Mean decrease in impurity feature importance
This approach to feature importance for random forests considers as important the variables which decrease a lot the impurity during splitting. It is described in the book ''Classification and Regression Trees'' by Leo Breiman and is the default implementation in sci-kit learn
and R. The definition is:where
* is a feature
* is the number of trees in the forest
* is tree
* is the fraction of samples reaching node
* is the change in impurity in tree at node .
As impurity measure for samples falling in a node e.g. the following statistics can be used:
*Entropy
Entropy is a scientific concept, most commonly associated with states of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynamics, where it was first recognized, to the micros ...
*Gini coefficient
In economics, the Gini coefficient ( ), also known as the Gini index or Gini ratio, is a measure of statistical dispersion intended to represent the income distribution, income inequality, the wealth distribution, wealth inequality, or the ...
*Mean squared error
In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference betwee ...
The normalized importance is then obtained by normalizing over all features, so that the sum of normalized feature importances is 1.
The sci-kit learn
default implementation can report misleading feature importance:
* it favors high cardinality features
* it uses training statistics and so does not reflect a feature's usefulness for predictions on a test set
Relationship to nearest neighbors
A relationship between random forests and the -nearest neighbor algorithm (-NN) was pointed out by Lin and Jeon in 2002. Both can be viewed as so-called ''weighted neighborhoods schemes''. These are models built from a training set that make predictions for new points by looking at the "neighborhood" of the point, formalized by a weight function :Here, is the non-negative weight of the 'th training point relative to the new point in the same tree. For any , the weights for points must sum to 1. Weight functions are as follows:
* In -NN, if is one of the points closest to , and zero otherwise.
* In a tree, if is one of the points in the same leaf as , and zero otherwise.
Since a forest averages the predictions of a set of trees with individual weight functions , its predictions are
This shows that the whole forest is again a weighted neighborhood scheme, with weights that average those of the individual trees. The neighbors of in this interpretation are the points sharing the same leaf in any tree . In this way, the neighborhood of depends in a complex way on the structure of the trees, and thus on the structure of the training set. Lin and Jeon show that the shape of the neighborhood used by a random forest adapts to the local importance of each feature.
Unsupervised learning
As part of their construction, random forest predictors naturally lead to a dissimilarity measure among observations. One can analogously define dissimilarity between unlabeled data, by training a forest to distinguish original "observed" data from suitably generated synthetic data drawn from a reference distribution.[ A random forest dissimilarity is attractive because it handles mixed variable types very well, is invariant to monotonic transformations of the input variables, and is robust to outlying observations. Random forest dissimilarity easily deals with a large number of semi-continuous variables due to its intrinsic variable selection; for example, the "Addcl 1" random forest dissimilarity weighs the contribution of each variable according to how dependent it is on other variables. Random forest dissimilarity has been used in a variety of applications, e.g. to find clusters of patients based on tissue marker data.
]
Variants
Instead of decision trees, linear models have been proposed and evaluated as base estimators in random forests, in particular multinomial logistic regression
In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes. That is, it is a model that is used to predict the prob ...
and naive Bayes classifier
In statistics, naive (sometimes simple or idiot's) Bayes classifiers are a family of " probabilistic classifiers" which assumes that the features are conditionally independent, given the target class. In other words, a naive Bayes model assumes th ...
s. In cases that the relationship between the predictors and the target variable is linear, the base learners may have an equally high accuracy as the ensemble learner.
Kernel random forest
In machine learning, kernel random forests (KeRF) establish the connection between random forests and kernel methods. By slightly modifying their definition, random forests can be rewritten as kernel methods, which are more interpretable and easier to analyze.
History
Leo Breiman was the first person to notice the link between random forest and kernel methods
In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). These methods involve using linear classifiers to solve nonlinear problems. The general task of pa ...
. He pointed out that random forests trained using i.i.d. random vectors in the tree construction are equivalent to a kernel acting on the true margin. Lin and Jeon established the connection between random forests and adaptive nearest neighbor, implying that random forests can be seen as adaptive kernel estimates. Davies and Ghahramani proposed Kernel Random Forest (KeRF) and showed that it can empirically outperform state-of-art kernel methods. Scornet first defined KeRF estimates and gave the explicit link between KeRF estimates and random forest. He also gave explicit expressions for kernels based on centered random forest and uniform random forest, two simplified models of random forest. He named these two KeRFs Centered KeRF and Uniform KeRF, and proved upper bounds on their rates of consistency.
Notations and definitions
Preliminaries: Centered forests
Centered forest is a simplified model for Breiman's original random forest, which uniformly selects an attribute among all attributes and performs splits at the center of the cell along the pre-chosen attribute. The algorithm stops when a fully binary tree of level is built, where is a parameter of the algorithm.
Uniform forest
Uniform forest is another simplified model for Breiman's original random forest, which uniformly selects a feature among all features and performs splits at a point uniformly drawn on the side of the cell, along the preselected feature.
From random forest to KeRF
Given a training sample of -valued independent random variables distributed as the independent prototype pair , where . We aim at predicting the response , associated with the random variable , by estimating the regression function