Covariance matrix adaptation evolution strategy (CMA-ES) is a particular kind of strategy for
numerical optimization
Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criteria, from some set of available alternatives. It is generally divided into two subfiel ...
.
Evolution strategies
Evolution strategy (ES) from computer science is a subclass of evolutionary algorithms, which serves as an optimization (mathematics), optimization technique. It uses the major genetic operators mutation (evolutionary algorithm), mutation, recomb ...
(ES) are
stochastic Stochastic (; ) is the property of being well-described by a random probability distribution. ''Stochasticity'' and ''randomness'' are technically distinct concepts: the former refers to a modeling approach, while the latter describes phenomena; i ...
,
derivative-free methods for
numerical optimization
Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criteria, from some set of available alternatives. It is generally divided into two subfiel ...
of non-
linear
In mathematics, the term ''linear'' is used in two distinct senses for two different properties:
* linearity of a '' function'' (or '' mapping'');
* linearity of a '' polynomial''.
An example of a linear function is the function defined by f(x) ...
or non-
convex
Convex or convexity may refer to:
Science and technology
* Convex lens, in optics
Mathematics
* Convex set, containing the whole line segment that joins points
** Convex polygon, a polygon which encloses a convex set of points
** Convex polytop ...
continuous optimization
Continuous optimization is a branch of optimization in applied mathematics.
As opposed to discrete optimization, the variables used in the objective function are required to be continuous variables—that is, to be chosen from a set of ...
problems. They belong to the class of
evolutionary algorithms
Evolutionary algorithms (EA) reproduce essential elements of the biological evolution in a computer algorithm in order to solve "difficult" problems, at least Approximation, approximately, for which no exact or satisfactory solution methods are k ...
and
evolutionary computation
Evolutionary computation from computer science is a family of algorithms for global optimization inspired by biological evolution, and the subfield of artificial intelligence and soft computing studying these algorithms. In technical terms ...
. An
evolutionary algorithm
Evolutionary algorithms (EA) reproduce essential elements of the biological evolution in a computer algorithm in order to solve "difficult" problems, at least Approximation, approximately, for which no exact or satisfactory solution methods are k ...
is broadly based on the principle of
biological evolution
Evolution is the change in the heritable characteristics of biological populations over successive generations. It occurs when evolutionary processes such as natural selection and genetic drift act on genetic variation, resulting in certai ...
, namely the repeated interplay of variation (via recombination and mutation) and selection: in each generation (iteration) new individuals (candidate solutions, denoted as
) are generated by variation of the current parental individuals, usually in a stochastic way. Then, some individuals are selected to become the parents in the next generation based on their fitness or
objective function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...
value
. Like this, individuals with better and better
-values are generated over the generation sequence.
In an
evolution strategy
Evolution strategy (ES) from computer science is a subclass of evolutionary algorithms, which serves as an optimization (mathematics), optimization technique. It uses the major genetic operators mutation (evolutionary algorithm), mutation, recomb ...
, new candidate solutions are usually sampled according to a
multivariate normal distribution
In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional ( univariate) normal distribution to higher dimensions. One d ...
in
. Recombination amounts to selecting a new mean value for the distribution. Mutation amounts to adding a random vector, a perturbation with zero mean. Pairwise dependencies between the variables in the distribution are represented by a
covariance matrix
In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...
. The covariance matrix adaptation (CMA) is a method to update the
covariance matrix
In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...
of this distribution. This is particularly useful if the function
is
ill-conditioned
In numerical analysis, the condition number of a function measures how much the output value of the function can change for a small change in the input argument. This is used to measure how sensitive a function is to changes or errors in the inpu ...
.
Adaptation of the
covariance matrix
In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...
amounts to learning a second order model of the underlying
objective function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...
similar to the approximation of the inverse
Hessian matrix
In mathematics, the Hessian matrix, Hessian or (less commonly) Hesse matrix is a square matrix of second-order partial derivatives of a scalar-valued Function (mathematics), function, or scalar field. It describes the local curvature of a functio ...
in the
quasi-Newton method
In numerical analysis, a quasi-Newton method is an iterative numerical method used either to find zeroes or to find local maxima and minima of functions via an iterative recurrence formula much like the one for Newton's method, except using ap ...
in classical
optimization
Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criteria, from some set of available alternatives. It is generally divided into two subfiel ...
. In contrast to most classical methods, fewer assumptions on the underlying objective function are made. Because only a ranking (or, equivalently, sorting) of candidate solutions is exploited, neither derivatives nor even an (explicit) objective function is required by the method. For example, the ranking could come about from pairwise competitions between the candidate solutions in a
Swiss-system tournament
A Swiss-system tournament is a non-eliminating tournament format that features a fixed number of rounds of competition, but considerably fewer than for a round-robin tournament; thus each competitor (team or individual) does not play all the other ...
.
Principles
Two main principles for the adaptation of parameters of the search distribution are exploited in the CMA-ES algorithm.
First, a
maximum-likelihood
In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ...
principle, based on the idea to increase the probability of successful candidate solutions and search steps. The mean of the distribution is updated such that the
likelihood
A likelihood function (often simply called the likelihood) measures how well a statistical model explains observed data by calculating the probability of seeing that data under different parameter values of the model. It is constructed from the j ...
of previously successful candidate solutions is maximized. The
covariance matrix
In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...
of the distribution is updated (incrementally) such that the likelihood of previously successful search steps is increased. Both updates can be interpreted as a
natural gradient descent. Also, in consequence, the CMA conducts an iterated
principal components analysis
Principal component analysis (PCA) is a Linear map, linear dimensionality reduction technique with applications in exploratory data analysis, visualization and Data Preprocessing, data preprocessing.
The data is linear map, linearly transformed ...
of successful search steps while retaining ''all'' principal axes.
Estimation of distribution algorithms
''Estimation of distribution algorithms'' (EDAs), sometimes called ''probabilistic model-building genetic algorithms'' (PMBGAs), are stochastic optimization methods that guide the search for the optimum by building and sampling explicit probabili ...
and the
Cross-Entropy Method are based on very similar ideas, but estimate (non-incrementally) the covariance matrix by maximizing the likelihood of successful solution ''points'' instead of successful search ''steps''.
Second, two paths of the time evolution of the distribution mean of the strategy are recorded, called search or evolution paths. These paths contain significant information about the correlation between consecutive steps. Specifically, if consecutive steps are taken in a similar direction, the evolution paths become long. The evolution paths are exploited in two ways. One path is used for the covariance matrix adaptation procedure in place of single successful search steps and facilitates a possibly much faster variance increase of favorable directions. The other path is used to conduct an additional step-size control. This step-size control aims to make consecutive movements of the distribution mean orthogonal in expectation. The step-size control effectively prevents
premature convergence
Premature convergence is an unwanted effect in evolutionary algorithms (EA), a metaheuristic that mimics the basic principles of biological evolution as a computer algorithm for solving an optimization problem. The effect means that the population ...
yet allowing fast convergence to an optimum.
Algorithm
In the following the most commonly used (''μ''/''μ''
''w'', ''λ'')-CMA-ES is outlined, where in each iteration step a weighted combination of the ''μ'' best out of ''λ'' new candidate solutions is used to update the distribution parameters. The main loop consists of three main parts: 1) sampling of new solutions, 2) re-ordering of the sampled solutions based on their fitness, 3) update of the internal state variables based on the re-ordered samples. A
pseudocode
In computer science, pseudocode is a description of the steps in an algorithm using a mix of conventions of programming languages (like assignment operator, conditional operator, loop) with informal, usually self-explanatory, notation of actio ...
of the algorithm looks as follows.
set
// number of samples per iteration, at least two, generally > 4
initialize
,
,
,
,
// initialize state variables
while ''not terminate'' do // iterate
for
in
do // sample
new solutions and evaluate them
sample_multivariate_normal(mean
, covariance_matrix
)
←
with
// sort solutions
// we need later
and
← update_m
// move mean to better solutions
← update_ps
// update isotropic evolution path
← update_pc
// update anisotropic evolution path
← update_C
// update covariance matrix
← update_sigma
// update step-size using isotropic path length
return
or
The order of the five update assignments is relevant:
must be updated first,
and
must be updated before
, and
must be updated last. The update equations for the five state variables are specified in the following.
Given are the search space dimension
and the iteration step
. The five state variables are
*
, the distribution mean and current favorite solution to the optimization problem,
*
, the step-size,
*
, a symmetric and
positive-definite In mathematics, positive definiteness is a property of any object to which a bilinear form or a sesquilinear form may be naturally associated, which is positive-definite. See, in particular:
* Positive-definite bilinear form
* Positive-definite ...
covariance matrix
In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...
with
and
*
, two evolution paths, initially set to the zero vector.
The iteration starts with sampling
candidate solutions
from a
multivariate normal distribution
In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional ( univariate) normal distribution to higher dimensions. One d ...
, i.e.
for
The second line suggests the interpretation as unbiased perturbation (mutation) of the current favorite solution vector
(the distribution mean vector). The candidate solutions
are evaluated on the objective function
to be minimized. Denoting the
-sorted candidate solutions as
the new mean value is computed as
where the positive (recombination) weights
sum to one. Typically,
and the weights are chosen such that
. The only feedback used from the objective function here and in the following is an ordering of the sampled candidate solutions due to the indices
.
The step-size
is updated using ''cumulative step-size adaptation'' (CSA), sometimes also denoted as ''path length control''. The evolution path (or search path)
is updated first.
where
*
is the backward time horizon for the evolution path
and larger than one (
is reminiscent of an
exponential decay
A quantity is subject to exponential decay if it decreases at a rate proportional to its current value. Symbolically, this process can be expressed by the following differential equation, where is the quantity and (lambda
Lambda (; uppe ...
constant as
where
is the associated lifetime and
the half-life),
*
is the variance effective selection mass and
by definition of
,
*
is the unique symmetric
square root
In mathematics, a square root of a number is a number such that y^2 = x; in other words, a number whose ''square'' (the result of multiplying the number by itself, or y \cdot y) is . For example, 4 and −4 are square roots of 16 because 4 ...
of the
inverse of
, and
*
is the damping parameter usually close to one. For
or
the step-size remains unchanged.
The step-size
is increased if and only if
is larger than the
expected value
In probability theory, the expected value (also called expectation, expectancy, expectation operator, mathematical expectation, mean, expectation value, or first Moment (mathematics), moment) is a generalization of the weighted average. Informa ...
and decreased if it is smaller. For this reason, the step-size update tends to make consecutive steps
-conjugate, in that after the adaptation has been successful
.
Finally, the
covariance matrix
In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...
is updated, where again the respective evolution path is updated first.
where
denotes the transpose and
*
is the backward time horizon for the evolution path
and larger than one,
*
and the
indicator function
In mathematics, an indicator function or a characteristic function of a subset of a set is a function that maps elements of the subset to one, and all other elements to zero. That is, if is a subset of some set , then the indicator functio ...
evaluates to one
if and only if
In logic and related fields such as mathematics and philosophy, "if and only if" (often shortened as "iff") is paraphrased by the biconditional, a logical connective between statements. The biconditional is true in two cases, where either bo ...