Natural evolution strategies (NES) are a family of

numerical optimization Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criteria, from some set of available alternatives. It is generally divided into two subfiel ...

algorithms for

black box In science, computing, and engineering, a black box is a system which can be viewed in terms of its inputs and outputs (or transfer characteristics), without any knowledge of its internal workings. Its implementation is "opaque" (black). The te ...

problems. Similar in spirit to

evolution strategies Evolution strategy (ES) from computer science is a subclass of evolutionary algorithms, which serves as an optimization (mathematics), optimization technique. It uses the major genetic operators mutation (evolutionary algorithm), mutation, recomb ...

, they iteratively update the (continuous) parameters of a ''search distribution'' by following the natural gradient towards higher expected fitness.

Method

The general procedure is as follows: the ''parameterized'' search distribution is used to produce a batch of search points, and the

fitness function A fitness function is a particular type of objective or cost function that is used to summarize, as a single figure of merit, how close a given candidate solution is to achieving the set aims. It is an important component of evolutionary algorit ...

is evaluated at each such point. The distribution’s parameters (which include ''strategy parameters'') allow the algorithm to adaptively capture the (local) structure of the fitness function. For example, in the case of a

Gaussian distribution In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real number, real-valued random variable. The general form of its probability density function is f(x ...

, this comprises the mean and the

covariance matrix In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...

. From the samples, NES estimates a search gradient on the parameters towards higher expected fitness. NES then performs a gradient ascent step along the natural gradient, a second order method which, unlike the plain gradient, renormalizes the update with respect to uncertainty. This step is crucial, since it prevents oscillations, premature convergence, and undesired effects stemming from a given parameterization. The entire process reiterates until a stopping criterion is met. All members of the NES family operate based on the same principles. They differ in the type of

probability distribution In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...

and the gradient

approximation An approximation is anything that is intentionally similar but not exactly equal to something else. Etymology and usage The word ''approximation'' is derived from Latin ''approximatus'', from ''proximus'' meaning ''very near'' and the prefix ...

method used. Different search spaces require different search distributions; for example, in low dimensionality it can be highly beneficial to model the full covariance matrix. In high dimensions, on the other hand, a more scalable alternative is to limit the covariance to the

diagonal In geometry, a diagonal is a line segment joining two vertices of a polygon or polyhedron, when those vertices are not on the same edge. Informally, any sloping line is called diagonal. The word ''diagonal'' derives from the ancient Greek � ...

only. In addition, highly multi-modal search spaces may benefit from more

heavy-tailed distributions In probability theory, heavy-tailed distributions are probability distributions whose tails are not exponentially bounded: that is, they have heavier tails than the exponential distribution. Roughly speaking, “heavy-tailed” means the distribu ...

(such as

Cauchy Baron Augustin-Louis Cauchy ( , , ; ; 21 August 1789 – 23 May 1857) was a French mathematician, engineer, and physicist. He was one of the first to rigorously state and prove the key theorems of calculus (thereby creating real a ...

, as opposed to the Gaussian). A last distinction arises between distributions where we can analytically compute the natural gradient, and more general distributions where we need to estimate it from samples.

Search gradients

Let

\theta

denote the parameters of the search distribution

\pi(x \,, \, \theta)

and

f(x)

the fitness function evaluated at

x

. NES then pursues the objective of maximizing the ''expected fitness under the search distribution'' ::

J(\theta) = \operatorname_\theta

(x) An emoticon (, , rarely , ), short for emotion icon, is a pictorial representation of a facial expression using characters—usually punctuation marks, numbers and letters—to express a person's feelings, mood or reaction, without needin ...

= \int f(x) \; \pi(x \,, \, \theta) \; dx through

gradient ascent Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for minimizing a differentiable multivariate function. The idea is to take repeated steps in the opposite direction of the grad ...

. The gradient can be rewritten as ::

\nabla_ J(\theta) = \nabla_ \int f(x) \; \pi(x \,, \, \theta) \; dx

:::

= \int f(x) \; \nabla_ \pi(x \,, \, \theta) \; dx

:::

= \int f(x) \; \nabla_ \pi(x \,, \, \theta) \; \frac \; dx

:::

\; \pi(x \,, \, \theta) \; dx

:::

= \operatorname_\theta \left \, \theta)\right /math>
that is, the

expected value In probability theory, the expected value (also called expectation, expectancy, expectation operator, mathematical expectation, mean, expectation value, or first Moment (mathematics), moment) is a generalization of the weighted average. Informa ...

f(x)

times the log-derivatives at

x

. In practice, it is possible to use the

Monte Carlo Monte Carlo ( ; ; or colloquially ; , ; ) is an official administrative area of Monaco, specifically the Ward (country subdivision), ward of Monte Carlo/Spélugues, where the Monte Carlo Casino is located. Informally, the name also refers to ...

approximation based on a finite number of

\lambda

samples ::

\nabla_ J(\theta) \approx 
   \frac 
\sum_^ f(x_k) \; \nabla_ 
\log\pi(x_k \,, \, \theta)

. Finally, the parameters of the search distribution can be updated iteratively ::

\theta \leftarrow \theta + \eta \nabla_ J(\theta)

Natural gradient ascent

Instead of using the plain stochastic gradient for updates, NES follows the natural gradient, which has been shown to possess numerous advantages over the plain (''vanilla'') gradient, e.g.: * the gradient direction is independent of the parameterization of the search distribution * the updates magnitudes are automatically adjusted based on uncertainty, in turn speeding convergence on

plateaus In geology and physical geography, a plateau (; ; : plateaus or plateaux), also called a high plain or a tableland, is an area of a highland consisting of flat terrain that is raised sharply above the surrounding area on at least one side. Oft ...

and ridges. The NES update is therefore ::

\theta \leftarrow \theta + \eta \mathbf^\nabla_ J(\theta)

, where

\mathbf

is the

Fisher information matrix In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable ''X'' carries about an unknown parameter ''θ'' of a distribution that models ''X''. Formally, it is the variance ...

. The Fisher matrix can sometimes be computed exactly, otherwise it is estimated from samples, reusing the log-derivatives

\nabla_\theta \log\pi(x, \theta)

Fitness shaping

NES utilizes

rank A rank is a position in a hierarchy. It can be formally recognized—for example, cardinal, chief executive officer, general, professor—or unofficial. People Formal ranks * Academic rank * Corporate title * Diplomatic rank * Hierarchy ...

-based fitness shaping in order to render the algorithm more robust, and ''invariant'' under monotonically increasing transformations of the fitness function. For this purpose, the fitness of the population is transformed into a set of

utility In economics, utility is a measure of a certain person's satisfaction from a certain state of the world. Over time, the term has been used with at least two meanings. * In a normative context, utility refers to a goal or objective that we wish ...

values

u_1 \geq \dots \geq u_\lambda

. Let

x_i

denote the i^th best individual. Replacing fitness with utility, the gradient estimate becomes ::

\nabla_ J (\theta) = \sum_^\lambda u_k \; \nabla_ \log\pi(x_k \,, \, \theta)

. The choice of utility function is a free parameter of the algorithm.

Pseudocode

1 repeat 2 ''// is the population size'' 3 4 5 6 end 7 ''// based on rank'' 8 9 ''// or compute it exactly'' 10 ''// is the learning rate'' 11 until stopping criterion is met

Bibliography

* D. Wierstra, T. Schaul, J. Peters and J. Schmidhuber (2008)
Natural Evolution Strategies
IEEE Congress on Evolutionary Computation (CEC). * Y. Sun, D. Wierstra, T. Schaul and J. Schmidhuber (2009)
Stochastic Search using the Natural Gradient
International Conference on Machine Learning (ICML). * T. Glasmachers, T. Schaul, Y. Sun, D. Wierstra and J. Schmidhuber (2010)
Exponential Natural Evolution Strategies
Genetic and Evolutionary Computation Conference (GECCO). * T. Schaul, T. Glasmachers and J. Schmidhuber (2011)
High Dimensions and Heavy Tails for Natural Evolution Strategies
Genetic and Evolutionary Computation Conference (GECCO). * T. Schaul (2012)
Natural Evolution Strategies Converge on Sphere Functions
Genetic and Evolutionary Computation Conference (GECCO).

External links

{{Evolutionary computation Evolution strategy Stochastic optimization Articles with example pseudocode