Simultaneous perturbation stochastic approximation (SPSA) is an
algorithm
In mathematics and computer science, an algorithm () is a finite sequence of Rigour#Mathematics, mathematically rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algo ...
ic method for optimizing systems with multiple unknown
parameters. It is a type of
stochastic approximation
Stochastic approximation methods are a family of iterative methods typically used for root-finding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving l ...
algorithm. As an optimization method, it is appropriately suited to large-scale population models, adaptive modeling, simulation
optimization
Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criteria, from some set of available alternatives. It is generally divided into two subfiel ...
, and
atmospheric model
In atmospheric science, an atmospheric model is a mathematical model constructed around the full set of primitive equations, primitive, Dynamical systems theory, dynamical equations which govern atmospheric motions. It can supplement these equati ...
ing. Many examples are presented at the SPSA website http://www.jhuapl.edu/SPSA. A comprehensive book on the subject is Bhatnagar et al. (2013). An early paper on the subject is Spall (1987) and the foundational paper providing the key theory and justification is Spall (1992).
SPSA is a descent method capable of finding global minima, sharing this property with other methods such as
simulated annealing
Simulated annealing (SA) is a probabilistic technique for approximating the global optimum of a given function. Specifically, it is a metaheuristic to approximate global optimization in a large search space for an optimization problem. ...
. Its main feature is the gradient approximation that requires only two measurements of the objective function, regardless of the dimension of the optimization problem. Recall that we want to find the optimal control
with loss
function
:
:
Both Finite Differences Stochastic Approximation (FDSA)
and SPSA use the same iterative process:
:
where
represents the
iterate,
is the estimate of the gradient of the objective function
evaluated at
, and
is a positive number sequence converging to 0. If
is a ''p''-dimensional vector, the
component of the
symmetric
Symmetry () in everyday life refers to a sense of harmonious and beautiful proportion and balance. In mathematics, the term has a more precise definition and is usually used to refer to an object that is invariant under some transformations ...
finite difference gradient estimator is:
:FD:
''1 ≤i ≤p'', where
is the unit vector with a 1 in the
place, and
is a small positive number that decreases with ''n''. With this method, ''2p'' evaluations of ''J'' for each
are needed. When ''p'' is large, this estimator loses efficiency.
Let now
be a random perturbation vector. The
component of the stochastic perturbation gradient estimator is:
:SP:
Remark that FD perturbs only one direction at a time, while the SP estimator disturbs all directions at the same time (the numerator is identical in all ''p'' components). The number of loss function measurements needed in the SPSA method for each
is always 2, independent of the
dimension
In physics and mathematics, the dimension of a mathematical space (or object) is informally defined as the minimum number of coordinates needed to specify any point within it. Thus, a line has a dimension of one (1D) because only one coo ...
''p''. Thus, SPSA uses ''p'' times fewer function evaluations than FDSA, which makes it a lot more efficient.
Simple experiments with ''p=2'' showed that SPSA converges in the same number of iterations as FDSA. The latter follows
approximately the
steepest descent direction, behaving like the gradient method. On the other hand, SPSA, with the random search direction, does not follow exactly the gradient path. In average though, it tracks it nearly because the gradient approximation is an almost
unbiased
Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is inaccurate, closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individ ...
estimator of the gradient, as shown in the following lemma.
Convergence lemma
Denote by
:
the bias in the estimator
. Assume that
are all mutually independent with zero-mean, bounded second
moments, and
uniformly bounded. Then
→0 w.p. 1.
Sketch of the proof
The main
idea
In philosophy and in common usage, an idea (from the Greek word: ἰδέα (idea), meaning 'a form, or a pattern') is the results of thought. Also in philosophy, ideas can also be mental representational images of some object. Many philosophe ...
is to use conditioning on
to express