Simultaneous perturbation stochastic approximation (SPSA) is an

algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algorithms are used as specificat ...

ic method for optimizing systems with multiple unknown

parameters A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...

. It is a type of

stochastic approximation Stochastic approximation methods are a family of iterative methods typically used for root-finding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving l ...

algorithm. As an optimization method, it is appropriately suited to large-scale population models, adaptive modeling, simulation

optimization Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criterion, from some set of available alternatives. It is generally divided into two subfi ...

, and

atmospheric model An atmospheric model is a mathematical model constructed around the full set of primitive dynamical equations which govern atmospheric motions. It can supplement these equations with parameterizations for turbulent diffusion, radiation, moist ...

ing. Many examples are presented at the SPSA website http://www.jhuapl.edu/SPSA. A comprehensive book on the subject is Bhatnagar et al. (2013). An early paper on the subject is Spall (1987) and the foundational paper providing the key theory and justification is Spall (1992). SPSA is a descent method capable of finding global minima, sharing this property with other methods as

simulated annealing Simulated annealing (SA) is a probabilistic technique for approximating the global optimum of a given function. Specifically, it is a metaheuristic to approximate global optimization in a large search space for an optimization problem. It ...

. Its main feature is the gradient approximation that requires only two measurements of the objective function, regardless of the dimension of the optimization problem. Recall that we want to find the optimal control

u^*

with loss function

J(u)

: :

u^* = \arg  \min_ J(u).

Both Finite Differences Stochastic Approximation (FDSA) and SPSA use the same iterative process: :

u_ = u_n - a_n\hat_n(u_n),

where

u_n=((u_n)_1,(u_n)_2,\ldots,(u_n)_p)^T

represents the

n^

iterate,

\hat_n(u_n)

is the estimate of the gradient of the objective function

g(u)= \fracJ(u)

evaluated at

, and

\

is a positive number sequence converging to 0. If

u_n

is a ''p''-dimensional vector, the

i^

component of the

symmetric Symmetry (from grc, συμμετρία "agreement in dimensions, due proportion, arrangement") in everyday language refers to a sense of harmonious and beautiful proportion and balance. In mathematics, "symmetry" has a more precise definiti ...

finite difference gradient estimator is: :FD:

(\hat(u_n))_i = \frac,

''1 ≤i ≤p'', where

e_i

is the unit vector with a 1 in the

i^

place, and

c_n

is a small positive number that decreases with ''n''. With this method, ''2p'' evaluations of ''J'' for each

g_n

are needed. Clearly, when ''p'' is large, this estimator loses efficiency. Let now

\Delta_n

be a random perturbation vector. The

i^

component of the stochastic perturbation gradient estimator is: :SP:

(\hat(u_n))_i = \frac.

Remark that FD perturbs only one direction at a time, while the SP estimator disturbs all directions at the same time (the numerator is identical in all ''p'' components). The number of loss function measurements needed in the SPSA method for each

g_n

is always 2, independent of the

dimension In physics and mathematics, the dimension of a Space (mathematics), mathematical space (or object) is informally defined as the minimum number of coordinates needed to specify any Point (geometry), point within it. Thus, a Line (geometry), lin ...

''p''. Thus, SPSA uses ''p'' times fewer function evaluations than FDSA, which makes it a lot more efficient. Simple experiments with ''p=2'' showed that SPSA converges in the same number of iterations as FDSA. The latter follows

approximately An approximation is anything that is intentionally similar but not exactly equal to something else. Etymology and usage The word ''approximation'' is derived from Latin ''approximatus'', from ''proximus'' meaning ''very near'' and the prefix ' ...

the steepest descent direction, behaving like the gradient method. On the other hand, SPSA, with the random search direction, does not follow exactly the gradient path. In average though, it tracks it nearly because the gradient approximation is an almost

unbiased Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individual, a group, ...

estimator of the gradient, as shown in the following lemma.

Convergence lemma

Denote by :

-\nabla J(u_n)

the bias in the estimator

\hat_n

. Assume that

\

are all mutually independent with zero-mean, bounded second moments, and

E(, (\Delta_n)_i, ^)

uniformly bounded. Then

b_n

→0 w.p. 1.

Sketch of the proof

The main

idea In common usage and in philosophy, ideas are the results of thought. Also in philosophy, ideas can also be mental representational images of some object. Many philosophers have considered ideas to be a fundamental ontological category of being ...

is to use conditioning on

\Delta_n

to express

E \hat_n)_i /math> and then to use a second order Taylor expansion of J(u_n+c_n\Delta_n)_i and J(u_n-c_n\Delta_n)_i . After algebraic manipulations using the zero mean and the independence of \, we get 

: E \hat_n)_i (g_n)_i + O(c_n^2) The result follows from the

hypothesis A hypothesis (plural hypotheses) is a proposed explanation for a phenomenon. For a hypothesis to be a scientific hypothesis, the scientific method requires that one can test it. Scientists generally base scientific hypotheses on previous obse ...

that

c_n

→0. Next we resume some of the hypotheses under which

u_t

converges in

probability Probability is the branch of mathematics concerning numerical descriptions of how likely an Event (probability theory), event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and ...

to the set of global minima of

J(u)

. The efficiency of the method depends on the shape of

J(u)

, the values of the parameters

a_n

and

c_n

and the distribution of the perturbation terms

\Delta_

. First, the algorithm parameters must satisfy the following conditions: *

a_n

>0,

a_n

→0 when n→∝ and

\sum_^ a_n = \infty

. A good choice would be

a_n=\frac;

a>0; *

c_n=\frac

, where c>0,

\gamma \in \left frac,\frac\right /math>;
* \sum_^ (\frac )^2 < \infty * \Delta_must be mutually independent zero-mean random variables, symmetrically distributed about zero, with \Delta_ < a_1 < \infty . The inverse first and second moments of the \Delta_must be finite.
A good choice for \Delta_is the

Rademacher distribution In probability theory and statistics, the Rademacher distribution (which is named after Hans Rademacher) is a discrete probability distribution where a random variate ''X'' has a 50% chance of being +1 and a 50% chance of being -1. A series (th ...

, i.e. Bernoulli +-1 with probability 0.5. Other choices are possible too, but note that the uniform and normal distributions cannot be used because they do not satisfy the finite inverse moment conditions. The loss function ''J(u)'' must be thrice continuously

differentiable In mathematics, a differentiable function of one real variable is a function whose derivative exists at each point in its domain. In other words, the graph of a differentiable function has a non-vertical tangent line at each interior point in its ...

and the individual elements of the third derivative must be bounded:

, J^(u),  < a_3 < \infty

. Also,

, J(u), \rightarrow\infty

u\rightarrow\infty

. In addition,

\nabla J

must be Lipschitz continuous, bounded and the ODE

\dot=g(u)

must have a unique solution for each initial condition. Under these conditions and a few others,

u_k

converges in probability to the set of global minima of J(u) (see Maryak and Chin, 2008). It has been shown that differentiability is not required: continuity and convexity are sufficient for convergence.

Extension to second-order (Newton) methods

It is known that a stochastic version of the standard (deterministic) Newton-Raphson algorithm (a “second-order” method) provides an asymptotically optimal or near-optimal form of stochastic approximation. SPSA can also be used to efficiently estimate the Hessian matrix of the loss function based on either noisy loss measurements or noisy gradient measurements (stochastic gradients). As with the basic SPSA method, only a small fixed number of loss measurements or gradient measurements are needed at each iteration, regardless of the problem dimension ''p''. See the brief discussion in

Stochastic gradient descent Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of ...

References

* Bhatnagar, S., Prasad, H. L., and Prashanth, L. A. (2013), ''Stochastic Recursive Algorithms for Optimization: Simultaneous Perturbation Methods'', Springe

* Hirokami, T., Maeda, Y., Tsukada, H. (2006) "Parameter estimation using simultaneous perturbation stochastic approximation", Electrical Engineering in Japan, 154 (2), 30–

* Maryak, J.L., and Chin, D.C. (2008), "Global Random Optimization by Simultaneous Perturbation Stochastic Approximation," ''IEEE Transactions on Automatic Control'', vol. 53, pp. 780-783. * Spall, J. C. (1987), “A Stochastic Approximation Technique for Generating Maximum Likelihood Parameter Estimates,” ''Proceedings of the American Control Conference'', Minneapolis, MN, June 1987, pp. 1161–1167. * Spall, J. C. (1992), “Multivariate Stochastic Approximation Using a Simultaneous Perturbation Gradient Approximation,” ''IEEE Transactions on Automatic Control'', vol. 37(3), pp. 332–341. * Spall, J.C. (1998). "Overview of the Simultaneous Perturbation Method for Efficient Optimization
2
''Johns Hopkins APL Technical Digest'', 19(4), 482–492. * Spall, J.C. (2003) ''Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control'', Wiley. {{isbn, 0-471-33052-3 (Chapter 7) Numerical climate and weather models Stochastic optimization Randomized algorithms Optimization algorithms and methods