Differential dynamic programming (DDP) is an
optimal control algorithm of the
trajectory optimization Trajectory optimization is the process of designing a trajectory that minimizes (or maximizes) some measure of performance while satisfying a set of constraints. Generally speaking, trajectory optimization is a technique for computing an open-loop ...
class. The algorithm was introduced in 1966 by
Mayne and subsequently analysed in Jacobson and Mayne's eponymous book. The algorithm uses locally-quadratic models of the dynamics and cost functions, and displays
quadratic convergence. It is closely related to Pantoja's step-wise Newton's method.
Finite-horizon discrete-time problems
The dynamics
describe the evolution of the state
given the control
from time
to time
. The ''total cost''
is the sum of running costs
and final cost
, incurred when starting from state
and applying the control sequence
until the horizon is reached:
:
where
, and the
for
are given by . The solution of the optimal control problem is the minimizing control sequence
''Trajectory optimization'' means finding
for a particular
, rather than for all possible initial states.
Dynamic programming
Let
be the partial control sequence
and define the ''cost-to-go''
as the partial sum of costs from
to
:
:
The optimal cost-to-go or ''value function'' at time
is the cost-to-go given the minimizing control sequence:
:
Setting
, the
dynamic programming principle reduces the minimization over an entire sequence of controls to a sequence of minimizations over a single control, proceeding backwards in time:
This is the
Bellman equation
A Bellman equation, named after Richard E. Bellman, is a necessary condition for optimality associated with the mathematical optimization method known as dynamic programming. It writes the "value" of a decision problem at a certain point in time ...
.
Differential dynamic programming
DDP proceeds by iteratively performing a backward pass on the nominal trajectory to generate a new control sequence, and then a forward-pass to compute and evaluate a new nominal trajectory. We begin with the backward pass. If
:
is the argument of the
operator in , let
be the variation of this quantity around the
-th
pair:
:
and expand to second order
The
notation used here is a variant of the notation of Morimoto where subscripts denote differentiation in denominator layout.
Dropping the index
for readability, primes denoting the next time-step
, the expansion coefficients are
:
The last terms in the last three equations denote contraction of a vector with a tensor. Minimizing the quadratic approximation with respect to
we have
giving an open-loop term
and a feedback gain term
. Plugging the result back into , we now have a quadratic model of the value at time
:
:
Recursively computing the local quadratic models of
and the control modifications
, from
down to
, constitutes the backward pass. As above, the Value is initialized with
. Once the backward pass is completed, a forward pass computes a new trajectory:
:
The backward passes and forward passes are iterated until convergence.
Regularization and line-search
Differential dynamic programming is a second-order algorithm like
Newton's method
In numerical analysis, Newton's method, also known as the Newton–Raphson method, named after Isaac Newton and Joseph Raphson, is a root-finding algorithm which produces successively better approximations to the roots (or zeroes) of a real ...
. It therefore takes large steps toward the minimum and often requires
regularization and/or
line-search
In optimization, the line search strategy is one of two basic iterative approaches to find a local minimum \mathbf^* of an objective function f:\mathbb R^n\to\mathbb R. The other approach is trust region.
The line search approach first finds a ...
to achieve convergence
. Regularization in the DDP context means ensuring that the
matrix in is
positive definite. Line-search in DDP amounts to scaling the open-loop control modification
by some
.
Monte Carlo version
Sampled differential dynamic programming (SaDDP) is a Monte Carlo variant of differential dynamic programming. It is based on treating the quadratic cost of differential dynamic programming as the energy of a
Boltzmann distribution
In statistical mechanics and mathematics, a Boltzmann distribution (also called Gibbs distribution Translated by J.B. Sykes and M.J. Kearsley. See section 28) is a probability distribution or probability measure that gives the probability ...
. This way the quantities of DDP can be matched to the statistics of a
multidimensional normal distribution. The statistics can be recomputed from sampled trajectories without differentiation.
Sampled differential dynamic programming has been extended to Path Integral Policy Improvement with Differential Dynamic Programming. This creates a link between differential dynamic programming and path integral control, which is a framework of stochastic optimal control.
Constrained problems
Interior Point Differential dynamic programming (IPDDP) is an
interior-point method generalization of DDP that can address the optimal control problem with nonlinear state and input constraints.
See also
*
Optimal control
References
{{Reflist
External links
A Python implementation of DDPA MATLAB implementation of DDP
Dynamic programming