Differential dynamic programming (DDP) is an optimal control algorithm of the

trajectory optimization Trajectory optimization is the process of designing a trajectory that minimizes (or maximizes) some measure of performance while satisfying a set of constraints. Generally speaking, trajectory optimization is a technique for computing an open-loop ...

class. The algorithm was introduced in 1966 by Mayne and subsequently analysed in Jacobson and Mayne's eponymous book. The algorithm uses locally-quadratic models of the dynamics and cost functions, and displays quadratic convergence. It is closely related to Pantoja's step-wise Newton's method.

Finite-horizon discrete-time problems

The dynamics describe the evolution of the state

\textstyle\mathbf

given the control

\mathbf

from time

i

to time

i+1

. The ''total cost''

J_0

is the sum of running costs

\textstyle\ell

and final cost

\ell_f

, incurred when starting from state

\mathbf

and applying the control sequence

\mathbf \equiv \

until the horizon is reached: :

J_0(\mathbf,\mathbf)=\sum_^\ell(\mathbf_i,\mathbf_i) + \ell_f(\mathbf_N),

where

\mathbf_0\equiv\mathbf

, and the

\mathbf_i

for

i>0

are given by . The solution of the optimal control problem is the minimizing control sequence

\mathbf^*(\mathbf)\equiv \operatorname_ J_0(\mathbf,\mathbf).

''Trajectory optimization'' means finding

\mathbf^*(\mathbf)

for a particular

\mathbf_0

, rather than for all possible initial states.

Dynamic programming

Let

\mathbf_i

be the partial control sequence

\mathbf_i \equiv \

and define the ''cost-to-go''

J_i

as the partial sum of costs from

i

N

: :

J_i(\mathbf,\mathbf_i)=\sum_^\ell(\mathbf_j,\mathbf_j) + \ell_f(\mathbf_N).

The optimal cost-to-go or ''value function'' at time

i

is the cost-to-go given the minimizing control sequence: :

V(\mathbf,i)\equiv \min_J_i(\mathbf,\mathbf_i).

Setting

V(\mathbf,N)\equiv \ell_f(\mathbf_N)

, the dynamic programming principle reduces the minimization over an entire sequence of controls to a sequence of minimizations over a single control, proceeding backwards in time: This is the

Bellman equation A Bellman equation, named after Richard E. Bellman, is a necessary condition for optimality associated with the mathematical optimization method known as dynamic programming. It writes the "value" of a decision problem at a certain point in time ...

Differential dynamic programming

DDP proceeds by iteratively performing a backward pass on the nominal trajectory to generate a new control sequence, and then a forward-pass to compute and evaluate a new nominal trajectory. We begin with the backward pass. If :

\ell(\mathbf,\mathbf) + V(\mathbf(\mathbf,\mathbf),i+1)

is the argument of the

\min[]

operator in , let

Q

be the variation of this quantity around the

i

-th

(\mathbf,\mathbf)

pair: :

\beginQ(\delta\mathbf,\delta\mathbf)\equiv &\ell(\mathbf+\delta\mathbf,\mathbf+\delta\mathbf)&&+V(\mathbf(\mathbf+\delta\mathbf,\mathbf+\delta\mathbf),i+1)
\\
-&\ell(\mathbf,\mathbf)&&-V(\mathbf(\mathbf,\mathbf),i+1)
\end

and expand to second order The

Q

notation used here is a variant of the notation of Morimoto where subscripts denote differentiation in denominator layout. Dropping the index

i

for readability, primes denoting the next time-step

V'\equiv V(i+1)

, the expansion coefficients are :

\begin
Q_\mathbf &= \ell_\mathbf+ \mathbf_\mathbf^\mathsf V'_\mathbf \\
Q_\mathbf &= \ell_\mathbf+ \mathbf_\mathbf^\mathsf V'_\mathbf \\
Q_ &= \ell_ + \mathbf_\mathbf^\mathsf V'_\mathbf_\mathbf+V_\mathbf'\cdot\mathbf_\\
Q_ &= \ell_ + \mathbf_\mathbf^\mathsf V'_\mathbf_\mathbf+ \cdot\mathbf_\\
Q_ &= \ell_ + \mathbf_\mathbf^\mathsf V'_\mathbf_\mathbf +  \cdot \mathbf_.
\end

The last terms in the last three equations denote contraction of a vector with a tensor. Minimizing the quadratic approximation with respect to

\delta\mathbf

we have giving an open-loop term

\mathbf=-Q_^Q_\mathbf

and a feedback gain term

\mathbf=-Q_^Q_

. Plugging the result back into , we now have a quadratic model of the value at time

i

: :

\begin
\Delta V(i) &= & -\tfracQ^T_\mathbf Q_^Q_\mathbf\\
V_\mathbf(i) &= Q_\mathbf & - Q_\mathbf Q_^Q_\\
V_(i) &= Q_ & - Q_Q_^Q_.
\end

Recursively computing the local quadratic models of

V(i)

and the control modifications

\

, from

i=N-1

down to

i=1

, constitutes the backward pass. As above, the Value is initialized with

V(\mathbf,N)\equiv \ell_f(\mathbf_N)

. Once the backward pass is completed, a forward pass computes a new trajectory: :

\begin
\hat(1)&=\mathbf(1)\\
\hat(i)&=\mathbf(i) + \mathbf(i) +\mathbf(i)(\hat(i) - \mathbf(i))\\
\hat(i+1)&=\mathbf(\hat(i),\hat(i))
\end

The backward passes and forward passes are iterated until convergence.

Regularization and line-search

Differential dynamic programming is a second-order algorithm like

Newton's method In numerical analysis, Newton's method, also known as the Newton–Raphson method, named after Isaac Newton and Joseph Raphson, is a root-finding algorithm which produces successively better approximations to the roots (or zeroes) of a real ...

. It therefore takes large steps toward the minimum and often requires regularization and/or

line-search In optimization, the line search strategy is one of two basic iterative approaches to find a local minimum \mathbf^* of an objective function f:\mathbb R^n\to\mathbb R. The other approach is trust region. The line search approach first finds a ...

to achieve convergence . Regularization in the DDP context means ensuring that the

Q_

matrix in is positive definite. Line-search in DDP amounts to scaling the open-loop control modification

\mathbf

by some

0<\alpha<1

Monte Carlo version

Sampled differential dynamic programming (SaDDP) is a Monte Carlo variant of differential dynamic programming. It is based on treating the quadratic cost of differential dynamic programming as the energy of a

Boltzmann distribution In statistical mechanics and mathematics, a Boltzmann distribution (also called Gibbs distribution Translated by J.B. Sykes and M.J. Kearsley. See section 28) is a probability distribution or probability measure that gives the probability ...

. This way the quantities of DDP can be matched to the statistics of a multidimensional normal distribution. The statistics can be recomputed from sampled trajectories without differentiation. Sampled differential dynamic programming has been extended to Path Integral Policy Improvement with Differential Dynamic Programming. This creates a link between differential dynamic programming and path integral control, which is a framework of stochastic optimal control.

Constrained problems

Interior Point Differential dynamic programming (IPDDP) is an interior-point method generalization of DDP that can address the optimal control problem with nonlinear state and input constraints.

References

{{Reflist

External links

A Python implementation of DDP

A MATLAB implementation of DDP
Dynamic programming