The Frank–Wolfe algorithm is an

iterative Iteration is the repetition of a process in order to generate a (possibly unbounded) sequence of outcomes. Each repetition of the process is a single iteration, and the outcome of each iteration is then the starting point of the next iteration. ...

first-order

optimization Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criterion, from some set of available alternatives. It is generally divided into two subfi ...

algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...

for constrained

convex optimization Convex optimization is a subfield of mathematical optimization that studies the problem of minimizing convex functions over convex sets (or, equivalently, maximizing concave functions over convex sets). Many classes of convex optimization prob ...

. Also known as the conditional gradient method, reduced gradient algorithm and the convex combination algorithm, the method was originally proposed by Marguerite Frank and Philip Wolfe in 1956. In each iteration, the Frank–Wolfe algorithm considers a

linear approximation In mathematics, a linear approximation is an approximation of a general function using a linear function (more precisely, an affine function). They are widely used in the method of finite differences to produce first order methods for solving o ...

of the objective function, and moves towards a minimizer of this linear function (taken over the same domain).

Problem statement

Suppose

\mathcal

is a

compact Compact as used in politics may refer broadly to a pact or treaty; in more specific cases it may refer to: * Interstate compact * Blood compact, an ancient ritual of the Philippines * Compact government, a type of colonial rule utilized in British ...

convex set In geometry, a subset of a Euclidean space, or more generally an affine space over the reals, is convex if, given any two points in the subset, the subset contains the whole line segment that joins them. Equivalently, a convex set or a convex ...

in a

vector space In mathematics and physics, a vector space (also called a linear space) is a set whose elements, often called '' vectors'', may be added together and multiplied ("scaled") by numbers called '' scalars''. Scalars are often real numbers, but ...

and

f \colon \mathcal \to \mathbb

is a

convex Convex or convexity may refer to: Science and technology * Convex lens, in optics Mathematics * Convex set, containing the whole line segment that joins points ** Convex polygon, a polygon which encloses a convex set of points ** Convex polytop ...

differentiable In mathematics, a differentiable function of one real variable is a function whose derivative exists at each point in its domain. In other words, the graph of a differentiable function has a non- vertical tangent line at each interior point i ...

real-valued function In mathematics, a real-valued function is a function whose values are real numbers. In other words, it is a function that assigns a real number to each member of its domain. Real-valued functions of a real variable (commonly called ''real ...

. The Frank–Wolfe algorithm solves the

optimization problem In mathematics, computer science and economics, an optimization problem is the problem of finding the ''best'' solution from all feasible solutions. Optimization problems can be divided into two categories, depending on whether the variables ...

:Minimize

f(\mathbf)

:subject to

\mathbf \in \mathcal

Algorithm

:''Initialization:'' Let

k \leftarrow 0

, and let

\mathbf_0 \!

be any point in

\mathcal

. :Step 1. ''Direction-finding subproblem:'' Find

\mathbf_k

solving ::Minimize

\mathbf^T \nabla f(\mathbf_k)

::Subject to

\mathbf \in \mathcal

:''(Interpretation: Minimize the linear approximation of the problem given by the first-order

Taylor approximation In calculus, Taylor's theorem gives an approximation of a ''k''-times differentiable function around a given point by a polynomial of degree ''k'', called the ''k''th-order Taylor polynomial. For a smooth function, the Taylor polynomial is the t ...

f

around

\mathbf_k \!

constrained to stay within

\mathcal

.)'' :Step 2. ''Step size determination:'' Set

\alpha \leftarrow \frac

, or alternatively find

\alpha

that minimizes

f(\mathbf_k+\alpha(\mathbf_k -\mathbf_k))

subject to

0 \le \alpha \le 1

. :Step 3. ''Update:'' Let

\mathbf_\leftarrow \mathbf_k+\alpha(\mathbf_k-\mathbf_k)

, let

k \leftarrow k+1

and go to Step 1.

Properties

While competing methods such as

gradient descent In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of ...

for constrained optimization require a projection step back to the feasible set in each iteration, the Frank–Wolfe algorithm only needs the solution of a linear problem over the same set in each iteration, and automatically stays in the feasible set. The convergence of the Frank–Wolfe algorithm is sublinear in general: the error in the objective function to the optimum is

O(1/k)

after ''k'' iterations, so long as the gradient is

Lipschitz continuous In mathematical analysis, Lipschitz continuity, named after German mathematician Rudolf Lipschitz, is a strong form of uniform continuity for functions. Intuitively, a Lipschitz continuous function is limited in how fast it can change: there ...

with respect to some norm. The same convergence rate can also be shown if the sub-problems are only solved approximately. The iterates of the algorithm can always be represented as a sparse convex combination of the extreme points of the feasible set, which has helped to the popularity of the algorithm for sparse greedy optimization in

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

and

signal processing Signal processing is an electrical engineering subfield that focuses on analyzing, modifying and synthesizing '' signals'', such as sound, images, and scientific measurements. Signal processing techniques are used to optimize transmissions, ...

problems, as well as for example the optimization of minimum–cost flows in transportation networks. If the feasible set is given by a set of linear constraints, then the subproblem to be solved in each iteration becomes a

linear program Linear programming (LP), also called linear optimization, is a method to achieve the best outcome (such as maximum profit or lowest cost) in a mathematical model whose requirements are represented by linear relationships. Linear programming is ...

. While the worst-case convergence rate with

O(1/k)

can not be improved in general, faster convergence can be obtained for special problem classes, such as some strongly convex problems.

Lower bounds on the solution value, and primal-dual analysis

Since

f

, for any two points

\mathbf, \mathbf \in \mathcal

we have: :

f(\mathbf) \geq f(\mathbf) +  (\mathbf - \mathbf)^T \nabla f(\mathbf)

This also holds for the (unknown) optimal solution

\mathbf^*

. That is,

f(\mathbf^*) \ge f(\mathbf) +  (\mathbf^* - \mathbf)^T \nabla f(\mathbf)

. The best lower bound with respect to a given point

\mathbf

is given by :

\begin
f(\mathbf^*) 
& \ge f(\mathbf) +  (\mathbf^* - \mathbf)^T \nabla f(\mathbf) \\ 
&\geq \min_ \left\ \\
&= f(\mathbf) - \mathbf^T \nabla f(\mathbf) + \min_ \mathbf^T \nabla f(\mathbf)
\end

The latter optimization problem is solved in every iteration of the Frank–Wolfe algorithm, therefore the solution

\mathbf_k

of the direction-finding subproblem of the

k

-th iteration can be used to determine increasing lower bounds

l_k

during each iteration by setting

l_0 = - \infty

and :

l_k := \max (l_, f(\mathbf_k) +  (\mathbf_k - \mathbf_k)^T \nabla f(\mathbf_k))

Such lower bounds on the unknown optimal value are important in practice because they can be used as a stopping criterion, and give an efficient certificate of the approximation quality in every iteration, since always

l_k \leq f(\mathbf^*) \leq f(\mathbf_k)

. It has been shown that this corresponding

duality gap In optimization problems in applied mathematics, the duality gap is the difference between the primal and dual solutions. If d^* is the optimal dual value and p^* is the optimal primal value then the duality gap is equal to p^* - d^*. This value ...

, that is the difference between

f(\mathbf_k)

and the lower bound

l_k

, decreases with the same convergence rate, i.e.

f(\mathbf_k) - l_k = O(1/k) .

Notes

Bibliography

* (Overview paper)
The Frank–Wolfe algorithm
description * .

External links

*https://conditional-gradients.org/: a survey of Frank–Wolfe algorithms.
Marguerite Frank giving a personal account of the history of the algorithm