Limited-memory BFGS
   HOME

TheInfoList



OR:

Limited-memory BFGS (L-BFGS or LM-BFGS) is an
optimization Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criterion, from some set of available alternatives. It is generally divided into two subfi ...
algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...
in the family of
quasi-Newton method Quasi-Newton methods are methods used to either find zeroes or local maxima and minima of functions, as an alternative to Newton's method. They can be used if the Jacobian or Hessian is unavailable or is too expensive to compute at every iteration. ...
s that approximates the Broyden–Fletcher–Goldfarb–Shanno algorithm (BFGS) using a limited amount of
computer memory In computing, memory is a device or system that is used to store information for immediate use in a computer or related computer hardware and digital electronic devices. The term ''memory'' is often synonymous with the term '' primary storag ...
. It is a popular algorithm for parameter estimation in
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
. The algorithm's target problem is to minimize f(\mathbf) over unconstrained values of the real-vector \mathbf where f is a differentiable scalar function. Like the original BFGS, L-BFGS uses an estimate of the inverse
Hessian matrix In mathematics, the Hessian matrix or Hessian is a square matrix of second-order partial derivatives of a scalar-valued function, or scalar field. It describes the local curvature of a function of many variables. The Hessian matrix was developed ...
to steer its search through variable space, but where BFGS stores a dense n\times n approximation to the inverse Hessian (''n'' being the number of variables in the problem), L-BFGS stores only a few vectors that represent the approximation implicitly. Due to its resulting linear memory requirement, the L-BFGS method is particularly well suited for optimization problems with many variables. Instead of the inverse Hessian H''k'', L-BFGS maintains a history of the past ''m'' updates of the position x and gradient ∇''f''(x), where generally the history size ''m'' can be small (often m<10). These updates are used to implicitly do operations requiring the H''k''-vector product.


Algorithm

The algorithm starts with an initial estimate of the optimal value, \mathbf_0, and proceeds iteratively to refine that estimate with a sequence of better estimates \mathbf_1,\mathbf_2,\ldots. The derivatives of the function g_k:=\nabla f(\mathbf_k) are used as a key driver of the algorithm to identify the direction of steepest descent, and also to form an estimate of the Hessian matrix (second derivative) of f(\mathbf). L-BFGS shares many features with other quasi-Newton algorithms, but is very different in how the matrix-vector multiplication d_k=-H_k g_k is carried out, where d_k is the approximate Newton's direction, g_k is the current gradient, and H_k is the inverse of the Hessian matrix. There are multiple published approaches using a history of updates to form this direction vector. Here, we give a common approach, the so-called "two loop recursion." We take as given x_k, the position at the -th iteration, and g_k\equiv\nabla f(x_k) where f is the function being minimized, and all vectors are column vectors. We also assume that we have stored the last updates of the form :s_k = x_ - x_k :y_k = g_ - g_k. We define \rho_k = \frac , and H^0_k will be the 'initial' approximate of the inverse Hessian that our estimate at iteration begins with. The algorithm is based on the BFGS recursion for the inverse Hessian as :H_ = (I-\rho_k s_k y_k^\top)H_k(I-\rho_k y_k s_k^\top) + \rho_k s_k s_k^\top. For a fixed we define a sequence of vectors q_,\ldots,q_k as q_k:=g_k and q_i:=(I-\rho_i y_i s_i^\top)q_. Then a recursive algorithm for calculating q_i from q_ is to define \alpha_i := \rho_i s_i^\top q_ and q_i=q_-\alpha_i y_i. We also define another sequence of vectors z_,\ldots,z_k as z_i:=H_iq_i. There is another recursive algorithm for calculating these vectors which is to define z_=H_k^0 q_ and then recursively define \beta_i:=\rho_i y_i^\top z_i and z_=z_i + (\alpha_i - \beta_i)s_i. The value of z_k is then our ascent direction. Thus we can compute the descent direction as follows: : \begin q = g_k\\ \mathtt\ i=k-1, k-2, \ldots, k-m\\ \qquad \alpha_i = \rho_i s^\top_i q\\ \qquad q = q - \alpha_i y_i\\ \gamma_k = \frac \\ H^0_k= \gamma_k I\\ z = H^0_k q\\ \mathtt\ i=k-m, k-m+1, \ldots, k-1\\ \qquad \beta_i = \rho_i y^\top_i z\\ \qquad z = z + s_i (\alpha_i - \beta_i)\\ z = -z \end This formulation gives the search direction for the minimization problem, i.e., z = - H_k g_k. For maximization problems, one should thus take instead. Note that the initial approximate inverse Hessian H^0_k is chosen as a diagonal matrix or even a multiple of the identity matrix since this is numerically efficient. The scaling of the initial matrix \gamma_k ensures that the search direction is well scaled and therefore the unit step length is accepted in most iterations. A Wolfe line search is used to ensure that the curvature condition is satisfied and the BFGS updating is stable. Note that some software implementations use an Armijo backtracking line search, but cannot guarantee that the curvature condition y_k^ s_k > 0 will be satisfied by the chosen step since a step length greater than 1 may be needed to satisfy this condition. Some implementations address this by skipping the BFGS update when y_k^ s_k is negative or too close to zero, but this approach is not generally recommended since the updates may be skipped too often to allow the Hessian approximation H_k to capture important curvature information. This two loop update only works for the inverse Hessian. Approaches to implementing L-BFGS using the direct approximate Hessian B_k have also been developed, as have other means of approximating the inverse Hessian.


Applications

L-BFGS has been called "the algorithm of choice" for fitting log-linear (MaxEnt) models and
conditional random field Conditional random fields (CRFs) are a class of statistical modeling methods often applied in pattern recognition and machine learning and used for structured prediction. Whereas a classifier predicts a label for a single sample without consid ...
s with \ell_2-regularization.


Variants

Since BFGS (and hence L-BFGS) is designed to minimize
smooth Smooth may refer to: Mathematics * Smooth function, a function that is infinitely differentiable; used in calculus and topology * Smooth manifold, a differentiable manifold for which all the transition maps are smooth functions * Smooth algebrai ...
functions without constraints, the L-BFGS algorithm must be modified to handle functions that include non-
differentiable In mathematics, a differentiable function of one real variable is a function whose derivative exists at each point in its domain. In other words, the graph of a differentiable function has a non-vertical tangent line at each interior point in its ...
components or constraints. A popular class of modifications are called active-set methods, based on the concept of the active set. The idea is that when restricted to a small neighborhood of the current iterate, the function and constraints can be simplified.


L-BFGS-B

The L-BFGS-B algorithm extends L-BFGS to handle simple box constraints (aka bound constraints) on variables; that is, constraints of the form where and are per-variable constant lower and upper bounds, respectively (for each , either or both bounds may be omitted). The method works by identifying fixed and free variables at every step (using a simple gradient method), and then using the L-BFGS method on the free variables only to get higher accuracy, and then repeating the process.


OWL-QN

Orthant-wise limited-memory quasi-Newton (OWL-QN) is an L-BFGS variant for fitting \ell_1- regularized models, exploiting the inherent
sparsity In numerical analysis and scientific computing, a sparse matrix or sparse array is a matrix in which most of the elements are zero. There is no strict definition regarding the proportion of zero-value elements for a matrix to qualify as sparse b ...
of such models. It minimizes functions of the form :f(\vec x) = g(\vec x) + C \, \vec x\, _1 where g is a
differentiable In mathematics, a differentiable function of one real variable is a function whose derivative exists at each point in its domain. In other words, the graph of a differentiable function has a non-vertical tangent line at each interior point in its ...
convex Convex or convexity may refer to: Science and technology * Convex lens, in optics Mathematics * Convex set, containing the whole line segment that joins points ** Convex polygon, a polygon which encloses a convex set of points ** Convex polytop ...
loss function. The method is an active-set type method: at each iterate, it estimates the sign of each component of the variable, and restricts the subsequent step to have the same sign. Once the sign is fixed, the non-differentiable \, \vec x\, _1 term becomes a smooth linear term which can be handled by L-BFGS. After an L-BFGS step, the method allows some variables to change sign, and repeats the process.


O-LBFGS

Schraudolph ''et al.'' present an
online In computer technology and telecommunications, online indicates a state of connectivity and offline indicates a disconnected state. In modern terminology, this usually refers to an Internet connection, but (especially when expressed "on line" o ...
approximation to both BFGS and L-BFGS. Similar to stochastic gradient descent, this can be used to reduce the computational complexity by evaluating the error function and gradient on a randomly drawn subset of the overall dataset in each iteration. It has been shown that O-LBFGS has a global almost sure convergence while the online approximation of BFGS (O-BFGS) is not necessarily convergent.


Implementation of variants

Notable open source implementations include: *
ALGLIB ALGLIB is a cross-platform open source numerical analysis and data processing library. It can be used from several programming languages (C++, C#, VB.NET, Python, Delphi). ALGLIB started in 1999 and has a long history of steady development wi ...
implements L-BFGS in C++ and C# as well as a separate box/linearly constrained version, BLEIC. * R's optim general-purpose optimizer routine uses the L-BFGS-B method. *
SciPy SciPy (pronounced "sigh pie") is a free and open-source Python library used for scientific computing and technical computing. SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT, ...
's optimization module's minimize method also includes an option to use L-BFGS-B. Notable non open source implementations include: * The L-BFGS-B variant also exists as ACM TOMS algorithm 778. In February 2011, some of the authors of the original L-BFGS-B code posted a major update (version 3.0). * A reference implementation in Fortran 77 (and with a Fortran 90 interface). This version, as well as older versions, has been converted to many other languages. * An OWL-QN C++ implementation by its designers.


Works cited


Further reading

* * * {{Optimization algorithms, unconstrained Optimization algorithms and methods