Łojasiewicz Inequality
   HOME

TheInfoList



OR:

In
real algebraic geometry In mathematics, real algebraic geometry is the sub-branch of algebraic geometry studying real algebraic sets, i.e. real-number solutions to algebraic equations with real-number coefficients, and mappings between them (in particular real polynomi ...
, the Łojasiewicz inequality, named after Stanisław Łojasiewicz, gives an upper bound for the distance of a point to the nearest zero of a given real analytic function. Specifically, let ƒ : ''U'' → R be a real analytic function on an
open set In mathematics, an open set is a generalization of an Interval (mathematics)#Definitions_and_terminology, open interval in the real line. In a metric space (a Set (mathematics), set with a metric (mathematics), distance defined between every two ...
''U'' in R''n'', and let ''Z'' be the zero locus of ƒ. Assume that ''Z'' is not empty. Then for any
compact set In mathematics, specifically general topology, compactness is a property that seeks to generalize the notion of a closed and bounded subset of Euclidean space. The idea is that a compact space has no "punctures" or "missing endpoints", i.e., i ...
''K'' in ''U'', there exist positive constants α and ''C'' such that, for all ''x'' in ''K'' :\operatorname(x,Z)^\alpha \le C, f(x), . Here, \alpha can be small. The following form of this inequality is often seen in more analytic contexts: with the same assumptions on ''f'', for every ''p'' ∈ ''U'' there is a possibly smaller open neighborhood ''W'' of ''p'' and constants θ ∈ (0,1) and ''c'' > 0 such that :, f(x)-f(p), ^\theta\le c, \nabla f(x), .


Polyak inequality

A special case of the Łojasiewicz inequality, due to , is commonly used to prove linear
convergence Convergence may refer to: Arts and media Literature *''Convergence'' (book series), edited by Ruth Nanda Anshen *Convergence (comics), "Convergence" (comics), two separate story lines published by DC Comics: **A four-part crossover storyline that ...
of
gradient descent Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for minimizing a differentiable multivariate function. The idea is to take repeated steps in the opposite direction of the gradi ...
algorithms. This section is based on and .


Definitions

f is a function of type \R^d \to \R, and has a continuous derivative \nabla f. X^* is the subset of \R^d on which f achieves its global minimum (if one exists). Throughout this section we assume such a global minimum value f^* exists, unless otherwise stated. The optimization objective is to find some point x in X^*. \mu, L > 0 are constants. \nabla f is L-
Lipschitz continuous In mathematical analysis, Lipschitz continuity, named after Germany, German mathematician Rudolf Lipschitz, is a strong form of uniform continuity for function (mathematics), functions. Intuitively, a Lipschitz continuous function is limited in h ...
iff \, \nabla f(x) - \nabla f(y)\, \leq L \, x - y\, , \quad \forall x, y f is \mu- strongly convex ifff(y)\ge f(x)+\nabla f(x)^T(y-x)+\frac\lVert y-x \rVert^2 \quad \forall x, y f is \mu-PL (where "PL" means "Polyak-Łojasiewicz") iff \frac\, \nabla f(x)\, ^2 \geq \mu\left(f(x)-f(x^*)\right), \quad \forall x


Basic properties


Gradient descent


Coordinate descent

The coordinate descent algorithm first samples a random coordinate i_k uniformly, then perform gradient descent by x_ = x_k - \eta \partial_ f(x_k) e_


Stochastic gradient descent

In
stochastic gradient descent Stochastic gradient descent (often abbreviated SGD) is an Iterative method, iterative method for optimizing an objective function with suitable smoothness properties (e.g. Differentiable function, differentiable or Subderivative, subdifferentiable ...
, we have a function to minimize f(x), but we cannot sample its gradient directly. Instead, we sample a random gradient \nabla f_i(x), where f_i are such that f(x) = \mathbb E_i _i(x) For example, in typical machine learning, x are the parameters of the neural network, and f_i(x) is the loss incurred on the i -th training data point, while f(x) is the average loss over all training data points. The gradient update step is x_ = x_k - \eta_k \nabla f_(x_k) where \eta_k > 0 are a sequence of learning rates (the learning rate schedule). As it is, the proposition is difficult to use. We can make it easier to use by some further assumptions. The second-moment on the right can be removed by assuming a uniform upper bound. That is, if there exists some C> 0 such that during the SG process, we have\mathbb E_i \nabla f_i(x_k)\, ^2\leq C for all k = 0, 1, \dots, then \mathbb\left \left(x_\right)-f^*\right\leq\left(1-2 \eta_k \mu\right)\left \left(x_k\right)-f^*\right\frac Similarly, if \forall k, \quad \mathbb E_i \nabla f_i(x_k) - \nabla f(x_k)\, ^2leq C then \mathbb\left \left(x_\right)-f^*\right\leq\left(1-\mu (2\eta_k - L\eta_k^2 )\right)\left \left(x_k\right)-f^*\right\frac


Learning rate schedules

For constant learning rate schedule, with \eta_k = \eta = 1/L, we have \mathbb\left \left(x_\right)-f^*\right\leq\left(1-\mu/L\right)\left \left(x_k\right)-f^*\right\frac By induction, we have \mathbb\left \left(x_\right)-f^*\right\leq\left(1-\mu/L\right)^k \left \left(x_0\right)-f^*\right\frac We see that the loss decreases in expectation first exponentially, but then stops decreasing, which is caused by the C/(2L) term. In short, because the gradient descent steps are too large, the variance in the stochastic gradient starts to dominate, and x_k starts doing a random walk in the vicinity of X^*. For decreasing learning rate schedule with \eta_k = O(1/k), we have \mathbb\left \left(x_\right)-f^*\right= O(1/k).


References

* * * *


External links

* {{DEFAULTSORT:Lojasiewicz inequality Inequalities (mathematics) Mathematical analysis Real algebraic geometry