real algebraic geometry In mathematics, real algebraic geometry is the sub-branch of algebraic geometry studying real algebraic sets, i.e. real-number solutions to algebraic equations with real-number coefficients, and mappings between them (in particular real polynomi ...

, the Łojasiewicz inequality, named after Stanisław Łojasiewicz, gives an upper bound for the distance of a point to the nearest zero of a given real analytic function. Specifically, let ƒ : ''U'' → R be a real analytic function on an

open set In mathematics, an open set is a generalization of an Interval (mathematics)#Definitions_and_terminology, open interval in the real line. In a metric space (a Set (mathematics), set with a metric (mathematics), distance defined between every two ...

''U'' in R^''n'', and let ''Z'' be the zero locus of ƒ. Assume that ''Z'' is not empty. Then for any

compact set In mathematics, specifically general topology, compactness is a property that seeks to generalize the notion of a closed and bounded subset of Euclidean space. The idea is that a compact space has no "punctures" or "missing endpoints", i.e., i ...

''K'' in ''U'', there exist positive constants α and ''C'' such that, for all ''x'' in ''K'' :

\operatorname(x,Z)^\alpha \le C, f(x), .

Here,

\alpha

can be small. The following form of this inequality is often seen in more analytic contexts: with the same assumptions on ''f'', for every ''p'' ∈ ''U'' there is a possibly smaller open neighborhood ''W'' of ''p'' and constants θ ∈ (0,1) and ''c'' > 0 such that :

, f(x)-f(p), ^\theta\le c, \nabla f(x), .

Polyak inequality

A special case of the Łojasiewicz inequality, due to , is commonly used to prove linear

convergence Convergence may refer to: Arts and media Literature *''Convergence'' (book series), edited by Ruth Nanda Anshen *Convergence (comics), "Convergence" (comics), two separate story lines published by DC Comics: **A four-part crossover storyline that ...

gradient descent Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for minimizing a differentiable multivariate function. The idea is to take repeated steps in the opposite direction of the gradi ...

algorithms. This section is based on and .

Definitions

f

is a function of type

\R^d \to \R

, and has a continuous derivative

\nabla f

X^*

is the subset of

\R^d

on which

f

achieves its global minimum (if one exists). Throughout this section we assume such a global minimum value

f^*

exists, unless otherwise stated. The optimization objective is to find some point

x

X^*

\mu, L > 0

are constants.

\nabla f

L

Lipschitz continuous In mathematical analysis, Lipschitz continuity, named after Germany, German mathematician Rudolf Lipschitz, is a strong form of uniform continuity for function (mathematics), functions. Intuitively, a Lipschitz continuous function is limited in h ...

iff

\, \nabla f(x) - \nabla f(y)\,  \leq L \, x - y\, , \quad \forall x, y

f

\mu

- strongly convex iff

f(y)\ge f(x)+\nabla f(x)^T(y-x)+\frac\lVert y-x \rVert^2  \quad \forall x, y

f

\mu

-PL (where "PL" means "Polyak-Łojasiewicz") iff

\frac\, \nabla f(x)\, ^2 \geq \mu\left(f(x)-f(x^*)\right), \quad \forall x

Basic properties

Gradient descent

Coordinate descent

The coordinate descent algorithm first samples a random coordinate

i_k

uniformly, then perform gradient descent by

x_ = x_k - \eta \partial_ f(x_k) e_

Stochastic gradient descent

stochastic gradient descent Stochastic gradient descent (often abbreviated SGD) is an Iterative method, iterative method for optimizing an objective function with suitable smoothness properties (e.g. Differentiable function, differentiable or Subderivative, subdifferentiable ...

, we have a function to minimize

f(x)

, but we cannot sample its gradient directly. Instead, we sample a random gradient

\nabla f_i(x)

, where

f_i

are such that

f(x) = \mathbb E_i_i(x)

For example, in typical machine learning,

x

are the parameters of the neural network, and

f_i(x)

is the loss incurred on the

i

-th training data point, while

f(x)

is the average loss over all training data points. The gradient update step is

x_ = x_k - \eta_k \nabla f_(x_k)

where

\eta_k > 0

are a sequence of learning rates (the learning rate schedule). As it is, the proposition is difficult to use. We can make it easier to use by some further assumptions. The second-moment on the right can be removed by assuming a uniform upper bound. That is, if there exists some

C> 0

such that during the SG process, we have

\leq C

for all

k = 0, 1, \dots

, then

\frac

Similarly, if

\forall k, \quad \mathbb E_i \nabla f_i(x_k) - \nabla f(x_k)\, ^2 leq C

then

\frac

Learning rate schedules

For constant learning rate schedule, with

\eta_k = \eta = 1/L

, we have

\frac

By induction, we have

\frac

We see that the loss decreases in expectation first exponentially, but then stops decreasing, which is caused by the

C/(2L)

term. In short, because the gradient descent steps are too large, the variance in the stochastic gradient starts to dominate, and

x_k

starts doing a random walk in the vicinity of

X^*

. For decreasing learning rate schedule with

\eta_k = O(1/k)

, we have

\mathbb\left \left(x_\right)-f^*\right = O(1/k)

References

* * * *

External links

* {{DEFAULTSORT:Lojasiewicz inequality Inequalities (mathematics) Mathematical analysis Real algebraic geometry