Łojasiewicz inequality explained

In real algebraic geometry, the Łojasiewicz inequality, named after Stanisław Łojasiewicz, gives an upper bound for the distance of a point to the nearest zero of a given real analytic function. Specifically, let ƒ : U → R be a real analytic function on an open set U in Rn, and let Z be the zero locus of ƒ. Assume that Z is not empty. Then for any compact set K in U, there exist positive constants α and C such that, for all x in K

\operatorname{dist}(x,Z)\alpha\leC|f(x)|.

Here α can be large.

The following form of this inequality is often seen in more analytic contexts: with the same assumptions on f, for every p ∈ U there is a possibly smaller open neighborhood W of p and constants θ ∈ (0,1) and c > 0 such that

|f(x)-f(p)|\theta\lec|\nablaf(x)|.

Polyak inequality

A special case of the Łojasiewicz inequality, due to, is commonly used to prove linear convergence of gradient descent algorithms. This section is based on and .

Definitions

f is a function of type \R^d \to \R, and has a continuous derivative

\nablaf

.

X*

is the subset of

\Rd

on which

f

achieves its global minimum (if one exists). Throughout this section we assume such a global minimum value

f*

exists, unless otherwise stated. The optimization objective is to find some point

x

in

X*

.

\mu, L > 0 are constants.

\nabla f is

L

-Lipschitz continuous iff

\|\nabla f(x) - \nabla f(y)\| \leq L \|x - y\|, \quad \forall x, y

f is \mu-strongly convex ifff(y)\ge f(x)+\nabla f(x)^T(y-x)+\frac\lVert y-x \rVert^2 \quad \forall x, y

f is \mu-PL (where "PL" means "Polyak-Łojasiewicz") iff \frac\|\nabla f(x)\|^2 \geq \mu\left(f(x)-f(x^*)\right), \quad \forall x

Coordinate descent

The coordinate descent algorithm first samples a random coordinate i_k uniformly, then perform gradient descent by x_ = x_k - \eta \partial_ f(x_k) e_

Stochastic gradient descent

In stochastic gradient descent, we have a function to minimize f(x), but we cannot sample its gradient directly. Instead, we sample a random gradient \nabla f_i(x), where f_i are such that f(x) = \mathbb E_i[f_i(x)] For example, in typical machine learning, x are the parameters of the neural network, and f_i(x) is the loss incurred on the i -th training data point, while f(x) is the average loss over all training data points.

The gradient update step is x_ = x_k - \eta_k \nabla f_(x_k) where \eta_k > 0 are a sequence of learning rates (the learning rate schedule).

As it is, the proposition is difficult to use. We can make it easier to use by some further assumptions.

The second-moment on the right can be removed by assuming a uniform upper bound. That is, if there exists some C> 0 such that during the SG process, we have\mathbb E_i[\|\nabla f_i(x_k)\|^2] \leq C for all k = 0, 1, \dots, then \mathbb\left[f\left(x_{k+1}\right)-f^*\right] \leq\left(1-2 \eta_k \mu\right)\left[f\left(x_k\right)-f^*\right]+\frac Similarly, if \forall k, \quad \mathbb E_i[\|\nabla f_i(x_k) - \nabla f(x_k)\|^2]\leq C then \mathbb\left[f\left(x_{k+1}\right)-f^*\right] \leq\left(1-\mu (2\eta_k - L\eta_k^2)\right)\left[f\left(x_k\right)-f^*\right]+\frac

Learning rate schedules

For constant learning rate schedule, with \eta_k = \eta = 1/L, we have \mathbb\left[f\left(x_{k+1}\right)-f^*\right] \leq\left(1-\mu/L\right)\left[f\left(x_k\right)-f^*\right]+\frac By induction, we have \mathbb\left[f\left(x_{k}\right)-f^*\right] \leq\left(1-\mu/L\right)^k \left[f\left(x_0\right)-f^*\right]+\frac We see that the loss decreases in expectation first exponentially, but then stops decreasing, which is caused by the C/(2L) term. In short, because the gradient descent steps are too large, the variance in the stochastic gradient starts to dominate, and x_k starts doing a random walk in the vicinity of X^*.

For decreasing learning rate schedule with \eta_k = O(1/k), we have

E\left[f\left(xk\right)-f*\right]=O(1/k)

.

References