Łojasiewicz inequality explained

In real algebraic geometry, the Łojasiewicz inequality, named after Stanisław Łojasiewicz, gives an upper bound for the distance of a point to the nearest zero of a given real analytic function. Specifically, let ƒ : U → R be a real analytic function on an open set U in Rⁿ, and let Z be the zero locus of ƒ. Assume that Z is not empty. Then for any compact set K in U, there exist positive constants α and C such that, for all x in K

\operatorname{dist}(x,Z)^\alpha\leC|f(x)|.

Here α can be large.

The following form of this inequality is often seen in more analytic contexts: with the same assumptions on f, for every p ∈ U there is a possibly smaller open neighborhood W of p and constants θ ∈ (0,1) and c > 0 such that

|f(x)-f(p)|^\theta\lec|\nablaf(x)|.

Polyak inequality

A special case of the Łojasiewicz inequality, due to, is commonly used to prove linear convergence of gradient descent algorithms. This section is based on and .

Definitions

$f$ is a function of type $\R^d \to \R$ , and has a continuous derivative

\nablaf

X^*

is the subset of

\R^d

on which

achieves its global minimum (if one exists). Throughout this section we assume such a global minimum value

f^*

exists, unless otherwise stated. The optimization objective is to find some point

X^*

$\mu, L > 0$ are constants.

$\nabla f$ is

-Lipschitz continuous iff

$\|\nabla f(x) - \nabla f(y)\| \leq L \|x - y\|, \quad \forall x, y$

$f$ is $\mu$ -strongly convex iff $f(y)\ge f(x)+\nabla f(x)^T(y-x)+\frac\lVert y-x \rVert^2 \quad \forall x, y$

$f$ is $\mu$ -PL (where "PL" means "Polyak-Łojasiewicz") iff $\frac\|\nabla f(x)\|^2 \geq \mu\left(f(x)-f(x^*)\right), \quad \forall x$

Coordinate descent

The coordinate descent algorithm first samples a random coordinate $i_k$ uniformly, then perform gradient descent by $x_ = x_k - \eta \partial_ f(x_k) e_$

Stochastic gradient descent

In stochastic gradient descent, we have a function to minimize $f(x)$ , but we cannot sample its gradient directly. Instead, we sample a random gradient $\nabla f_i(x)$ , where $f_i$ are such that $f(x) = \mathbb E_i[f_i(x)]$ For example, in typical machine learning, $x$ are the parameters of the neural network, and $f_i(x)$ is the loss incurred on the $i$ -th training data point, while $f(x)$ is the average loss over all training data points.

The gradient update step is $x_ = x_k - \eta_k \nabla f_(x_k)$ where $\eta_k > 0$ are a sequence of learning rates (the learning rate schedule).

As it is, the proposition is difficult to use. We can make it easier to use by some further assumptions.

The second-moment on the right can be removed by assuming a uniform upper bound. That is, if there exists some $C> 0$ such that during the SG process, we have $\mathbb E_i[\|\nabla f_i(x_k)\|^2] \leq C$ for all $k = 0, 1, \dots$ , then $\mathbb\left[f\left(x_{k+1}\right)-f^*\right] \leq\left(1-2 \eta_k \mu\right)\left[f\left(x_k\right)-f^*\right]+\frac$ Similarly, if $\forall k, \quad \mathbb E_i[\|\nabla f_i(x_k) - \nabla f(x_k)\|^2]\leq C$ then $\mathbb\left[f\left(x_{k+1}\right)-f^*\right] \leq\left(1-\mu (2\eta_k - L\eta_k^2)\right)\left[f\left(x_k\right)-f^*\right]+\frac$

Learning rate schedules

For constant learning rate schedule, with $\eta_k = \eta = 1/L$ , we have $\mathbb\left[f\left(x_{k+1}\right)-f^*\right] \leq\left(1-\mu/L\right)\left[f\left(x_k\right)-f^*\right]+\frac$ By induction, we have $\mathbb\left[f\left(x_{k}\right)-f^*\right] \leq\left(1-\mu/L\right)^k \left[f\left(x_0\right)-f^*\right]+\frac$ We see that the loss decreases in expectation first exponentially, but then stops decreasing, which is caused by the $C/(2L)$ term. In short, because the gradient descent steps are too large, the variance in the stochastic gradient starts to dominate, and $x_k$ starts doing a random walk in the vicinity of $X^*$ .

For decreasing learning rate schedule with $\eta_k = O(1/k)$ , we have

E\left[f\left(x_k\right)-f^*\right]=O(1/k)

References

- - 1608.04636 . cs.LG . Hamed . Karimi . Julie . Nutini . Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak–Łojasiewicz Condition . Mark . Schmidt . 2016.
Liu . Chaoyue . Zhu . Libin . Belkin . Mikhail . 2022-07-01 . Loss landscapes and optimization in over-parameterized non-linear systems and neural networks . Applied and Computational Harmonic Analysis . Special Issue on Harmonic Analysis and Machine Learning . 59 . 85–116 . 10.1016/j.acha.2021.12.009 . 1063-5203. 2003.00307 .