Steepest Descent
By noticing that the gradient of a function points to the direction of the steepest ascent, we can use the negative gradient to guide us towards the minimum. This method is called steepest descent or gradient descent.
Theory: Steepest Descent with Fixed Step Size
We shall formalise this idea into an algorithm:
- Choose a starting point \(x_0\) and a step size \(\alpha\).
- Calculate the gradient at the current point \(\nabla f(x_k)\).
- Update the current point using \(x_{k+1} = x_k - \alpha \cdot \nabla f(x_l)\).
- Repeat until the convergence criterion is met.
The step size \(\alpha\), also known as learning rate, which scales the gradient. Because the gradient vanishes at a stationary point, when this algorithm converges, it is guaranteed to converge to a stationary point. One should check afterwards whether the stationary point is a (local) minimum.
If we assume our objective function is convex and has Lipschitz continuous gradient, then we can show that the steepest descent algorithm always converges if the step size is sufficiently small. In the worst case, steepest descent converges sublinearly.
Proof
We start by inserting \(y = x - \alpha \nabla f(x)\) into the quadradic upper bound of the objective function: $$ \begin{align*} f(y) &\leq f(x) + \langle \nabla f(x), (y - x)\rangle + \frac{L}{2} \|y - x\|^2 \\ f(x - \alpha \nabla f(x)) &\leq f(x) - \alpha \langle \nabla f(x), \nabla f(x)\rangle + \frac{L}{2} \| \alpha \nabla f(x)\|^2 \\ &= f(x) - \alpha \left( 1 - \frac{L \alpha}{2}\right) \| \nabla f(x)\|^2 \end{align*} $$
If we choose \(0 < \alpha \leq \frac{1}{L}\), then \(\alpha \left(1 - \frac{L \alpha}{2}\right) > \frac{\alpha}{2}\) and we have $$ f(x_{k+1}) \leq f(x_k) - \frac{\alpha}{2} \| \nabla f(x)\|^2 $$ by putting \(x_k = x\) and identifying \(x_{k+1} = y = x_k - \alpha \nabla f(x_k)\).
Because of the convexity of \(f\), we have
$$ f(x^{*}) \geq f(x_{k}) + \langle \nabla f(x_{k}), x^{*} - x_{k} \rangle \quad \Leftrightarrow \quad f(x^{*}) + \langle \nabla f(x_{k}), x_{k} - x^{*} \rangle \geq f(x_{k}) $$
Therefore, we can further estimate \(f(x_{k+1})\) using $$ \begin{align*} f(x_{k+1}) &\leq f(x_k) - \frac{\alpha}{2} \| \nabla f(x)\|^2 \\ &\leq f(x^{*}) + \langle \nabla f(x_k), x_k - x^* \rangle - \frac{\alpha}{2} \| \nabla f(x)\|^2 \\ &= f(x^{*}) + \left\langle \frac{1}{2} \nabla f(x_k), 2(x_k - x^{*}) \right\rangle - \left\langle \frac{1}{2} \nabla f(x_k), \alpha \nabla f(x_k) \right\rangle \\ &= f(x^{*}) + \left\langle \frac{1}{2} \nabla f(x_k), (2x - 2 x^{*} - \alpha \nabla f(x_k)) \right\rangle \\ &= f(x^{*}) + \frac{1}{2\alpha} \langle \alpha \nabla f(x_k), (2x - 2 x^{*} - \alpha \nabla f(x_k)) \rangle \\ &= f(x^{*}) + \frac{1}{2\alpha} \left( \| x - x^{*} \|^2 - \| x - x^{*} - \alpha \nabla f(x_k) \|^2 \right) \\ &= f(x^{*}) + \frac{1}{2\alpha} \left( \| x - x^{*} \|^2 - \| x_{k+1} - x^{*} \|^2 \right) \end{align*} $$
Summing over all \(k\)'s on both sides, we obtain: $$ \begin{align*} \sum_{i=0}^k (f(x_{i+1}) - f(x^{*})) &\leq \frac{1}{2\alpha} \sum_{i=0}^k \left( \| x_i - x^{*} \|^2 - \| x_{i+1} - x^{*} \|^2 \right) \\ &= \frac{1}{2\alpha} \left( \| x_0 - x^{*} \|^2 - \| x_{k+1} - x^{*} \|^2 \right) \\ &\leq \frac{1}{2\alpha} \| x_0 - x^{*} \|^2 \end{align*} $$
Because \(f(x_{k+1}) \leq \frac{\alpha}{2} \|\nabla f(x)\|^2 \), \((f(x_{k}))\) is a decreasing sequence. Therefore, the sum on the left hand can be estimated by $$ \sum_{i=0}^k (f(x_{i+1}) - f(x^{*})) \geq \sum_{i=0}^k (f(x_{k+1}) - f(x^{*})) = (k+1)(f(x_{k+1}) - f(x^{*})) $$ which leads to $$ f(x_{k+1}) - f(x^{*}) \leq \frac{1}{k+1} \sum_{i=0}^k (f(x_{i+1}) - f(x^{*})) \leq \frac{1}{2\alpha(k+1)} \| x_0 - x^{*} \|^2 $$
This shows that for any \(\epsilon > 0\), we can choose a \(k\) large enough such that \(f(x_{k+1}) - f(x^{*}) \leq \epsilon\). Therefore, with the step size \(\alpha \leq \frac{1}{L}\), steepest descent is guaranteed to converge to the minimum of the objective function. In the worst case, steepest descent converges sublinearly with order \(\mathcal{O}\left(\frac{1}{k}\right)\) for konvex functions.
If the objective function is strongly convex and has Lipschitz continuous gradient, steepest descent can be shown to converge linearly in the worst case if a sufficiently small step size \(\alpha\) is chosen.
Proof
By inserting \(x_{k+1} = x_k - \alpha \nabla f(x_k)\), we obtain $$ \begin{align*} \|x_{k+1} - x^{*}\|^2 &= \|x_k - \alpha \nabla f(x_k) - x^{*}\|^2 \\ &= \|x_k - x^{*}\|^2 - 2 \alpha \langle \nabla f(x_k), x_k - x^{*} \| + \alpha^2 \| \nabla f(x_k) \|^2 \\ \end{align*} $$
From the quadratic lower bound of the strongly convex function, we have $$ f(y) \geq f(x) + \langle \nabla f(x), y - x \rangle + \frac{\mu}{2} \| y - x \|^2 $$ Putting \(y = x^{*}\) and \(x = x_k\), we obtain $$ f(x^{*}) \geq f(x_k) + \langle \nabla f(x_k), x^{*} - x_k \rangle + \frac{\mu}{2} \| x^{*} - x_k \|^2 \\ \Leftrightarrow \quad \langle \nabla f(x_k), x_k - x^{*} \rangle \geq f(x_k) - f(x^{*}) + \frac{\mu}{2} \| x_k - x^{*} \|^2 $$
Insert this inequality into the previous equation, we can write $$ \begin{align*} \|x_{k+1} - x^{*}\|^2 &\leq \|x_k - x^{*}\|^2 - 2 \alpha (f(x_k) - f(x^{*}) + \frac{\mu}{2} \| x_k - x^{*} \|^2) + \alpha^2 \| \nabla f(x_k) \|^2 \\ &= (1 - \alpha \mu) \|x_k - x^{*}\|^2 - 2 \alpha (f(x_k) - f(x^{*})) + \alpha^2 \| \nabla f(x_k) \|^2 \\ \end{align*} $$
For \(\alpha \leq \frac{1}{L}\), we have shown for convex functions $$ f(x_{k+1}) \leq f(x_k) - \frac{\alpha}{2} \| \nabla f(x)\|^2 $$ Because \(f(x^{*}) \leq f(x_k)\ \forall k\), we get $$ f(x^{*}) - f(x_k) \leq f(x_{k+1}) - f(x_k) \leq - \frac{\alpha}{2} \| \nabla f(x)\|^2 \\ \Leftrightarrow \quad \| \nabla f(x)\|^2 \leq \frac{2}{\alpha} (f({x_k}) - f(x^{*})) $$ which helps us to write the previous equation as $$ \begin{align*} \|x_{k+1} - x^{*}\|^2 &\leq (1 - \alpha \mu) \|x_k - x^{*}\|^2 - 2 \alpha (f(x_k) - f(x^{*})) + \alpha^2 \| \nabla f(x_k) \|^2 \\ &\leq (1 - \alpha \mu) \|x_k - x^{*}\|^2 - 2 \alpha (f(x_k) - f(x^{*})) + 2 \alpha (f({x_k}) - f(x^{*})) \\ &= (1 - \alpha \mu) \|x_k - x^{*}\|^2 \end{align*} Taking the square root on both sides, we get $$ \|x_{k+1} - x^{*}\| \leq \sqrt{1 - \alpha \mu} \|x_k - x^{*}\| $$ which shows that steepest descent converges linearly for strongly convex functions.
Theory: Steepes Descent with Variable Step Size
A very easy modifications we can do to improve the performance of the steepest descent algorithm with fixed step size is to make \(\alpha\) variable. We can use the gradient to determine a direction and find a minimum in this direction. This is called a line search. Since we just want to obtain a reasonable step size, this line search does not have to be precise. This can be done by the so called backtracking line search.
The Armijo variant of this method can be described as follows:
- Choose a starting point \(x\), a maximum step size \(\alpha_0\), a descent direction \(d\) and control parameters \(\tau \in (0, 1)\) and \(c \in (0, 1)\).
- Calculate the directional derivative \(m = \langle \nabla f(x), d \rangle\) and the threshold \(t = -c \cdot m\). Note that it is assumed that \(d\) leads to a local decrease in \(f\) and hence \(m < 0\).
- Check if the condition \( f(x) - f(x + \alpha_j d) \geq \alpha_j t \) is satisfied. If yes, return \(\alpha_j\). If not, set \(\alpha_{j+1} = \tau \cdot \alpha_j\).
- Repeat until the condition above is satisfied.
Let us try to gain some intuition from this cryptic description. The right hand side of the condition, \(\alpha_j t = c \alpha_j |m|\), is the expected decrease in \(f\) using linear approximation scaled by \(c\). So we start with a large step size \(\alpha_0\) and check if it leads to a sufficient decrease in \(f\). If not, we decrease the step and decrease it check again. In this way, we will almost never find the minimum in the search direction, but it is good enough, because we do not use line search to find the minimum, but to find a reasonable step size for steepest descent.
Steepest descent with backtracking line search can be summarized as follows:
- Choose a starting point \(x\) and a maximum step size \(\alpha_0\).
- Calculate the gradient at the current point \(\nabla f(x_k)\).
- Use the gradient as the direction for backtracking line search to find a step size \(\alpha_k\).
- Update the current point using \(x_{k+1} = x_k - \alpha_k \nabla f(x_k)\).
- Repeat until the convergence criterion is met.
Implementation: Base Class for Optimizers
Because we aim at implementing several optimization algorithms, we will
at first implement a base class OptimiserBase which contains the common
methods and attributes of all optimization algorithms.
import numpy as np
from abc import ABC, abstractmethod
class OptimiserBase(ABC):
def __init__(self, func, p0, maxiter=200, **kwargs):
self.func = func
self.p = np.copy(p0)
self.p_new = np.copy(p0)
self.maxiter = maxiter
self.kwargs = kwargs
@abstractmethod
def next_step(self):
pass
@abstractmethod
def check_convergence(self):
pass
def _check_convergence_grad(self):
tol = self.kwargs.get('grad_tol', 1e-6)
grad_norm = np.linalg.norm(self.func(self.p, deriv=1))
return grad_norm < tol
def run(self, full_output=False):
converged = False
ps = [self.p]
for i in range(0, self.maxiter):
self.p_new = self.next_step()
ps.append(self.p_new)
converged = self.check_convergence()
if converged:
break
else:
self.p = np.copy(self.p_new)
if converged:
print(f'Optimisation converged in {i + 1} iterations!')
if not converged:
print('WARNING: Optimisation could not converge '
f'after {i + 1} iterations!')
if full_output:
info_dict = {
'niter': len(ps),
'p_opt': self.p_new,
'fval_opt': self.func(self.p_new, deriv=0),
'grad_opt': self.func(self.p_new, deriv=1),
'p_traj': np.array(ps),
}
return self.p_new, info_dict
else:
return self.p_new
This base class implements the _check_convergence_grad method, which
checks whether the gradient norm is smaller than a given tolerance. One could
implement other convergence criteria, e.g. absolute of relative change of the
iterate, absolute or relative change of the objective function value, etc.
This base class also implements
the run method which runs the optimization algorithm until convergence
or until a maximum number of iterations is reached. The run method calls
next_step method to obtain the next iterate and check_convergence method
to check whether the algorithm has converged. These two methods are abstract
methods which need to be implemented by subclasses.
Implementation: Steepest Descent
With this boiler plate code in place, we can concentrate on the important part: How to obtain the next iterate. For steepest descent with a fixed step size, it is very straight-forward to implement.
class SimpleSteepestDescent(OptimiserBase):
def next_step(self):
alpha = self.kwargs.get('alpha', 0.01)
grad = self.func(self.p, deriv=1)
return self.p - alpha * grad
def check_convergence(self):
return self._check_convergence_grad()
For steepest descent with a variable step size, we need to incorporate a line search, like Armijo line search mentioned in the theory section.
def armijo_line_search(func, p0, vec, maxiter=100,
alpha0=1.0, c=0.5, tau=0.5):
m = np.dot(func(p0, deriv=1), vec)
t = -c * m
alpha = alpha0
for _ in range(0, maxiter):
if func(p0, deriv=0) - func(p0 + alpha * vec, deriv=0) > alpha * t:
break
else:
alpha *= tau
return alpha
Afterwards, we can implement the steepest descent with variable step size by doing only minor modifications to the steepest descent algorithm with fixed step size.
class SteepestDescent(OptimiserBase):
def next_step(self):
alpha0 = self.kwargs.get('alpha0', 1.0)
grad = self.func(self.p, deriv=1)
# alpha0 *= max(1.0, np.linalg.norm(grad))
alpha = armijo_line_search(
self.func, self.p, -grad, alpha0=alpha0,
)
return self.p - alpha * grad
def check_convergence(self):
return self._check_convergence_grad()
We can apply our implementation of the steepest descent algorithm to both of our test functions like this:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm
from objective_function import RosenbrockFunction, HimmelblauFunction, \
plot_2d_optimisation
from optimiser import SimpleSteepestDescent, BFGS
norm = LogNorm()
hf = HimmelblauFunction()
xs1 = np.linspace(-6.0, 6.0, 200)
ys1 = np.linspace(-6.0, 6.0, 200)
rf = RosenbrockFunction(args=(1.0, 100.0))
xs2 = np.linspace(-2.0, 2.0, 200)
ys2 = np.linspace(-1.0, 3.0, 200)
fig_sd, axs_sd = plt.subplots(1, 2, figsize=(10, 4))
p0 = [0.0, 0.0]
optimiser = SimpleSteepestDescent(hf, p0, maxiter=200,
alpha=0.01, grad_tol=1e-6)
popt, info = optimiser.run(full_output=True)
axs_sd[0].set_title(f'Converged after {info["niter"]} iterations')
plot_2d_optimisation(axs_sd[0], hf, xs1, ys1, norm=norm, traj=info['p_traj'])
p0 = [0.0, 0.0]
optimiser = SimpleSteepestDescent(rf, p0, maxiter=50000,
alpha=0.001, grad_tol=1e-6)
popt, info = optimiser.run(full_output=True)
axs_sd[1].set_title(f'Converged after {info["niter"]} iterations')
plot_2d_optimisation(axs_sd[1], rf, xs2, ys2, norm=norm, traj=info['p_traj'])
fig_sd.tight_layout()
plt.show()
If you play around with the step size, you may realize that the Rosenbrock
function requires a very small step size to converge, which leads to a very
slow convergence. Starting from \(x_0 = (0, 0)\), the optimization
trajectories for both functions are shown in the following figure.