(Quasi-)Newton Methods

The steepest descent method has a garanteed linear convergence for strongly convex functions. In order to achieve quadratic convergence, we must get more information from the function. The most straight-forward approach would be to take the second derivative, or the Hessian $\mathbf{H}$ of the function. This is called Newton's method in optimisation.

Theory: Newton's Method

Just like the method carrying the same name for root finding, which constructs a local linear approximation of the function, Newton's method in optimisation constructs a local quadratic approximation and updates the current point to the minimum of the parabola. The corresponding algorithm would be:

  1. Choose a starting point \(x_0\).
  2. Obtain a quadratic approximation at the current point: \(f(x) \approx f_k(x) = f(x_k) + \langle \nabla f(x_k), (x - x_k) \rangle + \frac{1}{2} \langle (x - x_k), \mathbf{H}(x_k)\, (x - x_k) \rangle \).
  3. Update the current point using \(x_{k+1} = x_{k} - \mathbf{H}(x_k)^{-1}\, \nabla f(x_k) \).
  4. Repeat until some convergence criteria is met.

For a problem with a handful of independent variables, the Hessian can be calculated and inverted without any problem. If we want to optimize a molecule with \(N\) atoms, we have abour \(3N\) degrees of freedom and the Hessian becomes very expensive to calculate. In this case, we can start with a guess for the initial Hessian, and use informations along the iterations to update this Hessian. This leads to Quasi-Newton methods. One very widespread algorithm of this class is the Broyden-Fletcher-Goldfarb-Shanno algorithm, or the BFGS algorithm.

Theory: BFGS Algorithm

At first, we expand the function at iteration \(k + 1\): $$ f(x) \approx f_{k + 1}(x) = f(x_{k + 1}) + \nabla \langle f(x_{k + 1}), (x - x_{k + 1}) \rangle + \frac{1}{2} \langle (x - x_{k + 1}), \mathbf{A}_{k + 1}\, (x - x_{k + 1}) \rangle $$ where \(\mathbf{A}_{k}\) is the approximate Hessian at \(k\)-th iteration.

Taking the gradient of the approximant, we get $$ \nabla f(x) \approx \nabla f_{k + 1}(x) = \nabla f(x_{k + 1}) + \mathbf{A}_{k + 1}\, (x - x_{k + 1}) $$

Inserting \(x = x_k\), the condition $$ \nabla f(x_k) = \nabla f(x_{k + 1}) + \mathbf{A}_{k + 1}\, (x_k - x_{k + 1}) $$ approximately holds. After some rearrangement, the quasi-Newton condition imposed on the update of \(\mathbf{A}_{k + 1}\) is $$ \mathbf{A}_{k + 1} s_k = y_k $$ with \(y_k = \nabla f(x_{k + 1}) - \nabla f(x_k)\) and \(s_k = x_{k + 1} - x_k\).

To make the update of Hessian as simple as possible, we choose to add two rank-1 matrices at each step, i.e. $$ \mathbf{A}_{k + 1} = \mathbf{A}_k + \mathbf{U}_k + \mathbf{V}_k $$

To ensure the Hessian is symmetric, \(\mathbf{U}_k\) and \(\mathbf{V}_k\) are chosen to be symmetric. The equation can thus be written as $$ \mathbf{A}_{k + 1} = \mathbf{A}_k + \alpha\ u_k \otimes u_k + \beta\ v_k \otimes v_k $$ where \(\otimes\) denotes the dyadic product. By choosing \(u_k = y_k\) and \(v_k = \mathbf{A}_k s_k\), and imposing the update condition secant for \(\mathbf{A}_{k + 1}\), we get \(\alpha = \frac{1}{ \langle y_k, s_k \rangle }\) and \(\beta = -\frac{1}{ \langle s_k, \mathbf{A}_k s_k \rangle }\).

Finally, we arrive at a viable algorihm, which can be formulate as:

  1. Choose a starting point \(x_0\) and a starting Hessian \(\mathbf{A}_0\).
  2. Obtain a direction by solving\(\mathbf{A}_k d_k = -\nabla f(x_k)\).
  3. Perform a line search along \(d_k\) to find an appropriate step size \(\alpha_k\).
  4. Set \(s_k = \alpha_k d_k\) and update \(x_{k + 1} = x_k + s_k\).
  5. Calculate \(y_k = \nabla f(x_{k + 1}) - \nabla f(x_{k})\).
  6. Update the Hessian using \(\mathbf{A}_{k + 1} = \mathbf{A}_k + \frac{y_k \otimes y_k}{\langle y_k, s_k \rangle} - \frac{(\mathbf{A}_k s_k) \otimes (s_k \mathbf{A}_k)}{\langle s_k, \mathbf{A}_k s_k \rangle} \).
  7. Repeat until the convergence criterion is met.

Although this method works, the determination of \(d_k\) requires the inversion of \(\mathbf{A}_k\), which could be quite expensive for larger Hessians. Fortunately, since the update rules for the Hessian is simple enough, we can invert the whole equation analytically with the help of the Sherman-Morrison-Woodbury formula, which results in $$ \mathbf{B}_{k + 1} = \mathbf{B}_k + \frac{( \langle s_k, y_k\rangle + \langle y_k, \mathbf{B}_k y_k\rangle)(s_k \otimes s_k)}{\langle s_k, y_k \rangle^2} - \frac{(\mathbf{B}_k y_k) \otimes s_k + s_k \otimes (y_k \mathbf{B}_k)}{ \langle s_k, y_k \rangle} $$ with the approximate inverted Hessian \(\mathbf{B}_k\).

The corresponding algorithm can be formulate as follows:

  1. Choose a starting point \(x_0\) and a starting inverted Hessian \(\mathbf{B}_0\).
  2. Obtain a direction by calculating \(d_k = -\mathbf{B}_k \nabla f(x_k)\).
  3. Perform a line search along \(d_k\) to find an appropriate step size \(\alpha_k\).
  4. Set \(s_k = \alpha_k d_k\) and update \(x_{k + 1} = x_k + s_k\).
  5. Calculate \(y_k = \nabla f(x_{k + 1}) - \nabla f(x_{k})\).
  6. Update the inverted Hessian using \( \mathbf{B}_{k + 1} = \mathbf{B}_k + \frac{( \langle s_k, y_k\rangle + \langle y_k, \mathbf{B}_k y_k\rangle)(s_k \otimes s_k)}{\langle s_k, y_k \rangle^2} - \frac{(\mathbf{B}_k y_k) \otimes s_k + s_k \otimes (y_k \mathbf{B}_k)}{ \langle s_k, y_k \rangle} \)
  7. Repeat until the convergence criterion is met.

An easy and practical choice of the initial Hessian is the identity matrix, whose inversion is also an identity matrix.

Implementation: BFGS Algorithm

Even though the formulas for the BFGS algorithm look indimidating, the implementation is actually quite simple. We just have to pay attention not to mistype the very long formula for the update of the inverted Hessian and make sure to use the correct multiplication.

class BFGS(OptimiserBase):

    def __init__(self, func, p0, maxiter=200, **kwargs):
        super().__init__(func, p0, maxiter, **kwargs)
        self.b_k = np.eye(len(p0))

    def next_step(self):
        alpha0 = self.kwargs.get('alpha0', 1.0)
        grad = self.func(self.p, deriv=1)
        v_k = -np.dot(self.b_k, grad)
        alpha = armijo_line_search(
            self.func, self.p, v_k, alpha0=alpha0,
        )
        s_k = alpha * v_k
        y_k = self.func(self.p + s_k, deriv=1) - self.func(self.p, deriv=1)
        self.b_k = self.b_k \
            + (np.dot(s_k, y_k) + np.linalg.multi_dot((y_k, self.b_k, y_k))) \
                * np.outer(s_k, s_k) / np.dot(s_k, y_k)**2 \
            - (np.outer(np.dot(self.b_k, y_k), s_k) \
                + np.outer(s_k, np.dot(y_k, self.b_k))) / np.dot(s_k, y_k)

        return self.p + s_k

    def check_convergence(self):
        return self._check_convergence_grad()

We can apply our implementation of the BFGS algorithm to both of our test functions like this:

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm
from objective_function import RosenbrockFunction, HimmelblauFunction, \
    plot_2d_optimisation
from optimiser import SimpleSteepestDescent, BFGS
norm = LogNorm()

hf = HimmelblauFunction()
xs1 = np.linspace(-6.0, 6.0, 200)
ys1 = np.linspace(-6.0, 6.0, 200)

rf = RosenbrockFunction(args=(1.0, 100.0))
xs2 = np.linspace(-2.0, 2.0, 200)
ys2 = np.linspace(-1.0, 3.0, 200)
fig_bfgs, axs_bfgs = plt.subplots(1, 2, figsize=(10, 4))

p0 = [0.0, 0.0]
optimiser = BFGS(hf, p0, maxiter=200, alpha0=1.0, grad_tol=1e-6)
popt, info = optimiser.run(full_output=True)
axs_bfgs[0].set_title(f'Converged after {info["niter"]} iterations')
plot_2d_optimisation(axs_bfgs[0], hf, xs1, ys1, norm=norm, traj=info['p_traj'])

p0 = [0.0, 0.0]
optimiser = BFGS(rf, p0, maxiter=5000, alpha0=1.0, grad_tol=1e-6)
popt, info = optimiser.run(full_output=True)
axs_bfgs[1].set_title(f'Converged after {info["niter"]} iterations')
plot_2d_optimisation(axs_bfgs[1], rf, xs2, ys2, norm=norm, traj=info['p_traj'])

fig_bfgs.tight_layout()
plt.show()

You may understand now why the BFGS algorithm is so popular. Both functions could be optimized in a handful of iterations. Starting from \(x_0 = (0, 0)\), the optimization trajectories for both functions are shown in the following figure.