Searching for the minimum? Gradient descent to the rescue!

In one of the previous posts, we talked about the Kaczmarz method for solving linear systems. This time, we are gonna provide three more ways to solve Ax=b iteratively, namely via the methods of Gradient descent, Conjugate gradient and Newton’s method. We will also take a look at the codes, implementation was done for all of them in matlab.

We will first look at how the methods are applied in unconstrained optimization  and then use them to solve linear systems.

What’s the main idea?

The main idea in using those methods is to solve an equivalent optimization problem, that is, an optimization problem that has the same solution as Ax=b. One such problem would be for example the quadratic form,

f(x)=\frac{1}{2}x^{T}Ax-b^Tx

Note that by picking the solution of the system as x=A^{-1}b, we then have that

f'(x)=Ax-b=AA^{-1}b-b=0

So,  the solution of the linear system is a critical point of f.

Why bother with optimization?

The next question we will try to tackle is a motivational one. Why would you want to add an extra layer of complexity to your problem by writing it as an optimization problem? Why not solve it via some of the usual iterative methods, like Jacobi or Gauss – Seidel?

  1. Alternative ways of solving a problem could come in handy, depending on the problem. For example, the Jacobi and Gauss – Seidel method’s convergence depends a lot on the spectral radius of the matrix A. Perhaps a different approach would work better in other cases.
  2. Because it’s instructive to use a tool for multiple  jobs and …
  3. Because optimization methods are cool!  😀

We move on to the …

General form of the methods

Let’s suppose that we have a cost function f and we want to minimize it. The methods that we are gonna talk about in this post can all be written as

x_{k+1}=x_{k}+a_k\cdot p_k

And they are all called “descent methods”, cause in each step, we move to a new point for which our function has a lower value than before. p_k is called the descent direction and it lets us know where we want to go to find the smaller values of f. After we pick a direcion, we need to also decide how much to move to that direction, i.e pick the step size a_k.

Our first method : Gradient descent

The easiest way to build a descent method is to think of what p_k should be at each step for a function f:\mathbb{R}\to[0,\infty) and f smooth & convex.  Now, what’s the easiest choice for p_k?

We would like to move in the opposite direction of the one in which f is increasing. How can we decide which direction is that? Since f is differentiable, we can just pick at every step p_k such that p_k\cdot f'(x_k)<0 or, more generally, p_k\cdot \nabla f(x_k)<0. Remember that the gradient of a function points to the direction where the function is increasing faster. In 1D too, f'(x_k) is positive if we are increasing faster on the right and negative if we are increasing faster on the left of the point x_k. Thus, in this case, we will move in the direction we want by going in the opposite direction of the gradient of f.

What about the stepsize?

We still need to choose our stepsize though, how much we are going to move in that particular direction. The idea is that given the current point x_k and the new direction p_k, we want to find an a_k such that

\min_{x\in\mathbb{R}}h(x)=\min_{x\in\mathbb{R}}f(x_k+x\cdot p_k)=h(a_k)

This corresponds to the idea that we want to minimize our function across the line that passes from the point p_k. If this problem is easy to solve, it can be done exactly in every step. Often that’s expensive (and unnecessary, since we don’t really need the exact point), so backtracking is used instead.

Complete convergence analysis for both cases (exact and inexact stepsize) can be found in this reference, Stephen Boyd’s book, p466-469.

Any problems?

Gradient descent is very sensitive to the condition of the Hessian of f near the point that we are trying to find. If the Hessian has a large condition number, then the method will stall and generally take a large number of iterations to reach the mark.  A more intuitive explanation is this : the gradient descent method always proposes a very specific path downhill. That path may not be the best (curvature-wise) way to go to the minimum.

Let us for example use the method on the Rosenbrock function,

f(x,y)=(1-x)^2+100(y-x)^2

The minimum of the function is at the bottom of the banana-shaped valey, at (1,1) where the function is equal to 0.

Rosenbrock function ~ pic from wikipedia

 

This function is often used as a benchmark for optimization methods.  Let’s see how the error drops for different initial points. If you cannot see the following pictures perfectly, just click on them. A new tab should open and you should be able to see them there.

gdescent

We can see here that the method stalls for the first two points and even for the point (0.9, 0.9), which is next to the point we are searching for. In this case we used a constant stepsize of 10^{-3}. Let’s see what happens if we use a smaller stepsize, a=10^{-4}.

gdescentSmaller

Since we now explore more smoothly, i.e in smaller steps, we can see more of the slowdown.

As this is the simplest of methods, the implementation is also very simple :

function [xk] = gdsFun(gradf, x0, NumberOfIter)
% GRADIENT DESCENT
% gradf : gradient of f, passed as a function handle
% This function implements the gradient descent method for
% finding the minimum of a function f. In this version,
% there is no stopping criterion. Instead, the number of
% iterations N must be given. Also,
% a starting point, x0,  must be provided.
xk = zeros(2,NumberOfIter);
xk(:,1) = x0;
for i = 1 : ( NumberOfIter - 1)
a = 0.001;
p = gradf(xk(1,i), xk(2,i));
 xk(:,i+1) = xk(:,i) - a * p;
end
end

In the next post, we will go a little bit deeper in the convergence results for Gradient Descent and introduce the Conjugate gradient method.

Advertisements