Finding the minimum : Step 1, apply heat. Step 2, wait for the function to cool off. Step 3, enjoy.

Hi everyone!

Today I will show you a cool method to find the minimum (or maximum) of a function f:A\to \mathbb{R}, where A could be a subset of the reals or of the complex numbers.

So, let’s say that you want to find the minimum of

f(x)= -x\cdot sin(x)\cdot cos(2x)^2,\ x\in[-10, 10]

This function has a complex landscape.

and of course it’s easy to see the where the global minimum is and that there are many local minima.

To find the exact value of  x for which f gets its minimum, you could use Fermat’s theorem and solve the equation f'(x)=0. Which means, get ready for it, here it comes,


and that’s the derivative. Now you could try to solve that (good luck!) or perhaps do some number crunching using Newton’s method or another optimization algorithm (like steepest descent).

What we are gonna do is to try to explore the landscape of the function using an algorithm called simulated annealing.

Annealing is a process used in metallurgy to further refine the properties of a metal. The idea is that if you have, for example, silver, it’s atoms might not sit in the optimal positions (regarding a certain property, for example ductility of the metal). By heating it up, the extra energy causes the atoms to jump around and then by cooling it, you allow them to sit in better positions, thus improving that property.

The idea is quite similar in the optimization algorithm of simulated annealing. Let’s describe one step of it. We provide the algorithm with a starting point x_{0} and a function f. Then we define a number, which can be either variable during the execution or a constant, the search radius R>0. By providing R, the algorithm knows that it should search for the minimum/maximum in [x_{0}-R,x_{0}+R]. If we would like to also limit ourselves to a particular interval [a,b] then we could write [max(a,x_{0}-R), min(b,x_{0}+R)]=[c,d].

Here is the interesting part. We also define a temperature T, which in general is a decreasing function of time and in this example will be T(n) = 1/log(n). Then, for the n-th iteration, we pick two values u_{1},u_{2}\sim U[c,d]  and calculate


Now we just have to check the following acceptance criterion, that P> u_{2}, and if it’s true, then repeat the whole thing with x =u_{1} . If it’s not,  repeat the whole thing with the same x (so that iteration was wasted).

OK, so too many things happened too fast. Let’s back up and explain a bit. Even better, let’s simplify things so that we can understand what’s happening under the hood. We will try a few iterations of the algorithm by using f:[0,1]\to \mathbb{R}, f(x)=x. and x_{0}=1/2. Let’s also define R=1.

For n = 1, we pick two values from the uniform distribution in [0,1]. Let them be 0.3 and 0.13.

We calculate  P now. So P=min(e^{(1/2 - 0.3)\cdot log(1)},1)=1 . That means that we will move from our place to x=0.3 cause P>0.13. Note that for the first step we would move from our position, no matter the value of u_{2}.

For n = 2, let u_{1}=0.46, u_{2}=0.91 Then  P=min(e^{(0.3 - 0.46)\cdot log(2)},1)=0.895 . That means that we don’t accept this move and x doesn’t change.

Perhaps you are beginning to see what’s happening. If the new move makes the value of the function smaller, then we accept it with probability P and if it doesn’t, well, again we might accept it with a probability 1-P.  The reason for that is to let our algorithm “explore” the domain. Remember that in each step, our algorithm only sees the [c,d] interval, which again changes at the beginning of every step.

And what about the temperature T? Well, that function puts an extra idea into P. When T is small, then the probability that we will move is mainly affected by the values of f, thus allowing both good and bad steps (a bad step is one in which we move to a larger rather than a smaller value). As the algorithm moves forward, the temperature makes sure that we will have more and more good steps and less of the bad ones.

Some people prefer to see the actual program in order to understand what’s happening, so here it is for Matlab. Of course, this program is not written in the most efficient way and I don’t claim that it is free of errors but it works fine for illustration purposes. You may use and modify it as you like.

function simAnneal(f, x, lower, upper)
% f : the function to be minimized % x : the starting point % lower : the lower limit of your interval % upper : the upper limit of your interval % for example, for finding the minimum of f(x) = x, starting from 1/2 and searching in [-1,1] you would write %simAnneal(@(x)x,1/2,-1,1) close all;
%Search radius
r = 10;
T =@(t) 1/log(t);
hold on;
for i=1:100
 xprev = x;

 %pick a, b inside the radius
 a = max(min([x-r,1]),lower);
 b = min(max([x+r,1]),upper);

 % Get a sample from U[a,b]
 u = rand()*(b-a)+a;

 p = min([exp((h(x)-h(u))/T(i)),1]);
 if p>rand();
 x = u;
 % Show what's happening
 fprintf('h(x) = %f\n',h(x));
% Point to the maxima/minima
plot([x,x],[f(x) + 1, f(x) + 1/10 ],'g',...
hold off;

So all in all, that’s all. This algorithm will not always be able to find the global minimum/maximum and will probably waste a lot of it’s samples before it makes a step. So why should anyone use it?

First of all, it’s fun to see how this method searches around for the minimum. Then, we didn’t make assumption whatsoever on smoothness. We will see in the examples below that this probabilistic view doesn’t require it.

Perhaps the most important reason comes from problems where the space is just too large to search with a deterministic approach. In those problems, even getting a not-so-optimal value for the minimum might be very good. An example of those problems is the traveling salesman.

Time for graphs!

So, here is how this algorithm works for different functions. Remember that this function uses a random number generator and thus you can’t expect it to run in the same way each time. In each case, we start from 0.

Enjoy. The green arrow points to the last place visited.

f(x)= -x\cdot sin(x)\cdot cos(2x)^2,\ x\in[-10, 10]

Notice how the method jumped from a place close to the one minimum to the other. That’s because R is a constant. By making it a variable number, decreasing with the number of iterations, we could “trap” our search in places where it’s highly likely that we can find a minimum.

Now, for a more challenging example.


Oops. That didn’t go too good. What happened? Well, the problem is that we made a lot of bad choices and thus, when we ran out of iterations to do, we were pretty far away from the minimum. Let’s try something new! Let’s change the temperature to T(n)=1/log(n)^2 and see what happens.

Now that’s cool! From this graph we can immediately deduce two things. First, changing the temperature to a function that decreases faster assures that we only do “logical” steps. Second, that we threw away a lot of our samples. Check out how many dot’s we had in the previous graph. We used only 19 out of 100 samples.  Of course we could always add a part to our code so that it would check if we are at a good spot and then stop.

That’s all for this post.

Until next time.

*Oh and I am sure that google can provide better references than me for further reading on simulated annealing.