The Travelling Santa Problem.
I spent the past couple of days on a problem called the “Travelling Santa Problem”, a variant of the NP complete “Travelling Salesman Problem”. There is an ongoing competition on Kaggle, a site which hosts various machine learning and optimization contests. In the original TSP problem you need to find a tour of a graph, visiting every vertex exactly once, minimizing the total cost. Since there can be an arbitrarily large (exponential in number of vertices) number of such paths (Hamiltonian cycles), no polynomial time exact algorithm is known for this problem. Applications of TSP are in the areas of scheduling, logistics and implementation of electronic circuitry. Kaggle’s version of the problem gives you a set of 150000 points on a two dimensional plane, and you need to find two paths, that visit all the points and have no edges that are common to both the paths, ie, disjoint paths. The aim of the contest is to find the minimum distance and your score is the distance on the longer path.
If you visualize the set of points, you get a picture of Santa.
They provided a benchmark of random paths to get started with, that is two completely random sequences of points. It had a score of about 1.3 billion, and at that time the leader had a score of 6.5 million. I decided to start with a very simple heuristic, which was to sort the points according to their x coordinate and connect the path from left to right. With this I could find a single path, but what about the second path? The disjoint path condition makes this a little trickier. But given the first path, it should be easy to check if adding a node to the second path will be valid. Since every node would have a maximum of two connections, we just need to store the ids of the previous and next nodes for each node in the first sequence. So using this, I sorted the points according to the y coordinate and started building the second path from top to bottom, if there was an edge that would be repeated from the first path, I would simply select a random vertex and make a connection. So now that I had two disjoint paths, I decided to make a submission, the score turned out to be 400 million. Much better than random, but way behind the leaders and quite inefficient. If you come to think of it, this approach would create a very zig-zag path on random data. The disjointness condition forces you to add random edges to the second path, which take the distance up even further. So time for a different strategy.
So the next heuristic I tried was the nearest neighbour approach. If you connect each node to its nearest neighbour, you should get a shorter path and this is quite intuitive. But the problem here is, it would have a complexity of O(n^2) and there are 150,000 points, go figure. A k-d tree can find the nearest neighbour in O(n lg n) time, but its implementation is non trivial and it would be pretty much useless for generating a second disjoint path, we would still have to rely on random links. Time for an approximation of nearest neighbours: its highly likely that the nearest neighbour would lie on one of the points closest by x or y coordinates. This can be found by simply sorting the points and considering the k nearest such points and finding the nearest neighbour (we can also check for the disjointness condition, while doing so). Doing this on the nearest 100 points for each node, got me a solution with a score of 25 million. I decided to increase the number of points, and at around 4000 points, I could get a solution of around 8.5 million and the code runs in a minute. Within 30% of the leaders, not bad, huh?. So now, how can this be improved further.
I opened CLRS to look for ideas, there was an approximation scheme called the 2-approximation scheme which is based on finding vertices on a walk of a minimum spanning tree. CLRS has an elegant proof of how it is less than twice the minimum possible value, and in practice much better than that. But the problem is in creating the minimum spanning tree. Using Prim’s algorithm would have a complexity of O(n^2 lg n), which is not feasible. It turns out that an MST for a Euclidean plane can be created in O(n lg n) time using a Voronoi diagram or a Delaunuay triangulation. As I had no background in computational geometry, I decided not to go down that path. Creating an approximate MST taking 40 nearest by x or y coordinates yielded a poor result.
An optimization technique, that was mentioned a lot in the forums was 2-opt. It is a local optimization technique, that involves taking two adjacent edges, and replacing them with a different combination that reduces the distance. If you have a sequence of points ABCD, you replace it with ACBD if it has a lower cost. I tried this technique, by randomly picking consecutive edges and swapping vertices if they had a lower cost. Even after 1 billion iterations and 45 minutes, it only yielded an improvement of 0.1%. It was getting stuck at a local minimum. Then I ran the optimization in a sequential order of all vertices in the path, instead of random choice, it converged in 5-6 iterations and yielded a 2-3% improvement.
There is a randomized algorithm known as simulated annealing based on the metallurgical principle of annealing, a metal is heated to a high temperature and allowed to cool slowly to remove crystal imperfections. Heating allows a different configuration to be reached, and on cooling the imperfection is removed. Here, perturbations of the sequence are created using 2-opt and during the high temperature phase, changes are accepted even if it increases the total cost. In the cooling phase the acceptance rule is tightened until a point is reached where only those moves that reduce the cost are accepted. This helps in finding a global optimum. Unfortunately, this algorithm didn’t improve the solution by much even after running for an hour.
I could get one path at around 7.1 million and the second at 8.25 million under 5 minutes of running time using sequential 2-opt. So the difference in the two paths could be the main culprit. The Lin - Kernighan heuristic is considered to be one of the best heuristics for TSP and it is a generalization of 2-opt. But my motivation to read more about it diminished after seeing some posts in the competition forum, that some of the leaders ran their programs for days, and worse, even weeks to get their results. In my opinion running a program for days to achieve a very marginal improvement is totally not justified. This was really disturbing and I decided to quit the competition.
What I learned from this competition was that NP complete problems could be approximated reasonably within a short amount of time, and the last mile takes a humongous effort, and whether its worth it is questionable. I also got to know about some randomized algorithms and Monte Carlo simulations. You can find my C++ code over here.