Try to kill anything called hope in you. You never want to put yourself in a situation where you wish very badly for something *specific* to happen to you, an event that would tomorrow suddenly make a big difference for the rest of your life.
—(Say, the outcome of a job interview or a lawsuit, a battle, winning the lotto, meeting a significant other, getting a paper accepted, have gold rise, have your employer go public, or other events that can markedly change your life.) If one event, a single event makes a difference, *and* it doesn’t depend on you, then you are megafragile, a prisoner of circumstances.
—This is not to say that no good event can make a difference to you, rather that you do not *need* to dream about it. For the alternative should be acceptable enough for you. The outcome of the battle is not what matters, but how *you* fight; for that you do not need to hope and dream like a loser.
—Likewise if you end up losing a lawsuit or something of the sort, show off how honorable you are in handling defeat.

So, simply, get organized in a way to not have to dream tomorrow or the day after. It takes a while, a lot of work, and the certainty that the counterfactual is equally good for you.

Nassim Nicholas Taleb
Make mistakes of ambition, not mistakes of sloth
Niccolo Machiavelli

Frameworks, features and architecture astronauts

You don’t need to use every feature of your framework just because it is there. Some frameworks like backbone.js are widely in use but their documentation is so terse and assumes a lot of things that will take some time to wrap your head around. So, what people end up doing is reading large tutorials or books on the topic. These “books” completely miss the point, the goal here is to solve the problem at hand and not to obsess over the framework using. Now for everything, there is a “one correct way” using the features provided by the framework. Things are usually opinionated, even if the developers claim otherwise. For most of these things, you can come up with simpler and much more flexible solutions by yourself, instead of going through the process of adapting the idea to your framework. I am not saying that you must not use frameworks, they serve a few purposes. But only use them for things you really need, for everything else, stick to something simple. Its not worth the additional effort, no matter what the “architecture astronauts" tell you. For eg, I like backbone’s idea of event dispatchers attached to views. It makes for a neatly organised structure for coding views. It makes user interface work less painful, but I wouldn’t go anywhere near backbone’s models and collections, simple JSON and AJAX work well for me here. I don’t see the need to add a layer of incidental complexity.

The Trouble with SVMs

People tend to use complicated machine learning models like support vector machines, just because it is believed to be the model with the best accuracy. The fact that SVMs are available as black box implementations exacerbate this problem. I would suggest focusing on your data instead of using slow techniques that you only partially understand. I achieved better results with a simple Naive Bayes classifier by choosing the right features. The best part is it only takes as much time as it does to read the document unlike an SVM.

I had a task of classifying text documents or reviews for sentiment polarity for a college project. I had a dataset of over 50,000 movie reviews with 25,000 positive and 25,000 negative reviews. After dividing my data into training and test data, the first thing I came up with was a Naive Bayes classifier based on the word frequencies of each term in the text. I used add one (or Laplacian) smoothing to handle words which were not in the training set. With this, the accuracy was around 69%. That is bad and is only marginally better than random guessing. I added a mechanism to transfer words to the opposite class when they were negated. That is to say if there is a word like “not” followed by some other word, the other word is placed in the opposite class. This along with Multinomial Naive Bayes, i.e counting a word only once in a document, helped increase the accuracy to 82.4%.

Then, against my better judgement I decided to try an SVM without completely understanding how it works. I used the TinySVM library and the training took over 45 minutes (vs 1 minute for NB) and resulted in a measly improvement of 0.3%. Not convinced by the black box approach, I decided to learn more about SVMs. The first place I went to were Andrew Ng’s lectures on Youtube. They helped me gain a general intuition about how an SVM works but the treatment of math was more hand wavy and it didn’t have anything about the actual algorithm to implement it. The papers on SVMs and SMO (Sequential Minimal Optimization) were too dense and the math was way above my level as I knew nothing about convex optimization. I spent part of the next month watching Stephen Boyd’s convex optimization lectures. Now I understood about Lagrange multipliers and KKT conditions but the task of implementing an SVM was too daunting. I realized that it may take a few months to write one from scratch. What a waste of time!

Now, I thought of removing the features that were only contributing noise to the classifiers. Stop word removal came to my mind but there were research papers suggesting that it actually removes information relevant to sentiment. Mutual information can be used quite effectively for feature selection. It is a measure of correlation of a feature with the whole dataset. Since it was based on probabilities, it suited my Naive Bayes model very well. So I reduced the number of features from over 300,000 words to 6,000. I chose this number by plotting a graph of the accuracy v/s the number of features. The accuracy shot up to 85.2% and the training time was still under a minute. 

The fact that I could throw away irrelevant features meant I could initially define a very large number of features and then prune the irrelevant ones. So I added bigrams and trigrams as additional features. The feature space had now increased to 11 million. By selecting the top 32,000 features by mutual information, I was able to get an accuracy of 88.80%. There was a paper by Andrew Maas of Stanford achieving an accuracy of 88.89% using complicated vector space and probability models on the same dataset. Naive Bayes comes within 0.1% of that.

The lesson from this is that you can achieve a lot using a very simple algorithm, if you focus on selecting the right features for your data. No need to resort to complex black box models.

The Travelling Santa Problem.

I spent the past couple of days on a problem called the “Travelling Santa Problem”, a variant of the NP complete “Travelling Salesman Problem”. There is an ongoing competition on Kaggle, a site which hosts various machine learning and optimization contests. In the original TSP problem you need to find a tour of a graph, visiting every vertex exactly once, minimizing the total cost. Since there can be an arbitrarily large (exponential in number of vertices) number of such paths (Hamiltonian cycles), no polynomial time exact algorithm is known for this problem. Applications of TSP are in the areas of scheduling, logistics and implementation of electronic circuitry. Kaggle’s version of the problem gives you a set of 150000 points on a two dimensional plane, and you need to find two paths, that visit all the points and have no edges that are common to both the paths, ie, disjoint paths. The aim of the contest is to find the minimum distance and your score is the distance on the longer path.

If you visualize the set of points, you get a picture of Santa.

image

They provided a benchmark of random paths to get started with, that is two completely random sequences of points. It had a score of about 1.3 billion, and at that time the leader had a score of 6.5 million. I decided to start with a very simple heuristic, which was to sort the points according to their x coordinate and connect the path from left to right. With this I could find a single path, but what about the second path? The disjoint path condition makes this a little trickier. But given the first path, it should be easy to check if adding a node to the second path will be valid. Since every node would have a maximum of two connections, we just need to store the ids of the previous and next nodes for each node in the first sequence. So using this, I sorted the points according to the y coordinate and started building the second path from top to bottom, if there was an edge that would be repeated from the first path, I would simply select a random vertex and make a connection. So now that I had two disjoint paths, I decided to make a submission, the score turned out to be 400 million. Much better than random, but way behind the leaders and quite inefficient. If you come to think of it, this approach would create a very zig-zag path on random data. The disjointness condition forces you to add random edges to the second path, which take the distance up even further. So time for a different strategy.

So the next heuristic I tried was the nearest neighbour approach. If you connect each node to its nearest neighbour, you should get a shorter path and this is quite intuitive. But the problem here is, it would have a complexity of O(n^2) and there are 150,000 points, go figure. A k-d tree can find the nearest neighbour in O(n lg n) time, but its implementation is non trivial and it would be pretty much useless for generating a second disjoint path, we would still have to rely on random links. Time for an approximation of nearest neighbours: its highly likely that the nearest neighbour would lie on one of the points closest by x or y coordinates. This can be found by simply sorting the points and considering the k nearest such points and finding the nearest neighbour (we can also check for the disjointness condition, while doing so). Doing this on the nearest 100 points for each node, got me a solution with a score of 25 million. I decided to increase the number of points, and at around 4000 points, I could get a solution of around 8.5 million and the code runs in a minute. Within 30% of the leaders, not bad, huh?. So now, how can this be improved further.

I opened CLRS to look for ideas, there was an approximation scheme called the 2-approximation scheme which is based on finding vertices on a walk of a minimum spanning tree. CLRS has an elegant proof of how it is less than twice the minimum possible value, and in practice much better than that. But the problem is in creating the minimum spanning tree. Using Prim’s algorithm would have a complexity of O(n^2 lg n), which is not feasible. It turns out that an MST for a Euclidean plane can be created in O(n lg n) time using a Voronoi diagram or a Delaunuay triangulation. As I had no background in computational geometry, I decided not to go down that path. Creating an approximate MST taking 40 nearest by x or y coordinates yielded a poor result.

An optimization technique, that was mentioned a lot in the forums was 2-opt. It is a local optimization technique, that involves taking two adjacent edges, and replacing them with a different combination that reduces the distance. If you have a sequence of points ABCD, you replace it with ACBD if it has a lower cost. I tried this technique, by randomly picking consecutive edges and swapping vertices if they had a lower cost. Even after 1 billion iterations and 45 minutes, it only yielded an improvement of 0.1%. It was getting stuck at a local minimum. Then I ran the optimization in a sequential order of all vertices in the path, instead of random choice, it converged in 5-6 iterations and yielded a 2-3% improvement. 

There is a randomized algorithm known as simulated annealing based on the metallurgical principle of annealing, a metal is heated to a high temperature and allowed to cool slowly to remove crystal imperfections. Heating allows a different configuration to be reached, and on cooling the imperfection is removed. Here, perturbations of the sequence are created using 2-opt and during the high temperature phase, changes are accepted even if it increases the total cost. In the cooling phase the acceptance rule is tightened until a point is reached where only those moves that reduce the cost are accepted. This helps in finding a global optimum. Unfortunately, this algorithm didn’t improve the solution by much even after running for an hour.

I could get one path at around 7.1 million and the second at 8.25 million under 5 minutes of running time using sequential 2-opt. So the difference in the two paths could be the main culprit. The Lin - Kernighan heuristic is considered to be one of the best heuristics for TSP and it is a generalization of 2-opt. But my motivation to read more about it diminished after seeing some posts in the competition forum, that some of the leaders ran their programs for days, and worse, even weeks to get their results. In my opinion running a program for days to achieve a very marginal improvement is totally not justified. This was really disturbing and I decided to quit the competition.

What I learned from this competition was that NP complete problems could be approximated reasonably within a short amount of time, and the last mile takes a humongous effort, and whether its worth it is questionable. I also got to know about some randomized algorithms and Monte Carlo simulations. You can find my C++ code over here.

Roll your own autocomplete solution using Tries.

You might have come across many websites with autocomplete suggestions, most notably Google.

Google autosuggest

Adding such an option to your site or application might seem daunting but there is a very simple recursive data structure that solves the problem. There is a ton of literature on the net on how to do this using black box approaches like Lucene, Solr, Sphinx, Redis etc. But all these packages require a lot of configuration and you also lose flexibility. Tries can be implemented in a few lines of code in any language of your choice.

Trie

A trie is basically a tree, with each node representing a letter as illustrated in the figure above. Words are paths along this tree and the root node has no characters associated with it.

  1. The value of each node is the path or the character sequence leading upto it.
  2. The children at each node are ideally represented using a hash table mapping the next character to the child nodes.
  3. Its also useful to set a flag at every node to indicate whether a word/phrase ends there.

Now we can define methods to insert a string and to search for one.

The insertion cost has a linear relationship with the string length. Now lets define the methods for listing all the strings which start with a certain prefix. The idea is to traverse down to the node representing the prefix and from there do a breadth first search of all the nodes that are descendants of the prefix node. We check if the node is at a word boundary from the flag we defined earlier and append the node’s value to the results. A caveat of the recursive approach, is you might run into stack depth problems when the maximum length of a string reaches tens of thousands of characters.

To delete an item,  first traverse down to the leaf of the keyword and then work backwards, checking for common paths.

And there you have it, an efficient way of retrieving strings with a common prefix in less than 40 lines of python code. I also wrote a C++ version using the STL data structures in about 90 lines, the time complexity for this version however is O(n log n) as the STL uses a red-black tree implementation for an associative array. Colin Dean has wrote a Ruby version and Marcus McCurdy, a Java one. You can find them all over here - https://github.com/vivekn/autocomplete. Read more about tries at Wikipedia.

Update: Thanks to bmahler on reddit for pointing out that the wordlist approach was unnecessary and space inefficient. It has been corrected.

Nothing is original. Steal from anywhere that resonates with inspiration or fuels your imagination. Devour old films, new films, music, books, paintings, photographs, poems, dreams, random conversations, architecture, bridges, street signs, trees, clouds, bodies of water, light and shadows. Select only things to steal from that speak directly to your soul. If you do this, your work (and theft) will be authentic. Authenticity is invaluable; originality is non-existent. And don’t bother concealing your thievery - celebrate it if you feel like it. In any case, always remember what Jean-Luc Godard said: “It’s not where you take things from - it’s where you take them to.
Jim Jarmusch
A good traveler has no fixed plans, and is not intent
on arriving.
Lao Tzu
The master of the art of living makes little distinction between his work and his play, his labor and his leisure, his mind and his body, his education and his recreation, his love and his religion. He simply pursues his vision of excellence in whatever he does, leaving others to decide whether he is working or playing. To him, he is always doing both.
– Buddha

Validators.js - javascript based form validation using jQuery

Validators.js is a javascript library for form validation that I wrote for my projects. It provides a simple and elegant interface for creating different types of form validators and chaining them together. Common functions like checking email addresses, matching a regular expression, checking for unfilled required fields, password-change validation are provided out of the box. It inserts html elments like divs or spans immediately after those fields which have failed validation, these can be styled with either the “.error” css class or any custom class you would like to specify. Here is an example on using the library, it requires jQuery.

The Validators() function is the main interface to the library, the first argument is an object that contains a mapping between the selectors(css or xpath) of the form fields and their corresponding Validator objects. To use multiple selectors, separate the selectors with a comma. The second argument is a callback function which is called if form validation is success, insert any code that you may want to call before form submission or if you want to submit the form by ajax.

If you want to use the ‘submit’ input type in the form, use jQuery’s .submit() function like this:

You can also create your own validators either by extending the classes defined in the Coffeescript source or in the following ways:

  • By providing a function that performs an operation on the field value and returns a true or a false value based on the validation success.

  • By providing a regular expression to match with.

You can find the coffeescript source and documentation over here.