Its not about the number of characters in your code that matters when you are choosing a language for productivity. Sure, I would rather write something in Python than in C, if I don’t need to worry about performance. But its hard for me to translate my ideas into Haskell, which results in even shorter code. Using that metric, one could argue that writing GZipped code is better, but think of the trouble you need to go through for that. Here is where I believe Lisp/Scheme really shines through. You can just program by “wishful thinking” as Hal Abelson says. Previously, I was under the impression that lisp was just another language. I was looking at a lisp code and saying to myself, I could implement the same in Python in the same number of lines, so what is the big deal?. Recently, I started reading “Structure and Interpretation of Computer Programs” (SICP) (If you haven’t read SICP, you should start reading it, it is the best book in expressing the idea of abstraction to programming). This book made me realize what is meant by the statement, “lisp has no syntax at all”. There is absolutely no friction in translating your idea into code, when you start thinking at the right levels of abstraction. I don’t get why lisp isn’t as popular as it should be, probably because most people don’t get abstraction. But at the same time, much of what is written about procedural abstraction in SICP is applicable to any programming language with lambdas and first class functions. So even if you don’t intend to use Lisp/Scheme, I would suggest getting a copy of SICP.
You don’t need to use every feature of your framework just because it is there. Some frameworks like backbone.js are widely in use but their documentation is so terse and assumes a lot of things that will take some time to wrap your head around. So, what people end up doing is reading large tutorials or books on the topic. These “books” completely miss the point, the goal here is to solve the problem at hand and not to obsess over the framework using. Now for everything, there is a “one correct way” using the features provided by the framework. Things are usually opinionated, even if the developers claim otherwise. For most of these things, you can come up with simpler and much more flexible solutions by yourself, instead of going through the process of adapting the idea to your framework. I am not saying that you must not use frameworks, they serve a few purposes. But only use them for things you really need, for everything else, stick to something simple. Its not worth the additional effort, no matter what the “architecture astronauts” tell you. For eg, I like backbone’s idea of event dispatchers attached to views. It makes for a neatly organised structure for coding views. It makes user interface work less painful, but I wouldn’t go anywhere near backbone’s models and collections, simple JSON and AJAX work well for me here. I don’t see the need to add a layer of incidental complexity.
People tend to use complicated machine learning models like support vector machines, just because it is believed to be the model with the best accuracy. The fact that SVMs are available as black box implementations exacerbate this problem. I would suggest focusing on your data instead of using slow techniques that you only partially understand. I achieved better results with a simple Naive Bayes classifier by choosing the right features. The best part is it only takes as much time as it does to read the document unlike an SVM.
I had a task of classifying text documents or reviews for sentiment polarity for a college project. I had a dataset of over 50,000 movie reviews with 25,000 positive and 25,000 negative reviews. After dividing my data into training and test data, the first thing I came up with was a Naive Bayes classifier based on the word frequencies of each term in the text. I used add one (or Laplacian) smoothing to handle words which were not in the training set. With this, the accuracy was around 69%. That is bad and is only marginally better than random guessing. I added a mechanism to transfer words to the opposite class when they were negated. That is to say if there is a word like “not” followed by some other word, the other word is placed in the opposite class. This along with Multinomial Naive Bayes, i.e counting a word only once in a document, helped increase the accuracy to 82.4%.
Then, against my better judgement I decided to try an SVM without completely understanding how it works. I used the TinySVM library and the training took over 45 minutes (vs 1 minute for NB) and resulted in a measly improvement of 0.3%. Not convinced by the black box approach, I decided to learn more about SVMs. The first place I went to were Andrew Ng’s lectures on Youtube. They helped me gain a general intuition about how an SVM works but the treatment of math was more hand wavy and it didn’t have anything about the actual algorithm to implement it. The papers on SVMs and SMO (Sequential Minimal Optimization) were too dense and the math was way above my level as I knew nothing about convex optimization. I spent part of the next month watching Stephen Boyd’s convex optimization lectures. Now I understood about Lagrange multipliers and KKT conditions but the task of implementing an SVM was too daunting. I realized that it may take a few months to write one from scratch. What a waste of time!
Now, I thought of removing the features that were only contributing noise to the classifiers. Stop word removal came to my mind but there were research papers suggesting that it actually removes information relevant to sentiment. Mutual information can be used quite effectively for feature selection. It is a measure of correlation of a feature with the whole dataset. Since it was based on probabilities, it suited my Naive Bayes model very well. So I reduced the number of features from over 300,000 words to 6,000. I chose this number by plotting a graph of the accuracy v/s the number of features. The accuracy shot up to 85.2% and the training time was still under a minute.
The fact that I could throw away irrelevant features meant I could initially define a very large number of features and then prune the irrelevant ones. So I added bigrams and trigrams as additional features. The feature space had now increased to 11 million. By selecting the top 32,000 features by mutual information, I was able to get an accuracy of 88.80%. There was a paper by Andrew Maas of Stanford achieving an accuracy of 88.89% using complicated vector space and probability models on the same dataset. Naive Bayes comes within 0.1% of that.
The lesson from this is that you can achieve a lot using a very simple algorithm, if you focus on selecting the right features for your data. No need to resort to complex black box models.
I spent the past couple of days on a problem called the “Travelling Santa Problem”, a variant of the NP complete “Travelling Salesman Problem”. There is an ongoing competition on Kaggle, a site which hosts various machine learning and optimization contests. In the original TSP problem you need to find a tour of a graph, visiting every vertex exactly once, minimizing the total cost. Since there can be an arbitrarily large (exponential in number of vertices) number of such paths (Hamiltonian cycles), no polynomial time exact algorithm is known for this problem. Applications of TSP are in the areas of scheduling, logistics and implementation of electronic circuitry. Kaggle’s version of the problem gives you a set of 150000 points on a two dimensional plane, and you need to find two paths, that visit all the points and have no edges that are common to both the paths, ie, disjoint paths. The aim of the contest is to find the minimum distance and your score is the distance on the longer path.
If you visualize the set of points, you get a picture of Santa.
They provided a benchmark of random paths to get started with, that is two completely random sequences of points. It had a score of about 1.3 billion, and at that time the leader had a score of 6.5 million. I decided to start with a very simple heuristic, which was to sort the points according to their x coordinate and connect the path from left to right. With this I could find a single path, but what about the second path? The disjoint path condition makes this a little trickier. But given the first path, it should be easy to check if adding a node to the second path will be valid. Since every node would have a maximum of two connections, we just need to store the ids of the previous and next nodes for each node in the first sequence. So using this, I sorted the points according to the y coordinate and started building the second path from top to bottom, if there was an edge that would be repeated from the first path, I would simply select a random vertex and make a connection. So now that I had two disjoint paths, I decided to make a submission, the score turned out to be 400 million. Much better than random, but way behind the leaders and quite inefficient. If you come to think of it, this approach would create a very zig-zag path on random data. The disjointness condition forces you to add random edges to the second path, which take the distance up even further. So time for a different strategy.
So the next heuristic I tried was the nearest neighbour approach. If you connect each node to its nearest neighbour, you should get a shorter path and this is quite intuitive. But the problem here is, it would have a complexity of O(n^2) and there are 150,000 points, go figure. A k-d tree can find the nearest neighbour in O(n lg n) time, but its implementation is non trivial and it would be pretty much useless for generating a second disjoint path, we would still have to rely on random links. Time for an approximation of nearest neighbours: its highly likely that the nearest neighbour would lie on one of the points closest by x or y coordinates. This can be found by simply sorting the points and considering the k nearest such points and finding the nearest neighbour (we can also check for the disjointness condition, while doing so). Doing this on the nearest 100 points for each node, got me a solution with a score of 25 million. I decided to increase the number of points, and at around 4000 points, I could get a solution of around 8.5 million and the code runs in a minute. Within 30% of the leaders, not bad, huh?. So now, how can this be improved further.
I opened CLRS to look for ideas, there was an approximation scheme called the 2-approximation scheme which is based on finding vertices on a walk of a minimum spanning tree. CLRS has an elegant proof of how it is less than twice the minimum possible value, and in practice much better than that. But the problem is in creating the minimum spanning tree. Using Prim’s algorithm would have a complexity of O(n^2 lg n), which is not feasible. It turns out that an MST for a Euclidean plane can be created in O(n lg n) time using a Voronoi diagram or a Delaunuay triangulation. As I had no background in computational geometry, I decided not to go down that path. Creating an approximate MST taking 40 nearest by x or y coordinates yielded a poor result.
An optimization technique, that was mentioned a lot in the forums was 2-opt. It is a local optimization technique, that involves taking two adjacent edges, and replacing them with a different combination that reduces the distance. If you have a sequence of points ABCD, you replace it with ACBD if it has a lower cost. I tried this technique, by randomly picking consecutive edges and swapping vertices if they had a lower cost. Even after 1 billion iterations and 45 minutes, it only yielded an improvement of 0.1%. It was getting stuck at a local minimum. Then I ran the optimization in a sequential order of all vertices in the path, instead of random choice, it converged in 5-6 iterations and yielded a 2-3% improvement.
There is a randomized algorithm known as simulated annealing based on the metallurgical principle of annealing, a metal is heated to a high temperature and allowed to cool slowly to remove crystal imperfections. Heating allows a different configuration to be reached, and on cooling the imperfection is removed. Here, perturbations of the sequence are created using 2-opt and during the high temperature phase, changes are accepted even if it increases the total cost. In the cooling phase the acceptance rule is tightened until a point is reached where only those moves that reduce the cost are accepted. This helps in finding a global optimum. Unfortunately, this algorithm didn’t improve the solution by much even after running for an hour.
I could get one path at around 7.1 million and the second at 8.25 million under 5 minutes of running time using sequential 2-opt. So the difference in the two paths could be the main culprit. The Lin - Kernighan heuristic is considered to be one of the best heuristics for TSP and it is a generalization of 2-opt. But my motivation to read more about it diminished after seeing some posts in the competition forum, that some of the leaders ran their programs for days, and worse, even weeks to get their results. In my opinion running a program for days to achieve a very marginal improvement is totally not justified. This was really disturbing and I decided to quit the competition.
What I learned from this competition was that NP complete problems could be approximated reasonably within a short amount of time, and the last mile takes a humongous effort, and whether its worth it is questionable. I also got to know about some randomized algorithms and Monte Carlo simulations. You can find my C++ code over here.
So I decided to take a plunge into the world of competitive programming after testing the waters for a while. I spent my free time in the last couple of weeks solving algorithmic puzzles on different websites and wanted to share my experience here.
I had done a few easy problems here and there earlier, the problem really was focus. Selecting a site to start with is a tough task with so many good contest sites with great repositories of problems. There’s TopCoder, which is the most popular, but it forces you to download a standalone Java Applet and doesn’t have a great interface. Don’t know if its just me, but I didn’t like the experience too much. Codeforces, Codechef, Interviewstreet, SPOJ, Project Euler and a whole host of online judges by various universities leave you with an infinite supply of algorithmic problems. Almost all of them host weekly or monthly contests and have a rating system similar to chess ELO ratings. Believe me, solving these problems gives you a kick and helps you improve your algorithmic skills. It is also quite addictive once you get rolling.
You need to select a particular site and stick to it for a while. Make sure you choose one that has a good mix of easy and hard problems so that your confidence is not tripped up. Codeforces has hints for problems in their archive and would be a good starting point. I don’t have much of a background in algorithms and haven’t taken any formal course, but this is probably a great way to learn about them and also improve your debugging skills.
Though most programming languages are supported by these sites, learning C++ and STL will be incredibly useful. Many problems have tight time limits which would frustrate the hell out of you if you are using an interpreted language like Python. Even if you are using an optimal algorithm. This is what made me give up the last time I tried these sites, not any more.
Codechef has a monthly programming contest with some nice algorithm problems, so I decided to have a go at it last week. This contest is a ten day long event with ten problems. So I spent a couple of hours a day working out these problems and managed to solve 7 out of 10. The problems that I could solve were using the standard algorithms that you would find on CLRS, with some minor changes. So I guess, if I can do it, anyone can. Need to ramp up my skills and read about more algorithms to solve the remaining, but it was really fun and exciting.
Another thing I learnt from the past couple of weeks is that multitasking doesn’t work even if you use great productivity apps, todo lists etc. Just pick something you would like to work on and spend a considerable amount of time without context switching to get more productive. Life becomes less hectic and more enjoyable this way.
You might have come across many websites with autocomplete suggestions, most notably Google.
Adding such an option to your site or application might seem daunting but there is a very simple recursive data structure that solves the problem. There is a ton of literature on the net on how to do this using black box approaches like Lucene, Solr, Sphinx, Redis etc. But all these packages require a lot of configuration and you also lose flexibility. Tries can be implemented in a few lines of code in any language of your choice.
A trie is basically a tree, with each node representing a letter as illustrated in the figure above. Words are paths along this tree and the root node has no characters associated with it.
Now we can define methods to insert a string and to search for one.
The insertion cost has a linear relationship with the string length. Now lets define the methods for listing all the strings which start with a certain prefix. The idea is to traverse down to the node representing the prefix and from there do a breadth first search of all the nodes that are descendants of the prefix node. We check if the node is at a word boundary from the flag we defined earlier and append the node’s value to the results. A caveat of the recursive approach, is you might run into stack depth problems when the maximum length of a string reaches tens of thousands of characters.
To delete an item, first traverse down to the leaf of the keyword and then work backwards, checking for common paths.
And there you have it, an efficient way of retrieving strings with a common prefix in less than 40 lines of python code. I also wrote a C++ version using the STL data structures in about 90 lines, the time complexity for this version however is O(n log n) as the STL uses a red-black tree implementation for an associative array. Colin Dean has wrote a Ruby version and Marcus McCurdy, a Java one. You can find them all over here - https://github.com/vivekn/autocomplete. Read more about tries at Wikipedia.
Update: Thanks to bmahler on reddit for pointing out that the wordlist approach was unnecessary and space inefficient. It has been corrected.
Nothing is original. Steal from anywhere that resonates with inspiration or fuels your imagination. Devour old films, new films, music, books, paintings, photographs, poems, dreams, random conversations, architecture, bridges, street signs, trees, clouds, bodies of water, light and shadows. Select only things to steal from that speak directly to your soul. If you do this, your work (and theft) will be authentic. Authenticity is invaluable; originality is non-existent. And don’t bother concealing your thievery - celebrate it if you feel like it. In any case, always remember what Jean-Luc Godard said: “It’s not where you take things from - it’s where you take them to. — Jim Jarmusch
A good traveler has no fixed plans, and is not intent
on arriving. — Lao Tzu
The master of the art of living makes little distinction between his work and his play, his labor and his leisure, his mind and his body, his education and his recreation, his love and his religion. He simply pursues his vision of excellence in whatever he does, leaving others to decide whether he is working or playing. To him, he is always doing both.
The Validators() function is the main interface to the library, the first argument is an object that contains a mapping between the selectors(css or xpath) of the form fields and their corresponding Validator objects. To use multiple selectors, separate the selectors with a comma. The second argument is a callback function which is called if form validation is success, insert any code that you may want to call before form submission or if you want to submit the form by ajax.
If you want to use the ‘submit’ input type in the form, use jQuery’s .submit() function like this:
You can also create your own validators either by extending the classes defined in the Coffeescript source or in the following ways:
You can find the coffeescript source and documentation over here.