### The Trouble with SVMs

People tend to use complicated machine learning models like support vector machines, just because it is believed to be the model with the best accuracy. The fact that SVMs are available as black box implementations exacerbate this problem. I would suggest focusing on your data instead of using slow techniques that you only partially understand. I achieved better results with a simple Naive Bayes classifier by choosing the right features. The best part is it only takes as much time as it does to read the document unlike an SVM.

I had a task of classifying text documents or reviews for sentiment polarity for a college project. I had a dataset of over 50,000 movie reviews with 25,000 positive and 25,000 negative reviews. After dividing my data into training and test data, the first thing I came up with was a Naive Bayes classifier based on the word frequencies of each term in the text. I used add one (or Laplacian) smoothing to handle words which were not in the training set. With this, the accuracy was around 69%. That is bad and is only marginally better than random guessing. I added a mechanism to transfer words to the opposite class when they were negated. That is to say if there is a word like “not” followed by some other word, the other word is placed in the opposite class. This along with Multinomial Naive Bayes, i.e counting a word only once in a document, helped increase the accuracy to 82.4%.

Then, against my better judgement I decided to try an SVM without completely understanding how it works. I used the TinySVM library and the training took over 45 minutes (vs 1 minute for NB) and resulted in a measly improvement of 0.3%. Not convinced by the black box approach, I decided to learn more about SVMs. The first place I went to were Andrew Ng’s lectures on Youtube. They helped me gain a general intuition about how an SVM works but the treatment of math was more hand wavy and it didn’t have anything about the actual algorithm to implement it. The papers on SVMs and SMO (Sequential Minimal Optimization) were too dense and the math was way above my level as I knew nothing about convex optimization. I spent part of the next month watching Stephen Boyd’s convex optimization lectures. Now I understood about Lagrange multipliers and KKT conditions but the task of implementing an SVM was too daunting. I realized that it may take a few months to write one from scratch. What a waste of time!

Now, I thought of removing the features that were only contributing noise to the classifiers. Stop word removal came to my mind but there were research papers suggesting that it actually removes information relevant to sentiment. Mutual information can be used quite effectively for feature selection. It is a measure of correlation of a feature with the whole dataset. Since it was based on probabilities, it suited my Naive Bayes model very well. So I reduced the number of features from over 300,000 words to 6,000. I chose this number by plotting a graph of the accuracy v/s the number of features. The accuracy shot up to 85.2% and the training time was still under a minute.

The fact that I could throw away irrelevant features meant I could initially define a very large number of features and then prune the irrelevant ones. So I added bigrams and trigrams as additional features. The feature space had now increased to 11 million. By selecting the top 32,000 features by mutual information, I was able to get an accuracy of 88.80%. There was a paper by Andrew Maas of Stanford achieving an accuracy of 88.89% using complicated vector space and probability models on the same dataset. Naive Bayes comes within 0.1% of that.

The lesson from this is that you can achieve a lot using a very simple algorithm, if you focus on selecting the right features for your data. No need to resort to complex black box models.