Roll your own autocomplete solution using Tries.

You might have come across many websites with autocomplete suggestions, most notably Google.

Google autosuggest

Adding such an option to your site or application might seem daunting but there is a very simple recursive data structure that solves the problem. There is a ton of literature on the net on how to do this using black box approaches like Lucene, Solr, Sphinx, Redis etc. But all these packages require a lot of configuration and you also lose flexibility. Tries can be implemented in a few lines of code in any language of your choice.

Trie

A trie is basically a tree, with each node representing a letter as illustrated in the figure above. Words are paths along this tree and the root node has no characters associated with it.

  1. The value of each node is the path or the character sequence leading upto it.
  2. The children at each node are ideally represented using a hash table mapping the next character to the child nodes.
  3. Its also useful to set a flag at every node to indicate whether a word/phrase ends there.

Now we can define methods to insert a string and to search for one.

The insertion cost has a linear relationship with the string length. Now lets define the methods for listing all the strings which start with a certain prefix. The idea is to traverse down to the node representing the prefix and from there do a breadth first search of all the nodes that are descendants of the prefix node. We check if the node is at a word boundary from the flag we defined earlier and append the node’s value to the results. A caveat of the recursive approach, is you might run into stack depth problems when the maximum length of a string reaches tens of thousands of characters.

To delete an item,  first traverse down to the leaf of the keyword and then work backwards, checking for common paths.

And there you have it, an efficient way of retrieving strings with a common prefix in less than 40 lines of python code. I also wrote a C++ version using the STL data structures in about 90 lines, the time complexity for this version however is O(n log n) as the STL uses a red-black tree implementation for an associative array. Colin Dean has wrote a Ruby version and Marcus McCurdy, a Java one. You can find them all over here - https://github.com/vivekn/autocomplete. Read more about tries at Wikipedia.

Update: Thanks to bmahler on reddit for pointing out that the wordlist approach was unnecessary and space inefficient. It has been corrected.

Nothing is original. Steal from anywhere that resonates with inspiration or fuels your imagination. Devour old films, new films, music, books, paintings, photographs, poems, dreams, random conversations, architecture, bridges, street signs, trees, clouds, bodies of water, light and shadows. Select only things to steal from that speak directly to your soul. If you do this, your work (and theft) will be authentic. Authenticity is invaluable; originality is non-existent. And don’t bother concealing your thievery - celebrate it if you feel like it. In any case, always remember what Jean-Luc Godard said: “It’s not where you take things from - it’s where you take them to.
Jim Jarmusch
A good traveler has no fixed plans, and is not intent
on arriving.
Lao Tzu
The master of the art of living makes little distinction between his work and his play, his labor and his leisure, his mind and his body, his education and his recreation, his love and his religion. He simply pursues his vision of excellence in whatever he does, leaving others to decide whether he is working or playing. To him, he is always doing both.
– Buddha

Validators.js - javascript based form validation using jQuery

Validators.js is a javascript library for form validation that I wrote for my projects. It provides a simple and elegant interface for creating different types of form validators and chaining them together. Common functions like checking email addresses, matching a regular expression, checking for unfilled required fields, password-change validation are provided out of the box. It inserts html elments like divs or spans immediately after those fields which have failed validation, these can be styled with either the “.error” css class or any custom class you would like to specify. Here is an example on using the library, it requires jQuery.

The Validators() function is the main interface to the library, the first argument is an object that contains a mapping between the selectors(css or xpath) of the form fields and their corresponding Validator objects. To use multiple selectors, separate the selectors with a comma. The second argument is a callback function which is called if form validation is success, insert any code that you may want to call before form submission or if you want to submit the form by ajax.

If you want to use the ‘submit’ input type in the form, use jQuery’s .submit() function like this:

You can also create your own validators either by extending the classes defined in the Coffeescript source or in the following ways:

  • By providing a function that performs an operation on the field value and returns a true or a false value based on the validation success.

  • By providing a regular expression to match with.

You can find the coffeescript source and documentation over here.

    A python ORM for Redis - RORM

    Of late, I have been using Redis a lot, using the redis-py library on the client side. To speed up things and increase productivity I used Amix’s redis_wrap module. One feature I missed out, due to moving from Postgres to Redis was Django’s ORM. My code was full of repetitive patterns and boilerplate, but luckily python is an awesome language and you can fix it easily using magic methods and metaclasses.

    For this ORM wrapper, I used __getattr__ and __setattr__, which allow you to handle the events where an object’s fields are set or retrieved. I changed the default behaviour to getting and setting from Redis instead.

    I also used some metaclasses to integrate with the cool list, set and hash wrappers Amix made. 

    So now I can use it like this

    I agree that this may not be very elegant, but that’s not a bad tradeoff, considering it is only 80 lines and took half an hour to write.

    You can check the code here 

    Firing up asynchronous tasks in Django with Celery

    While working on a new project of mine I encountered a problem of blocking the django server for too long. It is sort of a recommendation engine and does some data crunching by sending hundreds of network requests to different sites. The problem was that these tasks were performed in a synchronous fashion.

    Even if such a site handles only low traffic, most browsers timeout and close the connection this leads to a broken pipe like this:

    So the solution is to run these tasks asynchronously on a different process rather using the precious webserver time. Two options came to my mind for doing this:

    • Node.js - It is a super-fast asynchronous javascript framework, based on the premise of requests and response callback objects. 
    • Celery - A python based task queue, which runs tasks on a separate python process. Basically what it does is, it searches for functions or objects on which a ‘task’ decorator has been applied and serializes the function and the arguments passed to it. It then stores this data in a database and queues calls to the function. It is much slower than node but existing code can be ported easily.

    I decided to go with Celery because of two reasons, the major one being the fact that I have to rewrite my Models, ORM and the data analysis code in javascript. Also I needed some tasks to be run synchronously, as they depended on the results of the previous tasks. Using node meant I had to change my algorithms as well. So I decided to go for the much slower Celery.

    Setting up celery was a really painful experience, and took a whole day. The documentation on their site is a bit confusing, so let me sum it up for you. I used Redis as the database for the queue.

    1. First you need to install by doing an “easy_install Celery”.
    2. Then, go to your project folder and create a file called “celeryconfig.py” and add these settings:
    3. Make sure you are using absolute module names while importing modules in your project, Celery crashes on relative names.
    4. Now to setup an asynchronous task, go to the module you specified in CELERY_IMPORTS in your config file and do something like this:
    5. Make sure that the arguments you are passing to the task function are simple python builtin type objects. This is because they need to be serialized and stored in a database. If you need to use complex objects like ORM models, pass their ids or other parameters from which they can be reconstructed within the task function.
    6. Now start up your redis server and fire up a celery worker by issuing this command inside your project directory, where you have created your config file: “celeryd -l info”. The additional parameter provides some useful logging features.
    7. Start your Django server, which now has the power to perform asynchronous tasks.

    Coffeescript - a neat new language.

    Javascript has been around for more than 15 years as the language for client side development, and despite its name and to some extent its syntax, its a lot more similar to python and ruby than java or c. In js, functions are objects and can be passed around quite easily, it uses prototypal inheritance, which is a little different from object oriented programming.

    Coffeescript is a language that compiles to javascript, it provides javascript a cool syntax, much similar to python. And it abstracts the prototypal inheritance to an object oriented style, we are more familiar with. It also comes with utilities like list comprehension, variable and default arguments and a lot of syntactic sugar. It makes programming in javascript a lot more fun. A point to be noted is that all these features are inherently present in javascript, Coffeescript just brings them out. Also worth noting is the fact that it wraps the generated code in an anonymous function, so if you need to add objects or functions to the global namespace, you have to attach it to “this” or “window”.

    Two years back, a really fast asynchronous framework called Node.js made it possible to use javascript on the server side. Now Coffeescript makes it a lot more accessible. Also earlier this year, I started developing an Android application, I had to use Java and XML, the verbosity was so immense that it took 3500+ lines to develop a proof-of-concept client which communicated with a django backend. Now a lot of frameworks which support development of Android and iOS apps in javascript have come up. Titanium Appcelerator is one such framework, using which I tried developing the same app again to get the hang of it. With coffeescript, I could cut the lines of code from 3500 to 350! . Also the generated javascript was only 500 lines.

    With javascript going beyond the browser and expanding pervasively, Coffeescript is the swiss army knife you are looking for.

    Vincent Van Gogh’s The Starry Night 

    Vincent Van Gogh’s The Starry Night 

    Simplicity: Creating a social network in 200 lines using Redis and Python

    If you are developing a Facebook app, or for that matter an app for Twitter, Linkedin, Foursquare etc, you need to come up with some form of a social network that mirrors the users’ friends. Well, creating this using SQL, or through an ORM, might be the first thing that strikes your mind. I am not saying that it is a hard thing to do, but you’ll be amazed by how simple it is with a NoSQL database, or rather a key – value store, Redis.

    I had heard a lot about Redis and decided to take it for a spin. After using it for a while I realized that the problem is, people are generally obsessed with features. They want to have features they don’t need. For example, to sort posts in a user’s feed, in the SQL world, you need to use an “ORDER BY” command and sort by the timestamp each time someone loads the feed.

    Redis is a key store with a flat namespace and is amazingly fast. It is an “in memory” database, which means it stays completely on the RAM, writing to the hard disk periodically. In re dis you can map strings or keys to the following data types:

    • Strings up to 1GB in size

    • Numbers

    • Lists of strings and numbers

    • Sets and sorted sets

    • Hashes, similar to a JSON object

    Redis doesn’t come with as many features like Postgres or MySQL, but the best thing about it is it forces you to think simple. To store user updates in a feed, simply assign an id to each update and store it as a part of the update’s key. For every new update increment the id, now the feed is stored in a presorted manner. Now keep separate lists representing the feeds of every user, and when a friend creates an update , push the update id to the appropriate lists.

    You may ask what about complex interrelationships?, the answer is to avoid them completely. The simpler your design is, the better.

    A side note, Redis requires a lot of memory for a large database, but using the cloud, you can scale up easily.

    I hacked up a library in python that provides the main features of a social network, friends, followers, news feed, authentication etc. out of the box. I am using it in one of my future projects, if you would like to use it or check out the source, head on to http://github.com/vivekn/resn/ . Its only 200 odd lines long.