Erik's blog

Code, notes, recipes, general musings

Thoughts on a simple search suggest implementation

leave a comment »

Preamble:
Most search engines will suggest search terms as you’re typing. For example, if I type “appleipod” on yahoo.com, it suggests “apple ipod”, “apple new ipod phone”, “apple ipod nano”, etc.

Screen shot of Yahoo! search suggest results

Problem:
You are given an input string “appleipod” and a hash table of search terms { … “apple ipod”, “apple new ipod phone”, “apple ipod nano” … }. The table could contain 50,000 terms. How would you write a function to resolve the input string to least one suggestion?

Approach 1:
A couple approaches came to mind immediately: 1) search for “search suggestion algorithm”, as this is a solved problem; 2) iterate through the table performing regular expression tests on each entry, which would be slow, but simple.

Since the point is to come up with a solution from scratch that simply produces a suggestion, I started thinking about option two and came up with the following:

function suggest ( input, dictionary ) {
   var suggestions = [];
   for ( var term in dictionary ) {
      var regexp = new RegExp( "(.*)" + term + "(.*)" );
      if ( regexp.test( input ) ) {
         suggestions.push( term );
      }
   }
   return suggestions;
}

Running suggest( "appleipod", {"apple":true, "apple ipod":true, "a":true, "app":true} ); would match “a”, “app”, and “apple” in “appleipod”. It wouldn’t match “appleipod” to “apple ipod”, but the goal was just to suggest something, so it passes.

Another person suggested a different approach, Approach 2:

We can take advantage of the random access features of the hash table by iterating through the input string, character by character, checking to see if there’s an entry in the hash table:

function suggest ( input, dictionary ) {
   var suggestions = [];
   var suggestion = '';
   for ( var i = 0; i < input.length; i++ ) {
      suggestion += input[i];
      if ( dictionary[ suggestion ] ) {
         suggestions.push( suggestion );
      }
   }
   return suggestions;
}

That this would suggest “a”, “app”, and “apple” for “appleipod”, so it meets the requirements and is a much faster option.

As an Approach 3, I wonder is there’s a way to do hash/regex look up, i.e., dictionary[ regex ], so we could match any dictionary term containing the string “apple”. Sounds like another problem to think about 😉

Update (1/6/11)

Hold on. Neither Approach 1 or 2 satisfy the requirement of returning at least one suggestion given a table such as { … “apple ipod”, “apple new ipod phone”, “apple ipod nano” … }. What we should be testing is suggest( "appleipod", {"noise":true, "apple ipod":true, "apple new ipod phone": true, "apple ipod nano": true, "more noise": true} ); and verifying that it returns at least one of the target terms, e.g., “apple ipod”. So, let’s modify Approach 2 to sequentially introduce a space in between segments of the input, i.e., “appleipod” –> “a ppleipod”, “ap pleipod”, etc., instead of testing character by character.

Let’s leave Approach 3 unsolved for now, and call this Approach 4:

function suggest ( input, dictionary ) {
   var suggestions = [];

   //the 1st suggestion would have a leading space, which is a bit strange so skip ahead
   var suggestion = input;
   for ( var i = 1; i < input.length; i++ ) {

      if ( dictionary[ suggestion ] ) {
         suggestions.push( suggestion );
      }
      suggestion = input.substring(0, i) + ' ' + input.substr(i);
   }
   return suggestions;
}
Advertisements

Written by Erik

January 6, 2011 at 12:44 pm

Posted in algorithm

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: