Home > Collaborative Filtering in PHP

Collaborative Filtering in PHP

4th January 2011

I recently completed a POC project for a friend that called for a recommendation engine. That is, a mechanism was needed for the application to suggest domain-specific macguffins that similar users are interested in that might also interest the current user. The technique I used is called collaborative filtering. You can read more about this technique and others designed to extract wisdom from crowds in O'Reilly's excellent Programming Collective Intelligence and Ron Zacharski's work in progress called A Programmer's Guide to Data Mining.

Collaborative filtering essentially boils down to two steps:

Finding credible critics
Finding the difference in the set of the things liked by the user and the crediable critics

Credible critics are a cute way of labeling that set of users whose tastes are (perhaps) similar to the target user. There are many ways to determine this set, but the method I chose is processor friendly and easy to code.

First, you need a user-specified metric. This could be an actually rating (1-5) of your domain-specific widgets or simple "like" buttons. Whatever the shape of the rating, it must be represented a number.

After a while, your users will have generated a lot of rating data about your widgets. If you take two domain-specific widgets and plot their ratings on a cartesian plane, you'll begin to see patterns. Users who rate these items "closely" to the target user will make excellent credible critics (in theory). The game then is how to determine "close."

The most brain-dead simple method for the calculating distance of two cartesian points is called the Manhattan method. Very briefly, the formula to calculate this is to add the absolute value of the difference of each dimensions and divide these by the number of dimensions. In this case, dimensions are widgets that both users have rated. Let's look at some code.

function manhattan_distance($rating1=array(), $rating2=array()) {
   $MAX_DISTANCE = PHP_INT_MAX;
   $distance = $total = 0;
   foreach ($array_keys($rating1) as $widget_id) {
      if (array_key_exists($widget_id, $rating2) {
         $distance += abs($rating1[$widget_id] - $rating2[$widget_id]);
	 $total += 1;
      }
   }
   if ($total > 0) {
     return $distance / $total;
   }
   return $MAX_DISTANCE;
}

This function expects parameters to be associative arrays mapping widget IDs to ratings. Here's an example of calling this function:

$user1 = array("widget-1" => 3,
       	       "widget-2" => 1,
	       "widget-3" => 5,
              );

$user2 = array("widget-2" => 5,
               "widget-3" => 1,);
printf("Distance of user1 from user2: %0.2f\n", manhattan_distance($user1,$user2));

You will need to compare every user to each other, which is an algorithm with a runtime of O(n**2). This is not good for large user populations, so some future optimization will be needed. However, that optimization is not covered here.

To make a recommendation, we need to find a few critics with similar tastes. This is to say, we need to find users with the smallest distance from the target user.

function computeNearestNeighbor ($user_id, $user_list=array(), $count=1) {
   $your_ratings = get_ratings($user_id);
   $distances = array();
   foreach ($user_list as $uid) {
     $distances[] = array($uid,
     		          manhattan_distance($your_ratings, 
			  		     get_ratings($uid))
                         );
   }
   usort($distances, 'compare_recs');
   $tmp = array();
   while ($count > 0) {
      $i = array_shift($distances);	
      $tmp[] = $i[0];
      $count -= 1;
   }
   return $tmp;
}

function compare_recs ($a, $b) {
   if ($a[1] == $b[1]) {
      return 0;
   }
   return ($a[1] < $b[1]) ? -1 : 1;
}

function get_ratings ($user_id) {
   // Assume $data is held in a data store
   $tmp = array();
   foreach ($data as $row) {
      $tmp[$row["wiget_id"]] = $row["cnt"];
   }  
   return $tmp;
}

Finding the set of closests neighbors really is a fairly mechanical process of calculating the distances of all users to the target one, sorting that set by distance and using some number of the first several users in the list.

That completes the first part of the recommendation process: get the list of critics. Now we need to see what widgets those critics use that our target user is not. With the infrasture we've built, it is straight-forward to get the set of widgets not currently used by the target user.

function recommend_widgets ($user_id, $user_list=array()) {
   $recs = array();
   $seen = array();

   $your_ratings = get_ratings($user_id);
   $nearest_users = computeNearestNeighbor($user_id, $user_list, 3);

   foreach ($nearest_users as $uid) {
      $neighbor_ratings = get_invites($uid);
      foreach ($neighbor_ratings as $widget_id => $rating) {
      	 if ($seen[$widget_id]) {
	    continue;
         }
        
	if (!array_key_exists($widget_id, $your_ratings)) {
	   $recs[] = array($widget_id, $rating);
	   $seen[$widget_id] = True;
        }
      }
   }

   usort($recs, "compare_ratings");
   return $recs;
}

function compare_ratings ($a, $b) {
   if ($a[1] == $b[1]) {
      return 0;
   }
   return ($b[1] < $a[1]) ? -1 : 1;
   
}

Notice that the sort routine here orders the records by rating is descending order. This means that recommendations with high ratings will be listed first. You should consider whether this is the behavior you want your recommendation engine to have.

Collaborative filter is simply algorithm that invites all kinds of tweaking. There are many better ways to find critics or even to calculate "distance." You might want to add other weighting mechanisms to the recommendations of specific items. The key is to be able to articulate exactly how you want your recommendations to work.

Tags: data mining, php, programming