Get Social with Pearson Correlation


In one of my earlier posts, I discussed about using Pearson correlation for making social recommendation. In this post we will delve deeper into it including the Hadoop map reduce implementation. There are many correlation techniques, including cosine distance, slope one etc. These are already implemented in sifarish. The latest addition to the arsenal of correlation techniques in sifarish is Pearson correlation.

One advantage that Pearson correlation over other techniques is that it can handle  bias in ratings for an user. For example,  user A may consistently give better rating compared to user B. However, there may be good correlation between ratings of users A and B if the difference between their ratings is consistent.

Social Recommendation Basics

As you may recall from my earlier posts, there are three main steps in social recommendations solutions, specifically for item based recommendation

  1. Find correlation between two items based on ratings by a common set of users that have rated both items. This step is repeated  for all possible item pairs.
  2. For an item A that an user has rated and and Item B that the user has not rated, find the predicted rating for item B, based on rating of item A and the correlation between item A and B (from step 1)
  3. Aggregate all the predicted ratings for an item, based on all items that have been rated by an user and are correlated to the target item. This is an aggregation of the result from step 2. The aggregation function could be  average, median, maximum etc.
  4. Repeat steps 2 and 3 for all items that have not been rated by the user. For a given user, we end up with predicted rating for all items not yest rated by the user.

As you may have realized, the correlation performed in step 1 is independent of the other steps in this process. So we have the option of using different correlation technique, including Pearson.

Pearson Correlation Map Reduce

The map reduce implementation is in the class PearsonCorrelator. As before the input consists of an item ID, followed by a list of  userID and rating pair as below. If you data is not in this format, you need to get them into this format, preferably with a map reduce job.

ZAG2HF3A4A,LWLFD3A6SYT5:2,010MUJK6M1O2:2,53W34HFT4X4Q:4,R097F20QGSO9:2,PERT7XRM16IN:1 .....
1N50L7AH40,5040F212AL7P:3,5IXSH6Y46OT3:1,HUJHCV7U1B1O:5,Y341MW2SUXTY:5,02030IV24RGC:1 ......
.................................

As in content based recommendation solution described in my earlier post,  we use partitoned hashing to enumerate all possible pairs of items. We map the itemID into a hash bucket and then create all possible pairs of such buckets. The two hash buckets are combined using a mathematical function to generate the mapper output key. For the value we emit all the ratings along with the itemID. Deatils can be found in map() method implementation.

On the reducer side, there is one call to reducer for each pair of hash buckets. We also get the data values for both the buckets. However we need to segregate the two sets of values corresponding to the two buckets. Because we need to do a nested loop join the values from the two sets. We accomplish all these using secondary sorting. We preppend 0 to the key for one bucket and with 1 for the other bucket to implement secondary sorting.

For Pearson correlation between two items, we consider two sets of user ratings, such that the set of users is same between the two rating  sets. Pearson correlation is function of the following

  • Rating co variance between the two sets
  • Rating standard deviation of the  first set
  • Rating standard deviation of the second set

The actual formula can be found in my earlier post. The function findCorrelation()  computes the correlation value. The computation is only performed only when the intersection set of user ratings for some common set of users is above some threshold. This threshold is defined  through the config param min.rating.intersection.set. The default value is 3.

Here is some sample output of the map reduce. it consists of two itemIDs followed by the correlation coefficient and correlation weight.  The coefficient is in the range -1 to 1, where 1 implies best correlation  and -1 the worst. We have normalized it to be in the range 0 to 1000. The normalization scale is specified through the config parameter correlation.scale.

The correlation weight is size of the intersection between the two sets of users that have rated the two items. It’s used by the rating predictor MR class UtilityPredictor, in addition the correlation coefficient.

64MWC36DFA,WQPQA3Y41R,621,3
64MWC36DFA,S85609YCFD,62,3
64MWC36DFA,NEBNQU74NS,55,4
64MWC36DFA,03JMNWB5SL,248,3
6J4G4Y54K0,K6ZQZSLPNL,41,3
6J4G4Y54K0,7NUCX72VID,295,4
6J4G4Y54K0,EAQ4S628IP,1000,4
WQPQA3Y41R,S85609YCFD,154,4
WQPQA3Y41R,NEBNQU74NS,633,5
WQPQA3Y41R,03JMNWB5SL,209,4
.......................

Rest of the Story

These correlation values could be fed into rating prediction map reduce. You could follow this tutorial to run the remaining map reduce jobs to get the top k recommendations for a given user. The only change necessary is to use the class org.sifarish.social.PearsonCorrelator for the first map reduce in the tutorial. This earlier post of mine provides details of the sequence of map reduce jobs for the complete solution.

Advertisements

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Collaborative Filtering, Hadoop and Map Reduce, Predictive Analytic, Recommendation Engine and tagged , . Bookmark the permalink.

3 Responses to Get Social with Pearson Correlation

  1. Pingback: Recommendation Engine Powered by Hadoop (Part 1) | Mawazo

  2. Pingback: Recommendation Engine Powered by Hadoop (Part 2) | Mawazo

  3. Pingback: Making Recommendations in Real Time | Mawazo

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s