In one of my earlier posts, I discussed about using Pearson correlation for making social recommendation. In this post we will delve deeper into it including the Hadoop map reduce implementation. There are many correlation techniques, including cosine distance, slope one etc. These are already implemented in sifarish. The latest addition to the arsenal of correlation techniques in sifarish is Pearson correlation.
One advantage that Pearson correlation over other techniques is that it can handle bias in ratings for an user. For example, user A may consistently give better rating compared to user B. However, there may be good correlation between ratings of users A and B if the difference between their ratings is consistent.
Social Recommendation Basics
As you may recall from my earlier posts, there are three main steps in social recommendations solutions, specifically for item based recommendation
- Find correlation between two items based on ratings by a common set of users that have rated both items. This step is repeated for all possible item pairs.
- For an item A that an user has rated and and Item B that the user has not rated, find the predicted rating for item B, based on rating of item A and the correlation between item A and B (from step 1)
- Aggregate all the predicted ratings for an item, based on all items that have been rated by an user and are correlated to the target item. This is an aggregation of the result from step 2. The aggregation function could be average, median, maximum etc.
- Repeat steps 2 and 3 for all items that have not been rated by the user. For a given user, we end up with predicted rating for all items not yest rated by the user.
As you may have realized, the correlation performed in step 1 is independent of the other steps in this process. So we have the option of using different correlation technique, including Pearson.
Pearson Correlation Map Reduce
The map reduce implementation is in the class PearsonCorrelator. As before the input consists of an item ID, followed by a list of userID and rating pair as below. If you data is not in this format, you need to get them into this format, preferably with a map reduce job.
ZAG2HF3A4A,LWLFD3A6SYT5:2,010MUJK6M1O2:2,53W34HFT4X4Q:4,R097F20QGSO9:2,PERT7XRM16IN:1 ..... 1N50L7AH40,5040F212AL7P:3,5IXSH6Y46OT3:1,HUJHCV7U1B1O:5,Y341MW2SUXTY:5,02030IV24RGC:1 ...... .................................
As in content based recommendation solution described in my earlier post, we use partitoned hashing to enumerate all possible pairs of items. We map the itemID into a hash bucket and then create all possible pairs of such buckets. The two hash buckets are combined using a mathematical function to generate the mapper output key. For the value we emit all the ratings along with the itemID. Deatils can be found in map() method implementation.
On the reducer side, there is one call to reducer for each pair of hash buckets. We also get the data values for both the buckets. However we need to segregate the two sets of values corresponding to the two buckets. Because we need to do a nested loop join the values from the two sets. We accomplish all these using secondary sorting. We preppend 0 to the key for one bucket and with 1 for the other bucket to implement secondary sorting.
For Pearson correlation between two items, we consider two sets of user ratings, such that the set of users is same between the two rating sets. Pearson correlation is function of the following
- Rating co variance between the two sets
- Rating standard deviation of the first set
- Rating standard deviation of the second set
The actual formula can be found in my earlier post. The function findCorrelation() computes the correlation value. The computation is only performed only when the intersection set of user ratings for some common set of users is above some threshold. This threshold is defined through the config param min.rating.intersection.set. The default value is 3.
Here is some sample output of the map reduce. it consists of two itemIDs followed by the correlation coefficient and correlation weight. The coefficient is in the range -1 to 1, where 1 implies best correlation and -1 the worst. We have normalized it to be in the range 0 to 1000. The normalization scale is specified through the config parameter correlation.scale.
The correlation weight is size of the intersection between the two sets of users that have rated the two items. It’s used by the rating predictor MR class UtilityPredictor, in addition the correlation coefficient.
64MWC36DFA,WQPQA3Y41R,621,3 64MWC36DFA,S85609YCFD,62,3 64MWC36DFA,NEBNQU74NS,55,4 64MWC36DFA,03JMNWB5SL,248,3 6J4G4Y54K0,K6ZQZSLPNL,41,3 6J4G4Y54K0,7NUCX72VID,295,4 6J4G4Y54K0,EAQ4S628IP,1000,4 WQPQA3Y41R,S85609YCFD,154,4 WQPQA3Y41R,NEBNQU74NS,633,5 WQPQA3Y41R,03JMNWB5SL,209,4 .......................
Rest of the Story
These correlation values could be fed into rating prediction map reduce. You could follow this tutorial to run the remaining map reduce jobs to get the top k recommendations for a given user. The only change necessary is to use the class org.sifarish.social.PearsonCorrelator for the first map reduce in the tutorial. This earlier post of mine provides details of the sequence of map reduce jobs for the complete solution.