The ultimate goal of any recommendation engine is predict rating for items that an user has not engaged with so far. The set of items to recommended is then based on the predicted ratings. There are two main aspects of the processing logic of any recommendation engine. The first goal is to find the correlation between items, which could be content based or based on the user engagement data. The final goal is to predict the ratings based on the correlation data from the first step and available ratings for an user.
Most of my posts in the past focused on the correlation part of recommendation. In this post, I will delve into rating prediction aspect of recommendation. I will be making reference to my Hadoop based open source recommendation engine implementation, sifarish on gibhub.
The predicted rating of an item j based on rating of item i by user u is as follows. The function relating know rating and correlation depends on the type of correlation. For Cosine distance or Pearson correlation, it’s multiplication.
The final predicted rating is found by aggregating over all items that an user has rated where all those items are correlated to the target item. It’s as follows
The aggr function is typically average of the ratings weighted by the correlation value. But it could be median or maximum instead of average. We can think of the processing as a two layer network, where the first layer contains all the items rated by an user and the second layer contains all the items whose ratings need to be predicted. The edges between the nodes in the first layer and the second layer represent the correlation between the items in question
I will be using correlation based on user engagement data (aka collaborating filtering). Essentially the rating matrix is a matrix of items and users. The rows are for the items and columns for the users. The cell values are the ratings.
In case ratings are not explicitly provided by users, they can be derived from user engagement signals by applying some heuristics. I had discussed this in an earlier post. As discussed there, implicit ratings are estimated, based on the affinity of an user towards an items as indicated by different engagement events e.g., browsed an item from search result, placed an item in shopping cart etc.
Typically the matrix is very sparse in nature, because an user engages with a very small percentage of items available. The goal of a recommendation engine is to fill in the blanks in the rating matrix. That’s another way of looking at recommendation engines.
We want to find correlation between items i.e. rows in the rating matrix. Although correlation of users is another approach, items based correlation has been found to be more reliable. Multiple correlation algorithms are available in sifarish. They are as follows
|Jaccard Coefficient||Appropriate for boolean values. Uses intersection and union of two sets|
|Cosine Distance||Cosine distance between two vectors|
|Average Difference||Takes the average of the two sets and then their difference|
|Pearson correlation||Considers covariance std deviation|
I will be using the Cosine Distance based correlation in this post. Here is some sample input rating matrix data. In each row we have item ID, followed by variable number of user ID and rating pairs.
KRF04EPAOK,2TLRE2ZZBL30:3,PESB0L620G27:5,OR2SC24DNYYQ:2,0HBS2M41OG0S:4,... 4E10WO9FPV,YE1FGW878H7J:4,D1S54NA44TQQ:3,0OZNT6A0GCCQ:5,A8G03VORW64Z:4,PFMXP8PH0O3G:3,.... 61N4NAAVWY,3WTFVGK4BQEJ:3,591O2NN2O7BO:5,K9H23EUZJNLL:2,J5688XZXJN1S:4,18C4HVTN242C:4,... RKHGE1DFPC,FA31X5H6MOD6:3,8UX3BL45ANX9:1,U7JBK6320IYB:3,... YB4A4A1X3O,P3Y4EFUU2O24:4,ZVSDBAZTE1UU:1,PFMXP8PH0O3G:5,3C2RUF3GYBY4:1,463W3042D42V:4,... .....
This data is consumed by the MR class ItemDynamicAttributeSimilarity. This is a general purpose map reduce class that finds similarities between entities with variable number of attributes where the attributes are of the same type. Rating matrix rows can be thought of as entities where the attributes are rating by users.
Here is the output from correlation map reduce class. The output consists of two items IDs followed by their correlation value.
B05D64CFKQ,T1RJEY9030,37,2 B05D64CFKQ,7KIW2CPG2I,0,0 B05D64CFKQ,R154QPBYMB,0,0 B05D64CFKQ,CIHI3LL27X,123,3 B05D64CFKQ,3233GCW0KZ,266,2 B05D64CFKQ,IHL1LNY1VR,0,0 .....
The last field is the number of intersecting elements found when computing the correlation between the two item vectors. As we will see later this gets used in the next MR job. Since the rating matrix is very sparse, we find many items pairs with correlation value of 0.
This MR implemented by the UtilityPredictor class takes the rating matrix and the item correlation which is the output of the earlier MR. It uses secondary sorting and the mapper output is keyed by item ID.
The correlated rating depends on the known rating of an item and correlation between the item and the correlated item. There is a parameter called correlation.modifier through which the impact of the correlation can be controlled. It’ value is around 1.0. If it’s greater than 1 it has the the effect of reducing the impact of low correlation values. In other words, an item that is poorly correlated to the target item, will tend to be filtered out. Choosing a value less than 1.0 has an equalizing effect of the correlation values.
Here is some sample output of this MR
EL8QWC5MV2AW,0X333SMLNB,34,2,58 EL8QWC5MV2AW,HMJ2YD7Z60,4,2,8 EL8QWC5MV2AW,3VNC83W40D,132,3,221 EL8QWC5MV2AW,CPQL8TJ0ZR,12,2,21 R3H0X6173D0L,26WVDB6260,22,2,28 R3H0X6173D0L,37ZVOZRFW1,22,3,28 ......
The output fields are
- user ID
- item ID
- predicted rating
- correlation weight
- correlation value
You may be wondering about some of the additional fields in the output. They are used in the next map reduce which does an aggregation over known ratings by an user and predicts the final rating.
We could set minimum threshold for input rating and items correlation coefficients through configuration parameters. These configuration parameters will allow items with low input rating and poorly correlated items to be filtered out.
The final rating predicted by the MR class UtilityAggregator. It takes the output of the previous MR and predicts rating. Generally a weighted average is used based on the correlation value.
Final predicted rating could be weighted by the correlation weight which is the intersection length between two item vectors. It reflects the quality of the correlation. We could have a high correlation value even if there are only few users rating the two items in question, but they happen to be an agreement resulting in high correlation value. But the correlation weight will be small.
Correlation length based weighted average is controlled through a configuration parameter corr.length.weighted.average .
The input rating standard deviation could be used for weighted averaging during aggregation. A smaller standard deviation implies more unanimity of opinions of users who have rated the item. So an item with lower standard deviation could have more weight towards the rating aggregation.
This is behavior controlled through a configuration parameter input.rating.stdDev.weighted.average. To use this feature, an additional MR job needs to be executed with the rating matrix as the input
Here is some sample output
049EC13GEGE0,GOU4QGOBWQ,600,5 049EC13GEGE0,H3ZAN0C44Q,599,4 049EC13GEGE0,HNAEY0QRLD,589,2 04N71XN2P5AZ,04M6LVLGDF,1000,3 04N71XN2P5AZ,0MOHXO0E24,634,6 04N71XN2P5AZ,0X333SMLNB,1000,4 04N71XN2P5AZ,16KZU3FK2H,748,3 ......
The output field are
- user ID
- item ID
- predicted rating
- count of known ratings contributing to the prediction
This brings us to the end. All that remains is to pick to the top n items with high predicted to make the final rating.
Irrespective of how items correlations are computed, they all go through the MR jobs discussed here for the final rating predictions. There are number of configurations parameters to control the behavior of the MR jobs. Details of the complete workflow can be found here in this tutorial document.
I am working on introducing diversity and novelty in sifarish, so that there is an element of unexpectedness and serendipity in the recommendation. I will come back and blog about it after I have made some progress.
For commercial support for this solution or other solutions in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing,