From Item Correlation to Rating Prediction

The ultimate goal of any recommendation engine is predict rating for items that an user has not engaged with so far. The set of items to recommended is then based on the predicted ratings.  There are two main aspects of the processing logic of  any recommendation engine. The first goal is to find the correlation between items, which could be content based or based on the user engagement data. The final goal is to predict the ratings based on the correlation data from the first step and  available ratings for an user.

Most of my posts in the past focused on the correlation part of recommendation. In this post, I will delve into  rating prediction aspect of recommendation. I will be making reference to my Hadoop based open source recommendation engine implementation, sifarish on gibhub.

Prediction Logic

The predicted rating of an item j based on rating of item i by user u is as follows. The function relating know rating and correlation depends on the type of correlation. For Cosine distance or Pearson correlation, it’s  multiplication.

rating(u,i, j) = fun(rating(u,i), corr(i,j))
rating(u,i,j) is the predicted rating of item j based on the rating of item i by user u
rating(u,i) is the known rating of an item i by user u
corr(i,j) is the correlation between item i and j

The final predicted rating is found by aggregating over all items that an user has rated where all those items are correlated to the target item. It’s as follows

rating(u,j) = aggr(rating(u,i,j))
aggr applies some aggregate function over all items i rated by user u

The aggr function is typically  average of the ratings weighted by the correlation value. But it could be  median or maximum instead of average. We can think of the processing as a two layer network, where the first layer contains all the items rated by an user and the second layer contains all the items whose ratings need to be predicted. The edges between the nodes in the first layer and the second layer represent the correlation between the items in question

Rating Matrix

I will be using correlation based on user engagement data (aka collaborating filtering). Essentially the rating matrix is a  matrix of items and users. The rows are for the items and columns for the users.  The cell values are the ratings.

In case ratings are not explicitly provided by users, they can be derived from user engagement signals  by applying some heuristics. I had discussed this in an earlier post. As discussed there, implicit ratings are estimated, based on the affinity of an user towards an items as indicated by different engagement events e.g., browsed an item from search result, placed an item in shopping cart etc.

Typically the matrix is very sparse in nature, because  an user engages with a very small percentage of items available.  The goal of a recommendation engine is to fill in the blanks in the rating matrix. That’s another way of looking at recommendation engines.

Item Correlation

We want to find correlation between items i.e. rows in the rating matrix. Although correlation of users is another approach, items based correlation has been found to be more reliable. Multiple correlation algorithms are available in sifarish. They are as follows

Jaccard Coefficient Appropriate for boolean values. Uses intersection and union of two sets
Cosine Distance Cosine distance between two vectors
Average Difference Takes the average of the two sets and then their difference
Pearson correlation Considers covariance std deviation

I will be using the Cosine Distance based correlation in this post. Here is some sample input rating matrix data. In each row we have item ID, followed by variable number of user ID and rating pairs.


This data is consumed by  the MR class ItemDynamicAttributeSimilarity. This is a general purpose map reduce class that finds similarities between entities with variable number of attributes where the attributes are of the same type. Rating matrix rows can be thought of as entities where the attributes are rating by users.

Here is the output from correlation map reduce class. The output consists of two  items IDs followed by their correlation value.


The last field is the number of intersecting  elements found when computing the correlation between the two item vectors. As we will see later this gets used in the next MR job. Since the rating matrix is very sparse, we find many items pairs with correlation value of 0.

Correlated Rating

This MR implemented by the UtilityPredictor class takes the rating matrix and the item correlation which is the output of the earlier MR. It uses secondary sorting and the mapper output is keyed by item ID.

The correlated rating depends on the known rating of an item and correlation between the item and the correlated item. There is a parameter called correlation.modifier through which the impact of the correlation can be controlled. It’ value is around 1.0. If it’s greater than 1 it has the the effect of reducing the impact of low correlation values. In other words, an item that is poorly correlated to the target item, will tend to be filtered out. Choosing a value less than 1.0 has  an equalizing effect of the correlation values.

effectCorr = corr * pow(modifier)

Here is some sample  output of this MR


The output fields are

  1. user ID
  2. item ID
  3. predicted rating
  4. correlation weight
  5. correlation value

You may be wondering about some of the additional fields in the output. They are used in the next map reduce which does an aggregation over known ratings by an user and predicts the final rating.

We could  set minimum threshold for input rating and items correlation coefficients through configuration parameters. These configuration parameters will allow items with low input rating and poorly correlated items to be filtered out.

Predicted Rating

The final rating predicted by the MR class UtilityAggregator.  It takes the output of the previous MR and predicts rating. Generally a weighted average is used based on the correlation value.

Final predicted  rating could be weighted by the correlation weight which is the intersection length between two item vectors. It reflects the quality of the correlation. We could have a high correlation value even if there are only few users rating the two items in question, but they happen to be an agreement resulting in high correlation value. But the correlation weight will be small.

Correlation length based weighted average is controlled through a configuration parameter corr.length.weighted.average .

The input rating standard deviation  could be used for weighted averaging during aggregation. A smaller standard deviation implies more unanimity of opinions of users who have rated the item. So an item with lower standard deviation could have more weight towards the rating aggregation.

This is behavior controlled through a configuration parameter input.rating.stdDev.weighted.average. To use this feature, an additional MR job needs to be executed with the rating matrix as the input

Here is some sample output


The output field are

  1. user ID
  2. item ID
  3. predicted rating
  4. count of known ratings contributing to the prediction

This brings us to the end. All that remains is to pick to the top n items with high predicted to make the final rating.

Summing Up

Irrespective of how items correlations are computed, they all go through the  MR jobs discussed here for the final rating predictions. There are number of configurations parameters  to control  the behavior of the MR jobs. Details of the complete workflow can be found here in this tutorial document.

I am working on introducing diversity and novelty in sifarish, so that there is an element of unexpectedness and serendipity in the recommendation. I will come back and blog about it after I have made some progress.

For commercial support for this solution or other solutions in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing,


About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Big Data, Data Mining, Hadoop and Map Reduce, Recommendation Engine and tagged , . Bookmark the permalink.

11 Responses to From Item Correlation to Rating Prediction

  1. Pingback: Get Social With Pearson Correlation | Mawazo

  2. Pingback: Get Social with Pearson Correlation | Big Data Cloud

  3. Pingback: Recommendation Engine Powered by Hadoop (Part 1) | Mawazo

  4. Pingback: Recommendation Engine Powered by Hadoop (Part 2) | Mawazo

  5. Pingback: Business Goal Infused Recommendation | Mawazo

  6. Pingback: Making Recommendations in Real Time | Mawazo

  7. Pingback: Popularity Shaken | Mawazo

  8. Pingback: Novelty in Personalization | Mawazo

  9. Pingback: Positive Feedback Driven Recommendation Rank Reordering | Mawazo

  10. Pingback: Customer Service and Recommendation System | Mawazo

  11. Pingback: Association Mining with Improved Apriori Algorithm | Mawazo

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s