Recommendation Engine Powered by Hadoop (Part 2)

In Part 1 of this post the focus was on finding the correlation between items, based on rating data available in individual items. The MR job output was the correlation coefficient matrix, with correlation coefficient  values between 0 and 1 for any item pair.

Next Step

Armed with the item correlation data and items rating data  for any visitor, we will find the new items correlated with the current items of interest for any visitor. Those items will be the new items to be recommended for a visitor.

This  final phase of the processing will be done by another MR job. We are using two MR jobs in tandem. We are taking the output of the first MR job, which is item correlation data and feeding that into the MR job which is the topic of this post. We run this job for a given visitor, with the item rating data for the visitor as input.

The full implementation of my Hadoop driven Recommendation Engine sifarish is available on github.


This is a sample mapper input. In each row, we have the product ids of two items followed by their correlation coefficient.

Mapper input
Item one Item two Correlation Coefficient
5gtr63t5 gh7e5ja6 .67
3dy456sh 7e65gft8 .21

We have one row for each product id pair. In reality, it won’t be for every possible items pair. Item pair that does not have mutual interest for any visitor, will not appear in the output of the first MR job. You can take a look at the reducer code of the first MR job in the earlier post to verify.

We also need the rating data for individual user as input. Since the individual rating data will be used all the mapper tasks, we distribute the individual rating data as what is known as “side data”. Since all the individual rating data are needed by all the mapper tasks, we can not send the data as mapper input.

Side data in files gets distributed by the Hadoop job tracker to the distributed cache of Hadoop nodes. Here is some sample rating data for a visitor

Item rating
Item rating
4ud863t5 2.7
3d8d7tsh 3.1

This data will be in simple text file which will be distributed through Hadoop command line option -files. They will end up in the distributed cache of each Hadoop node. The mapper and reducers will be able to access them locally as a native file system files.

The mapper implementation is as follows.

For every item correlated with user’s item set, it outputs the itemId, correlation coefficient and the corresponding correlated rating.

public class RecommendationMapper extends Mapper{
  private	Text keyHolder = new Text();;
  private	Text valueHolder = new Text();
  private Map< String,Integer > visitorRatings =
    new HashMap< String,Integer >();

  public void setup(Context context) {
    //parse side data
    String ratingFile = "rating.txt";
    visitorRatings = parseRating(ratingFile);

  public void map(LongWritable key, Text value, Context context)
	  throws IOException, InterruptedException {
		Text keyHolder = new Text();
		Text valueHolder = new Text();;

    //parse comma separated tokens
    List corrData = parseInput(value);

    String otherItem = null;
    double corrCoeff;
    double corrRating;
    Intger rating = visitorRatings.get(corrData.get(0));
    if (null ! = rating) {
      //first in the pair is an item with visitor rating
      otherItem = corrData.get(1);
      corrCoeff = new Double(corrData.get(2));
      corrRating = rating * corrCoeff;
    } else {
      rating = visitorRatings.get(corrData.get(1));
      if (null ! = rating) {
        //second item in the pair is an item with visitor rating
        otherItem = corrData.get(0);
        corrCoeff = new Double(corrData.get(2));
        corrRating = rating * corrCoeff;


    //emit only if one of the items in the pair belongs to set of items
    //with rating data for a visitor
    if (null != otherItem) {
      //keyed by correlated item

      //comma separated string with correlation coeff and correlated rating
      String valueSt = concatanate(corrCoeff, corrRating);
		  context.write(keyHolder, valueHolder);


For any item all the correlated ratings will be accumulated, which goes as input to the reducer, which we are going to discuss next.


The reducer input looks like as follows. For each correlated item, there is a list of correlation coefficient and correlated rating pair.

Reducer input
Item Corr coeff and Corr rating list
4ud863t5 [.63,2.46][.43,2.83]
3d8d7tsh [.38,0.89][.73,3.18][.52,2.11]

The reducer iterates through correlation coefficient and correlated rating, aggregates them and finally computes the weighted rating.

The implementation is as follows

public class RecommendationrReducer extends Reducer {
  private	Text valueHolder = new Text();

  protected void reduce(Text key, Iterable values, Context context)
	  throws IOException, InterruptedException {

    //iterate through correlation coeff and correlated rating
    double sumCorrCoeff = 0;
    double sumCorrRate = 0;
    while(values.iterator().hasNext()) {
      //string has correlation coeff and correlated rating
      String valueSt = values.iterator().next().toString();
      List values = parseValue(valueSt);
      sumCorrCoeff += values.get(0);
      sumCorrRate += values.get(1);

    //rated weight by dividing sum of correlated rating with sum of corr coefficient
    double weightedRate = sumCorrRate / sumCorrCoeff;
    String valueSt = concatanate(key.toString(), weightedRate);

    context.write(NullWritable.get(), valueHolder);

The reducer output is the list of items and the corresponding weighted rating. All that is left is to sort the items by the weighted rating and there we have it,  the recommendation list for a visitor ordered by rating. Typically we want to show only the top few recommendations.

Further Thoughts

We are at the end of of our journey through the land of recommendation engine. There are some issues that require further thoughts

One issue has to do with the temporal nature of correlation data. As users interact more with products and or purchase products, the underlying rating data changes, which impacts the correlation data. The correlation data needs to be accurate and up to date, so that we have recommendations based on recent user behavior.

How frequently should the correlation matrix be recalculated? It depends on how busy your web site is. The busier the site and more user activity you have, the more frequently the correlation matrix need to re calculated.

Cold start problem

The other issue is what is usually described as the “cold start” problem which is related to new user or new item.

When we have a new user to the site without an activity history, we don’t have rating data for the user and we can not make any recommendation, until we have tracked user activity and built some rating data.

We face the same issue when a new item is introduced, for which no user interaction data is available. We won’t be able to recommend the item, until we have some user interaction data is available for the item. This problem is somewhat easier than the new user problem. It is possible to leverage product category and ontology data to relate newly introduced item with other existing items for which rating data is available and come up with recommendations.

Item based vs memory based

I have used a simple and commonly used solution for recommendation engine in this post. There are many other more complex solutions available and it’s an active area of research.

My approach is based on what is know as the memory based approach. We essentially  crunch through all the data to find a specific solution for a given data set. In comparison, model based approaches try to learn from data to discover models and patterns. These approaches are more holistic in nature and have their roots in data mining and machine learning. You can learn more here, if interested.


For the  implementation, please refer to this post for the map reduce workflow for the complete solution   and this one for Pearson correlation map reduce. The final implementation differs  from the solution proposed here, because I had to accommodate multiple correlation algorithms, of which Pearson correlation is one.

Please refer to this tutorial for the complete work flow involving all the MR jobs. Some of the MR jobs in the work flow are optional.


About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Collaborative Filtering, Data Mining, Hadoop and Map Reduce, Java and tagged , , . Bookmark the permalink.

10 Responses to Recommendation Engine Powered by Hadoop (Part 2)

  1. Pingback: Recommendation Engine Powered By Hadoop « Another Word For It

  2. manishdunani says:

    which type of dataset it requires??
    Is it type of====>>>u_id,P_id,Ratings dataset?

  3. pkghosh says:

    That should work. I will be implementing this soon. You will be able to get more details from


  4. Sugandha says:

    Can you please tell me what was your database design?

  5. Pranab says:

    If your book data is already in Oracle, you could either 1) Export it to HDFS using Sqoop or 2)Use Oracle tool to export to CSV files and then use hadoop file system command copy the CSV files to HDFS.

    If using sifarish, you have to create a JSON file containing data schema. There are some examples in the resource directory

  6. Gowsalya says:

    Hello sir,
    Am a newbie to hadoop.Am doing my final year have submitted an abstract about cross recommender system(like books recommended for movie) for ma current project.But i couldn’t find where to start with the project.There are lots of question around my mind like what would be the dataset,how to process them,how to correlate and so on.Could you please help by guiding me.Thank you.

  7. tek classes says:

    Article gives a relavant information about Hadoop

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s