Socially Accepted Recommendation

All my earlier posts on recommendation  systems focused on the so called content based recommendation. These systems rely on finding similarities between the attributes of entities  e.g., between products.

They are useful to address the so called cold start problem, when sufficient user engagement data is not available.  However, they are limited in use e.g., you can not make cross sell recommendation with these algorithms.

However, as user engagement data becomes available, they are used in a class of algorithms called collaborative filtering. I like calling them social recommendation algorithms. The basic idea is very simple and intuitive. If an user has shown interest in item A and a second user has shown interest in A and B, then it’s very likely that the first user may also be interested in item B. I have started implemented these algorithms for my open source project sifarish.

User Engagement

Level of user engagement can be estimated based on the following user actions. Typically, explicit rating by user is not available and other engagement events as below are used to estimate user’s interest or engagement level.

  • Time spent on a product page
  • Item added to wish list
  • Item added to shopping cart
  • Item purchased

Utility Matrix

All the social algorithms take an utility matrix as input. It contains data on user engagement for different items

If there are m items and n users, an utility matrix is a m x n matrix. The value of a matrix element (i, j) represents the rating of an item i by an user j.

 A row represents an item and it’s associated ratings. A columns represents an user and all the ratings for that user.

user_1 user_2 user_3 user_4 user_n
item_1 2 4
item_2 3 5
item_n 6

The utility matrix is very sparse, because  typically a user will engage with a small fraction of the items available in the inventory.

There is a duality about the utility matrix. The utility matrix can be used either to find similar items or similar users. I am using item based similarity which has been found to produce better results.

Social Algorithms

There are various social algorithms in use. I will do a quick overview of them. So far I have implemented  Cosine distance, Jaccard distance  and Mean rating correlation algorithms in sifarish. In this post my focus is on the cosine distance algorithm.  I am not planning to implement association rule mining, at least not for now.

Algorithm Description
Cosine distance Based on cosine distance between items vectors in the utility matrix. Appropriate when rating is numeric
Jaccard distance Based on Jaccard distance between items vectors in the utility matrix. Appropriate when rating is boolean
Average rating Based on correlation of average rating of items, a.k.a slope one recommender
Pearson Correlation Based on Pearson correlation between items. More rigorous correlation calculation.
Association Rule Association rule mining, a.k.a  frequent item set analysis as used in market basket analysis.

Vector Distance

Each row of the utility matrix represents an item. The values of the row vector are the ratings by different user. Essentially we find the cosine distance or jaccard between each pair of item vectors. Since these vector are sparse, they are effective means of finding similarities between the vectors.

Cosine distance is the cosine of the angle between two vectors. When the vectors  are parallel, the distance is 1 which corresponds to  maximum similarity. Jaccard distance is the ratio of the number matches to number of non matches between the vectors.

If an item is rated 2 and 4 by two users and 3 and 5 for another item  by the same two users as in the utility matrix above , we have the two item vector as (2, 4) and (3.5).  The cosine distance between the two vectors will be 1. The similarity between  two items don’t  depend on the absolute value of the rating, but on how the rating differs between the 2 items.

Map Reduce

I am starting with some simple eCommerce order data, consisting of orderID, userID, itemID and quantity. I run a MR called Projection on it to generate, for a given item all the users that have purchased the items. Here is some sample output.


The first field is the itemID, followed by list userIDs. For a given user, we have list of all the items purchased by the user. What I have is a boolean utility matrix. If I was using numerical rating value, the userID would have been followed by a rating  separated by a colon as  574MWIVH7QRV:3.

The output from Projection is fed into ItemDynamicAttributeSimilarity.   The map reduce class ItemDynamicAttributeSimilarity in sifarish takes every possible pair of items vectors and calculates Cosine similarity or Jaccard similarity. It uses the partitioned hashing algorithm as in other similarity calculating map reduce implementations in sifarish.

The output of this MR is shown below. The first two columns are item IDs. The last column is the similarity between the two items which is (1 – cosineDist) * 1000.  I have used Cosine distance as the distance algorithm. The distance algorithm can be chosen through a configuration parameters.


The other map reduce TopMatches has been used in all the content based recommendation implementation in sifarish. For any item it finds the top n matches.  It processes the output of ItemDynamicAttributeSimilarity to generate the following sample output.


It shows the similar items for a given item e.g., for item with itemID 0E8FAPF91N. They are ranked by degree of similarity. If an user has engaged with an item  in the first column, then the items in the second column are potential candidates for recommendation for that user.

If we have the rating for 0E8FAPF91N for an user, we can compute the predicted rating for all the candidate items. However, that’s not the end of the story, as you will see in the next section.

Rating Prediction

As per the result in the last section, the item 2TURYTEVPX is a candidate recommendation based on the item 0E8FAPF91N. However, 2TURYTEVPX could also have been a candidate based on another item 1HGT3K5N02 that the use may have been engaged with and for which rating is available.

So we need to aggregate the the impact of 0E8FAPF91N and 1HGT3K5N02 on 2TURYTEVPX to compute the final predicted rating for  2TURYTEVPX. There are some interesting ways of doing that and it will be the topic of a future post.

User Rating Support

It may be desirable to exclude users who are not sufficiently engaged i.e the number of items that an user has been engaged with is below some percentage of the total number of items. This threshold is know as support in the data mining world. This feature is not supported in sifarish right now. I might add it in future.

Social Network Angle

An user’s social network can be overlaid on the   utility matrix to create a personalized utility matrix, which could be used for recommendations.  Essentially, user’s rating in the utility matrix could be weighted down by some radial function based on the degree separation in the social graph. Although it can be done, the question is whether it makes sense.

There are some caveats though, which makes me wonder whether it’s useful to bring the social graph in the picture.

We humans are multifaceted. Only some of these facets are manifested in a social network. Those facets may be irrelevant in the context of recommendation of some products and services.

For example, my social graph in linkedin is characterized by my interest in certain technologies or certain kinds of businesses. There will be very little value in overlaying my linked in social graph for making recommendations for songs and movies.

For commercial support for this solution or other solutions in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing,


About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Collaborative Filtering, Hadoop and Map Reduce, Predictive Analytic and tagged , , . Bookmark the permalink.

One Response to Socially Accepted Recommendation

  1. Pingback: Making Recommendations in Real Time | Mawazo

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s