All my earlier posts on recommendation systems focused on the so called content based recommendation. These systems rely on finding similarities between the attributes of entities e.g., between products.
However, as user engagement data becomes available, they are used in a class of algorithms called collaborative filtering. I like calling them social recommendation algorithms. The basic idea is very simple and intuitive. If an user has shown interest in item A and a second user has shown interest in A and B, then it’s very likely that the first user may also be interested in item B. I have started implemented these algorithms for my open source project sifarish.
Level of user engagement can be estimated based on the following user actions. Typically, explicit rating by user is not available and other engagement events as below are used to estimate user’s interest or engagement level.
- Time spent on a product page
- Item added to wish list
- Item added to shopping cart
- Item purchased
All the social algorithms take an utility matrix as input. It contains data on user engagement for different items
A row represents an item and it’s associated ratings. A columns represents an user and all the ratings for that user.
The utility matrix is very sparse, because typically a user will engage with a small fraction of the items available in the inventory.
There is a duality about the utility matrix. The utility matrix can be used either to find similar items or similar users. I am using item based similarity which has been found to produce better results.
There are various social algorithms in use. I will do a quick overview of them. So far I have implemented Cosine distance, Jaccard distance and Mean rating correlation algorithms in sifarish. In this post my focus is on the cosine distance algorithm. I am not planning to implement association rule mining, at least not for now.
|Cosine distance||Based on cosine distance between items vectors in the utility matrix. Appropriate when rating is numeric|
|Jaccard distance||Based on Jaccard distance between items vectors in the utility matrix. Appropriate when rating is boolean|
|Average rating||Based on correlation of average rating of items, a.k.a slope one recommender|
|Pearson Correlation||Based on Pearson correlation between items. More rigorous correlation calculation.|
|Association Rule||Association rule mining, a.k.a frequent item set analysis as used in market basket analysis.|
Each row of the utility matrix represents an item. The values of the row vector are the ratings by different user. Essentially we find the cosine distance or jaccard between each pair of item vectors. Since these vector are sparse, they are effective means of finding similarities between the vectors.
If an item is rated 2 and 4 by two users and 3 and 5 for another item by the same two users as in the utility matrix above , we have the two item vector as (2, 4) and (3.5). The cosine distance between the two vectors will be 1. The similarity between two items don’t depend on the absolute value of the rating, but on how the rating differs between the 2 items.
I am starting with some simple eCommerce order data, consisting of orderID, userID, itemID and quantity. I run a MR called Projection on it to generate, for a given item all the users that have purchased the items. Here is some sample output.
0D1Q5PEB71,7GAMPEHB9EOG,HMB450ON513A 0E8FAPF91N,73DX1OCNCUML,574MWIVH7QRV,UMIDIYR4EF83,BLU6TITE1DU8,EJMWAP30A4B6,2ZQ5N83NYQV5,8WCXUGKL0BE6,ILLIQTKY8M7Z 0FAUQVKI8P,DJFLC7XQH0IJ,CPACJA9KH17T,2FEEADZ7ZHRL,3YBXWNJ2719F,JJG38F2U7YZG,YGDWAAATCK2M,8J49UZN70OLE,RM2GDXHVVG29,TL6N9DVI5RVM,LZO6T6GZU9CB,1KAHYT4STFP4,UPGH8BBE8FBC
The first field is the itemID, followed by list userIDs. For a given user, we have list of all the items purchased by the user. What I have is a boolean utility matrix. If I was using numerical rating value, the userID would have been followed by a rating separated by a colon as 574MWIVH7QRV:3.
The output from Projection is fed into ItemDynamicAttributeSimilarity. The map reduce class ItemDynamicAttributeSimilarity in sifarish takes every possible pair of items vectors and calculates Cosine similarity or Jaccard similarity. It uses the partitioned hashing algorithm as in other similarity calculating map reduce implementations in sifarish.
The output of this MR is shown below. The first two columns are item IDs. The last column is the similarity between the two items which is (1 – cosineDist) * 1000. I have used Cosine distance as the distance algorithm. The distance algorithm can be chosen through a configuration parameters.
0KP7QY3FFV,VKK38HZWM6,928 0KP7QY3FFV,FE70S2U68S,1000 0KP7QY3FFV,KO10PQ5ELA,1000 0KP7QY3FFV,9ICAJ58SN1,1000
The other map reduce TopMatches has been used in all the content based recommendation implementation in sifarish. For any item it finds the top n matches. It processes the output of ItemDynamicAttributeSimilarity to generate the following sample output.
0D1Q5PEB71,IJ9HBHIM6X,786 0D1Q5PEB71,RKB9DUV91L,803 0D1Q5PEB71,EQD4N0KW3F,823 0E8FAPF91N,2TURYTEVPX,823 0E8FAPF91N,6IKJO6ZWUX,866 0E8FAPF91N,PXF73TYJWH,866
It shows the similar items for a given item e.g., for item with itemID 0E8FAPF91N. They are ranked by degree of similarity. If an user has engaged with an item in the first column, then the items in the second column are potential candidates for recommendation for that user.
If we have the rating for 0E8FAPF91N for an user, we can compute the predicted rating for all the candidate items. However, that’s not the end of the story, as you will see in the next section.
As per the result in the last section, the item 2TURYTEVPX is a candidate recommendation based on the item 0E8FAPF91N. However, 2TURYTEVPX could also have been a candidate based on another item 1HGT3K5N02 that the use may have been engaged with and for which rating is available.
So we need to aggregate the the impact of 0E8FAPF91N and 1HGT3K5N02 on 2TURYTEVPX to compute the final predicted rating for 2TURYTEVPX. There are some interesting ways of doing that and it will be the topic of a future post.
User Rating Support
It may be desirable to exclude users who are not sufficiently engaged i.e the number of items that an user has been engaged with is below some percentage of the total number of items. This threshold is know as support in the data mining world. This feature is not supported in sifarish right now. I might add it in future.
Social Network Angle
An user’s social network can be overlaid on the utility matrix to create a personalized utility matrix, which could be used for recommendations. Essentially, user’s rating in the utility matrix could be weighted down by some radial function based on the degree separation in the social graph. Although it can be done, the question is whether it makes sense.
There are some caveats though, which makes me wonder whether it’s useful to bring the social graph in the picture.
For example, my social graph in linkedin is characterized by my interest in certain technologies or certain kinds of businesses. There will be very little value in overlaying my linked in social graph for making recommendations for songs and movies.