All my earlier posts on recommendation systems focused on the so called content based recommendation. These systems rely on finding similarities between the attributes of entities e.g., between products.

They are useful to address the so called cold start problem, when sufficient user engagement data is not available. However, they are limited in use e.g., you can not make cross sell recommendation with these algorithms.

However, as user engagement data becomes available, they are used in a class of algorithms called *collaborative filtering*. I like calling them social recommendation algorithms. The basic idea is very simple and intuitive. If an user has shown interest in item A and a second user has shown interest in A and B, then it’s very likely that the first user may also be interested in item B. I have started implemented these algorithms for my open source project sifarish.

## User Engagement

Level of user engagement can be estimated based on the following user actions. Typically, explicit rating by user is not available and other engagement events as below are used to estimate user’s interest or engagement level.

*Time spent on a product page**Item added to wish list**Item added to shopping cart**Item purchased*

## Utility Matrix

All the social algorithms take an utility matrix as input. It contains data on user engagement for different items

If there are m items and n users, an utility matrix is a m x n matrix. The value of a matrix element (i, j) represents the rating of an item i by an user j.

A row represents an item and it’s associated ratings. A columns represents an user and all the ratings for that user.

user_1 | user_2 | user_3 | user_4 | user_n | |

item_1 | 2 | 4 | |||

item_2 | 3 | 5 | |||

item_n | 6 |

The utility matrix is very sparse, because typically a user will engage with a small fraction of the items available in the inventory.

There is a duality about the utility matrix. The utility matrix can be used either to find similar items or similar users. I am using item based similarity which has been found to produce better results.

## Social Algorithms

There are various social algorithms in use. I will do a quick overview of them. So far I have implemented Cosine distance, Jaccard distance and Mean rating correlation algorithms in sifarish. In this post my focus is on the cosine distance algorithm. I am not planning to implement association rule mining, at least not for now.

Algorithm | Description |

Cosine distance | Based on cosine distance between items vectors in the utility matrix. Appropriate when rating is numeric |

Jaccard distance | Based on Jaccard distance between items vectors in the utility matrix. Appropriate when rating is boolean |

Average rating | Based on correlation of average rating of items, a.k.a slope one recommender |

Pearson Correlation | Based on Pearson correlation between items. More rigorous correlation calculation. |

Association Rule | Association rule mining, a.k.a frequent item set analysis as used in market basket analysis. |

## Vector Distance

Each row of the utility matrix represents an item. The values of the row vector are the ratings by different user. Essentially we find the cosine distance or jaccard between each pair of item vectors. Since these vector are sparse, they are effective means of finding similarities between the vectors.

Cosine distance is the cosine of the angle between two vectors. When the vectors are parallel, the distance is 1 which corresponds to maximum similarity. Jaccard distance is the ratio of the number matches to number of non matches between the vectors.

If an item is rated 2 and 4 by two users and 3 and 5 for another item by the same two users as in the utility matrix above , we have the two item vector as (2, 4) and (3.5). The cosine distance between the two vectors will be 1. The similarity between two items don’t depend on the absolute value of the rating, but on how the rating differs between the 2 items.

## Map Reduce

I am starting with some simple eCommerce order data, consisting of *orderID*, *userID*, *itemID* and *quantity*. I run a MR called Projection on it to generate, for a given item all the users that have purchased the items. Here is some sample output.

0D1Q5PEB71,7GAMPEHB9EOG,HMB450ON513A 0E8FAPF91N,73DX1OCNCUML,574MWIVH7QRV,UMIDIYR4EF83,BLU6TITE1DU8,EJMWAP30A4B6,2ZQ5N83NYQV5,8WCXUGKL0BE6,ILLIQTKY8M7Z 0FAUQVKI8P,DJFLC7XQH0IJ,CPACJA9KH17T,2FEEADZ7ZHRL,3YBXWNJ2719F,JJG38F2U7YZG,YGDWAAATCK2M,8J49UZN70OLE,RM2GDXHVVG29,TL6N9DVI5RVM,LZO6T6GZU9CB,1KAHYT4STFP4,UPGH8BBE8FBC

The first field is the *itemID*, followed by list *userIDs*. For a given user, we have list of all the items purchased by the user. What I have is a boolean utility matrix. If I was using numerical rating value, the *userID* would have been followed by a rating separated by a colon as 574MWIVH7QRV:3.

The output from Projection is fed into ItemDynamicAttributeSimilarity. The map reduce class ItemDynamicAttributeSimilarity in sifarish takes every possible pair of items vectors and calculates Cosine similarity or Jaccard similarity. It uses the partitioned hashing algorithm as in other similarity calculating map reduce implementations in sifarish.

The output of this MR is shown below. The first two columns are item IDs. The last column is the similarity between the two items which is *(1 – cosineDist) * 1000*. I have used Cosine distance as the distance algorithm. The distance algorithm can be chosen through a configuration parameters.

0KP7QY3FFV,VKK38HZWM6,928 0KP7QY3FFV,FE70S2U68S,1000 0KP7QY3FFV,KO10PQ5ELA,1000 0KP7QY3FFV,9ICAJ58SN1,1000

The other map reduce TopMatches has been used in all the content based recommendation implementation in sifarish. For any item it finds the top n matches. It processes the output of *ItemDynamicAttributeSimilarity* to generate the following sample output.

0D1Q5PEB71,IJ9HBHIM6X,786 0D1Q5PEB71,RKB9DUV91L,803 0D1Q5PEB71,EQD4N0KW3F,823 0E8FAPF91N,2TURYTEVPX,823 0E8FAPF91N,6IKJO6ZWUX,866 0E8FAPF91N,PXF73TYJWH,866

It shows the similar items for a given item e.g., for item with itemID* 0E8FAPF91N*. They are ranked by degree of similarity. If an user has engaged with an item in the first column, then the items in the second column are potential candidates for recommendation for that user.

If we have the rating for *0E8FAPF91N* for an user, we can compute the predicted rating for all the candidate items. However, that’s not the end of the story, as you will see in the next section.

## Rating Prediction

As per the result in the last section, the item *2TURYTEVPX* is a candidate recommendation based on the item *0E8FAPF91N*. However, *2TURYTEVPX* could also have been a candidate based on another item *1HGT3K5N02* that the use may have been engaged with and for which rating is available.

So we need to aggregate the the impact of *0E8FAPF91N* and *1HGT3K5N02* on *2TURYTEVPX* to compute the final predicted rating for *2TURYTEVPX*. There are some interesting ways of doing that and it will be the topic of a future post.

## User Rating Support

It may be desirable to exclude users who are not sufficiently engaged i.e the number of items that an user has been engaged with is below some percentage of the total number of items. This threshold is know as* support* in the data mining world. This feature is not supported in sifarish right now. I might add it in future.

## Social Network Angle

An user’s social network can be overlaid on the utility matrix to create a personalized utility matrix, which could be used for recommendations. Essentially, user’s rating in the utility matrix could be weighted down by some radial function based on the degree separation in the social graph. Although it can be done, the question is whether it makes sense.

There are some caveats though, which makes me wonder whether it’s useful to bring the social graph in the picture.

We humans are multifaceted. Only some of these facets are manifested in a social network. Those facets may be irrelevant in the context of recommendation of some products and services.

For example, my social graph in linkedin is characterized by my interest in certain technologies or certain kinds of businesses. There will be very little value in overlaying my linked in social graph for making recommendations for songs and movies.

For commercial support for this solution or other solutions in my github repositories, please talk to ThirdEye Data Science Services. Support is available for *Hadoop* or Spark deployment on cloud including installation, configuration and testing,

Pingback: Making Recommendations in Real Time | Mawazo