Warm Starting a Recommender with Hadoop


In my earlier post I discussed the solution for cold starting a recommender. Cold starting refers to the situation when no user interaction data is available. You may have a newly registered user in your web site. The user may have provided some personal data including user’s interests while registering. The user profile data is the only data you have at your disposal. That data can be leveraged to find recommended items. It boils down to find matching between two different entity types with some common attributes.

In this post my focus is on warm starting, when you limited amount of user behavior data available i.e., the user may have purchased and browsed few items on the site. The items purchased or browsed can be used to find similar items to recommend. I have used electronics product data as an example for this post. The attribute data types in this example are numeric, categorical and text.

Similarity Matching Map Reduce

The solution is very similar to matching between two different entity types. The similarity is defined in terms of distances between two entities in a multi dimensional space. The only difference is how the items are partitioned and the partitions paired up.

The ID of each item is hashed. We take all the possible combination of hash values. For each hash value pair, we pair up the entities from one hash bucket with entities from the other has bucket and calculate the distance.

For each pair of hash values we compute an unique value through some function that serves as part of the mapper output key. The following code snippet shows how the mapper key is generated.

String partition = partitonOrdinal >= 0 ? items[partitonOrdinal] :  "none";
hash = (items[idOrdinal].hashCode() %  bucketCount  + bucketCount) / 2 ;
for (int i = 0; i < bucketCount;  ++i) {
if (i < hash){
hashPair = hash * 1000 +  i;
keyHolder.set(partition, hashPair,0);
valueHolder.set("0" + value.toString());
} else {
hashPair =  i * 1000  +  hash;
keyHolder.set(partition, hashPair,1);
valueHolder.set("1" + value.toString());
}
context.write(keyHolder, valueHolder);
}
}

On the reducer side, we need to segregate items from one hash bucket from the items in the other hash bucket. We use secondary sorting for this purpose. The mapper key consists of the following components.

  • Partitioning field value:  One of the attributes may be designated as the partitioning attributes through configuration, so that all the items with partitioning attribute value are paired up for similarity matching e.g., category in product data
  • Hash pair mapper:  The result of a mapping function that maps the two hash values of items Ids to some unique value.
  • A suffix:  This enables secondary sorting to work. A value of 0 or 1 is used to segregate one set of items from one hash bucket to the other set of items corresponding to the other hash bucket. This will force the values in the reducer invocation not to be intermixed.

The full source code is available here. In the reducer, we essentially do join of two sets, each set corresponding to  hash bucket. For each pair of items considered, we compute the distance between them, using algorithms discussed in my earlier post.

When an hash bucket is paired with itself, we don’t need to make a distinction between them in the reducer  and we pair up each item with another item for the distance calculation.

Nearest Neighbor Map Reduce

This map reduce job discussed in this earlier post, takes the output of the first map reduce and finds the nearest n neighbor of each item. The output of this map reduce is an item ID and the item IDs of it’s nearest n neighbors and the corresponding distance, as below. The smaller the distance, better the match. The number n is configurable.

DBL-049959,DBL-T40306,172
DBL-049959,DBL-CA0357,183
DBL-049959,DBL-CA0352,185
DBL-049959,DBL-CA0355,186
DBL-049959,DBL-CA0099,188

This output will typically be imported into a database to be used in an web application to show recommended items based on items an user has purchased or browsed.

Final Recommendation

Once the final output is in a database, we could do a query to find all the matching items for a given set of source items, ordered by the distance in the ascending order. The source items are the set of items the user has purchased or browsed.

Essentially we are doing a merge operation. This will work better, if we had chosen the nearest neighbor items within a given distance, instead of the n nearest neighbor in the nearest neighbor map reduce.  Accordingly, I have have modified the nearest neighbor map reduce, so that it’s a configurable option, whether to choose nearest n neighbor or all the neighbors within a specified distance.

While transitioning from cold start recommender to warm start recommender, it might be a good idea to combine both, at least until significant amount of user behavior data is available.  In other other words, the top matches from cold start recommender is merged with the results of warm start recommender and final top n recommendations are picked from the merged and sorted list.

Other Usage of Similarity Map Reduce

The similarity map reduce used here can be used in various other applications e.g., finding document similarity. I am also using it for a fraud and out lier analysis open source project I am working on. I will be  blogging about that in near future.

Hot Starting a Recommender

So far I have discussed about cold starting and warm starting. Next comes hot starting, which is essentially social recommendation using user behavior data using collaborative filtering. By social, I mean usage of user purchase or browsing history for large number of users, compared to the approach taken by the solution outlined in this post, which is based on the  user behavior data for a given user only.

I have discussed the social recommender solution in earlier posts. I will be implementing the collaborative filtering  based social  recommender soon and adding it to the sifarish project in github. So, stay tuned.

Advertisements

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Big Data, Data Mining, Hadoop and Map Reduce, Recommendation Engine and tagged , . Bookmark the permalink.

One Response to Warm Starting a Recommender with Hadoop

  1. Pingback: Warm Starting a Recommender with Hadoop | BigDataCloud.com

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s