Similarity Based Recommendation – Hadoop Way


In my earlier post, I discussed some of the basic  concepts for Similarity Based Recommendation. As discussed, distance between entities in a multi dimensional attributes space is used as a measure of similarity. In this post I will take a deep dive into the Hadoop based implementation called sifarish which is hosted as an open source project  on github. This particular Hadoop implementation is for finding similarities between entities of different entity types.

This kind of solution is useful for the the so called cold start problem and to bootstrap your recommender when there is absolutely no user interaction history is available.  You may only have at your disposal some user profile data indicative of user’s interest in some  products or services.

I have tested it for two different data sets. One set is for making recipe recommendation based on user profiles. The other one makes recommendation of Kiva micro loan based on lender’s profile. Lender’s profile is matched with loans. For this example, I have used actual data from Kiva. I will be using Kiva micro loan as the example throughout  this post.

Matching Process

The matching process uses two Hadoop map reduce jobs working in tandem. The first one calculates distance between each pair of entities. The second map reduce job finds the top N matches of the target entity for every instance of the source entity, where N is user configurable.

For the Kiva loan example, as the final output we will get the top N matching loans for every lender profile.

For distance computing map reduce, we essentially pair each instance of source entity (e.g., lender profile) with each instance of the target entity (e.g., loan) we are matching. This causes a combinatorial explosion in computation. For each pair, we compute the distance. The details of the distance computation logic can be found in my earlier post.

The second map reduce processes the output of the first map reduce and uses the secondary sorting pattern to get the top N matches for the target entity(e.g., loan) for every instance of the source entity (e.g., lender profile).

Kiva Micro Loan

For Kiva we match lender profile with loan and recommend loans to lenders. I obtained loan data using Kiva’s REST API. The pertinent attributes of loan are as follows

  • Sector: the business sector of borrower e.g., agriculture, retail etc
  • Country: country of the borrower
  • Amount: loan amount
  • Raised: amount raised so far

Kiva does not have any pertinent data relevant to lender’s profile. I used a script to generate fake lender profile data. It has the following attributes

  • Sector: the business sector e.g., agriculture, retail etc that the lender prefers
  • Country: country of the borrower that lender prefers
  • Raised: lenders prefers loans with raised amount below this level

Similarity Measure Map Reduce

The immediate issue that comes for this map reduce is how to divide up the work of pairing up entities and calculate the distance for each pair. Essentially we are doing a Cartesian join.

The approach I took is similar to what is known as Block Nested Inner Loop Join for doing join in RDBMS, except that I use hashing for partitioning

The idea is to partition each set of entities and pair up the partitions from each entity. I have decided to do the partitioning by hashing the entity Id.

The Id of each entity is hashed. For each entity type we get a set of hash values. We take all the possible combination of hash values for the two entity types. For each hash value pair, we pair up the entities from each type falling in the  two hash values. The distance computation between entities for each hash value pair is distributed among the reducers.

The mapper output key is a function of two hash values. I am using the following function to generate the mapper output key

key = (hash(SEID) %  hashBucketCount) * hashBucketCount + hash(TEID) % hashBucketCount

where

  • SEID = Source entity ID
  • TEID = Target entity ID
  • hashBucketCount = Number of hash buckets, which is configurable

To distribute the load uniformly across reducers through Hadoop’s default reducer partitioner, hashBucketCount should be chosen properly. For example, if the value chosen is 15, there will be 225 uniques values for the key. If the number of reducers chosen is 20,  each reducer will on an average process 11 keys.

The values for a given mapper key will contains instances of entities of both types. Since we have to pair up instances of one type entity with instances of another type, we need a way to segregate the instances, so instances of one type appears before the instances of the other type in the list of values when the reducer gets invoked. This will enable easier nested loop join.

The key defined earlier will be used as the base part of the key (we will call it keyBase). The key is enhanced as follows. It will ensure that that the source entities will appear before the target entities.

We are using the secondary sorting pattern. We will will use a custom reducer partitioner and group comparator based on keyBase. Here is the  source code for this map reduce .

For source entity, key = keyBase * 10 For target entity,  key = (keyBase * 10) + 1

Here is some sample output of this map reduce. The first field is lender profile Id, followed by the loan ID and other loan attributes. The last attributes is the distance, which is normalized to be in the range 0 – 999.

ZJHUVP7N9GGT,362664,Wekhonya B Shg Group,Agriculture,Kenya,900,5,http://www.kiva.org/lend/362664,477
ZJHUVP7N9GGT,350136,Mo'ayad Al-Khateeb,Retail,Palestine,1000,77,http://www.kiva.org/lend/350136,533
ZJHUVP7N9GGT,354980,Luis,Food,Bolivia,800,12,http://www.kiva.org/lend/354980,477
ZJHUVP7N9GGT,362666,Buyanarvjih Luvsan,Services,Mongolia,2700,0,http://www.kiva.org/lend/362666,477
ZJHUVP7N9GGT,354647,Naseem's Group,Retail,Pakistan,1625,92,http://www.kiva.org/lend/354647,558
ZJHUVP7N9GGT,361820,Virginia Gumantaron,Food,Philippines,1175,0,http://www.kiva.org/lend/361820,477
ZJHUVP7N9GGT,359888,Angela Antequisa,Retail,Philippines,300,58,http://www.kiva.org/lend/359888,507

This output is consumed by the top N match map reduce, which only needs the two IDs and the distance. The other loan attributes were were part of the output for debugging purpose.

MetaData and Configuration

Both map reduce jobs consume a meta data and configuration file in JSON format. Here is the configuration file for the Kiva micro loan project. It’s content is as follows

  • For both entity types, meta data for each field
  • Overridden distance  between categorical attribute values
  • Concept hierarchy for categorical attributes
  • Min and max values for numerical attributes
  • Categorical value mapping between entity types
  • Attribute weight in distance aggregation

Overriding Categorical Attribute Distance

By default, the distance between two categorical attribute values is 1 if they are not same. But in same cases, distance could be less than 1 even for dissimilar values, because semantically they could be closer.

Let’s take three countries for example, Chad, Sudan and Peru. Because of geographical, social and economic similarity between Chad and Sudan, the distance could be set to 0.65 which overrides the default value of 1.0. Here is a snippet from the configuration. It’s part of the meta data for the field country.

{
	"thisValue" : "Chad",
	"thatValue" : "Sudan",
	"distance" : 0.65
}

Categorical Attribute Concept Hierarchy

Some attributes have natural  hierarchy. For the Kiva example, there is a hierarchy between between country and continent. For example, if the lenders’ profile specifies Asia and the loan has Cambodia as country, it should be considered to be a match and the distance should be 0, although they are not the same literally. This is how concept hierarchy is configured

{
	"parent" : "Africa",
	"children" : ["Kenya","Benin","Ghana","Sierra Leone","Uganda","Senegal"]
},

Numerical Attribute Min and Max Values

Min and max values are needed to normalize the distance between numerical attribute values. Here is an example  for the field raised which indicates the amount of money raised for a loan so far.

{
	"name" : "raised",
	"ordinal" : 5,
	"dataType" : "int",
	"min" : 0,
	"max" : 100
},

Attribute Weight

The distance between the two matching entities is computed by aggregating the distances between corresponding attributes. While aggregating all attributes  contribute uniformly. But that can be changed by assigning a weight  as follows. Here the attribute country is given more weight compared to other attributes.

{
	"name" : "country",
	"ordinal" : 3,
	"dataType" : "categorical",
	"weight" : 1.2
}

When the weight for an attribute is more than 1, the effective distance between the two attributes for the entity pair drops, which makes the two entities closer than what it would have been otherwise.

Numerical Attribute Distance Nonlinearity

For numerical attributes, distance is the difference in the values. However, distance can be a non linear function of the difference in values. For example in case of Kiva, the raised  attribute in the lender’s profile has function maxSoft associated with it as shown below.

{
	"name" : "raised",
	"ordinal" : 3,
	"dataType" : "int",
	"numDistFunction" : "maxSoft",
	"mappings" :
	[
		{
			"matchingOrdinal" : 5
		}
	]
}

It implies that if the value of the raised attribute in loan is less than the raised attribute in the lender’s profile it’s a perfect match i.e., the distance is 0. Otherwise, distance is the difference in values. In plain English, the lenders prefers to lend money to those loans whose raised amount as a percentage of loan amount is less what’s specified for raised in the lender’s profile. It’s akin to range queries in SQL.

Here is complete list for  different non linear functions

Function Distance
maxSoft 0 if the target entity attribute value  is less than  the source, difference of values otherwise
minSoft 0 if the target entity attribute value is more than  the source, difference of values otherwise
maxHard 0 if the target entity attribute value is less than the source, 1 otherwise
minHard 0 if the target entity attribute value is more than  the source, 1 otherwise
equalSoft 0 if the target entity attribute value is equal to the source, difference of values otherwise
equalHard 0 if the target entity attribute value is more than the source, 1 otherwise

Top Match Map Reduce

This map reduce job processes the output of the Similarity Measure Map Reduce and  finds the top N target entity (e.g., loan) for any source entity (e.g. lender profile). The parameter N is user configurable.

The mapper emits profile Id and distance as key and loan Id and distance as value.  This is  a  classic secondary sort map reduce. This earlier post of mine has  a detailed example of secondary sort in map reduce.

We use the base part of the key (profile Id) for the reducer partitioner and the group comparator.  In the reducer as we iterate through the values which are sorted in the ascending order of distance, we emit the only top N values and ignore the rest. Here is the source code for this map reduce.

The final output looks like this. The fields  from left to right are profile Id, matching loan Id and the distance which essentially ranks the loans. Here we are showing top 5 matches.

0UQ972OAXIEM,361111,0
0UQ972OAXIEM,362406,0
0UQ972OAXIEM,362390,0
0UQ972OAXIEM,363101,45
0UQ972OAXIEM,361678,105

Text Matching

I am planning to support text fields implement test similarity. Kiva loan has textual description. If the lender profile contained a text field with some keywords, we could match them against loan description text. It could be as simple as percentage of key words found in the text.

The text will have to go through the normal processing of stop word removal, stemming etc as done in any text search solution. We may need a preprocessing map reduce that will do these processing and replace the text field with a set of tokens.

Wrapping Up

I am planning to make this along with other recommendation engine algorithms I will be implementing, available as a service. My next project will be the Hadoop implementation of the collaborative filtering based recommendation engine, which is a better solution given there is enough social and behavioral  data.

Let me know if you have an interesting application, that could use my solution. I will be happy to look into it.

Here are the links to sample files for configuration and shell script for this use case, that you might useful.

  1. Configuration properties file
  2. Schema JSON file
  3. Shell script

Just to reiterate, this Map Reduce class is for finding similarities between entities of two different types. To find similarities between entities of the same type, the Map Reduce class SameTypeSimilarity should be used.

Advertisements

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Data Mining, Hadoop and Map Reduce, Java, Recommendation Engine and tagged , . Bookmark the permalink.

9 Responses to Similarity Based Recommendation – Hadoop Way

  1. Pingback: Similarity Based Recommendation – Tossed up with Text Analytic | Mawazo

  2. Pingback: Warm Starting a Recommender with Hadoop | Mawazo

  3. Kiran My says:

    Thanks for detailed article on Hadoop. hadoop can also be used olap and oltp processing.
    Please click Why Hadoop is introduced to know more on Basics of Hadoop

  4. Pingback: Socially Accepted Recommendation | Mawazo

  5. Pingback: Similarity Based Recommendation – Hadoop Way | BigDataCloud.com

  6. Pingback: Socially Accepted Recommendation | BigDataCloud.com

  7. Pingback: Get Social With Pearson Correlation | Mawazo

  8. Pingback: Get Social with Pearson Correlation | Big Data Cloud

  9. Pingback: Location and Time Based Service | Mawazo

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s