Similarity Based Recommendation – The Basics


For recommendation engines, the preferred approach is to use social rating data. My earlier post on recommendation engine provides details of using Collaborative Filtering with social data.However, when no such data is available, or even if it’s available but the amount is not large enough, other approaches are necessary.  This is the so called cold start problem. One such approach uses distance based similarity calculation in a multi dimensional feature space. It ‘s also known as content based recommendation.

Introduction

In this post and in a follow up post, I will present a Hadoop based similarity calculation implementation. The implementation is available on github as a project called sifarish. This is the first of a series of Hadoop based recommendation engine implementation in sifarish.  It will be augmented later with social rating data based collaborative filtering implementation.

Examples

For a shopping site, if user profile data in terms of user’s interest is available, them products in line with user profile can be found, by simply finding the similarity between the profile and products. The profile and the products are considered to be vectors of features. The similarity calculation is based on the proximity of the two feature vectors in an n dimensional feature space.

Another example is a job recommendation system. If the resume and the job posting are represented as vectors of attributes, job recommendations can be made based on similarity matching between the two entities.

I will use recipe recommendation as an example in this post. It’s assumed that the user profile is available in terms user’s preference for cuisine category and other attributes. The system will recommend recipes ordered similarities with profile.

Our goal is to find the distances between a profile and the recipes and have the recipes in an ascending order of the distance i.e., in a descending order of the similarity.

For all these problems, we are essentially interested in finding k nearest neighbors to some entity. The nearness in defined in terms of distances between entities.

Distance Based Similarity

Similarity is based on the distance between two entities. Similarity is inversely proportional to the distance. A distance of 0 implies maximum similarity.

Each entity is represented as a vector of attributes. Distance between two vectors in a multi dimensional space is expressed in a  generic form  by the following expression, which is called Minkwoski distance. Every attribute contributes to the overall distance between two objects.

dist(x,y) = power(sum(power(abs(x(k) – y(k)), r)), 1/r)

It’s computed as follows

  1. Take absolute value of the difference between two points
  2. Raise result of 1 to the power of r
  3. Sum result of 2 over all attributes
  4. Raise the sum to the power of 1/r

Two specific distance calculation algorithm can be derived from the  generic Minkowski distance depending on the value of r.

When r = 1, it’s called the Manhattan or City Block distance. When r = 2, it is called the Euclidean distance. As r approaches infinity, the distance approaches the maximum difference between any of the attributes. It’s called Supermum distance.

Euclidean distance is a popular distance calculation algorithm. In our implementation, distance calculation algorithm can be specified as either Manhattan, Euclidean or Minkowski, through configuration. If Minkowski is chosen, the value of r can also be configured.

As r goes from 1 to infinity, the distance between two entities goes from being the sum of the differences between the attributes to the maximum of the differences between the attributes.

Comparson with Database Query

Similarity matching can be thought of like a database query, where one the entities (e.g. user profile) is like the query and the other entities (e.g. products) are like the query results. But there are are some significant differences.

Database query is very crisp in nature i.e., results are returned only when the query conditions are met  exactly. Similarity based matching is more fuzzy in nature. Perhaps an example will help.  If you specify tv_size >= 42 in a product search, it will return only TV’s with size 42 or more with normal database query. But with with similarity based matching it will also return TV’s with size 40 or lower, because the algorithm is based on distance in a multi dimensional attribute space. However inexact  matches will rank lower in similarity results.

In my next post on this topic, I will talk about some of the functions (minSoft, maxHard etc.) that can be applied in distance calculation of numerical attributes. They make distance between numerical attributes non linear.

Distance and Data Types

For numerical attributes, distance is intuitive enough. It’s simply the difference in values, although as mentioned earlier, it’s possible to impose different functions on the distance between numerical attributes.

What about categorical attributes like recipe category with values like veg, non veg, vegan, meat lover? For categorical attributes, the distance is 0 if the values are   same and 1 if they are different.

For categorical data, if the data values are are not same, the distance is always 1. However, some values are naturally closer to each other compared to others. Sifarish allows this by letting the the user specify a distance less than 1.0 for some value pairs.

For example, the user may specify the distance between non veg and meat lover to be 0.8, because they are naturally closer to each other, but for veg and non veg the default distance value of 1.0 makes sense.

In ranked categorical attribute, there is an implied order between the values, e.g., income level. The common approach is assign a sequence of numbers to the values. Once the mapping is done, it can can be treated as numerical data. In this case, the interval between the mapped values are fixed. If that is not desirable, the values could be mapped to a any set of numbers with the same ordering as the original data.

Some categorical attributes have hierarchical structure e.g., product category. Each node of the tree  represents a categorical attribute value. The distance between two attribute values  is defined as the smallest number of edges traversed in going from the tree node corresponding to one attribute value to the other node representing the other attribute value. For example, the distance two sibling nodes will be 2. The distance is normalized by the maximum distance between two nodes in the tree.

For boolean attribute the distance is 0 if the boolean values are same and 1 otherwise. For asymmetric boolean attribute, when both values are false it’s not significant and hence is not considered in the distance calculation. In asymmetric attribute, some times certain value is so prevalent, that it’s better to ignore such value in distance calculation, when two entities have that same value for the attribute. Otherwise, it will dominate the calculated distance value.

Distance Normalization

The distance between different attributes, potentially of different data types need to be combined to find the distance between two entity instances. To ensure equal contribution from attributes of different data types, some normalization is necessary. Attributes with categorical data is naturally scaled to the range 0 to 1.0.

For numerical attributes, the difference is divided by the difference between the min and max values for the attribute, forcing the resulting normalized  distance to be in the range 0 to 1.0. In sifarish, the min and max values are specified as meta data configuration.

Attribute Weight

Sometimes it may be necessary to give more importance to some attributes, in the distance calculation. This is achieved by configuring an weight for attribute and taking the weight into considerations. The modified expression for distance is as follows.

dist = 1/weight * diff + (1 – 1/weight) * diff *diff
dist(x,y) = power(sum(power(abs(x(k) – y(k)), r)), 1/r)
where dist = distance between 2 attributes, diff = difference between 2 attribute values and dist(x,y) = distance between 2 entities x and y

Generally, the distance between two attributes is the difference in their values. However when weight is introduced, distance becomes a non linear function of the value difference. The second term is introduced to meet the boundary condition: dist = diff when  diff = 1  i.e., when the two attribute values are maximally different, weight should not have any effect.

By assigning a weight greater than 1.0, the distance along that attribute dimension is lowered, increasing it’s contribution towards similarity between two entities and vice versa Sifarish allows attribute weights to be specified through configuration. If not specified, it defaults to 1.0.

Other Similarity Measures

In this post, my focus is distance based similarity calculation, which is appropriate for dense data.

For sparse data, which often contains asymmetric data, similarity depends more on characteristics that are  shared, rather than characteristics that are lacking.

In such cases cosine, jaccard and extended jaccard measures are more appropriate. One example of sparse and asymmetric data is document term vector for finding document similarity. When two documents are considered, most of the terms will be absent from both the documents.

If we consider the terms that are lacking in similarity calculation based on distance, then they will dominate the similarity value which will not reflect the true picture. In such cases, cosine or jaccard measures are more meaningful. In future, I will add these measures to sifarish.

Summing Up

In this post I have covered the basics. In the follow up post, I will cover the Hadoop implementation. Distance based similarity is an O(m x n) problem where m and n are the number of entities of the two different types.

Each entity of one type needs to be paired up with each entity of the other type. For large data set, scaling becomes an issue and that’s where Hadoop comes into play.

For commercial support for any solution in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing,

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Data Mining and tagged , . Bookmark the permalink.