For recommendation engines, the preferred approach is to use social rating data. My earlier post on recommendation engine provides details of using Collaborative Filtering with social data.However, when no such data is available, or even if it’s available but the amount is not large enough, other approaches are necessary. This is the so called cold start problem. One such approach uses distance based similarity calculation in a multi dimensional feature space. It ‘s also known as content based recommendation.
In this post and in a follow up post, I will present a Hadoop based similarity calculation implementation. The implementation is available on github as a project called sifarish. This is the first of a series of Hadoop based recommendation engine implementation in sifarish. It will be augmented later with social rating data based collaborative filtering implementation.
Another example is a job recommendation system. If the resume and the job posting are represented as vectors of attributes, job recommendations can be made based on similarity matching between the two entities.
I will use recipe recommendation as an example in this post. It’s assumed that the user profile is available in terms user’s preference for cuisine category and other attributes. The system will recommend recipes ordered similarities with profile.
Our goal is to find the distances between a profile and the recipes and have the recipes in an ascending order of the distance i.e., in a descending order of the similarity.
Distance Based Similarity
Each entity is represented as a vector of attributes. Distance between two vectors in a multi dimensional space is expressed in a generic form by the following expression, which is called Minkwoski distance. Every attribute contributes to the overall distance between two objects.
It’s computed as follows
- Take absolute value of the difference between two points
- Raise result of 1 to the power of r
- Sum result of 2 over all attributes
- Raise the sum to the power of 1/r
Two specific distance calculation algorithm can be derived from the generic Minkowski distance depending on the value of r.
Euclidean distance is a popular distance calculation algorithm. In our implementation, distance calculation algorithm can be specified as either Manhattan, Euclidean or Minkowski, through configuration. If Minkowski is chosen, the value of r can also be configured.
Comparson with Database Query
Similarity matching can be thought of like a database query, where one the entities (e.g. user profile) is like the query and the other entities (e.g. products) are like the query results. But there are are some significant differences.
Database query is very crisp in nature i.e., results are returned only when the query conditions are met exactly. Similarity based matching is more fuzzy in nature. Perhaps an example will help. If you specify tv_size >= 42 in a product search, it will return only TV’s with size 42 or more with normal database query. But with with similarity based matching it will also return TV’s with size 40 or lower, because the algorithm is based on distance in a multi dimensional attribute space. However inexact matches will rank lower in similarity results.
In my next post on this topic, I will talk about some of the functions (minSoft, maxHard etc.) that can be applied in distance calculation of numerical attributes. They make distance between numerical attributes non linear.
Distance and Data Types
For numerical attributes, distance is intuitive enough. It’s simply the difference in values, although as mentioned earlier, it’s possible to impose different functions on the distance between numerical attributes.
What about categorical attributes like recipe category with values like veg, non veg, vegan, meat lover? For categorical attributes, the distance is 0 if the values are same and 1 if they are different.
For example, the user may specify the distance between non veg and meat lover to be 0.8, because they are naturally closer to each other, but for veg and non veg the default distance value of 1.0 makes sense.
In ranked categorical attribute, there is an implied order between the values, e.g., income level. The common approach is assign a sequence of numbers to the values. Once the mapping is done, it can can be treated as numerical data. In this case, the interval between the mapped values are fixed. If that is not desirable, the values could be mapped to a any set of numbers with the same ordering as the original data.
Some categorical attributes have hierarchical structure e.g., product category. Each node of the tree represents a categorical attribute value. The distance between two attribute values is defined as the smallest number of edges traversed in going from the tree node corresponding to one attribute value to the other node representing the other attribute value. For example, the distance two sibling nodes will be 2. The distance is normalized by the maximum distance between two nodes in the tree.
For boolean attribute the distance is 0 if the boolean values are same and 1 otherwise. For asymmetric boolean attribute, when both values are false it’s not significant and hence is not considered in the distance calculation. In asymmetric attribute, some times certain value is so prevalent, that it’s better to ignore such value in distance calculation, when two entities have that same value for the attribute. Otherwise, it will dominate the calculated distance value.
The distance between different attributes, potentially of different data types need to be combined to find the distance between two entity instances. To ensure equal contribution from attributes of different data types, some normalization is necessary. Attributes with categorical data is naturally scaled to the range 0 to 1.0.
For numerical attributes, the difference is divided by the difference between the min and max values for the attribute, forcing the resulting normalized distance to be in the range 0 to 1.0. In sifarish, the min and max values are specified as meta data configuration.
Sometimes it may be necessary to give more importance to some attributes, in the distance calculation. This is achieved by configuring an weight for attribute and taking the weight into considerations. The modified expression for distance is as follows.
Generally, the distance between two attributes is the difference in their values. However when weight is introduced, distance becomes a non linear function of the value difference. The second term is introduced to meet the boundary condition: dist = diff when diff = 1 i.e., when the two attribute values are maximally different, weight should not have any effect.
By assigning a weight greater than 1.0, the distance along that attribute dimension is lowered, increasing it’s contribution towards similarity between two entities and vice versa Sifarish allows attribute weights to be specified through configuration. If not specified, it defaults to 1.0.
Other Similarity Measures
In this post, my focus is distance based similarity calculation, which is appropriate for dense data.
In such cases cosine, jaccard and extended jaccard measures are more appropriate. One example of sparse and asymmetric data is document term vector for finding document similarity. When two documents are considered, most of the terms will be absent from both the documents.
If we consider the terms that are lacking in similarity calculation based on distance, then they will dominate the similarity value which will not reflect the true picture. In such cases, cosine or jaccard measures are more meaningful. In future, I will add these measures to sifarish.
In this post I have covered the basics. In the follow up post, I will cover the Hadoop implementation. Distance based similarity is an O(m x n) problem where m and n are the number of entities of the two different types.
Each entity of one type needs to be paired up with each entity of the other type. For large data set, scaling becomes an issue and that’s where Hadoop comes into play.