When I implemented feature similarity based matching engine in my open source Personalization and Recommendation Engine sifarish, it was for addressing the cold start problem. It allowed me to do content or feature based recommendation for users with limited engagement.
But later on, it evolved into a more general purpose search and matching engine work horse which was used to solve several other problems. In this post, I will show how to implement a location and time based search service by leveraging location and time window attributes that were added to the matching engine. In the front end of this service there could be mobile app.
It’s been reported that about 25 % of the location based searches from mobile phones are for restaurants. We will take this as our use case with a small caveat, as I will describe later.
The matching engine implemented in the map reduce class SameTypeSimilarity has been discussed in details in this earlier post. The basic concepts in similarity based matching is discussed in another earlier post.
Our goal is to find the distance between two objects in an n dimensional feature space. Similarity is inversely proportional to distance. Here are the main steps in the distance calculation.
- Distance is calculated between corresponding attributes of a pair of objects. The distance between attributes is always normalized to 1.0, based on value range meta data provided.
- The attributes could be numerical, categorical, text, geo location, location, time window, hour window and event.
- The attributes could be given different weights, reflecting their relative importance
- Attribute distances are aggregated using Euclidean or some other metric, which gives us the final distance between two objects.
Our focus in this will be on the location and time related attributes. These are attributes that allow us to contextual matching.
Contextual Search Service
In stead of impromptu search, our search service will allow you to plan ahead and find restaurants. As an user you will specify as part of a search request a geo location, some hour window and the type of cuisine and the service will find the matching restaurants that are close to the geo location specified and open during the hour specified and offering the cuisine requested.
There are two sets of objects; the user’s search request and the restaurant details. There are the attributes of the object.
- Geo location
- Hour window
The first attribute which is an unique identifier is needed. The second attribute is used for identification and documentation purpose and plays no role in matching. For restaurant data, ID will be the ID of the restaurant and for the search request data it will be search ID.
Location and Time Attributes
Location and time attributes supported are as follows. These are complex attributes involving multiple components
|timeWindow||Pair of date time values with the format yyyy-MM-dd HH:mm:ss;yyyy-MM-dd HH:mm:ss|
|hourWindow||Pair of hour minute values with the format HHmm;HHmm|
|location||It’s triplet consisting land mark, city and state with the format land mark;city;state|
|geoLocation||Pair of lat and long values with the format lat;long|
|event||Combination of location and time window|
With timeWindow we specify a date time range that may span across o or more days. For timeWindow, to compute distance we find the the intersection between two the the two ranges and normalize that with a max intersection value provided through metadata.
The attribute type hourWindow is handled the same way for distance computation. With hourWindow we specify hour and minute range that applies to any day.
For location attributes, distance in the range of o to 1 is computed based how many of the three components match. The first component of this attribute which is a land mark, could be a street name .
For geoLocation attributes comprising of lattitude and longitude, Haversine formula is used. A maximum distance is specified in metadata, which is used to normalize the calculated distance. The geoLocation attribute provides more accurate fine grained distance calculation compared to location.
Matching With MapReduce
There are two map reduce jobs involved. The first one SameTypeSimilarity calculates distance between between two objects one from each data set. In our example, the two data sets are restaurant and search request. The output will consists of search ID, restaurant ID and distance. Details of this map reduce can be found in earlier post.
The second map reduce TopMatches groups, sorts and limits the output of the first map reduce. Number of top matches desired is specified through a configuration parameter. The output will consist of n records for each search ID, where n is the number of top matches. Each row will contain search ID, restaurant ID and distance with distance in ascending order.
The Hadoop jobs require metadata about the different fields. I have created the JSON metadata file for this example. There is also a properties file containing the configuration parameters. There are few point about the metadata worth noting
- I have chosen euclidean for attribute distance aggregation. Other choices are manhattan and minkwoski
- For the cuisine field I have used text attribute type. If we have a predefined set of cuisine values, we could have used the categorical type instead. With categorical attribute, matching is binary with distance either 0 or 1.
- I am using jaccard coefficient for text similarity. Other choice is cosine. The parameter explodeThreshold is set. We will discuss this shortly.
- I have applied a nonlinear transformation for the location attribute distance. With a weight greater than 1, effective distance will be less than actual distance. As a result this attribute will have a greater influence in the final distance between objects
Effective Distance Between Attributes
By default, the effective distance is same as the distance. The effective distance between attributes could be computed based on a non linear functional transformation of the distance.
For example if the geolocation distance is below some threshold, you may not want to make a distinction and set the distance to 0. Otherwise it should be same as the distance. In this case the appropriate function is the ramp function.
Here are all the functions that are available. Some of them use the parameter weight. Some of them use the parameter functionThreshold. Some use both.
|nonLinear||Second order polynomial. Can be concave or convex depending on weight parameter|
|sigmoid||Sigmoid function. The parameter threshold is used as the transition point. Higher weight causes sharper transition making it more like a step function.|
|step||Step function. The transition from 0 to 1 happens at threshold|
|ramp||Ramp function. The effective distance is 0 below threshold otherwise equal to distance|
Imploding and Exploding Object Distance
Sometimes, you may want the distance between certain attributes to dictate the overall distance between objects irrespective of what other attribute values are. If the distance between some attribute of two objects is below implodeThreshold, the distance between the objects goes to 0. On the same token, if the attribute distance is above explodeThreshold, the distance between the objects goes to the maximum value.
For example, the user may have a strong preference for the cuisine that is specified in the search request. If the cuisines don’t match, the distance between the two objects should be maximally mismatched, irrespective of what other attribute values are. Effectively it will push those restaurant objects that don’t match based on that one attribute out of the neighborhood of the search object.
Other Usages of Matching Engine
As I mentioned earlier, the map reduce SameTypeSimilarity has been re purposed to solve various problems as below
- Near duplicate detection. All fields in the record are modeled as text field and edit distance algorithm is used to find near duplicates
- Nearest neighbor classifier (KNN). As the first stage of this process nearest neighbors for a test data point is found using this map reduce
We have seen how temporal and spatial dimensions can play crucial role in a contextual search service. They are equally important for contextual recommendation. Consider a collaborative filtering based recommendation system for music.
Generally in collaborative filtering we use a 2 dimensional item x user correlation matrix. However we may find that the kind of music people listen is correlated to the hour of the day. We could make the recommendations more contextual by making the correlation matrix 3 dimensional by adding the hour of the day as the third dimension.
Real Time Matching
The solution outlined is based on Hadoop batch processing. As search request trickles in, the system is expected to accumulate the requests for a period and then run Hadoop map reduce jobs to find the matches
For real time or near real time response, the matching algorithms could be implemented on Storm or Spark Streaming. However it’s not feasible to go through all restaurant data in real time. For real time response, it will be necessary to index restaurant data with the geo location and potentially some other attributes and store the data in a NOSQL database.
For faster response, it should also be possible to partition the restaurant data based on geo location and hold the the partitions either as cached RDD in Spark streaming or as in memory cache in Storm bolts.
We have gone through an exercise of how to implement a location and time based service using the matching engine available in sifarish.
There is no tutorial doc for this example. However, you could generate some data and follow the tutorial for product matching for the processing steps.