Location and Time Based Service


When I implemented feature similarity based matching engine in my open source Personalization and Recommendation Engine sifarish, it was for addressing the cold start problem. It allowed me to do content or feature based recommendation for users with limited engagement.

But later on, it evolved into a more general purpose search and matching engine work horse which was used to solve several other problems. In this post, I will show how to implement a location and time based search service by leveraging location and time window attributes that were added to the matching engine. In the front end of this service there could be mobile app.

It’s been reported that about 25 % of the location based searches from mobile phones are for restaurants. We will take this as our use case with a small caveat, as I will describe later.

Background

The matching engine implemented in the map reduce class SameTypeSimilarity has been discussed in details in this earlier post. The basic concepts in similarity based matching is discussed in another earlier post.

Our goal is to find the distance between two objects in an n dimensional feature space. Similarity is inversely proportional to distance. Here are the main steps in the distance calculation.

  1. Distance is calculated between corresponding attributes of a pair of objects. The distance between attributes is always normalized to 1.0, based on value range meta data provided.
  2. The attributes could be numerical, categorical, text, geo location, location, time window, hour window and event.
  3. The attributes could be given different weights, reflecting their relative importance
  4. Attribute distances are aggregated using Euclidean or some other metric, which gives us the final distance between two objects.

Our focus in this will be on the location and time related attributes. These are attributes that allow us to contextual matching.

Contextual Search Service

In stead of impromptu search, our search service will allow you to plan ahead and find restaurants. As an user you will specify as part of a search request a geo location,  some hour window and the type of cuisine and the service will find the matching restaurants that are close to the geo location specified and open during the hour specified and offering the cuisine requested.

There are two sets of objects; the user’s search request and the restaurant details. There are the attributes of the object.

  1. ID
  2. Name
  3. Cuisine
  4. Geo location
  5. Hour window
  6. Rating

The first attribute which is an unique identifier is needed. The second attribute is used for identification and documentation purpose and  plays no role in matching. For restaurant data, ID will be the ID of the restaurant and for the search request data it will be search ID.

Location and Time Attributes

Location and time attributes supported are as follows. These are complex attributes involving multiple components

Attribute Description
timeWindow Pair of date time values with the format yyyy-MM-dd HH:mm:ss;yyyy-MM-dd HH:mm:ss
hourWindow Pair of hour minute values with the format HHmm;HHmm
location It’s triplet consisting land mark, city and state with the format land mark;city;state
geoLocation Pair of lat and long values with the format lat;long
event Combination of location and time window

With timeWindow we specify a date time range that may span across o or more days. For timeWindow, to compute distance we find the the intersection between two the the two ranges and normalize that with a max intersection value provided through metadata.

The attribute type hourWindow is handled the same way for distance computation. With hourWindow we specify hour and minute range that applies to any day.

For location attributes, distance in the range of o to 1 is computed based how many of the three components match. The first component of this attribute which is a land mark, could be a street name .

For geoLocation attributes comprising of lattitude and longitude, Haversine formula is used. A maximum distance is specified in metadata, which is used to normalize the calculated distance. The geoLocation attribute provides more accurate fine grained distance calculation compared to location.

Matching With MapReduce

There are two map reduce jobs involved. The first one SameTypeSimilarity calculates distance between between two objects one from each data set. In our example, the two data sets are restaurant and search request. The output will consists of search ID, restaurant ID and  distance. Details of this map reduce can be found in earlier post.

The second map reduce TopMatches groups, sorts and limits the output of the first map reduce. Number of top matches desired is specified through a configuration parameter. The output will consist of n records for each search ID, where n is the number of top matches. Each row will contain search ID, restaurant ID and distance with distance in ascending order.

The Hadoop jobs require metadata about the different fields. I have created the JSON metadata file for this example. There is also a properties file containing the  configuration parameters. There are few point about the metadata worth noting

  1. I have chosen euclidean for  attribute distance aggregation. Other choices are manhattan and minkwoski
  2. For the  cuisine field I have used text attribute type. If we have a predefined set of cuisine values, we could have used the categorical type instead. With categorical attribute, matching is binary with distance either 0 or 1.
  3. I am using jaccard coefficient for text similarity. Other choice is cosine. The parameter explodeThreshold is set. We will discuss this shortly.
  4. I have applied a nonlinear transformation  for the location attribute distance. With a weight greater than 1, effective distance will be less than actual distance. As a result this attribute will have a greater influence in the final distance between objects

Effective Distance Between Attributes

By default, the effective distance is same as the distance. The effective distance between attributes could be computed based on a non linear functional transformation of the distance.

For example if the geolocation distance is below some threshold, you may not want to make a distinction and set the distance to 0. Otherwise it should be same as the distance. In this case the appropriate function is the ramp function.

Here are all the functions that are available. Some of them use the parameter weight. Some of them use the parameter functionThreshold. Some use both.

Function Description
nonLinear Second order polynomial. Can be concave or convex depending on weight parameter
sigmoid Sigmoid function. The parameter threshold is used as the transition point. Higher weight causes sharper transition making it more like a step function.
step Step function. The transition from 0 to 1 happens at threshold
ramp Ramp function. The effective distance is 0 below threshold otherwise equal to distance

 Imploding and Exploding Object Distance

Sometimes, you may want the distance between certain attributes to dictate the overall distance between objects irrespective of what other attribute values are. If the distance between some attribute of two objects is below implodeThreshold, the distance between the objects goes to 0. On the same token, if the attribute distance is above explodeThreshold, the distance between the objects goes to the maximum value.

For example, the user may have a strong preference for the cuisine that is specified in the search request. If the cuisines don’t match, the distance between the two objects should be maximally mismatched, irrespective of what other attribute values are. Effectively it will push those restaurant objects that don’t match based on that one attribute out of the neighborhood of the search object.

Other Usages of Matching Engine

As I mentioned earlier, the map reduce SameTypeSimilarity has been re purposed  to solve various problems as below

  1. Near duplicate detection. All fields in the record are modeled as text field and edit distance algorithm is used to find near duplicates
  2. Nearest neighbor classifier (KNN). As the first stage of this process nearest neighbors for a test data point is found using this map reduce

Contextual Recommendation

We have seen how temporal and spatial dimensions can play crucial role in a contextual search service. They are equally important for contextual recommendation. Consider a collaborative filtering based recommendation system for music.

Generally in collaborative filtering we use a 2 dimensional  item x user correlation matrix. However we may find that the kind of music people listen is  correlated to the hour of the day. We could make the recommendations more contextual  by making the correlation matrix 3 dimensional by adding the hour of the day as the third dimension.

Real Time Matching

The solution outlined is based on Hadoop batch processing. As search request trickles in, the system is expected to  accumulate the requests for a period and then run Hadoop map reduce jobs to find the matches

For real time or near real time response, the matching algorithms could be implemented on Storm or Spark Streaming. However it’s not feasible to go through all restaurant data in real time. For real time response, it will be necessary to index restaurant  data with the geo location and potentially some other attributes and store the data in a NOSQL database.

For faster response, it should also be possible to  partition the restaurant data based on geo location and hold the the partitions either as cached RDD in Spark streaming  or as in memory cache in Storm bolts.

Final Thoughts

We have gone through an exercise of how to implement a location and time based service using the matching engine available in sifarish.

There is no tutorial doc for this example. However, you could generate some data and  follow the tutorial for product matching for the processing steps.

Advertisements

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Big Data, Hadoop and Map Reduce, Mobile, Real Time Processing, Recommendation Engine, Search, Spark, Storm and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s