Presence Data Analytic using MongoDb and Map Reduce


My last post was on location data query and indexing using MongoDB. Location data query and index support is an unique and powerful feature of MongoDB. Continuing along the same thread, I will dig into Map Reduce framework built right into MongoDB.

Some NOSQL database systems provide built in map reduce framework. When the query engine is not enough for complex aggregate queries or other complex computation, you can take control by using the MapReduce engine provided by the database, in this case MongoDB.

In case of MongoDB. the developer provides the map and reduce functions in javascript and then  executes the MongoDB mapreduce command. Although map reduce is a powerful and scalable parallel processing framework, it’s full power can not be unleashed unless your data is sharded in MngoDB. Otherwise, MongoDb’s map reduce will execute sequentially in single thread.

The Ad Campaign Scenario

In this post I will show how to plan a mobile ad campaign, leveraging MongoDB’s map reduce to perform analytic on presence data.

A restaurant in San Francisco wants to drive 50 additional customers during the lunch hour. The owner has decided to launch a mobile advertisement campaign and serve coupons through SMS or email. Assuming a conversion rate of 5%, 1000 mobile phone users need to targeted everyday with the advertisement. The owner wants to to strike a campaign deal with a big service provider.

The service provider, before it makes a deal with the restaurant owner needs to have some assurance that there are 1000 of it’s customers in the vicinity of the restaurant within a time window, say between 12 and 1 PM in the afternoon, everyday during the campaign period.

The service provider decides to perform analytic on it’s presence data for the past 2 or 3 month. The service provide wants to find the number n such that the probability of finding n customer is at least .9.

Once found, the number should be 1000 or more to meet the campaign’s criteria. Another way to express the requirement is that if the campaign runs for 10 days, then for 9 out of the 10 days, there will be 1000 users in the vicinity between 12 and 1 PM. It can be thought of as an SLA  for the ad campaign. The requirement can be summarized as below.

 P(x > n)  >=  .9 n > 1000 

Analytic Solution

Our ultimate goal is the cumulative distribution of number of mobile phone users in the vicinity of the restaurant between 12 and 1 PM in the afternoon.  I have two map reduce jobs to accomplish this.

The first one counts the number of users  subject to the location and time constraint, for everyday from the past data. Essentially it’s an aggregator.  I have added the map and reduce functions in the initialize method of the ruby class as follows. There are essentially strings containing javascript code fragments.


class Tracker
 def initialize
 @db = Mongo::Connection.new.db("test")
 @coll = @db.collection("trackerData")
 @bucket_size = 5
 @daily_count_map = <<-eos
 function() {
 var key = this.time.getUTCMonth() + ':' + this.time.getUTCDate();
 if (this.time.getUTCHours() == 12 && this.loc.lat > 36.9 &&
 this.loc.lat < 37.1 && this.loc.long > 121.9 && this.loc.long < 122.1) {
 emit(key, {count: 1});
 }
 };
 eos

 @daily_count_reduce = <<-eos
 function(key, values) {
 var sum = 0;
 values.forEach(function(f) {
 sum += f.count;
 });
 return {count: sum};
 };
 eos

 @presence_histogram_map = <<-eos
 function() {
 var key = Math.floor(this.value.count / 5);
 emit(key, {count: 1});
 };
 eos

 @presence_histogram_reduce = <<-eos
 function(key, values) {
 var sum = 0;
 values.forEach(function(f) {
 sum += f.count;
 });
 return {count: sum};
 };
 eos

 end
end

In the map method, I am using a concatenation of month and day of the month as the key. In the reduce method I am just adding up all the counts for a given day. Unfortunately, I had to hard code the geo spatial condition inside the map function.The down side of this hack is that all data from a collection get passed to the map method and the map method does the filtering.

The right way is to pass the query as an option called  query along with the map and reduce functions when invoking map reduce. Unfortunately, MongoDB throws an exception when geo spatial query is  passed to map reduce command.

Generally,  MongoDB stores output of map reduce in a temporary collection. But I wanted to store them in a known collection, so that it could  be used as input for the second map reduce job. The collection name is provided through the out option.


#map reduce for daily presence count
 def daily_presence_count(ll_lat, ll_long, ur_lat, ur_long)
 out_col = @db.collection('daily_count')
 out_col.remove

 lower_left = [ll_lat, ll_long]
 upper_right = [ur_lat, ur_long]
 box = [lower_left,upper_right]
 results = @coll.map_reduce(@daily_count_map, @daily_count_reduce,
 {:out => 'daily_count'})

 out_col.find().each { |doc| puts doc.inspect }

 end

Here is some output of this map reduce which is stored in the collection daily_count. The _id of each document is essentially the key emitted by the  map function, which is the day and hour concatenated. The value of count is the number of users found for that day

 {"_id"=>"10:10", "value"=>{"count"=>36.0}} {"_id"=>"10:11", "value"=>{"count"=>34.0}} {"_id"=>"10:12", "value"=>{"count"=>26.0}} {"_id"=>"10:13", "value"=>{"count"=>35.0}} {"_id"=>"10:14", "value"=>{"count"=>37.0}} {"_id"=>"10:15", "value"=>{"count"=>34.0}} {"_id"=>"10:16", "value"=>{"count"=>38.0}} {"_id"=>"10:17", "value"=>{"count"=>28.0}} {"_id"=>"10:18", "value"=>{"count"=>33.0}} {"_id"=>"10:19", "value"=>{"count"=>35.0}} 

The second map reduce operates on the output of the first map reduce and creates a histogram. You can think of a  histogram as a discrete and sampled representation of probability density function.  With a histogram, you can get a rough estimate of probability  P(n) of having n users  subject to the location and time constraint, for a given day.

I am using a bucket width of 5 users. The key emitted by the map function is the bucket index. The reducers sums the counts for each bucket. This is how the second map reduce gets invoked.

#map reduce for daily presence histogram
 def daily_presence_hist
 out_col = @db.collection('presence_hist')
 out_col.remove
 count_col = @db.collection('daily_count')
 results = count_col.map_reduce(@presence_histogram_map,
 @presence_histogram_reduce, {:out => 'presence_hist'})

 out_col.find().each { |doc| puts doc.inspect }
 end

The output of this map reduce which is stored in the collection presence_hist, is as follows. The _id is the histogram bucket index. The value of count is the number of days.

 {"_id"=>4.0, "value"=>{"count"=>7.0}} {"_id"=>5.0, "value"=>{"count"=>11.0}} {"_id"=>6.0, "value"=>{"count"=>26.0}} {"_id"=>7.0, "value"=>{"count"=>13.0}} {"_id"=>8.0, "value"=>{"count"=>4.0}} 

For the histogram we can infer, for example, that  among all the days sampled in our analysis, for 11 different days the number of users found with the time and location constrains was between 25 and 30. These numbers are low, because I didn’t have enough test data in my database.

With the histogram in hand, the only thing left is to find the number of users, such that there is at least 90% chance that those many users will be found on any given day. This is found by cumulative distribution calculation as belows


#cumulative distribution for presence
 def cum_dist(percent)
 pres_hist = @db.collection('presence_hist')
 ar = pres_hist.find().to_a
 count = ar.inject(0) do |s, m|
 m.inspect
 s + m['value']['count']
 end
 cut_off = (count * (100 - percent)) / 100
 puts "count: #{count}  cut_off: #{cut_off}"

 i = 0
 sum = 0
 presence = 0
 while (i < ar.length)
 sum += ar[i]['value']['count']
 if (sum > cut_off)
 presence = @bucket_size * ar[i]['_id']
 break
 end
 i += 1
 end
 puts "presence: #{presence}"
 presence
 end

This method will be called with the argument as 90 for our case. If the number returned by this method is 1000 or  more, the service provider has a reasonable assurance of targeting 1000 users  for 9 out 10 days and the SLA requirement will be met.

If you visualize the histogram with number of users along  the x axis and the number days for which those many users were found to be along the y axis, what we have done is to find a point along the x axis, such that area under the histogram  to the right of the point is 90% of the total area under the histogram.

Final Thoughts

This is a good post on MapReduce in MongoDB. Here is an interesting post on using MapReduce and MongoDB for log data analysis.

My ad campaign example is somewhat naive. An actual ad campaign is a far more complex, involving many parameters. What I wanted to demonstrate is how you can perform location or presence data analytic with MongDB and it’s mapreduce command.

Advertisements

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Data Mining, Map Reduce, MongoDB, Ruby and tagged , , . Bookmark the permalink.

2 Responses to Presence Data Analytic using MongoDb and Map Reduce

  1. Pingback: Presence Data Analytic using MongoDb and Map Reduce | BigDataCloud.com

  2. Vikram says:

    Good post 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s