Measuring Campaign Effectiveness for an Online Service on Spark

Measuring campaign effectiveness is critical for any company to justify the marketing money being spent. Consider a company providing a free online service on signup. It’s critical for the company to convert them so that they subscribe to a paid service as soon as possible.

In this post, we will use  simple statistical techniques to find the relative merits for different campaigns in terms of effectiveness which is measured by conversions. The Spark based solution is available in open source project chombo.

Campaign Effectiveness

We will consider a very specific campaign scenario.  The online service company tries to lure it’s free users to become paid users through an on line campaign when the user accesses it’s service. The campaign effectiveness is measured by the following criteria.

  1. Whether the user converts
  2. Conversion rate
  3. Time lag between between sign up time and conversion time

The conversion rate is a dynamic quantity and varies with time. For the free online service, people are signing up on a continuous basis and some of them convert at some point in future.

So conversion rate has to be associated with a time window. For example, we may define conversion rate as the percentage of people who signed up in calendar year and subsequently convert.

The marketing department of the company changes it’s campaign every calendar year. The decision for the changes in the campaign for a new year is highly influenced by the the effectiveness of last year’s campaign with reference to previous years campaign.

If the campaign for the recent year is better than the previous year’s campaign, then the marketing department  might stick to it. Otherwise, if they see a degradation they might makes changes to the campaign. This is where the analysis comes into the picture.

Although we are considering a specific criteria for campaign effectiveness, it can be defined in a broader context.

Conversion Data

Our conversion data for customers who signed up in a given year is as follows. We are mostly interested the time lag in months between signup date and conversion date, which is the last field

  1. Customer ID
  2. Referring channel
  3. Signup date
  4. Number months to conversion

Here is some sample data

9CZ7ID43O4,organic search,2015-03-20,4
SG37B4L7WS,paid search,2015-08-16,3
215509TFBR,organic search,2015-05-23,0
N2RB32HMMP,paid search,2015-06-22,3
RT1W3YC0QS,paid search,2015-05-23,4

Statistical Analysis

We are going to calculate the distribution of conversion lag time. The Spark job NumericalAttrDistrStats calculates mean, median, std deviation and other statistical quantities  for any field containing numerical data.  It’s general purpose statistics calculator. In our case, we are only analyzing the 4th field which is the time lag for conversion in months.

We perform the analysis for data for 2 consecutive years 2014 and 2015. Each year has a slightly different online campaign. We want to compare the statistical analysis results side by side for those 2 years. Specifically, we will compare the campaign result of 2015 with reference to the campaign result of 2014. Here is the out put for those 2 years

Year: 2014
Year: 2015

The first part of the output is the actual distribution or histogram. The distribution is over number of customers converting against number of months lagging the signup date. The second part of the output contains various summary statistics. Here is the output with annotation and comments for better clarity.

Statistics Year: 2014 Year: 2015 Comment
Mean 3.912 3.315 Mean lag time for 2015 is  lower
Median 3.957 3.317 Median lag time for 2015 is lower
Std Dev 2.770 2.700 Std dev of lag time is about same
Mode 2.500 0.500 Most conversion for 2015 happens on the first month after sign up.
Quantile(25%) 2.218 1.567 Conversion happens earlier for 2015
Quantile(50%) 3.957 3.317 Conversion happens earlier for 2015
Quantile(75%) 6.284 5.649 Conversion happens earlier for 2015
KL Divergence 0.023 There is difference in conversion lag distribution

It goes without saying that conversion rate is  the most important measurement. However, we are going beyond and peering into more detailed statistics based on the distribution of conversion lag time for additional insight. For example, if for the same conversion rate from two campaigns, with one the customers might convert earlier, which is better.

Based on the review of the results for the two years as listed above, we can conclude that the campaign for 2015 was more effective. The customers who signed up in 2015, tended to convert earlier with respect to sign up dates.

The last statistics Kullback-Leibler (KL) Divergence requires some explanation. essentially it measures the difference between two distributions in terms of entropy. Entropy is a measure of randomness in a distribution.  We will discuss more on this later.

The KL divergence only tells us of any difference in distributions at a macro level. We still have to look at other statistics listed above to gain better insight on the difference in responses of the two campaigns.

Kullback Leibler Divergence

KL divergence measures the difference between two distributions and it’s defined as below.

KL(p,q) = Σx p(x) log(p(x)/q(x))   where
p(x) = first distribution
q(x) = second distribution

KL divergence defines the expected difference in the entropy of two distributions. KL divergence is not a symmetric function i.e. KL(p,q) is not equal to KL(q,p). In our use case we calculated the distribution difference between customers who signed up in 2014 and 2015.

Customer Segmentation

We could have done the analysis based different segments of customers, if there are reasons to believe that the campaign response behavior was likely to be different based on the segment a customer belongs to. For example we could have segmented customers based on the referring channels when customers sign up.

Generally, in the reduceByKey operation of the  index of the field being analyzed is used as the key. To analyze based on customer segments we need to use the  composite key (referrer channel, filed index). If the segment was defined by some other set of attributes, then those attributes would have been included in the key.

Campaign Attribution

We have assumed that the campaign of the  year in which an user signs up, is  attributed to the conversion of that user. This a simplistic assumption. If an user signs up towards tail end of an year, the campaign for the next year is likely to have a stronger influence on the user.

Attribution is a very challenging problem. There may have been  other influencing factors and touch points besides the campaign, that may have driven a customer to convert. There is no easy way to tell.

Engagement Score

Often conversion rate is very small, which necessitates consideration of other measures for evaluating campaign effectiveness. For online campaign as in our use case, we can consider other relevant non conversion events like clicks, browsing review of service or product, adding to shopping cart etc  and translate them to a net engagement score.

The engagement score will reflect the user’s level of interest in a product or service. This earlier post of mine has details on computation of engagement score.

We can calculate distribution of engagement score and use percentiles and statistical quantities as alluded to earlier to evaluate the effectiveness of a campaign.

Summing Up

We have used some simple statistical techniques to evaluate relative effectiveness of campaigns.  There is a Hadoop MR implementation for the solution also available. The tutorial document contains the  execution steps for this use case.


For commercial support for any solution in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing,


About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Big Data, Data Science, Marketing Analytic, Spark and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s