Measuring campaign effectiveness is critical for any company to justify the marketing money being spent. Consider a company providing a free online service on signup. It’s critical for the company to convert them so that they subscribe to a paid service as soon as possible.
In this post, we will use simple statistical techniques to find the relative merits for different campaigns in terms of effectiveness which is measured by conversions. The Spark based solution is available in open source project chombo.
We will consider a very specific campaign scenario. The online service company tries to lure it’s free users to become paid users through an on line campaign when the user accesses it’s service. The campaign effectiveness is measured by the following criteria.
- Whether the user converts
- Conversion rate
- Time lag between between sign up time and conversion time
The conversion rate is a dynamic quantity and varies with time. For the free online service, people are signing up on a continuous basis and some of them convert at some point in future.
So conversion rate has to be associated with a time window. For example, we may define conversion rate as the percentage of people who signed up in calendar year and subsequently convert.
The marketing department of the company changes it’s campaign every calendar year. The decision for the changes in the campaign for a new year is highly influenced by the the effectiveness of last year’s campaign with reference to previous years campaign.
If the campaign for the recent year is better than the previous year’s campaign, then the marketing department might stick to it. Otherwise, if they see a degradation they might makes changes to the campaign. This is where the analysis comes into the picture.
Although we are considering a specific criteria for campaign effectiveness, it can be defined in a broader context.
Our conversion data for customers who signed up in a given year is as follows. We are mostly interested the time lag in months between signup date and conversion date, which is the last field
- Customer ID
- Referring channel
- Signup date
- Number months to conversion
Here is some sample data
9CZ7ID43O4,organic search,2015-03-20,4 SG37B4L7WS,paid search,2015-08-16,3 215509TFBR,organic search,2015-05-23,0 YO1O82VP2J,direct,2015-11-30,8 N2RB32HMMP,paid search,2015-06-22,3 RT1W3YC0QS,paid search,2015-05-23,4 JGUU8XKG6C,social,2015-05-11,9
We are going to calculate the distribution of conversion lag time. The Spark job NumericalAttrDistrStats calculates mean, median, std deviation and other statistical quantities for any field containing numerical data. It’s general purpose statistics calculator. In our case, we are only analyzing the 4th field which is the time lag for conversion in months.
We perform the analysis for data for 2 consecutive years 2014 and 2015. Each year has a slightly different online campaign. We want to compare the statistical analysis results side by side for those 2 years. Specifically, we will compare the campaign result of 2015 with reference to the campaign result of 2014. Here is the out put for those 2 years
Year: 2014 (3,12,0.500,0.094,1.500,0.124,2.500,0.147,3.500,0.141,4.500,0.113,5.500,0.104,6.500,0.095,7.500,0.065,8.500,0.034,9.500,0.046,10.500,0.019,11.500,0.018,3.912,3.957,2.770,2.500,2.218,3.957,6.284,0.0,0) Year: 2015 (3,12,0.500,0.161,1.500,0.157,2.500,0.144,3.500,0.120,4.500,0.107,5.500,0.094,6.500,0.069,7.500,0.067,8.500,0.031,9.500,0.025,10.500,0.019,11.500,0.006,3.315,3.317,2.700,0.500,1.567,3.317,5.649,0.022,11)
The first part of the output is the actual distribution or histogram. The distribution is over number of customers converting against number of months lagging the signup date. The second part of the output contains various summary statistics. Here is the output with annotation and comments for better clarity.
|Statistics||Year: 2014||Year: 2015||Comment|
|Mean||3.912||3.315||Mean lag time for 2015 is lower|
|Median||3.957||3.317||Median lag time for 2015 is lower|
|Std Dev||2.770||2.700||Std dev of lag time is about same|
|Mode||2.500||0.500||Most conversion for 2015 happens on the first month after sign up.|
|Quantile(25%)||2.218||1.567||Conversion happens earlier for 2015|
|Quantile(50%)||3.957||3.317||Conversion happens earlier for 2015|
|Quantile(75%)||6.284||5.649||Conversion happens earlier for 2015|
|KL Divergence||0.023||There is difference in conversion lag distribution|
It goes without saying that conversion rate is the most important measurement. However, we are going beyond and peering into more detailed statistics based on the distribution of conversion lag time for additional insight. For example, if for the same conversion rate from two campaigns, with one the customers might convert earlier, which is better.
Based on the review of the results for the two years as listed above, we can conclude that the campaign for 2015 was more effective. The customers who signed up in 2015, tended to convert earlier with respect to sign up dates.
The last statistics Kullback-Leibler (KL) Divergence requires some explanation. essentially it measures the difference between two distributions in terms of entropy. Entropy is a measure of randomness in a distribution. We will discuss more on this later.
The KL divergence only tells us of any difference in distributions at a macro level. We still have to look at other statistics listed above to gain better insight on the difference in responses of the two campaigns.
Kullback Leibler Divergence
KL divergence measures the difference between two distributions and it’s defined as below.
KL divergence defines the expected difference in the entropy of two distributions. KL divergence is not a symmetric function i.e. KL(p,q) is not equal to KL(q,p). In our use case we calculated the distribution difference between customers who signed up in 2014 and 2015.
We could have done the analysis based different segments of customers, if there are reasons to believe that the campaign response behavior was likely to be different based on the segment a customer belongs to. For example we could have segmented customers based on the referring channels when customers sign up.
Generally, in the reduceByKey operation of the index of the field being analyzed is used as the key. To analyze based on customer segments we need to use the composite key (referrer channel, filed index). If the segment was defined by some other set of attributes, then those attributes would have been included in the key.
We have assumed that the campaign of the year in which an user signs up, is attributed to the conversion of that user. This a simplistic assumption. If an user signs up towards tail end of an year, the campaign for the next year is likely to have a stronger influence on the user.
Attribution is a very challenging problem. There may have been other influencing factors and touch points besides the campaign, that may have driven a customer to convert. There is no easy way to tell.
Often conversion rate is very small, which necessitates consideration of other measures for evaluating campaign effectiveness. For online campaign as in our use case, we can consider other relevant non conversion events like clicks, browsing review of service or product, adding to shopping cart etc and translate them to a net engagement score.
The engagement score will reflect the user’s level of interest in a product or service. This earlier post of mine has details on computation of engagement score.
We can calculate distribution of engagement score and use percentiles and statistical quantities as alluded to earlier to evaluate the effectiveness of a campaign.
We have used some simple statistical techniques to evaluate relative effectiveness of campaigns. There is a Hadoop MR implementation for the solution also available. The tutorial document contains the execution steps for this use case.