Visitor Conversion with Bayesian Discriminant and Hadoop

You have lots of visitors on your eCommerce web site and obviously you would like most of them to convert. By conversion, I mean buying   your product or service. It could also mean the visitor taking  an action, which potentially could financially benefit the business e.g., opening an account or signing up for email new letter. In this post, I will cover some predictive data mining techniques that may facilitate higher conversion rate.

Wouldn’t it be nice if for any ongoing session, you could predict the odds of the visitor converting   during the session, based on the visitor’s behavior during the session.

Armed with such information, you could take different kinds of actions to enhance the chances of conversion. You could entice the visitor with a discount offer. Or you could engage the visitor in a live chat to answer any product related questions.

There are simple predictive analytic techniques to predict the probability of of a visitor converting. When the predicted probability crosses a predefined threshold, the visitor could be considered to have high potential of converting

Predictive Analytic

In predictive analytic we have a set of input or predictor attributes and and a output or class attributes. Using known data a.k.a training set, we train a model based on some learning algorithm.

The model predicts output based on input. Once the model is built, we can use it use to make prediction for fresh input. This process is also known as Supervised Learning in machine learning parlance.

In our case, we have a simple model with one input. The input is the number of clicks during a session. The output is a boolean which indicates whether the visitor has converted in a session. It’s been found that there is a strong correlation between number of clicks and conversion.  We could also consider additional input attributes e.g., hour of the day when the session started or whether the referrer for the visitor was search engine. For simplicity, let’s just stick one input. The simple model can easily be extended for multi dimensional input.

Discriminant Analysis

In discriminant analysis, which has it’s grounding in statistical learning, we assign an input to the output class with highest probability, based on the model obtained from the training data.  Discriminant analysis is generally used when the output or class variable is binary.

In our case, instead of predicting output class for input, we are more interested in input values, which is the number of clicks per session, which lie at the decision boundary.

If we imagine two clusters of input in a multi dimensional hyper space, one group corresponding  to one value of output and the other group corresponding to the other value, we are interested in the separating hyperplane between the two clusters and the corresponding input value.

Once the dividing hyper plane is found, we can then infer that input data set on one side of the hyper plane corresponds to one output value and the input data set on the other side corresponds to the other output value.

In our case, we have only one input i.e., the number of clicks and we have one dimensional input space. The separating hyperplane is just a point for our problem. So, the magic number we are after is the number of clicks in a session, above which the visitor is more likely i.e., with a probability equal to or greater than .5, to convert. Once this event is detected, we can perform some of the actions as described earlier.

Next, I will go through some of the options for discriminant analysis and finally zero in Bayesian Discriminant analysis.

Fisher Discriminant

It’s also known as linear discriminant analysis. It’s based on the assumption that the conditional probability density function of the input is Gaussian. There are two such  conditional probability density functions,  each conditioned on one of the two output values.  There is a further assumption that the variance of the two probability density is same. Based on these assumptions, it can be shown that the number of clicks at the decision boundary is as follows

c = (m(c | v=1) – m(c | v=0)) /2 – log(p(v=1) / p(v=0)) *
s /(m(c | v=1) – m(c | v=0))

m(c | v=1) = mean number of clicks for a session when visitor has converted
m(c | v=0) = mean number of clicks for a session when visitor has not converted
p(v=1) = probability of visitor converted
p(v=0) = probability of visitor not converted
s = variance of number of clicks conditioned by conversion

All the quantities can be calculated from the training data.  This approach belongs to the so called parametric modeling family solutions, because it’s based on some assumptions about the probabilistic model i.e., Gaussian with certain parameters i.e., mean and variance

Logistic Discriminant

Logistic discriminant and Bayesian discriminant analysis belong to the so called non parametric modeling methods, because we don’t make any assumption regarding the probability density of output and use our training data more directly. There are some other advantages of non parametric  discriminant analysis.

In logistic regression, the log of odds of the output of being of one class not the other is linearly related to input as follows. This relation will be more complex if there are multiple input variables and if the cross terms are included.

log(p(v=1)/p(v=0)) = a + b * c

a,b = Coefficients
c = Number of clicks in session

The probability of the output or class variable is directly estimated from the input. The coefficients of linear relation is found using the training data  by a technique called maximum likelihood estimate, instead of linear regression, since we are dealing with probabilistic quantities. The right hand side will have more terms if we were dealing multiple input variables instead of one.

When the probability of the output belonging to one class or other are equal, the left hand side of the equation is zero, which is of interest for us, because that signifies the decision boundary. Under this condition, the following is true.

c = – a / b

When the number of clicks in a session is equal to or greater than the quantity above, we conclude that a visitor is more likely to convert.

Bayesian Discriminant

Bayesian discriminant analysis is based on Baye’s theory of conditional probability which is as follows

p(A | B) = p(B | A) * p(A)/ p(B)

If we consider A to be output and B to be input, the probability of output for a given input can be estimated, if we know the probability of input for a given input, the  unconditional probability of input and the unconditional probability of the output. All the quantities on the right hand side can be computed from the training data.

In the context of  our problem, Baye’s theory can be written as follows.

p(v=1|c) = p(c|v=1) * p(v=1) / (p(c|v=0) * p(v=0) + p(c|v=1) * p(v=1))

p(v=1|c) = Probability of visitor converting, given clicks per session
p(c|v=1) = Probability of clicks per session, given the visitor converted in the session
p(v=1) = Unconditional probability of visitor converted in a session
p(c|v=0) = Probability of clicks per session, given the visitor did not convert in the session
p(v=0) = Unconditional probability of visitor did not convert in a session

Again, all the quantities on the right hand side can be computed, to give us the probability of the visitor converting for different values of clicks per session. At the decision boundary, p(v=1|c) will have a value of 0.5.  To find that we search through the result of the previous step, until we have found a value c, for which the condition is true.

Hadoop Analysis

In the first step our analysis, we need to process session data to calculate the the different conditional probability densities.   Since an histogram can approximate probability density, when the data set is large enough and histogram bucket width is narrow enough, we will be computing histogram. The histogram calculation will be over hundreds and thousands of session data, Hadoop is natural choice for processing this kind of large data sets.

Our input contains one  row per session. Each row contains the number of clicks in the session, time spent in the session and a boolean indicating whether the user converted during the session.

The data does not include any session data for existing customers. This data set may have been produced by another map reduce which which took as input  the raw click stream data. In our analysis we will only use the number of clicks as input. Here is some sample.


We are interested in computing the following

  • Click count histogram for unconverted visitor
  • Click count histogram for converted visitor
  • Number of sessions where visitor did not convert
  • Number of sessions where visitor converted

Once we have these quantities , we can calculate probability of visitor conversion, as per Bayes theorem. Our mapper emits conversion flag and click count as key and 1 as value.The map method aggregates the counts and the cleanup method does the emit. This pattern is called “in memory combiner”.  Combiner is an optimization technique for reducing io during shuffle stage between the mapper and the reducer

public static class BayesDiscriminatorMapper extends Mapper {
private Text keyHolder = new Text();
private IntWritable valueHolder = new IntWritable(1);
private Map  clickCount = new HashMap();

protected void cleanup(Context context) throws IOException, InterruptedException {
for (String keyVal : clickCount.keySet()){
context.write(keyHolder, valueHolder);

protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] items  =  value.toString().split(",");
String keyVal = items[2] + "," + items[0];
Integer count = clickCount.get(keyVal);
if (null == count){
count = 0;
count = count + 1;
clickCount.put(keyVal, count);


public static class BayesDiscriminatorReducer extends Reducer {
private Text valueHolder = new Text();
private Map totalCount = new HashMap();

protected void cleanup(Context context) throws IOException, InterruptedException {
for (String classValue : totalCount.keySet()){
int count = totalCount.get(classValue);
valueHolder.set(classValue + "," + count);
context.write(NullWritable.get(), valueHolder);

protected void reduce(Text key, Iterable values, Context context)
throws IOException, InterruptedException {
int count = 0;
for (IntWritable value : values){
count += value.get();
String[] items = key.toString().split(",");
String classVal = items[0];
Integer classCount = totalCount.get(classVal);
if (null == classCount){
classCount = 0;
classCount = classCount + count;
totalCount.put(classVal, classCount);

valueHolder.set(key.toString() + "," + count);
context.write(NullWritable.get(), valueHolder);


Each row of the output contains the conversion flag, click count per session and the number of session with those click counts. The last two rows show the session counts for conversion and no conversion.


The final calculation of conditional probability of conversion for a given number of clicks per session is is done by a simple Java class. Here is the final output.

It consists of the number of clicks per session and the corresponding probability of visitor converting during the session.

All that remains to be done is to find the number of clicks for the threshold probability of 0.5, which partitions the input range.

1	0.004
2	0.001
3	0.002
4	0.002
5	0.001
6	0.007
7	0.017
8	0.037
9	0.253
10	0.561
11	0.812
12	0.670
13	0.583
14	0.588
15	0.416

What we find is that the probability of conversion increases with the number of clicks per session. It crosses the threshold of 0.5 between 9 and 10. We conclude that the decision boundary lies between 9 and 10 clicks. Interestingly, it drops below 0.5 between 14 and 15, which is another decision boundary.

The range that is of interest to us is 10-14. In this range of click count, the visitor is more likely to convert. When a visitor in a session, has clicks in this range, we can take some of the actions as discussed earlier.

It’s possible for the probability to never cross the threshold value of 0.5 and so not have any solution. It means that the input space is not separable and there is no clear decision boundary.

Our predicted model can be expressed as a step function as follows, where f(c) is a function of the number of clicks indicating whether the visitor converts or not.

f(c) = 1 if c > 9 and c < 15


The input we are using , the number of clicks per session., albeit a crude one, is indicative of the visitor’s intent to convert. One way to improve this is to give more weight to pages that are more relevant, e.g, product or service related pages. With this approach, we sum all the weights corresponding to pages visited, instead of simply counting the pages.

Some additional  inputs we could have considered, as alluded to earlier, are as follows. Including more input variables makes our model more complex and computationally more demanding.

  • Referrer e.g., search engine, social network, advertisement, other
  • Hour of day for visit
  • Geo zone of visitor based in reverse IP lookup
  • Number of prior visits

If we included all the variables as listed above in our analysis, then the threshold probability of 0.5 will lie  in a hyperplane in a multi dimensional space. In our simple analysis, since we considered only one variable, the threshold probability of 0.5 corresponds to a point value of that one variable i.e., number of clicks in a session.

Input Selection

Deciding how many input variables to use and which one to use is a complex issue in machine learning.

Generally, you want to use input variables that are more strongly correlated to the output and have stronger influence on the output.

For example, to decide whether to include referrer as an input variable in our model we could do the following

  1. Find probability of conversion from the whole data set
  2. Find conditional probability of conversion conditioned on referrer being a search engine
  3. If the result of 2 is significantly different from 1, include referrer as an input variable

You could also repeat the steps for the case of no conversion. When the conditional probability, significantly deviates from the unconditional probability, it implies a strong influence of that input on the output.

Another issue that requires careful consideration is the relationship between input variables. and in some situations, only one among a sub set of input variables should be included in the model.

When two input variables are strongly correlated to each other, only of them should be included.

Including both, do not provide any additional information and may even have an adverse effect on the prediction model. In our example, number of clicks and time spent are highly correlated to each other. That’s why we used only one of them, i.e., the number of clicks.


Implementation is available as a project called visitante in github. The implementation is somewhat naive right now. I am in the process of including some additional parameters as mentioned earlier in the model.

I am also in the process of adding Hadoop implementation of traditional web analytic metrics, e.g., page statistics, bounce rate, shopping cart abandonment rate and conversion rate.

Final thoughts

Data mining and machine learning is an area of technology that is  complex and vast and it’s an active area of research.   I have a strong personal interest in this area and I am more or less self taught. I am particularly interested in using data mining algorithms in solving everyday problems, especially in the context of big data and Hadoop. Here are some useful reference books.

1. The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Hastie, Tibshirani and Friedman

2. Applied Data Mining for Business and Industry  by Giudici and Figini

About these ads

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source contributor. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Data Mining and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Data Mining, Hadoop and Map Reduce, Java, Predictive Analytic and tagged , . Bookmark the permalink.

6 Responses to Visitor Conversion with Bayesian Discriminant and Hadoop

  1. Pingback: Visitor Conversion with Bayesian Discriminant and Hadoop « Another Word For It

  2. Mani says:

    Hi Pranab, Thanks for the efforts you put in into this blog. I am a beginner in Hadoop and would like some help with the code for visitante. How does the statements work
    fieldDelimRegex = context.getConfiguration().get(“field.delim.regex”, “\\s+”);
    String fieldMetaSt = context.getConfiguration().get(“field.meta”);
    I copied it from the source code of SessionExtractor. I searched the net but didn’t find much info.

  3. Pranab says:

    I am reading custom hadoop configuration parameters through those calls.The parameters could be specified in hadoop command line. In this case, I am specifying a properties file through command line and then then reading the properties file to load hadoop configuration.

  4. Mani says:

    Thanks for the clarification and keep up the great work

  5. Pingback: Visitor Conversion with Bayesian Discriminant and Hadoop |

  6. Pingback: Visitor Conversion with Bayesian Discriminant and Hadoop | Mawazo |

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s