Customer Conversion Prediction with Markov Chain Classifier

Posted on July 6, 2015 by Pranab

For on line users, conversion generally refers to the user action that results in some tangible gain for a business e.g., an user opening an account or an user making his or her first purchase. Next to drawing large number of users to a web site, getting an user to convert is the most critical event in an user’s relationship with on line business. Being able to predict when an user will convert to become a customer should be an important tool that on line businesses should have at their disposal. A business could intiate targeted marketing campaign based on the prediction result.

In this post, I will be using user online behavior data to predict whether an user will convert using Markov Chain Classifier. The Hadoop based implementation is part of my OSS project avenir.

Markov Chain Classification for Sequence Data

Depending on whether the data is instance or sequence , there two classes of learning algorithms. With sequence data, the order of the data points is considered an important aspect of the data and the learning algorithms will exploit the order in the data. Markov Chain Classification is a supervised learning algorithm for sequential data.

Sequence data with a temporal context is called time series data. For many learning problems, sequence data is more effective. When we use instance data, the order between the data points, temporal or something else, is lost.

For our example, each web site visit or session will be considered a data point in a sequence of data points. Session data ordered by time will be our sequence data. The data elements in the sequence i.e. session data will a vector of several attributes.

There are many relevant attributes for web session. I will be considering only the following. I am not implying that these are only feature attributes that matter.

Time elapsed since the last visit
Time spent in the session

Other feature attributes that could be relevant in the predictive model are as follows. It requires something called feature analysis to determine the relevant sub set of features for any predictive analytic problem.

Referrer for the first visit
Hour of the day for visit
Day of the week for visit
Number of pages visited in a session

Building Markov Model

The first step towards building the complete solution is to build the Markov State Transition model. It’s a state transition probability matrix. If a problem is modeled with n states, the matrix will be n x n. A matrix element value in the matrix is the probability of transitioning from one state to another.

Going back to our problem, we are are considering only two session attributes. To keep the state transition matrix manageable, we will discretize the attributes into 3 levels; High, Medium and Low. With two attributes, we will end up with 9 states in out problem. Each session will be characterized with two symbols, which stands for a state. For example, HM will imply that time elapsed since last session is high and time spent in the current session is medium.

Each line in our input will capture the data about all consecutive past sessions for a given user. It will consist of the following

Cookie ID (or User ID)
Class variable indicating whether the user converted or not (True or False)
Sequence of session data where each element of the sequence is a 2 alphabet symbol

This is essentially our training data. Here is some sample data

4F014156K07N,F,LL,ML,HH,HL,LL,HM,HL,LH,ML,HH,HL,LH
G7C0M9H5SUZ1,F,HL,LM,HL,MH,HH,HH,ML,HL
GWBX875AD31D,F,LL,HM,HL,HL,HM
KRO2F24JUDE5,T,HL,HM,HM,HL,HM,MH,HM,HL,HL
3J0G4BB9BI1Q,F,LM,LH,LH,MH,LM,MH,LH

This data is ingested by the map reduce class MarkovStateTransitionModel. The output will consist of 2 matrices, each being a 9 x 9 state transition probability matrix. Here is the output. The first line contains the list of states.

LL,LM,LH,ML,MM,MH,HL,HM,HH
classLabel:T
85,103,219,81,46,105,159,94,105
100,142,294,55,61,127,79,49,89
102,142,311,54,61,135,45,46,100
100,93,155,88,61,93,214,118,72
98,105,218,93,55,115,120,100,91
98,124,299,55,39,136,91,47,107
121,51,81,121,61,75,276,101,108
108,84,150,110,62,67,201,100,115
87,108,208,74,59,114,166,74,106
classLabel:F
106,51,50,129,66,53,300,130,110
128,53,69,124,60,57,276,125,102
105,59,89,109,52,62,296,111,113
108,46,46,130,59,54,289,147,116
108,49,51,154,43,57,294,125,115
115,64,55,120,58,60,276,131,115
111,52,41,135,64,55,288,137,111
112,52,43,133,54,55,288,136,122
117,48,50,143,62,56,270,137,112

The cell values in the matrix are not actual probability, but normalized frequency count. As we will see later, since we will take ratio of the matrix cell values, actual probability is not necessary.

Predicting with Markov Model

Our test data format will be similar to the training data, except that the class variable is missing, which is going to be predicted. The test data and the Markov State Transition matrix produced by the first map reduce is processed by the map reduce class MarkovModelClassifier. For each sequence we calculate log odd ratio as below.

logOdds = Σlog(tp(i,j) / tn(i,j))
where
tp(i,j) = transition probability for transition from state i to j for positive (T) class value
tn(i,j) = transition probability for transition from state i to j for negative (F) class value

From the sequence, we take contiguous pairs of data points by sliding an window of size 2 over the sequence. For each state pair, we read two normalized frequency count values from the two matrices and calculate the log odds ratio as above. If the ratio is greater than 0, then the user is likely to convert.

In plain language, we determine whether the sequence of states for an user is more likely for positive class values or negative class values and make the prediction accordingly. The prediction score i.e. log odds ratio changes with time based on the sequence of states at any given time.

Instead of 0, a threshold value can be provided through a configuration parameter. For example, a small positive threshold value could be provided to keep us way from the class separation boundary and reduce the number of false positives. Here is some sample output, for a log odds threshold of 0.5. In this output snippet, one user has been predicted for conversion

6F1RIJZ882GA,F,-0.9419187552056059
R393GY9VWM9P,F,-1.0870003657324971
X573L7KOBRR0,F,-0.01828609983127566
KMUY5UN2M7TY,T,1.1076440756594925
2VBVQLCDDYJE,F,-4.542646122905596

All the predictive model does is that it identifies users that have the potential to convert. A business could target those users with marketing campaign, hopefully to turn that potential into reality.

The marketing could specifically target users that are close to the border line i.e., log odds ratio is slightly above the threshold. Those are the users that need to be nurtured into paying customers. Those with high log odds ratio, are more likely to convert any way and may not need any special attention.

Meta Learning and Model Complexity

Meta learning is about understanding and learning about the learning process. In machine learning context, typically it involves learning about and selecting some higher order parameters that drive the learning algorithm performance. In our case, we are concerned about two meta learning parameters.

We would like find out the optimum sub set of feature attributes among all feature attributes listed earlier. We would also like to learn whether or not a higher order Markov Model will have more predictive accuracy and if so what order.

It’s good to remind ourselves that making a model more complex will not necessarily yield the best result, because of bias variance trade off. Generally there is an optimum level of complexity when the generalization error is minimum.

Feature Set Selection

We could increase the complexity of our model by including more feature attributes.We have used only two session related variable, with each one having 3 possible values. We ended up with 9 x 9 state transition probability matrices.

If we included some of the other attributes mentioned earlier, the state transition probability matrices would have been much bigger.

These are some of the ways we could explore the feature space looking for an optimum sub set of features providing best prediction accuracy.

Find relevance of each feature attribute with respect to the class atribute and other feature attributes
Brute force way of trying different feature sub sets with learning algorithm being used

Details of the first approach can be find in my earlier post on feature sub set selection. The solution is based on calculating certain statistical parameters based entropy and mutual information.

Higher Order Markov Model

We used first order Markov Model i.e., the state probability is only conditioned on the immediately earlier state. As a result, we have two dimensional state transition probability matrices. If the probability was conditional on n earlier states, we would have used (n +1) dimensional state transition probability matrices.

The question is how do we select the number conditioning data points n for best predictive accuracy. Here are are couple of approaches.

For different values of n, construct higher order conditional probability matrix. Find the conditional entropy for each case. Choose n based on lowest conditional entropy.
Brute force way of trying different n with learning algorithm being used

Summing Up

We have explored Markov Model based predictive analytic technique that operates on sequence data. The tutorial for preparing data and executing the map reduce jobs are available.

Python Version

A Python version of the implementation is now available. Please follow the Python section of the tutorial document.

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.

View all posts by Pranab →

This entry was posted in Big Data, Data Science, Hadoop and Map Reduce, Machine Learning, Marketing Analytic, Predictive Analytic, Statistics and tagged customer conversion, markov model classification, targeted marketing. Bookmark the permalink.

25 Responses to Customer Conversion Prediction with Markov Chain Classifier

Kieran Smith says:

June 29, 2016 at 1:49 pm

This is a great breakdown, thank you

Reply
mytrades4u says:

August 12, 2016 at 4:06 pm

Hi Pranab,
You mention that the state transition probability matrix is a normalized frequency count. Is that just the total number of times a state appears across all customers? Wondering what you mean by “normalized frequency?”

As a follow up, doesn’t taking the frequency count of states somehow ignore the temporal information? That is, the model is no longer using information about customer visits early on vs. at a later time.

Reply
- avaneesh kumar says:
  
  January 2, 2017 at 1:37 am
  
  Normalized frequency is the count of the specific states upon total count.
  It is a basic normalization process used in feature scaling.
  
  Reply
Pranab says:

August 24, 2016 at 8:20 am

State transition probabilities are across all customers, but there are 2 of them one for converting customers and one for non converting customers. The counts for each row is normalized i.e. divided by the row sum and multiplied by the constant

The state transition matrix captures the temporal information. A row is the current state and the column is the next state state. The corresponding cell value is the probability of being in the next state.

Since it’s a first order Markov chain, the complete sequence for all visits for an user is an aggregate of many first order transitions, involving 2 states, which are current and next.

Reply
- sarbani dasgupta says:
  
  April 9, 2019 at 8:04 am
  
  Hi Pranab
  
  I have used case where the target variable is derived by if demand will go up or go down from the previous week.So y is binary UP or down and I have predictors.
  
  In such a case do we get separate transition matrix for UP and separate for down?
  Secondly can you please refer me to the code used for prediction of future events
  
  Reply
ABHINABA ROY says:

April 6, 2017 at 11:43 pm

Hello Pranab,

Could you provide a pseudo code for ‘MarkovStateTransition MAtrix’ and ‘MarkovModel Classifier’?

I don’t know how to read java programs as I code in R.

Thank you

Reply
- Pranab says:
  
  April 7, 2017 at 6:59 am
  
  If you do google search you will find plenty of material
  
  Reply
david says:

April 11, 2017 at 8:17 am

Hello Pranab,

I have a doubt about balances between number of instances at classLabel:T and number of instances at classLabel:F. Probably you will have many mores instances at classLabel:F. Is it not has any impact in the results?

Thank you

Reply
- Pranab says:
  
  April 11, 2017 at 8:20 am
  
  That’s the class imbalance problem. There are techniques to overcome them
  
  Reply
  - david says:
    
    April 11, 2017 at 8:34 am
    
    Thanks Pranab.
    
    In this setup with R do you have any preference R library?
    Thanks.
Esmée Winnubst says:

December 8, 2017 at 2:17 am

Hi Pranab,
thank you for your very interesting article.

Is there a method for including more data you have on the ID for the markov chain classifier?
For example if a user has logged in to the website, you know his age and how much he purchased before and the types of items he bought.

Reply
- Pranab says:
  
  December 10, 2017 at 6:16 pm
  
  Esmee,
  You can segment the users based on age, types of items bought in past and other attributes and then build a separate. Markov chain model for each segment. Purchase amount in past transactions can be made part of the state.
  
  Reply
  - Esmée Winnubst says:
    
    December 11, 2017 at 1:48 am
    
    Thank you
antony says:

March 5, 2018 at 11:03 am

Hi Pranab,
Do you have python code for this article?

Reply
- Pranab says:
  
  March 5, 2018 at 11:16 am
  
  Unfortunately, I don’t
  
  Reply
Shankar says:

January 27, 2019 at 10:29 pm

Hi Pranab,
I am working on similar data. My unique states are A, B and C. Sample data is as below-
UserID1,T,A->B->C->A->A->B
UserID2,F,C->C->A->B

I wanted to understand, how do I make pairs for 1st and 2nd order Markov chain? How do I calculate transition matrix?

Reply
Pranab says:

April 9, 2019 at 8:11 am

Works for first order Markov Chain only. For transition matrix, from any given state, count the number of translations to all other states

Reply
Pranab says:

April 9, 2019 at 8:15 am

If I understood correctly, up and down will be the states in your case. There are 2 transition matrix , one for converting and the other for non converting . The blog should have a link to the github repo

Reply
- Sarbani says:
  
  April 9, 2019 at 9:51 pm
  
  one for converting and the other for non converting what does it mean?
  I will check the GitHub link is it coded in python?
  
  Reply
  - Pranab says:
    
    April 9, 2019 at 10:15 pm
    
    Please read the blog thoroughly and read up on Markov chain classifier
Sarbani says:

April 9, 2019 at 11:23 pm

Hi Pranab

The way I understand the problem and the blog
We have predictors say Var1 Var2 Var3 and target variable taking values -UP and down
so we have a matrix with state transition of Var1 Var2 Var3 for UP and a separate matrix for down.

Please correct me if that is not the case .I would really help if you can elaborate in a few lines ,rest I will read up.

Thanks

Reply
- Pranab says:
  
  October 20, 2020 at 8:01 am
  
  Your understanding is correct
  
  Reply
Anshuman Sinha says:

October 20, 2020 at 3:32 am

Hi Pranab,

If in addition to feature attributes, you also had demographic data on the user that you wanted to include, how would you go about it? This data, say gender/age/location is constant for a user but may be useful predictors.

Reply
- Pranab says:
  
  October 20, 2020 at 8:04 am
  
  You could build separate models for each combination of gender, age , location etc. In other words. cluster customers based on demographic and then build separate model for each cluster
  
  Reply
  - Anshuman Sinha says:
    
    October 22, 2020 at 5:56 am
    
    Dear Pranab,
    
    Thanks so much for your prompt response and suggestion. Your suggestion is exactly what I was planning to do so it’s great to get that affirmation. I am basically going to try and follow this paper (http://library.usc.edu.ph/ACM/KKD%202017/pdfs/p2081.pdf “Learning Temporal State of Diabetes Patients via Combining Behavioral and Demographic Data” by Gao et al) which uses a few approaches. One is exactly what you suggested: Clustering + HMM per cluster. The second they use is have a single HMM where the transition probabilities are some type of logistic function of all the static and dynamic variables. Thank you! 🙂