Stop the Customer Separation Pain with Bayesian Classifier

Posted on February 19, 2013 by Pranab

In my last post, we did some exploratory analytic for customer churn. We identified the parameters that have most influence on whether a customer account gets closed or not. We performed correlation analysis using Cramer index.

In this post, we will take the next step forward i.e., build a Bayesian prediction model for predicting customer churn. The Hadoop based Bayesian classifier implementation is part of my open source project avenir on github. We will use the same mobile service provider customer data as an example, as in the last post. New customer acquisition is expensive. So any predictive analytic solution, that can predict whether an existing customer will be closing his or her account or not is extremely valuable.

Bayesian Classifier

Bayesian classifier is a probabilistic prediction model founded on conditional probability. In our case, it predict the probability of a customer keeping his or her account open or closing it. The most well known application for Bayesian Classifier is spam filtering.

Every time you click on spam button in your email application, chances are that information gets used in building a Bayesian prediction model behind the scene. Bayesian classifiers are based on Bayes theorem, which uses conditional probability. It is defined as follows.

P(C|X) = P(X|C) * P (C) / P(X)
where
P(C|X) = Probability of class attribute value C given the feature value X , a.k.a class posterior probability
P(X|C) = Probability of feature attribute value X given the class attribute value C, a.k.a feature posterior probability
P(C) = Probability of the class attribute value C, a.k.a class prior probability
P(X) = Probability of of feature attribute value X, a.k.a feature prior probability

Feature attributes are attributes used to make a prediction. A class attribute is an attribute whose value is predicted based on all the feature attribute values. Since an entity has multiple features, the feature values form a vector of values.

P(X|C) is the joint probability of a vector of feature values conditioned by the class attribute value C. In the so called naive Bayes model, we assume the independence of the features. So the joint probability distribution turns into a product of individual probabilities, as below

P(X|C) = P(X(1)|C) P(X(2)|C) …P(X(n)|C)
P(X) = P(X(1)) P(X(2)) …P(X(n))

Referring to out example of mobile service provider customer, an example of P(X(i) | C) might be probability of data usage, given the account is closed. In our example P(C) is the probability of account being open or closed. P(X(j)) could be probability of minutes used. P(C|X) would be the probability of account status being open or closed given a set of account feature attribute values and this is what we are after.

Customer Account Data

As described in the earlier post, the account feature attributes are as follows. All the feature attributes are categorical.

Minutes used
Data used
Customer Service Calls
Payment history
Account age

The class attribute is the account status. Our goal is to to predict the probability of account status being open or closed given a set of feature attribute values.

Solution

Any classification problem has two parts: training a model based on training data set and then the validation of the trained model. In our case the validation will consist of comparing the predicted account status with the actual actual account. The validation is done on a separate data set called the validation data set.

We will have one map reduce for training the model using the training data set and the second one to validate the model using the validation data set and the output of the first map reduce i.e., the Bayesian model.

Bayesian Model Map Reduce

This map reduce implemented by the class BayesianDistribution, takes the training data and essentially does counting for the the different probabilities needed as per Bayes theorem. These estimated probabilities will be close to the real probabilities when the sample size is large enough according to maximum likelihood principle. Here is some sample input for this map reduce.

VD6Z2289O2A1,med,med,med,average,1,open
Q0GCO89ZJ1LV,med,high,low,good,1,closed
11Y0UP3KIB6V,med,low,low,average,2,closed
9N432K6QJGJO,overage,high,med,average,1,closed
S40EG4GHTDGN,overage,low,low,average,3,closed
8FLU4G6KOHIE,high,med,low,average,2,open
J3SHLF7AFEGN,med,med,low,good,4,open
E0U2M1W5BMX0,med,low,med,poor,2,closed
NMF0IWEIFBLL,overage,med,low,average,3,open

Counting is performed for the following

Every possible set of class attribute and feature attribute value combination. It gets used for computing P(X|C)
Every possible class attribute values. It gets used for computing P(C)
Every possible set of feature attribute value combinations . It gets used for computing P (X)

Here is some sample output for the model. It contains various count values used for building the Bayesian model. Lot of insight can be gained by simple examination of these counts. For example, you will be able to detect the the unique combination of account feature attribute values that triggers account closing.

Each line has 4 fields and they are 1. class attribute value 2. feature attribute ordinal 3. feature attribute value 4. count. For numerical attribute, the attribute value is bin index as explained in a later section. The line with all values present, is used in computing P(X|C). A line with the first and fourth field only gets used in computing P(C). A line with the second, third and fourth field only gets used in computing P(X)

closed,5,4,1462
closed,,,1462
,5,4,1462
open,1,high,1050
open,,,1050
,1,high,1050
open,1,low,865
open,,,865

Numerical Attribute

For categorical attribute, we count for each unique value of the attribute in the data. When the data is numerical, the traditional approach is to find the mean and standard deviation and fit a normal distribution. I have taken a different approach, instead force fitting an uni modal Gaussian distribution, which may not be most prudent choice in many cases. Using jargon, I am using a non parametric distribution model, rather than a parametric model.

I discretize the data, using the bin width as specified in the data schema. This way, I preserve, whatever the natural probability distribution for the attribute is. For example, minute usage distribution my not be uni modal, and there may three different peaks, corresponding low, medium and high usage.

The bin index goes into the output as the attribute value. The corresponding count is the count for that bin index.

Bayesian Prediction Map Reduce

This is another map reduce implementation, which has no reducer. It reads the output of the previous map reduce i.e., the different count values and build the Bayesian model. As a new record is processed in the mapper, it’s run through the model and the account status probabilities are generated. It’s implemented by the class BayesianPredictor.

Here is some sample output. The first part of the record is the original input from the validation data set. The field second from the last is the predicted account status and the last field is the associated probability, in a scale of 100. As you can see, in most cases the account status is correctly predicted. However in many cases the prediction is wrong.

7EEF3U8U26HI,high,med,low,average,2,open,open,51
MZFK7PA61W2K,med,low,low,good,4,closed,open,68
Q7A8A9C3A30C,high,high,low,good,1,closed,closed,56
C124V89X2M2F,high,med,low,poor,3,closed,closed,62
KA9T3KF0CW1B,med,med,low,good,1,open,open,72
WRQ1TJN96VWN,low,med,high,poor,1,open,closed,79

Here we have calculated probabilities of class attribute values (open and closed). The output includes only the class attribute value with the highest probability.

Using the Bayesian model, we calculate the probabilities of the different class attribute values e.g. open or closed. The value associated with the highest probability is chosen as the predicted value. The output consists of that value and the associated probability in the last two fields.

Classification Metrics

The prediction map reduce also generates additional metric as Hadoop counters. These metrics are based on the so called confusion matrix. If a class attribute has n possible values, then the confusion matrix is a n x n matrix where the columns are the predicted class values and the rows are the actual values. For a perfect classifier, only the diagonal elements of the confusion matrix will be non zero and the off diagonal elements will be zero. Here is a 2 x 2 confusion matrix.

True Positive(TP)	False Negative(FN)
False Positive(FP)	True Negative(TN)

For our example, account close is associated with positive class attribute. For any business use case, there are costs associated with inaccurate predictions as indicated by the off diagonal elements. However, the cost may not be symmetric.

For our example, false negative is costlier than false positive. In our case the false negative counts stands for the number of cases where the predicted value of account status is open, where it’s actual value is closed. it’s costly because, the prediction gives the business a false sense of security that the accounts are safe and will remain open.

There are various metrics based on the confusion matrix. They are as follows. Correctness of the overall prediction is expressed by accuracy.

accuracy = (TP + TN) / (TP + TN + FP + FN)
recall = TP / (TP +FN)
precision = TP / (TP + FP)

Proportion of positive cases predicted correctly by the classifier to the actual number of positive cases is indicated by recall. The proportion cases that are actually positive to the cases predicted to be positive is indicated by precision. Here the different metric values as processed by Hadoop counters

13/02/18 23:01:56 INFO mapred.JobClient:   Validation
13/02/18 23:01:56 INFO mapred.JobClient:     Recall=60
13/02/18 23:01:56 INFO mapred.JobClient:     Correct=632
13/02/18 23:01:56 INFO mapred.JobClient:     Precision=67
13/02/18 23:01:56 INFO mapred.JobClient:     TrueNagative=307
13/02/18 23:01:56 INFO mapred.JobClient:     FalseNegative=214
13/02/18 23:01:56 INFO mapred.JobClient:     Accuracy=63
13/02/18 23:01:56 INFO mapred.JobClient:     Incorrect=368
13/02/18 23:01:56 INFO mapred.JobClient:     TruePositive=325
13/02/18 23:01:56 INFO mapred.JobClient:     FalsePositive=154

They are all processed under a counter group called Validation. As we can see, the quality of our model could be better. In data mining, the quality of the model depends heavily on the quantity and quality of the data.

Prediction Output

Prediction output can be controlled in two ways. First, you may be interested in some specific outcome only e.g., account closed. You may only be interested in the probability of account closed, whether or not it’s the most probable outcome. The desired class attribute value is specified through the configuration parameter bp.predict.class. Actually you can specify a subset of the possible values for the class attribute.

Second, you are interested the class attribute value with the highest probability i.e., the most probable outcome. This is the approach taken for our example

There is a caveat when multiple class attribute values are specified. In such cases, you may want to define a threshold, such that the class attribute value with the highest probability value will have to have a probability higher than the sum of the probability of the class attribute with the next highest probability and the defined threshold. Basically, we are making the class attribute value with highest probability more discriminative. This is defined through the configuration parameter class.prob.diff.threshold.

For example, using a scale of 100 if this threshold value is set to 30 and the probability of close is found to be higher that the probability of open for a record, then it will be classified as closed only if P(close) >= P(open) + 30. When the threshold is set, the output will consist of an extra field as the last field. It’s value will be either classified, when the the threshold condition is met. Otherwise, it’s set to ambiguous. When ambiguous, the classifier is saying “I give up. I can’t figure this out”

Wrapping Up

Bayesian classifiers are simple and it’s not very computationally intensive to build models with large data sets. Another advantage of Bayesian classifiers over other classifiers is that it can easily handle incremental training data. It’s not necessary to re process the whole training data set when new data arrives incrementally.

In data mining, there is no one size fit all algorithm and solution. Whether Bayesian classifier works for you will depend on the problem and the data set.

Here is a tutorial document with details on how to run the two map reduce jobs for the Bayesian classifier. You could use this script to generate the input.

For commercial support for any solution in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing,

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.

View all posts by Pranab →

This entry was posted in Big Data, Data Mining, Data Science, Hadoop and Map Reduce, Predictive Analytic, Uncategorized and tagged bayesian classifier, customer churn. Bookmark the permalink.

9 Responses to Stop the Customer Separation Pain with Bayesian Classifier

nellaivijay says:

February 20, 2013 at 9:07 am

Well written Pranab

Reply
Jason Lynn says:

March 7, 2013 at 6:13 pm

Pranab, really enjoy your blog. You mention that using non-parametric methods might be a poor choice when evaluating an attribute – I found this comment a little surprising as my experience indicates normal distributions should not be assumed. Are there any general guidelines you use when deciding which approach to use, or is it necessary to perform a distribution test?

Reply
- Pranab says:
  
  March 7, 2013 at 6:39 pm
  
  Jason
  
  I don’t see that statement in my blog. This is the exact quote – “I have taken a different approach, instead force fitting an uni modal Gaussian distribution, which may not be most prudent choice in many cases”. Clearly, I am preferring a non parametric model.
  
  Reply
  - Jason Lynn says:
    
    March 7, 2013 at 7:19 pm
    
    Oh, I misread that sentence – I thought you were referring to the choice b/w normal distro (mentioned in the previous sentence) and non-parametric methods. That makes a lot more sense! Thanks.
koushik says:

November 1, 2013 at 10:40 pm

My Question is regarding mapreduce and i hope u will give much clear idea to my question..
i have to use 2 files at a time in my map function hw could it possible(am newbie to this mapreduce) both files are stored in same directory.. plz respond to my query…..
thanks in advance.

Reply
- Pranab says:
  
  November 3, 2013 at 7:35 pm
  
  Koushik
  You could put both files in HDFS input directory. Haddop will treat each one as separate split and each split will be processed by a separate mapper
  
  Reply
  - koushik says:
    
    November 4, 2013 at 1:31 am
    
    No sir,my question is “i have to read(use) values from both files in my mapper function”.
    thanks….
- Pranab says:
  
  November 4, 2013 at 6:07 pm
  
  Wondering why you need one mapper to process multiple files. This kind of constraints in the state less processing model of Hadoop, may indicate some issue in your map reduce design.
  
  Anyway, here are two options. The first one is not reliable.
  
  1. Write a script to merge the files into one big file. This will not work for you if the combined size is more than one HDFS block size
  2. Instead of using TextInputFormat, write a custom input format to process multiple input files
  
  If you are doing some kind of join like operation across two files, you can still use multiple mapper. In side each mapper, you could track what kind file you are processing by looking for file name patterns and having your mapper emit accordingly. For this to work, you have to follow some file naming convention.
  
  Here is an example from one of my projects. It calculates running aggregates and deals with two kinds of files: 1. current aggregate 2. new incremental data
  
  https://github.com/pranab/chombo/blob/master/src/main/java/org/chombo/mr/RunningAggregator.java
  
  Reply
  - koushik says:
    
    November 7, 2013 at 5:01 am
    
    thanks.. it is very helpful…