Explore Customer Churn with Cramer Index

Classification problems involve predicting a response variable based on a set of feature variables for some entity. But there is another problem whose solution is a prerequisite for solving classification problem. We may want to know which among the set of feature variables are most strongly correlated to the response variables. Once we have identified those, we may only want to use that sub set of the feature variables to build the prediction model.

To put this in context, we will use the customer churn prediction problem, specifically for mobile telecom service provider customers. A customer may have attributes like minutes used, data used, number of customer service calls etc. We are interested in identifying those feature attributes critical to effective prediction of whether a customer will close his or her account. It’s always beneficial to reduce the number of features in building prediction models. It’s known as feature sub set selection.

Correlation

For numerical variables, correlation between two variables is simply the co variance normalized by the products of two standard deviations. For categorical variables, i.e. variables having a finite set of unordered values it’s more complex. There are several approaches for handling categorical data. The correlation statistic we will be using is called the Cramer Index.

The basic building block for many correlation statistic between categorical variables is the the Contingency Matrix. If a variable a has n possible values and a variable b has m possible values. Then the contingency matrix will me a n X m matrix. Each cell of the matrix will contain a count of the number of samples that have the corresponding attribute value pair.

If the we consider the attribute pair minutes used (MU) and account status (AS), we have a 3 x 2 Contingency Matrix as shown below

MU(low) : AS(open)	MU(low) : AS(closed)
MU(med) : AS(open)	MU(med) : AS(closed)
MU(high) : AS(open)	MU(high) : AS(closed)

Just an inspection of the matrix, may provide valuable insight. For example, if we see high value for the bottom left cell i.e. minutes used high and account status open, we know customers are closing account, there is no suitable plan for high minute usage.

This is how Cramer Index is defined in terms of Contingency Matrix. The index depends on how concentrated the values are across the cells.

CramerIndex = (sum(n(i,j) * n(i,j) / (nr(i) * nc(j))) – 1) / (min(numRow, numCol) – 1)
where
n(i,j) = value of (i,j) cell of contingency matrix
nr(i) = sum of values over all columns for the i th row
nc(j) = sum of values over all rows for the j th column
sum = sum over all i and j

The Cramer index will always be between 0 and 1, 0 indicating weakest correlation and 1 indicating the strongest correlation.

Map Reduce

The attributes to be correlated are specified through the configuration parameters source.attributes and dest.attributes. In the initialize() method of the mapper, for all the possible attribute pairs from the two sets an instance of a contingency matrix is created.

The Cramer Index implementation is part of my open source project avenir, which contains a collection classification and prediction algorithms implemented on Hadoop. This specific map reduce implementation is available here.

As the mapper processes each record. for each possible attribute pairs from the two sets, the values are extracted from the record. The value pair is used to locate a cell in the corresponding contingency matrix and it’s value incremented.

In the cleanup() method of the mapper, the contingency matrix for each attribute pair is emitted. The key is the attribute pair and the value is the serialized contingency matrix.

On the reducer side, contingency matrices for a given key i.e. attribute pair are aggregated and the Cramer Index calculated based on the final aggregated contingency matrix. The reducer emits attribute pair followed by Cramer Index.

Customer Churn Analysis

The mobile service provider customer data has the following feature attributes in the data I am using . The data is over a period going into the past.

minute used (low, med, high, overage)
data used (low, med, high)
customer service calls (low, med, high)
payment history (poor, average, good)
account age (low, med, high)

Here are some sample input. The first field is the customer ID and the last field is the response attribute i.e., the account status. The remaining fields are the feature attributes as listed above.

KX9LBZ3ZVLII,med,med,med,poor,4,open
94PMT4ZQU47W,overage,high,low,average,1,closed
DIINUH7HZUUX,low,high,med,good,4,open
H6W0HROO0H2X,high,low,high,average,4,open
31P1TG4RTGQI,overage,med,low,average,1,closed
GTL7W53933LU,high,med,low,good,4,closed
JU39F4BSB70Z,overage,low,low,poor,2,open
2A4RURJLJ5EZ,high,high,low,good,3,open
FS2DZZ2VK063,low,low,low,good,2,open
B3U4OECQ628K,med,med,med,poor,3,closed
5OWQFS2EGIKV,med,low,low,good,2,open
JU2JVU0WL1Y1,overage,med,low,average,2,open

Most of the attributes above are numerical. They have been discretized into categorical values. The response variable is account status which is either open or closed. Each feature variable is paired with the response variable and the corresponding Cramer Index is emitted as output.

Here is the output. Form the output, we can see that minutes used has the strongest correlation to open status.

minUsed,status,0.022663449872222907
dataUsed,status,0.0038947124486503615
CSCalls,status,0.010164336836900434
payment,status,0.00905448707197265
acctAge,status,0.0030426459155057373

Wrapping Up

Not only does correlation calculation helps you identify the critical feature attributes towards the final prediction, it also provides valuable insight. Sometimes that insight is all you need, even if you don’t go all the way to solve the blown prediction problem.

For example, if high minutes used is found to have the strongest correlation to account closing, a mobile service provider could pro actively seek out such customers, and offer them alternative calling plans before they leave. Here is the tutorial for the example.

For commercial support for any solution in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing,

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.

View all posts by Pranab →