Explore Customer Churn with Cramer Index

Classification problems involve predicting a response variable based on  a set of feature variables for some entity. But there is another problem whose solution is a prerequisite for solving classification problem. We may want to know which among the set of feature variables are most strongly correlated to the response variables. Once we have identified those, we may only want to use that sub set of the feature variables to build the prediction model.

To put this in context, we will use the customer churn prediction problem, specifically for mobile telecom service provider customers. A customer may have attributes like minutes used, data used, number  of customer service calls etc. We are interested in identifying those feature attributes critical to effective prediction of whether a customer will close his or her account. It’s always beneficial to reduce the number of features in building prediction models. It’s known as feature sub set selection.


For numerical variables, correlation between two variables is simply the co variance normalized by the products of two standard deviations. For categorical variables, i.e. variables having a finite set of unordered values  it’s more complex. There are several approaches for handling categorical data. The correlation statistic we will be using is called the Cramer Index.

The basic building block for many correlation statistic between categorical variables is the the Contingency Matrix. If a variable a has n possible values and a variable b has m possible values. Then the contingency matrix will me a n X m matrix. Each cell of the matrix will  contain a count of the number of samples that have the corresponding attribute  value pair.

If the we consider the attribute pair minutes used (MU) and account status (AS), we have a 3 x 2 Contingency Matrix as shown below

MU(low) : AS(open) MU(low) : AS(closed)
MU(med) : AS(open) MU(med) : AS(closed)
MU(high) : AS(open) MU(high) : AS(closed)

Just an inspection of the matrix, may provide valuable insight. For example, if we see high value for the bottom left cell i.e. minutes used high and account status open, we know customers are closing account, there is no suitable plan for high minute usage.

This is how   Cramer Index is defined in terms of Contingency Matrix. The index depends on how concentrated the values are across the cells.

CramerIndex = (sum(n(i,j) * n(i,j) / (nr(i) * nc(j))) – 1) / (min(numRow, numCol) – 1)
n(i,j) = value of (i,j) cell of contingency matrix
nr(i) = sum of values over all columns for the i th row
nc(j) = sum of values over all rows for the j th column
sum = sum over all i and j

The Cramer index will always be between 0 and 1, 0 indicating weakest correlation and 1 indicating the strongest correlation.

Map Reduce

The attributes to be correlated are specified through the configuration parameters source.attributes and dest.attributes. In the initialize() method of the mapper, for all the possible attribute pairs from the two sets an instance of a contingency matrix is created.

The Cramer Index implementation is part of my open source project avenir, which contains a collection classification and prediction algorithms implemented on Hadoop. This specific  map reduce implementation is available here.

As the mapper  processes each record. for each possible attribute pairs from the two sets, the values are extracted from the record. The value pair is used to locate a cell in the corresponding contingency matrix and it’s value incremented.

In the cleanup() method of the mapper, the contingency matrix for each attribute pair is emitted. The key is the attribute pair and the value is the serialized contingency matrix.

On the reducer side, contingency matrices for a given key i.e. attribute pair are aggregated and the Cramer Index  calculated based on the final aggregated contingency matrix. The reducer emits attribute pair followed by Cramer Index.

Customer Churn Analysis

The mobile service provider customer data has the following feature attributes in the data I am using . The data is over a period  going into the past.

  • minute used (low, med, high, overage)
  • data used (low, med, high)
  • customer service calls (low, med, high)
  • payment history (poor, average, good)
  • account age  (low, med, high)

Here are some sample input. The first field is the customer ID and the last field is the response attribute i.e., the account status. The remaining fields are the feature attributes as listed above.


Most of the attributes above are numerical. They have been discretized into categorical values. The response variable is account status which is either open or closed. Each feature variable is paired with the response variable and the corresponding Cramer Index is emitted as output.

Here is the output. Form the output, we can see that minutes used has the strongest correlation to open status.


Wrapping Up

Not only does correlation calculation helps you identify the critical feature attributes towards the final prediction, it also provides valuable insight. Sometimes that insight is all you need, even if you don’t go all the way to solve the blown prediction problem.

For example,  if high minutes used is found to have the strongest correlation to account closing, a mobile service provider could pro actively seek out such customers, and offer them alternative calling plans before they leave. Here is the tutorial for the example.

For commercial support for any solution in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing,


About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Big Data, Correlation, Data Mining, Hadoop and Map Reduce, Predictive Analytic and tagged , , , . Bookmark the permalink.

One Response to Explore Customer Churn with Cramer Index

  1. Pingback: Stop the Customer Separation Pain with Bayesian Classifier | Mawazo

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s