Customer Churn Prediction with SVM using Scikit-Learn


Support Vector Machine (SVM) is unique among the supervised machine learning algorithms in the sense that it focuses on training data points along the separating hyper planes. In this post, I will go over the details of how I have used SVM from the excellent python machine learning library scikit-learn to predict customer churn for a hypothetical telecommunication company.

Along the way we will also explore the interplay between model complexity, training data size and generalization error rate to gain deeper  insight into learning problems.

The python implementation is available in my open source project avenir on github. The implementation provides a nice abstraction on the SVM implementation of scikit-learn. It can handle any kind of data. The feature field column indices and the class field column index are provided through configuration parameters.

Support Vector Machine Basics

Consider a learning problem with n features and we are interested in binary classification. There is a hyper plane in the n dimensional feature space that separates the training data points belonging to the two classes.

For now we are considering only the case where the hyper plane completely separates the training data points belonging to the different classes.  We will discuss the non separating case later   There could be many such separating hyper planes. How does SVM choose among them? I will provide an intuitive explanation without the math.

Now imagine two other hyper planes parallel to the separating hyper plane one on either side of the separating hyper plane. Now imagine the two hyper planes being pushed away from each other until they come in contact with some training data points. in other words, we are trying to find two hyper planes with maximum margin between them.

SVM tries to find a separating hyper plane, such that the two parallel hyper planes on two sides of it has maximum margin between them. The training data points that fall on the maximum margin hyper planes are called support vectors.

For prediction with test data, only thing required are the support vectors. Support vectors essentially constitute the model for SVM. For prediction, dot product between the test data point and the support vectors is necessary.

Mathematically, SVM is constrained quadratic optimization problem. For   non separating problems a penalty term is introduced.  For handling non linear separating hyper planes, kernel function is used, which transforms a non linear separating hyper plane in the original feature space to separating linear hyper plane in a higher dimension.

Model Complexity, Training Data Size and Error

There is a non linear relationship between generalization error, model complexity and training data size as below

eg < et  +  f(c,n,d)
where
eg = generalization error
et = training error
f = non linear function
c = model complexity
n = training data size
d = error probability

The error will be less than the value on the right side of the expression with a probability of (1-d) or higher. The second term on the right hand side is error due model complexity for some training data size. The exact expression is available in my earlier post.

Since the relationship is non linear, for some model complexity and error, the minimum training data size can only be found through iteration. To further exacerbate  the  problem, the model complexity is expressed as the VC number which is very difficult to find in most cases. However, qualitatively we can make the following well known statements.

  1. Generalization error is minimum for certain model complexity, below which the error is higher due to under fitting or bias and above which it’s higher due over fitting or variance.
  2. Generalization error goes down with increasing training data size until it converges to some error level.

To build a good model, our goal is two fold as below, with generalization error minimization in mind. It’s essentially an unconstrained optimization problem which can only be solved by repeating lots of tests with different parameters and tracking generalization error. This process is known as meta learning.

  1. We want to find the optimum model complexity
  2. We want to find minimum required training data size

In meta learning since the cost function i.e. error rate does not have a closed form expression, but only an upper bound, traditional optimization techniques can not be used. The only option is to resort  to brute force grid search through the parameter space.

For most real world problems, you have some training data and you don’t have the luxury of training data of various sizes. In that case, it’s a constrained optimization problem and you can only experiment with model complexity. However, because of the constraint on the training data size, the model built may not be guaranteed to be the best.

Customer Churn Data

The customer churn data for a hypothetical telecommunication company has the following attributes

  1. Plan ID
  2. Minutes used
  3. Data used
  4. Number of customer service calls
  5. Number of customer service emails
  6. Number of people within calling network with the same provider
  7. Churned

The usage data and the customer service data is over some time horizon e.g., 3 months. The python script that generates the data takes the following command line arguments

  1. Number of customers
  2. Percentage of churning customers
  3. Error percentage

To simulate noisy data, error can  be deliberately being my own. introduced in the training data through the last argument. Error is created by introducing class overlap. For example, if the error percentage is 10, then 10% of data points that should have been labeled churned will be labelled as non churned and vice versa.

Training and Validation

Model is validated after training using k fold validation. In k  fold validation, 1/k fraction of data is used for validation and the rest is used for training the model. I have used 2 flavors of k fold validation as follows

  • Linear k fold validation : Data is divided  statically into k subsets. Out of which k-1 sunsets are used for training and 1 subset is used for validation
  • Random k fold validation: One subset is randomly created consisting of 1/k fraction of data. Remaining data is used for training
  • Bagging: Uses Scikit provided functions

For  the first 2 validation approaches, there are 2 flavors of implementations, one based on native scikit library, and the other being my own. The main reason is for my own implementation is to get more details of the error i.e., false positive error and false negative error.

Each training and validation method produces an ensemble of models, which can be persisted in files and later used for prediction purpose.

For some problems, breakdown of the error is critical. With customer churn, false negative is critical to judge the quality of a model. Because predicting a customer to be not churning when in reality the customer is churning i.e. false negative  is significantly more critical than false positive.

My implementation is a user friendly wrapper around scikit-learn library. All the configuration parameters are provided through a properties file Here are the configuration parameters.

Parameter Comment
common.mode execution mode: train or predict
common.model.directory directory for saving model
common.model.file.prefix file name prefix for saved mode
common.preprocessing pre processing for training data: scale or normalize
train.data.file training data file name
train.data.feature.fields coma separated feature field column indices
train.data.class.field class label field column index
train.validation validation method: kfold, rfold or bagging
train.num.folds number of folds (k)
train.num.iter number of iterations for rfold validation
train.native.kfold.validation true if using native scikit implementation
train.native.rfold.validation true if using native scikit implementation
train.algorithm SVM implementation: svc, nusvc or linearsvc
train.kernel.function kernel function (scikit parameter kernel: linear, rbf, poly or sigmoid
train.poly.degree polynomial kernel function degree (scikit parameter degree):2,3,4 etc
train.penalty penalty for error (scikit parameter C): -1.0 for default
train.gamma kernel coefficient (scikit parameter gamma); -1.0 for default
train.print.sup.vectors true if support vectors are to be printed
train.persist.model true if trained model need to be persisted
train.bagging.num.estimators number of models for bagging (scikit parameter n_estimators)
train.bagging.sample.fraction fraction of data to be used for training in bagging (scikit parameter max_samples)
train.bagging.use.oob true if out of bag data to be used for validation (scikit parameter oob_score)
pred.data.file test data file in predict mode
pred.data.feature.fields test data feature field column indices
pred.num.models number of models in saved ensemble of models

 

There are lot of configuration parameters, to make the script more like SVM based training and prediction work bench

Experimenting with Model Complexity

Model complexity is decided by the parameters of the learning algorithm and the features of the training data used. Among the model parameters kernel function seems to have the greatest influence on error and model quality.

I ran the tests for a fixed training  data set size of 5000. I did not vary the features set. All tests are based on 5 features listed above. Here are the test results. I used svc implementation for all tests. these tests will tell us the best model parameters.

The generalization error rate in the table is the average error from an ensemble of models. With k fold validation, we will get k trained models.

Kernel Error False positive error False negative error
linear 0.107 0.018 0.089
rbf 0.101 0.019 0.082
poly (degree 2) 0.143 0.013 0.130
poly (degree 3) 0.131 0.014 0.117
poly (degree 4) 0.159 0.012 0.147
poly (degree 5) 0.162 0.011 0.151
sigmoid 0.252 0.000 0.252

 

Here are some observation from the test results. These result are relevant for our data set and can not be generalized

  • The rbf kernel gives the least error rate, but it’s only slightly better than the linear kernel. It implies that complexity of the rbf kernel matches with the complexity of innate in the data set.
  • The sigmoid kernel seems to be the worst, possible because it’s too complex for the data and there is over fitting.
  • The polynomial kernel provides some interesting insight. It gives  least error for the degree of 3. We can see higher error due to bias or under fitting  at low model complexity when the degree is 2. We also see higher error for degree 3 and above due to variance or over fitting.

Experimenting with Training Data Set Size

In the next set of tests we will hold the model parameters fixed and vary the training data set size. We will run tests for 3 different training data size. Here are the test results

Data set size Error False positive error False negative error
1000 0.125 0.024 0.101
3000 0.117 0.020 0.097
5000 0.101 0.019 0.082

Since we have been holding 20% of the data for validation, the actual training data size is 80% of the data size. The error rate drops with increasing data size as expected and converging around 5000. I ran a test with data size of 8000 but didn’t see any significant drop in error.

Final Model

I experimented with some other parameters like penalty, kernel coefficient etc. They did not seem to make much difference. The default values for them just worked fine. The model that gave the lowest generalization error has the following characteristics

  • Kernel function : rbf
  • Other parameters: default values
  • Data set size : 5000

I have used my linear k fold validation method for all tests. Since I was bench marking error rate for different parameters, it was important that training data being used for each test was same. With random k fold, a different training data set would have been selected randomly for each test.

The best error rate from training tests was slightly above 10%, which is expected because we have artificially introduced an error rate of 10% in the training data.. This encouraging because if the data was noise free, our error rate would have been very close to 0.

We discussed about supp0rt vectors earlier. I found that about 25% of the data was selected as support vectors. The number of support vectors is an indication of the complexity of the trained model.

Making Prediction

If the ensemble of models were saved in files during training by appropriately setting the configuration parameters, they can reloaded to make predictions with test data. The configuration parameter common.mode needs to be set to predict.

Using the ensemble of models, majority voting method is used to make prediction. The output also shows how the votes are split between the ensemble of models. To avoid ties, odd number of models should used.

Running the Script

Here are the steps necessary to set up the environment and run tests. Python needs to be installed

The script uses numpy and scikit-leran. It’s best to install anaconda to get all related packages.

I used properties file for configuration and python module called jprops for processing properties file. Here is how you install it

pip install jprops 

To create test data use the python script telecom_churn.py as below. The 3 command line arguments are as explained before.

./telecom_churn.py 5000 20 10 > churn_train_5000.txt

Edit the properties configuration file svm.properties as per your needs. Run the script svm.py as below. Output will appear in the console. You could pipe it and save it.

./svm.py svm.properties

Here is a tutorial document with the details of all the steps. If you are going to run the use case, please refer to the tutorial.

Wrapping Up

We have used SVM from scikit-learn to build a predictive model for customer churn. We have also gone through the process of so called meta learning for choosing the best performing model by searching through the parameter space.

The python script implemented is general enough for applying SVM to any predictive problem as long as the data is in tabular CSV format. There are enough parameters through a configuration properties file to try various tuning parameters.

Advertisements

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Data Science, Machine Learning, Predictive Analytic, Python and tagged , , , , . Bookmark the permalink.

2 Responses to Customer Churn Prediction with SVM using Scikit-Learn

  1. Alex says:

    Thanks for the post! I’m learning how to do churn predictive, very helpfull. A question: why you don’t use KFold functions of sklearn?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s