Customer Churn Prediction with SVM using Scikit-Learn

Support Vector Machine (SVM) is unique among the supervised machine learning algorithms in the sense that it focuses on training data points along the separating hyper planes. In this post, I will go over the details of how I have used SVM from the excellent python machine learning library scikit-learn to predict customer churn for a hypothetical telecommunication company.

Along the way we will also explore the interplay between model complexity, training data size and generalization error rate to gain deeper insight into learning problems.

The python implementation is available in my open source project avenir on github. The implementation provides a nice abstraction on the SVM implementation of scikit-learn. It can handle any kind of data. The feature field column indices and the class field column index are provided through configuration parameters.

Support Vector Machine Basics

Consider a learning problem with n features and we are interested in binary classification. There is a hyper plane in the n dimensional feature space that separates the training data points belonging to the two classes.

For now we are considering only the case where the hyper plane completely separates the training data points belonging to the different classes. We will discuss the non separating case later There could be many such separating hyper planes. How does SVM choose among them? I will provide an intuitive explanation without the math.

Now imagine two other hyper planes parallel to the separating hyper plane one on either side of the separating hyper plane. Now imagine the two hyper planes being pushed away from each other until they come in contact with some training data points. in other words, we are trying to find two hyper planes with maximum margin between them.

SVM tries to find a separating hyper plane, such that the two parallel hyper planes on two sides of it has maximum margin between them. The training data points that fall on the maximum margin hyper planes are called support vectors.

For prediction with test data, only thing required are the support vectors. Support vectors essentially constitute the model for SVM. For prediction, dot product between the test data point and the support vectors is necessary.

Mathematically, SVM is constrained quadratic optimization problem. For non separating problems a penalty term is introduced. For handling non linear separating hyper planes, kernel function is used, which transforms a non linear separating hyper plane in the original feature space to separating linear hyper plane in a higher dimension.

Model Complexity, Training Data Size and Error

There is a non linear relationship between generalization error, model complexity and training data size as below

e_g < e_t + f(c,n,d)
where
e_g = generalization error
e_t = training error
f = non linear function
c = model complexity
n = training data size
d = error probability

The error will be less than the value on the right side of the expression with a probability of (1-d) or higher. The second term on the right hand side is error due model complexity for some training data size. The exact expression is available in my earlier post.

Since the relationship is non linear, for some model complexity and error, the minimum training data size can only be found through iteration. To further exacerbate the problem, the model complexity is expressed as the VC number which is very difficult to find in most cases. However, qualitatively we can make the following well known statements.

Generalization error is minimum for certain model complexity, below which the error is higher due to under fitting or bias and above which it’s higher due over fitting or variance.
Generalization error goes down with increasing training data size until it converges to some error level.

To build a good model, our goal is two fold as below, with generalization error minimization in mind. It’s essentially an unconstrained optimization problem which can only be solved by repeating lots of tests with different parameters and tracking generalization error. This process is known as meta learning.

We want to find the optimum model complexity
We want to find minimum required training data size

In meta learning since the cost function i.e. error rate does not have a closed form expression, but only an upper bound, traditional optimization techniques can not be used. The only option is to resort to brute force grid search through the parameter space.

For most real world problems, you have some training data and you don’t have the luxury of training data of various sizes. In that case, it’s a constrained optimization problem and you can only experiment with model complexity. However, because of the constraint on the training data size, the model built may not be guaranteed to be the best.

Customer Churn Data

The customer churn data for a hypothetical telecommunication company has the following attributes

Plan ID
Minutes used
Data used
Number of customer service calls
Number of customer service emails
Number of people within calling network with the same provider
Churned

The usage data and the customer service data is over some time horizon e.g., 3 months. The python script that generates the data takes the following command line arguments

Number of customers
Percentage of churning customers
Error percentage

To simulate noisy data, error can be deliberately being my own. introduced in the training data through the last argument. Error is created by introducing class overlap. For example, if the error percentage is 10, then 10% of data points that should have been labeled churned will be labelled as non churned and vice versa.

Training and Validation

Model is validated after training using k fold validation. In k fold validation, 1/k fraction of data is used for validation and the rest is used for training the model. I have used 2 flavors of k fold validation as follows

Linear k fold validation : Data is divided statically into k subsets. Out of which k-1 sunsets are used for training and 1 subset is used for validation
Random k fold validation: One subset is randomly created consisting of 1/k fraction of data. Remaining data is used for training
Bagging: Uses Scikit provided functions

For the first 2 validation approaches, there are 2 flavors of implementations, one based on native scikit library, and the other being my own. The main reason is for my own implementation is to get more details of the error i.e., false positive error and false negative error.

Each training and validation method produces an ensemble of models, which can be persisted in files and later used for prediction purpose.

For some problems, breakdown of the error is critical. With customer churn, false negative is critical to judge the quality of a model. Because predicting a customer to be not churning when in reality the customer is churning i.e. false negative is significantly more critical than false positive.

My implementation is a user friendly wrapper around scikit-learn library. All the configuration parameters are provided through a properties file Here are the configuration parameters.

Parameter	Comment
common.mode	execution mode: train or predict
common.model.directory	directory for saving model
common.model.file.prefix	file name prefix for saved mode
common.preprocessing	pre processing for training data: scale or normalize
train.data.file	training data file name
train.data.feature.fields	coma separated feature field column indices
train.data.class.field	class label field column index
train.validation	validation method: kfold, rfold or bagging
train.num.folds	number of folds (k)
train.num.iter	number of iterations for rfold validation
train.native.kfold.validation	true if using native scikit implementation
train.native.rfold.validation	true if using native scikit implementation
train.algorithm	SVM implementation: svc, nusvc or linearsvc
train.kernel.function	kernel function (scikit parameter kernel: linear, rbf, poly or sigmoid
train.poly.degree	polynomial kernel function degree (scikit parameter degree):2,3,4 etc
train.penalty	penalty for error (scikit parameter C): -1.0 for default
train.gamma	kernel coefficient (scikit parameter gamma); -1.0 for default
train.print.sup.vectors	true if support vectors are to be printed
train.persist.model	true if trained model need to be persisted
train.bagging.num.estimators	number of models for bagging (scikit parameter n_estimators)
train.bagging.sample.fraction	fraction of data to be used for training in bagging (scikit parameter max_samples)
train.bagging.use.oob	true if out of bag data to be used for validation (scikit parameter oob_score)
pred.data.file	test data file in predict mode
pred.data.feature.fields	test data feature field column indices
pred.num.models	number of models in saved ensemble of models

There are lot of configuration parameters, to make the script more like SVM based training and prediction work bench

Experimenting with Model Complexity

Model complexity is decided by the parameters of the learning algorithm and the features of the training data used. Among the model parameters kernel function seems to have the greatest influence on error and model quality.

I ran the tests for a fixed training data set size of 5000. I did not vary the features set. All tests are based on 5 features listed above. Here are the test results. I used svc implementation for all tests. these tests will tell us the best model parameters.

The generalization error rate in the table is the average error from an ensemble of models. With k fold validation, we will get k trained models.

Kernel	Error	False positive error	False negative error
linear	0.107	0.018	0.089
rbf	0.101	0.019	0.082
poly (degree 2)	0.143	0.013	0.130
poly (degree 3)	0.131	0.014	0.117
poly (degree 4)	0.159	0.012	0.147
poly (degree 5)	0.162	0.011	0.151
sigmoid	0.252	0.000	0.252

Here are some observation from the test results. These result are relevant for our data set and can not be generalized

The rbf kernel gives the least error rate, but it’s only slightly better than the linear kernel. It implies that complexity of the rbf kernel matches with the complexity of innate in the data set.
The sigmoid kernel seems to be the worst, possible because it’s too complex for the data and there is over fitting.
The polynomial kernel provides some interesting insight. It gives least error for the degree of 3. We can see higher error due to bias or under fitting at low model complexity when the degree is 2. We also see higher error for degree 3 and above due to variance or over fitting.

Experimenting with Training Data Set Size

In the next set of tests we will hold the model parameters fixed and vary the training data set size. We will run tests for 3 different training data size. Here are the test results

Data set size	Error	False positive error	False negative error
1000	0.125	0.024	0.101
3000	0.117	0.020	0.097
5000	0.101	0.019	0.082

Since we have been holding 20% of the data for validation, the actual training data size is 80% of the data size. The error rate drops with increasing data size as expected and converging around 5000. I ran a test with data size of 8000 but didn’t see any significant drop in error.

Final Model

I experimented with some other parameters like penalty, kernel coefficient etc. They did not seem to make much difference. The default values for them just worked fine. The model that gave the lowest generalization error has the following characteristics

Kernel function : rbf
Other parameters: default values
Data set size : 5000

I have used my linear k fold validation method for all tests. Since I was bench marking error rate for different parameters, it was important that training data being used for each test was same. With random k fold, a different training data set would have been selected randomly for each test.

The best error rate from training tests was slightly above 10%, which is expected because we have artificially introduced an error rate of 10% in the training data.. This encouraging because if the data was noise free, our error rate would have been very close to 0.

We discussed about supp0rt vectors earlier. I found that about 25% of the data was selected as support vectors. The number of support vectors is an indication of the complexity of the trained model.

Making Prediction

If the ensemble of models were saved in files during training by appropriately setting the configuration parameters, they can reloaded to make predictions with test data. The configuration parameter common.mode needs to be set to predict.

Using the ensemble of models, majority voting method is used to make prediction. The output also shows how the votes are split between the ensemble of models. To avoid ties, odd number of models should used.

Running the Script

Here are the steps necessary to set up the environment and run tests. Python needs to be installed

The script uses numpy and scikit-leran. It’s best to install anaconda to get all related packages.

I used properties file for configuration and python module called jprops for processing properties file. Here is how you install it

pip install jprops

To create test data use the python script telecom_churn.py as below. The 3 command line arguments are as explained before.

./telecom_churn.py 5000 20 10 > churn_train_5000.txt

Edit the properties configuration file svm.properties as per your needs. Run the script svm.py as below. Output will appear in the console. You could pipe it and save it.

./svm.py svm.properties

Here is a tutorial document with the details of all the steps. If you are going to run the use case, please refer to the tutorial.

Wrapping Up

We have used SVM from scikit-learn to build a predictive model for customer churn. We have also gone through the process of so called meta learning for choosing the best performing model by searching through the parameter space.

The python script implemented is general enough for applying SVM to any predictive problem as long as the data is in tabular CSV format. There are enough parameters through a configuration properties file to try various tuning parameters.

Support

For commercial support for any solution in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing,

2 Responses to Customer Churn Prediction with SVM using Scikit-Learn

Alex says:

August 4, 2016 at 2:18 am

Thanks for the post! I’m learning how to do churn predictive, very helpfull. A question: why you don’t use KFold functions of sklearn?

- Pranab says:
  
  August 4, 2016 at 6:41 pm
  
  K fold cross validation should be used as with with any predictive model building