Support Vector Machine (SVM) is unique among the supervised machine learning algorithms in the sense that it focuses on training data points along the separating hyper planes. In this post, I will go over the details of how I have used SVM from the excellent python machine learning library scikit-learn to predict customer churn for a hypothetical telecommunication company.
Along the way we will also explore the interplay between model complexity, training data size and generalization error rate to gain deeper insight into learning problems.
The python implementation is available in my open source project avenir on github. The implementation provides a nice abstraction on the SVM implementation of scikit-learn. It can handle any kind of data. The feature field column indices and the class field column index are provided through configuration parameters.
Support Vector Machine Basics
Consider a learning problem with n features and we are interested in binary classification. There is a hyper plane in the n dimensional feature space that separates the training data points belonging to the two classes.
For now we are considering only the case where the hyper plane completely separates the training data points belonging to the different classes. We will discuss the non separating case later There could be many such separating hyper planes. How does SVM choose among them? I will provide an intuitive explanation without the math.
Now imagine two other hyper planes parallel to the separating hyper plane one on either side of the separating hyper plane. Now imagine the two hyper planes being pushed away from each other until they come in contact with some training data points. in other words, we are trying to find two hyper planes with maximum margin between them.
SVM tries to find a separating hyper plane, such that the two parallel hyper planes on two sides of it has maximum margin between them. The training data points that fall on the maximum margin hyper planes are called support vectors.
For prediction with test data, only thing required are the support vectors. Support vectors essentially constitute the model for SVM. For prediction, dot product between the test data point and the support vectors is necessary.
Mathematically, SVM is constrained quadratic optimization problem. For non separating problems a penalty term is introduced. For handling non linear separating hyper planes, kernel function is used, which transforms a non linear separating hyper plane in the original feature space to separating linear hyper plane in a higher dimension.
Model Complexity, Training Data Size and Error
There is a non linear relationship between generalization error, model complexity and training data size as below
The error will be less than the value on the right side of the expression with a probability of (1-d) or higher. The second term on the right hand side is error due model complexity for some training data size. The exact expression is available in my earlier post.
Since the relationship is non linear, for some model complexity and error, the minimum training data size can only be found through iteration. To further exacerbate the problem, the model complexity is expressed as the VC number which is very difficult to find in most cases. However, qualitatively we can make the following well known statements.
- Generalization error is minimum for certain model complexity, below which the error is higher due to under fitting or bias and above which it’s higher due over fitting or variance.
- Generalization error goes down with increasing training data size until it converges to some error level.
To build a good model, our goal is two fold as below, with generalization error minimization in mind. It’s essentially an unconstrained optimization problem which can only be solved by repeating lots of tests with different parameters and tracking generalization error. This process is known as meta learning.
- We want to find the optimum model complexity
- We want to find minimum required training data size
In meta learning since the cost function i.e. error rate does not have a closed form expression, but only an upper bound, traditional optimization techniques can not be used. The only option is to resort to brute force grid search through the parameter space.
For most real world problems, you have some training data and you don’t have the luxury of training data of various sizes. In that case, it’s a constrained optimization problem and you can only experiment with model complexity. However, because of the constraint on the training data size, the model built may not be guaranteed to be the best.
Customer Churn Data
The customer churn data for a hypothetical telecommunication company has the following attributes
- Plan ID
- Minutes used
- Data used
- Number of customer service calls
- Number of customer service emails
- Number of people within calling network with the same provider
The usage data and the customer service data is over some time horizon e.g., 3 months. The python script that generates the data takes the following command line arguments
- Number of customers
- Percentage of churning customers
- Error percentage
To simulate noisy data, error can be deliberately being my own. introduced in the training data through the last argument. Error is created by introducing class overlap. For example, if the error percentage is 10, then 10% of data points that should have been labeled churned will be labelled as non churned and vice versa.
Training and Validation
Model is validated after training using k fold validation. In k fold validation, 1/k fraction of data is used for validation and the rest is used for training the model. I have used 2 flavors of k fold validation as follows
- Linear k fold validation : Data is divided statically into k subsets. Out of which k-1 sunsets are used for training and 1 subset is used for validation
- Random k fold validation: One subset is randomly created consisting of 1/k fraction of data. Remaining data is used for training
- Bagging: Uses Scikit provided functions
For the first 2 validation approaches, there are 2 flavors of implementations, one based on native scikit library, and the other being my own. The main reason is for my own implementation is to get more details of the error i.e., false positive error and false negative error.
Each training and validation method produces an ensemble of models, which can be persisted in files and later used for prediction purpose.
For some problems, breakdown of the error is critical. With customer churn, false negative is critical to judge the quality of a model. Because predicting a customer to be not churning when in reality the customer is churning i.e. false negative is significantly more critical than false positive.
My implementation is a user friendly wrapper around scikit-learn library. All the configuration parameters are provided through a properties file Here are the configuration parameters.
|common.mode||execution mode: train or predict|
|common.model.directory||directory for saving model|
|common.model.file.prefix||file name prefix for saved mode|
|common.preprocessing||pre processing for training data: scale or normalize|
|train.data.file||training data file name|
|train.data.feature.fields||coma separated feature field column indices|
|train.data.class.field||class label field column index|
|train.validation||validation method: kfold, rfold or bagging|
|train.num.folds||number of folds (k)|
|train.num.iter||number of iterations for rfold validation|
|train.native.kfold.validation||true if using native scikit implementation|
|train.native.rfold.validation||true if using native scikit implementation|
|train.algorithm||SVM implementation: svc, nusvc or linearsvc|
|train.kernel.function||kernel function (scikit parameter kernel: linear, rbf, poly or sigmoid|
|train.poly.degree||polynomial kernel function degree (scikit parameter degree):2,3,4 etc|
|train.penalty||penalty for error (scikit parameter C): -1.0 for default|
|train.gamma||kernel coefficient (scikit parameter gamma); -1.0 for default|
|train.print.sup.vectors||true if support vectors are to be printed|
|train.persist.model||true if trained model need to be persisted|
|train.bagging.num.estimators||number of models for bagging (scikit parameter n_estimators)|
|train.bagging.sample.fraction||fraction of data to be used for training in bagging (scikit parameter max_samples)|
|train.bagging.use.oob||true if out of bag data to be used for validation (scikit parameter oob_score)|
|pred.data.file||test data file in predict mode|
|pred.data.feature.fields||test data feature field column indices|
|pred.num.models||number of models in saved ensemble of models|
There are lot of configuration parameters, to make the script more like SVM based training and prediction work bench
Experimenting with Model Complexity
Model complexity is decided by the parameters of the learning algorithm and the features of the training data used. Among the model parameters kernel function seems to have the greatest influence on error and model quality.
I ran the tests for a fixed training data set size of 5000. I did not vary the features set. All tests are based on 5 features listed above. Here are the test results. I used svc implementation for all tests. these tests will tell us the best model parameters.
The generalization error rate in the table is the average error from an ensemble of models. With k fold validation, we will get k trained models.
|Kernel||Error||False positive error||False negative error|
|poly (degree 2)||0.143||0.013||0.130|
|poly (degree 3)||0.131||0.014||0.117|
|poly (degree 4)||0.159||0.012||0.147|
|poly (degree 5)||0.162||0.011||0.151|
Here are some observation from the test results. These result are relevant for our data set and can not be generalized
- The rbf kernel gives the least error rate, but it’s only slightly better than the linear kernel. It implies that complexity of the rbf kernel matches with the complexity of innate in the data set.
- The sigmoid kernel seems to be the worst, possible because it’s too complex for the data and there is over fitting.
- The polynomial kernel provides some interesting insight. It gives least error for the degree of 3. We can see higher error due to bias or under fitting at low model complexity when the degree is 2. We also see higher error for degree 3 and above due to variance or over fitting.
Experimenting with Training Data Set Size
In the next set of tests we will hold the model parameters fixed and vary the training data set size. We will run tests for 3 different training data size. Here are the test results
|Data set size||Error||False positive error||False negative error|
Since we have been holding 20% of the data for validation, the actual training data size is 80% of the data size. The error rate drops with increasing data size as expected and converging around 5000. I ran a test with data size of 8000 but didn’t see any significant drop in error.
I experimented with some other parameters like penalty, kernel coefficient etc. They did not seem to make much difference. The default values for them just worked fine. The model that gave the lowest generalization error has the following characteristics
- Kernel function : rbf
- Other parameters: default values
- Data set size : 5000
I have used my linear k fold validation method for all tests. Since I was bench marking error rate for different parameters, it was important that training data being used for each test was same. With random k fold, a different training data set would have been selected randomly for each test.
The best error rate from training tests was slightly above 10%, which is expected because we have artificially introduced an error rate of 10% in the training data.. This encouraging because if the data was noise free, our error rate would have been very close to 0.
We discussed about supp0rt vectors earlier. I found that about 25% of the data was selected as support vectors. The number of support vectors is an indication of the complexity of the trained model.
If the ensemble of models were saved in files during training by appropriately setting the configuration parameters, they can reloaded to make predictions with test data. The configuration parameter common.mode needs to be set to predict.
Using the ensemble of models, majority voting method is used to make prediction. The output also shows how the votes are split between the ensemble of models. To avoid ties, odd number of models should used.
Running the Script
Here are the steps necessary to set up the environment and run tests. Python needs to be installed
The script uses numpy and scikit-leran. It’s best to install anaconda to get all related packages.
I used properties file for configuration and python module called jprops for processing properties file. Here is how you install it
pip install jprops
To create test data use the python script telecom_churn.py as below. The 3 command line arguments are as explained before.
./telecom_churn.py 5000 20 10 > churn_train_5000.txt
Here is a tutorial document with the details of all the steps. If you are going to run the use case, please refer to the tutorial.
We have used SVM from scikit-learn to build a predictive model for customer churn. We have also gone through the process of so called meta learning for choosing the best performing model by searching through the parameter space.
The python script implemented is general enough for applying SVM to any predictive problem as long as the data is in tabular CSV format. There are enough parameters through a configuration properties file to try various tuning parameters.