Categorical feature variables is a thorny issue for many supervised Machine Learning algorithms. Many learning algorithms can not handle categorical feature variables. In this post, we will go over an encoding scheme called Leave One Out Encoding, as implemented with Spark. It’s a recent algorithm and popular in Kaggle. This algorithm is particularly useful for high cardinality categorical features.
The Spark implementation of the encoding algorithms can be found in my OSS project avenir in github.
Encoding Categorical Variables
The most well known algorithm is Dummy Variable Generator, which creates a binary variable for each unique value of the categorical variables. If you have too many categorical variables and/or categorical variables with high cardinality, this algorithm will create too many additional feature variables.
There is another class algorithms which encode each categorical variable value to a numeric value and keeps the number of dimensions under control. All these algorithms have something in common. They all are based on the correlation between the categorical variable with the class or target variable. I discussed two of them in an earlier post. Essentially, these algorithms can only be used for supervised machine learning data set.
Leave One Out Encoding
Leave One Out encoding essentially calculates the mean of the target variables for all the records containing the same value for the categorical feature variable in question. The encoding algorithm is slightly different between training and test data set. For training data set, the record under consideration is left out, hence the name Leave One Out. The encoding is as follows for certain value of a certain categorical variable.
For validation data or prediction data set, the definition is slightly different. We don’t need to leave the current record out and we don’t need the randomness factor. It’s simpler definition is as below
For validation data set, we don’t leave the current record out. As we will see later, all the statistics about target variable for each value of each categorical variable as calculated for the training data set is saved and then used for validation and prediction data set.
The factor R acts as a regularizer. When the support is low for a particular value of variable i.e when n is low, then the encoded value is not reliable. Regularization factor R remedies the problem. It also ensures that the denominator is always positive for training data set.
Loan Approval Data
A loan approval data set is being used for encoding. The data set has the following fields
- loan ID
- marital status (categorical with 3 values)
- number of children
- education level
- employment status
- number of years of experience
- if any outstanding loan
- outstanding loan amount
- loan term
- credit score
- residence zip code (categorical with many values)
- approval status
There two categorical variables. One of them (zip code) has very high cardinality. The last field is class or target variable. Here is some sample data
ETEC7F2YLP,single,2,1,0,107,8,23,362,10,698,95106,0 1XPXKLE4UQ,married,1,2,1,127,10,28,237,30,733,95067,1 MKXAM7YQ0C,divorced,2,2,0,117,15,18,286,30,654,95113,0 5U17SLR3DV,single,1,3,1,98,6,22,264,30,730,95103,1 VE8175H9NI,single,2,2,1,100,11,20,294,30,746,95115,1
Spark Job for Encoding
Spark job for Leave One Out encoding is implemented in the scala object CategoricalLeaveOneOutEncoding. The input data RDD is cached as two passes are made through it.
In the first pass, target variable statistics is calculated for each value of each categorical variable. The statistics is also saved if the data set is training data set. It gets used subsequently for validation and prediction data sets. Encoding takes place in In the second pass. Here is some sample output.
5P186RXY11,-0.218,1,2,0,112,10,20,232,30,729,-0.277,1 1W5DXF5ECK,-0.210,2,1,0,119,9,20,402,30,621,-0.115,0 B5FA7FB017,0.030,1,1,0,155,8,16,286,15,772,-0.101,1 39JQ9MHRIE,0.032,1,1,1,115,14,18,212,30,677,-0.128,1
The 2nd and 12th columns have been replaced with encoded values. The range of the values is the same as range of the class or target variable values. If a categorical variable is highly correlated with the class variable i.e. it’s a good predictor and the encoded values will tend to have extreme values within the range e.g either close to 1.0 or close to -1.0
Here is some sample output for target variable statistics which is saved while in training mode. The fields are 1) categorical variable column index 2) categorical variable value 3) count for target variable value 4) sum for target variable value
11,95118,293,-37 11,95109,292,-34 11,95376,278,-76 11,95106,287,-71
For this encoding process to work properly, the class values needs to meet the following requirement
- Should be integer
- Should be symmetrical e.g. 1 and -1
If your class values don’t meet these requirements, then you can specify the positive class value through the configuration parameter class.pos.val. In this case, the actual class values get mapped to 1 and -1 for encoding purpose.
When encoding training data set, the configuration parameter train.data.set needs to be set to true. The configuration parameter target.stat.file also needs to be set with a file path where the target variable aggregate data is saved.
We have gone through a recent and popular encoding algorithms for categorical attributes. To run the use case for loan data, please follow the instructions in the tutorial document.