Categorical feature variables is a thorny issue for many supervised Machine Learning algorithms. Many learning algorithms can not handle categorical feature variables. In this post, we will go over an encoding scheme called Leave One Out Encoding, as implemented with Spark. It’s a recent algorithm and popular in Kaggle. This algorithm is particularly useful for high cardinality categorical features.

The Spark implementation of the encoding algorithms can be found in my OSS project *avenir* in github.

## Encoding Categorical Variables

The most well known algorithm is Dummy Variable Generator, which creates a binary variable for each unique value of the categorical variables. If you have too many categorical variables and/or categorical variables with high cardinality, this algorithm will create too many additional feature variables.

There is another class algorithms which encode each categorical variable value to a numeric value and keeps the number of dimensions under control. All these algorithms have something in common. They all are based on the correlation between the categorical variable with the class or target variable. I discussed two of them in an earlier post. Essentially, these algorithms can only be used for supervised machine learning data set.

## Leave One Out Encoding

Leave One Out encoding essentially calculates the mean of the target variables for all the records containing the same value for the categorical feature variable in question. The encoding algorithm is slightly different between training and test data set. For training data set, the record under consideration is left out, hence the name *Leave One Out*. The encoding is as follows for certain value of a certain categorical variable.

*c _{i} = (Σ_{j != i} t_{j} / (n – 1 + R)) x (1 + ε_{i}) where*

*c*

_{i}= encoded value for i^{th}record*t*

_{j}= target variable value for j^{th}record*n = number of records with the same categorical variable value*

*R = regularization factor*

*ε*

_{i}= zero mean random variable with normal distribution N(0, s)For validation data or prediction data set, the definition is slightly different. We don’t need to leave the current record out and we don’t need the randomness factor. It’s simpler definition is as below

*c _{i} = (Σ_{j} t_{j} / (n + R)) where*

*c*

_{i}= encoded value for i^{th}record*t*

_{j}= target variable value for j^{th}record*n = number of records with the same categorical variable value*

*R = regularization factor*

For validation data set, we don’t leave the current record out. As we will see later, all the statistics about target variable for each value of each categorical variable as calculated for the training data set is saved and then used for validation and prediction data set.

The factor *R* acts as a regularizer. When the support is low for a particular value of variable i.e when *n* is low, then the encoded value is not reliable. Regularization factor R remedies the problem. It also ensures that the denominator is always positive for training data set.

## Loan Approval Data

A loan approval data set is being used for encoding. The data set has the following fields

*loan ID**marital status (categorical with 3 values)**number of children**education level**employment status**income**number of years of experience**if any outstanding loan**outstanding loan amount**loan term**credit score**residence zip code (categorical with many values)**approval status*

There two categorical variables. One of them (zip code) has very high cardinality. The last field is class or target variable. Here is some sample data

ETEC7F2YLP,single,2,1,0,107,8,23,362,10,698,95106,0 1XPXKLE4UQ,married,1,2,1,127,10,28,237,30,733,95067,1 MKXAM7YQ0C,divorced,2,2,0,117,15,18,286,30,654,95113,0 5U17SLR3DV,single,1,3,1,98,6,22,264,30,730,95103,1 VE8175H9NI,single,2,2,1,100,11,20,294,30,746,95115,1

## Spark Job for Encoding

Spark job for *Leave One Out* encoding is implemented in the scala object CategoricalLeaveOneOutEncoding. The input data *RDD* is cached as two passes are made through it.

In the first pass, target variable statistics is calculated for each value of each categorical variable. The statistics is also saved if the data set is training data set. It gets used subsequently for validation and prediction data sets. Encoding takes place in In the second pass. Here is some sample output.

5P186RXY11,-0.218,1,2,0,112,10,20,232,30,729,-0.277,1 1W5DXF5ECK,-0.210,2,1,0,119,9,20,402,30,621,-0.115,0 B5FA7FB017,0.030,1,1,0,155,8,16,286,15,772,-0.101,1 39JQ9MHRIE,0.032,1,1,1,115,14,18,212,30,677,-0.128,1

The 2^{nd} and 12^{th} columns have been replaced with encoded values. The range of the values is the same as range of the class or target variable values. If a categorical variable is highly correlated with the class variable i.e. it’s a good predictor and the encoded values will tend to have extreme values within the range e.g either close to 1.0 or close to -1.0

Here is some sample output for target variable statistics which is saved while in training mode. The fields are **1**) categorical variable column index **2**) categorical variable value **3**) count for target variable value **4**) sum for target variable value

11,95118,293,-37 11,95109,292,-34 11,95376,278,-76 11,95106,287,-71

For this encoding process to work properly, the class values needs to meet the following requirement

*Should be integer**Should be symmetrical e.g. 1 and -1*

If your class values don’t meet these requirements, then you can specify the positive class value through the configuration parameter *class.pos.val*. In this case, the actual class values get mapped to 1 and -1 for encoding purpose.

When encoding training data set, the configuration parameter *train.data.set* needs to be set to true. The configuration parameter *target.stat.file* also needs to be set with a file path where the target variable aggregate data is saved.

## Wrapping Up

We have gone through a recent and popular encoding algorithms for categorical attributes. To run the use case for loan data, please follow the instructions in the tutorial document.

## Support

For commercial support for any solution in my github repositories, please talk to ThirdEye Data Science Services. Support is available for *Hadoop* or Spark deployment on cloud including installation, configuration and testing,

Pingback: Combating High Cardinality Features in Supervised Machine Learning | Mawazo