Leave One Out Encoding for Categorical Feature Variables on Spark


Categorical feature variables is a thorny issue for many supervised Machine Learning algorithms. Many learning algorithms can not handle categorical feature variables. In this post, we will go over an encoding scheme called Leave One Out Encoding, as implemented with Spark. It’s a recent algorithm and popular in Kaggle. This algorithm is particularly useful for high cardinality categorical features.

The Spark implementation of the encoding algorithms can be found in my OSS project avenir in github.

Encoding Categorical Variables

The most well known algorithm is Dummy Variable Generator, which creates a binary variable for each unique value of the categorical variables. If you have too many categorical variables and/or categorical variables with high cardinality, this algorithm will create too many additional feature variables.

There is another class algorithms which encode each categorical variable value to a  numeric value and keeps the number of dimensions under control. All these algorithms have something in common. They all are based on the correlation between the categorical variable with the class or target variable. I discussed two of them in an earlier post. Essentially, these algorithms can only be used for supervised machine learning data set.

Leave One Out Encoding

Leave One Out encoding essentially calculates the mean of the target variables for all the records containing the same value for the categorical feature variable in question. The encoding algorithm is slightly different between training and test data set. For training data set, the record under consideration is left out, hence the name Leave One Out. The encoding is as follows for certain value of a certain categorical variable.

ci = (Σj != i tj / (n – 1 + R)) x (1 + εi)   where
ci = encoded value for ith record
tj = target variable value for jth record
n = number of records with the same categorical variable value
R = regularization factor
εi = zero mean random variable with normal distribution N(0, s)

For validation data or prediction data set, the definition is slightly different. We don’t need to leave the current record out and we don’t need the randomness factor. It’s simpler definition is as below

ci = (Σj tj / (n + R))    where
ci = encoded value for ith record
tj = target variable value for jth record
n = number of records with the same categorical variable value
R = regularization factor

For validation data set, we don’t leave the current record out. As we will see later, all the statistics about target variable for each value of each categorical variable as calculated for the training data set is saved and then used for validation and prediction data set.

The factor R acts as a regularizer. When the support is low for a particular value of variable i.e when n is low, then the encoded value is not reliable. Regularization factor R remedies the problem. It also ensures that the denominator is always positive for training data set.

Loan Approval Data

A loan approval data set is being used for encoding. The data set has the following fields

  1. loan ID
  2. marital status (categorical with 3 values)
  3. number of children
  4. education level
  5. employment status
  6. income
  7. number of years of experience
  8. if any outstanding loan
  9. outstanding loan amount
  10. loan term
  11. credit score
  12. residence zip code (categorical with many values)
  13. approval status

There two categorical  variables. One of them (zip code) has very high cardinality. The last field is class or target variable. Here is some sample data

ETEC7F2YLP,single,2,1,0,107,8,23,362,10,698,95106,0
1XPXKLE4UQ,married,1,2,1,127,10,28,237,30,733,95067,1
MKXAM7YQ0C,divorced,2,2,0,117,15,18,286,30,654,95113,0
5U17SLR3DV,single,1,3,1,98,6,22,264,30,730,95103,1
VE8175H9NI,single,2,2,1,100,11,20,294,30,746,95115,1

Spark Job for Encoding

Spark job for Leave One Out encoding is implemented in the scala object CategoricalLeaveOneOutEncoding. The input data RDD is cached as two passes are made through it.

In the first pass, target variable statistics is calculated for each value of each categorical variable. The statistics is also saved if the data set is training data set. It gets used subsequently for validation and prediction data sets. Encoding takes place in In the second pass. Here is some sample output.

5P186RXY11,-0.218,1,2,0,112,10,20,232,30,729,-0.277,1
1W5DXF5ECK,-0.210,2,1,0,119,9,20,402,30,621,-0.115,0
B5FA7FB017,0.030,1,1,0,155,8,16,286,15,772,-0.101,1
39JQ9MHRIE,0.032,1,1,1,115,14,18,212,30,677,-0.128,1

The 2nd and 12th columns have been replaced with encoded values. The range of the values is the same as range of the class or target variable values. If a categorical variable is highly correlated with the class variable i.e. it’s a good predictor and the encoded values will tend to have extreme values within the range e.g either close to 1.0 or close to -1.0

Here is some sample output for target variable statistics which is saved while in training mode. The fields are 1) categorical variable column index 2) categorical variable value 3) count for target variable value 4) sum for target variable value

11,95118,293,-37
11,95109,292,-34
11,95376,278,-76
11,95106,287,-71

For this encoding process to work properly, the class values needs to meet the following requirement

  1. Should be integer
  2. Should be symmetrical e.g. 1 and -1

If your class values don’t meet these requirements, then you can specify the positive class value through the  configuration parameter class.pos.val. In this case, the actual class values get mapped to 1 and -1 for encoding purpose.

When encoding training data set, the configuration parameter train.data.set needs to be set to true. The configuration parameter target.stat.file also needs to be set with a  file path where the target variable aggregate data is saved.

Wrapping Up

We have gone through a recent and popular encoding algorithms for categorical attributes. To run the use case for loan data, please follow the instructions in the tutorial document.

Support

For commercial support for any solution in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing,

Advertisements

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Big Data, Data Science, ETL, Spark and tagged . Bookmark the permalink.

2 Responses to Leave One Out Encoding for Categorical Feature Variables on Spark

  1. Pingback: Combating High Cardinality Features in Supervised Machine Learning | Mawazo

  2. JosephTah says:

    delta airlines via klm delta

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s