Handling Categorical Feature Variables in Machine Learning using Spark


Categorical features variables i.e. features variables with fixed set of unique values  appear in the training data set for many real world problems. However, categorical variables pose a serious problem for many Machine Learning algorithms. Some examples of such algorithms are Logistic Regression, Support Vector Machine (SVM) and any Regression algorithm.

In this post we will go over a Spark based solution to alleviate the problem. The solution implementation can be found in my open source projects chombo and avenir. We will be using CRM data as the use case

Categorical Feature Variable Problem

The underlying data type of a categorical variable is string. However  the values are constrained as below.

  • Has finite and unique set of values
  • There is no ordering or any other relationship between the values

Some popular Machine Learning algorithms expect all the feature variables to be numeric. What do you do when when you have categorical feature variables in your training data set and you want to use one of those algorithms. As alluded to earlier, examples of such algorithms are Logistic Regression, Support Vector Machine (SVM).

Most Decision Tree algorithm implementations  also can not handle categorical variables.  My implementation of Decision Tree can handle categorical variables.

One popular solution is to have one numeric binary variable for each value of the categorical variable.  For any particular value of the categorical variable, the binary variable in the corresponding position will be set to 1 and 0 in the rest of the positions. It’s also known as One Hot Encoding. Each categorical variable value is replaced with a binary vector, with only one element set to 1 and the rest to 0.

Let’s consider the categorical variable color with value set (red, green, blue, yellow, brown, violet). Since the cardinality is 5, we need 5 numerical binary variables, one for each of the values in the set. For the color yellow, the binary value set will be (0, 0, 0, 1, 0,0). Since yellow is the third value, only the third binary variable has been set to 1.

In this example, we have replaced one categorical variable with 5 binary variable. Effectively, we have added 4 additional feature variables in our training set.

The solution involves two steps. In the first step, we find all the unique values for all the categorical variables in the data set. If this information is already available, the first step is not necessary. In the second step, we generate the dummy binary variables as outlined earlier.

Sales Lead Use Case

We will be using sales lead data as gleaned from a hypothetical CRM system. The context is that some Data Scientist  wants to build a predictive model, that will predict whether a sales lead will convert or not. The Data Scientist wants to use SVM for building  the model.

The data set contains 12 variables, including the class variable. Among the feature variables, there are 4 categorical variables. The variables are as below.

  1. id
  2. source of lead (categorical)
  3. lead contact type (categorical)
  4. lead company size (categorical)
  5. number of days in sales pipeline
  6. number of meetings with lead
  7. number of emails exchanged with the client
  8. number of web site visits by the lead
  9. number  of demos shown to the client
  10. expected revenue from the deal
  11. proposal with price sent to the lead (categorical)
  12. converted  (class label)

The first field which is an ID, will be obviously be not used for building the learning model. Excluding the first and the last field, there are 10 feature variables.

Here is some sample input data

SNVLC4X156,referral,canReccommend,medium,7,6,8,5,2,62916,N,0
81K11016AU,referral,canReccommend,large,31,5,13,7,2,56402,N,0
SN1U9G2BE4,advertisement,canDecide,medium,49,0,9,3,4,47007,Y,0
JRR174F6OM,referral,canDecide,large,102,3,9,5,3,54579,Y,1
KJY1MP6LQP,tradeShow,canDecide,large,72,0,13,5,4,41249,Y,1
08W49U4557,webDownload,canReccommend,large,102,1,7,4,2,48673,N,0
Q4G22I9N7T,referral,canReccommend,small,99,3,11,4,1,35852,N,0

Discovering Unique Values

The Spark job that finds all the unique values for categorical variables in implemented in scala object UniqueValueCounter.  As mentioned before, if the unique values are already known, then running this job is not necessary.

The Spark job has 2 main steps. In the first step a map operation generates paired record, with column index as the key and a set containing the column value. The second step performs a reduce by key operation whereby the set of column values are merged.

Here is the output from this Spark job. The first field is the column index. The remaining fields are the unique values for the column.

(10,N,Y)
(2,canReccommend,canDecide)
(1,referral,tradeShow,webDownload,advertisement)
(3,medium,large,small)

There is a case insensitivity configuration parameter available. If set to true, all categorical variable values are converted to lower case before processing.

Dummy Binary Value Generation

This Spark job is implemented in the scala object BinaryDummyVariableGenerator. The unique value list for each categorical variable is provided through configuration. If they are not already known,

The Spark job has a map function, which for each categorical variable , creates as many binary fields  as the number of unique values for that categorical variable. As in the first job, there is a case insensitivity configuration parameter available for this Spark job also.

Here is some sample output.The 4 categorical fields have been replaced with 11 binary fields.

0C70F76M50,0,1,0,0,1,0,0,0,1,73,6,7,5,3,33930,0,1,0
B0YBN7V21R,0,0,1,0,1,0,1,0,0,77,4,7,8,4,48972,0,1,0
UQTO1LREAZ,0,0,0,1,0,1,1,0,0,47,6,7,3,1,43251,1,0,0
2INR6KVTKU,1,0,0,0,1,0,0,0,1,51,5,11,7,3,79233,0,1,0
4M0JWKZ95V,0,0,1,0,1,0,0,1,0,37,5,6,9,3,40794,0,1,0
909GY7EZEZ,0,0,0,1,1,0,0,0,1,57,6,10,6,1,64915,0,1,0

High Cardinality Categorical Variables

What happens if there are categorical variables with high cardinality i.e too many unique values. With binary dummy variables approach or One Hot Encoding approach, too many new fields will be added and you will end up with an explosion of feature dimensions in your data set.

Too many feature dimensions is problematic for most Machine Learning algorithms. It’s also known as the curse of conditionality problem.

Binary encoding looks promising because it does not introduce as many new variables, but it’s faulty as we will find out soon. You choose the smallest n such that  c < 2n where c is the number of values in the categorical variable. Then you convert each position of the values into a binary representation. With this scheme the categorical variable will be replaced with n binary variable.

Going back to example of color, n will be 3.  the range of binary values based on position  will be 0 through 5. The binary encoding for the color yellow will (0 1 1).

Although, this scheme introduces only 3 variables, instead of 6 as in simple binary variables, essentially we have assigned a numerical value to each value of a categorical variable. The numerical value happens to be represented with binary encoding.

We have essentially introduced a relationship and to be more specific an ordering between the values of a categorical variable values. This goes against the definition of categorical variables.

In Label Encoding,  there are no additional fields. Each categorical value is replaced with a number. However, it is as bad as Binary Encoding and for the same reason, i.e. it artificially introduces an order between the values.

If your data set has class labels as in training data set for unsupervised machine learning, the categorical variable values can be replaced with a numerical value with the Supervised Ratio or Weight of Evidence algorithms.  In both algorithms, the numerical value depends on the correlation between the categorical variable value and the class label.

Summing Up

You may face many pre processing steps before training data set is ready for building the machine learning model. The problem addressed in this article is one such example. The use case can be executed by following the steps in the tutorial document.

Support

For commercial support for any solution in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing,

Advertisements

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Big Data, Data Science, Data Transformation, ETL, Scala, Spark and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s