Typical training data set for real world machine learning problems has mixture of different types of data including numerical and categorical. Many machine learning algorithms can not handle categorical variables. Those that can, categorical data can pose a serious problem if they have high cardinality i.e too many unique values.

In this post we will go though a technique to convert high cardinality categorical attributes to numerical values, based on how the categorical variable correlates with the class or target variable. The Map Reduce implementations are available in my open source projects *avenir* and *chombo*

## Categorical Variables

Some Machine Learning algorithms e.g. *Logistic Regression*, *Support Vector Machine* can not handle categorical variables and expect all variables to be numeric. The popular approach is to convert them to *n* dummy numerical variables if the cardinality of the variable is *n*.

It’s also called *one hot encoding*, because after a categorical variable value is converted to a vector of size n, only one of the vector elements will have a value 1 and the rest 0, assuming that the vector elements are assigned value of 1 or 0. Although a simple, this approach may cause a significant increase number of dimensions, which is not a good for machine learning problems.

## High Cardinality Categorical Variables

Even if the Machine Learning algorithms is tolerant of categorical variables, e.g., Decision Tree, high cardinality could be taxing in terms of computing resources.

Decision tree splits the feature space into sub spaces such that data population in the sub spaces is as homogeneous as possible. It iteratively splits and considers different partitions created.

For categorical variables , the number of possible splits grows non linearly with cardinality. If we are splitting the categorical values into 2 sub sets for example, it has to consider all possible such pair of sub sets. Zip code is a good example of categorical variable with very high cardinality.

Generating numerical dummy variables, as alluded to earlier, is not practical when the the cardinality is high. With significant increase in number of dimensions, you will be impacted by so called curse of dimensionality.

With our supply chain example, there may be several hundred products. Using dummy variables, there will be several hundred additional dimensions in the data, which is not tenable.

## Encoding to Numeric Value

In the supervised Machine Learning context, where class or target variables are available, high cardinality categorical attribute values can be can be converted to numerical values. The encoding algorithms are based on correlation of such categorical attributes to the target or class variables.

As an added bonus, we get some feature engineering done when using these algorithms. When the categorical attribute is highly correlated with the target attribute, the numerical values will have high variance, with most values concentrated around the extremes.

In the ** supervised ratio** algorithm, the numerical value is a function of number of records with the categorical value in question and how they break down between positive and negative class attribute values as follows.

*v _{i} = p_{i} / t_{i} where*

*v*

_{i}= numerical value for i^{th}value of some categorical attribute*p*

_{i}= number of records with positive class value for the categorical attribute value in question*t*

_{i}= total number of records with the categorical attribute value in questionIn the ** weight of evidence** algorithm, additionally total number of records with the positive and negative class labels are also taken into account as follows. For class imbalanced data, this algorithm works better.

*v _{i} = log((p_{i} / p) / (n_{i} / n) where *

*p*

_{i}= number of records with positive class value for the categorical attribute value in question*n*

_{i}= number of records with negative class value for the categorical attribute value in question*p = total number of records with positive class value*

*n = total number of records with negative class value*

## Supply Chain Delivery Data

We will use supply chain delivery data as our use case. Consider an electronic product manufacture setting up a production cycle to manufacture certain number of products. The product requires many electronic components supplied by other manufacturers. The manufacturer has some of them in stock but not all needed.

To compensate for the deficit, more components need to be ordered. The manufacturer needs to have some idea about the lead time i.e the time difference between order placement time and delivery time.Only with the knowledge of lead time the manufacturer can decide the start time of the production cycle.

If the production starts too early, it may have to be halted because of stock out of components. If it starts too late, the ordered components will be delivered but not used immediately, resulting in additional holding cost.

Generally the the lead time for the component manufacturer is 5 days. However some times it takes longer. Fortunately for the manufactured, some smart Data Scientists have built a predictive model for lead time, using the all the past delivery data from the manufacturer.The data has the following 5 attributes

*order Id**product Id**quantity**month of year**delivered late*

The 2^{nd}, 3^{rd} and the 4^{th} attributes are the feature attributes for the predictive model. The last attribute is the class variable with values T and F. The 2 categorical attributes are *product Id* and *month of year*. Here is some sample input data. Here is some sample data

97SXK8F0BT3H,3U4488754B,906,05,T 97SXK8F0BT3H,57960CFF2B,601,05,T 97SXK8F0BT3H,S4508G9XOH,816,05,F 97SXK8F0BT3H,S4508G9XOH,278,05,F T72VL3HJ69BD,S4508G9XOH,397,04,F T72VL3HJ69BD,68I8O60993,776,04,F T72VL3HJ69BD,P0XC99Z8JJ,560,04,F

The *product Id* variable has high cardinality, because there may be several hundred electronic components. We will convert this attribute values to numerical using the encoding algorithms alluded to earlier.

This conversion is necessary for the training data set before building the machine learning predictive model. It’s also necessary to perform this transformation for any data that’s going to be used for prediction after the model is built, because the learned model is defined in terms of the transformed numerical type.

## Encoding Map Reduce

The encoding is performed by the Map reduce class *CategoricalContinuousEncoding*. The algorithm to be used is selected through a configuration parameter.

The number of lines in the output will depend on the number of categorical attributes being transformed and the number of unique values in those categorical attributes. Here is some sample output. The fields in the output are 1)index of the column in the input being transformed, 2)categorical attribute value and 3)corresponding encoded numerical value.

1,R453VJMYXF,15 1,IZCMPKC3VC,69 1,Z0PGR17843,69 1,SIIF2641H9,5 1,B5D6F5XJ70,72 1,8R35R104X2,6 1,LFPMZ9IBBV,9 1,D2B7502TBC,77 1,3WXN2E4589,2

## Transformation Map Reduce

Our next task is to replace the categorical values with the numerical values we got as generated by the first Map Reduce job. For this we use data transformation ETL Map Reduce class *Transformer* in *chombo*.

It comes with lot of out of the box transformers. Each transformer is identified with an unique tag. For any column in the data that needs transformation, we can provide a set of transformer tags. All the corresponding transformers execute in a chain, the output of one going into the input of the next.

We are using a transformer called *keyValueTrans*. it’s configured with a set of key strings and corresponding value strings. When any of keys is found in a column being transformed, it is replaced with the corresponding value.

Generally the key value list is provided through configuration. There is also an option of providing the key value list through a file in HDFS or file system. We are using this option. Here is the some sample output. Only the 2^{nd} column in the input has been replaced. Here some sample data. It’s same as the input, except the values in 2nd column have been replaced with numerical values.

AL0Y12RG1YK7,2,882,06,F 30OSLEBO536E,15,855,03,F 30OSLEBO536E,5,385,03,F 30OSLEBO536E,71,379,03,F 30OSLEBO536E,2,649,03,F 30OSLEBO536E,2,183,03,F 30OSLEBO536E,0,562,03,F 3103PUM82HBY,0,867,06,F 3103PUM82HBY,2,776,06,F 3103PUM82HBY,74,720,06,T

## Wrapping Up

For real world problems, before building a Machine Learning model, there are many data munging hurdles to overcome. This is just one example.

In this post, we have gone through the process of handling cardinality categorical feature variables in predictive modelling context. The details of the steps for the execution of the use case is available in a tutorial document.

There is another categorical attribute encoding algorithm called *Leave One Out*, which is recent and popular among *Kaggle* competition participants.

Pingback: Handling Categorical Feature Variables in Machine Learning using Spark | Mawazo

Pingback: Leave One Out Encoding for Categorical Feature Variables on Spark | Mawazo

Hi, thanks for your sharing. I’m using R and wondering if the encoding to numeric value solution applies for regression target variables instead of classification problem?

Pingback: Encoding High Cardinality Categorical Variables with Feature Hashing on Spark | Mawazo