Categorical variables are ubiquitous in data. They pose a serious problem in many Data Science analysis processes. For example, many supervised Machine Learning algorithms work only with numerical data. With high cardinality categorical variables, popular encoding solutions like One Hot Encoding is not feasible.

In this post we will go through a technique called Feature Hashing for encoding high cardinality categorical variables as implemented on Spark. We will showcase the solution with a use case from mobile advertisement.

You can find the Spark implementation in my open source github project *avenir*. As with my other Spark implementation, the solution is meta data driven and agnostic of any specific application or data set.

## Feature Hashing

With a set categorical variables, the only meaningful characteristic we can talk about is the distance between the categorical variables in two records. Distance could be could be computed using *Jaccard* formula based on set intersection and union.

In Feature Hashing, a vector of categorical variables gets converted to a higher dimensional space of integers, where the distance between two vectors of categorical variables in approximately maintained the transformed numerical dimensional space. With Feature Hashing, the number of dimensions will be far less than the number of dimensions with simple binary encoding a.k.a One Hot Encoding.

Let’s consider the case of a data set with 2 categorical variables, the first one with a cardinality of 70 and the second one with a cardinality of 50. With simple binary encoding you will have to introduce 118 (70 + 50 – 2) additional fields to replace the 2 categorical variable fields in the data set.

With One Hot Encoding, the distance between categorical variables in any pair of records in preserved in the new space of dimension 118. With Feature Hashing you can get away with much smaller dimension e.g 10 in this case while recognizing that inter record distances will not be fully preserved. Hash collision is the reason for the failure to preserve the distances, making the mapping less than perfect

Given a categorical vector of size *m* and the encoded vector of size * n*, and 2 hash functions

*and*

**h1***, here are the steps for encoding with Feature Hashing.*

**h2***For each categorical variable v, in a record do the following**Find the index into encoding vector i as h1(v) % n**Add 1 or -1 to the encoding vector element at index i from previous step based on h2(v) % 2*

The big unknown is the encoding vector size n. Bigger the encoded vector size., lesser the chance of hash collision and hence better the quality of the encoding. Shortly we will discuss some guidelines on choosing encoded victor size.

## Encoded Vector Size

There are steps that could be taken to estimate the size of the encoded vector size. If you have 3 categorical variables with cardinalities c_{1}, c_{2} and c_{3} then the maximum possible combination of unique values is c_{1} x c_{2} x c_{3}. Because there is always bias in real world data, the actual number of unique combination of values. is likely to be less than the maximum value.

Following this intuition, here are the steps to estimate the size of the encoded vector size, where * n* is encoded vector size and

*is the categorical vector size.*

**m***Find the number unique combination of values, u for the categorical variables**Choose the smallest n such C(n,m) is less than u where C(n,m) is combination defined as n!/(m! x (n-m)!)*

Applying this procedure to determine encoded vector size will alleviate the problem of hash collision but not remove it completely. Any hash function for string will have some collision, although the degree of collision will depend on the specific hash function.

## Mobile Avertisement Data

We are going going use a data set related to mobile advertisement as the use case. The data is generated synthetically with a python script. the data set has the following fields.

*Impression ID**Timestamp**Device OS**Device model ID**Mobile app ID**Advertisement ID**Zip code corresponding to device location**Whether advertisement is tapped or not*

Here is some sample data

3PXRQAT8O56V235G,1564803337,Android,4ZLA33K4,HX383YK1,6X67ULH9HE,91097,1 T5BOETM6I08HXK04,1564803358,iOS,0OGZ3V8I,3HHNM572,P25445XZ27,61445,0 S91N42830HZMQ37U,1564803380,iOS,0OGZ3V8I,T4R78G56,4AYNQDUH7S,45456,0 9AB1JWJ983CQGYBO,1564803408,iOS,MQBV29J0,87X9ANI5,EHK19BE921,95596,0 VT396L30SE0S8OB0,1564803416,iOS,1Y20SWUZ,GM252HGQ,966I579GBO,29224,0 YV993DAQO8879D05,1564803446,Android,YRGAZ69L,70W2W6O6,9SVTNLYRAX,07803,0

The plan is to build some Machine Learning model t on this data and which requires encoding the categorical variables *Device model ID, Mobile app ID, Advertisement ID* and *Zip code* as numerical data.. The cardinality for these variables is high, for some typically in the range of hundreds.

If you were to apply One Hot Encoding, you will have an encoded vector size in the hundreds, which is completely impractical. With One Hot Encoding, the encoded vector size will be the sum of the cardinalities of the categorical variables being encoded.

## Unique Value Combination Count Spark Job

As per the guidelines for the encoded vector size, we are going run the Spark job *UniqueValueCounter* to get a count of the unique value combinations in the data set. Here is some sample output. It contains the unique value combination of categorical variables followed by a count

P0H478NC,W9Q5333O,932T4724P6,13676,1 D9HJ65N0,B88DB48N,9S2L186EL2,37468,1 CK116CK0,O0Y6DD1K,41082L73HJ,08766,1 WCA4X9P7,HKM4LNT1,064Q026YGG,40897,1 V7R33T8N,RR72SYS5,3FMWE02M0Q,82145,1

The number of unique value combinations is the total number of lines in the output files. This value can be used to estimate the encoded vector size following the procedure as alluded to earlier. With the data set under consideration, the encoded vector size as per the procedure outlined comes to 28. I have used 24 which is set through a configuration parameter.

There are several reasons to take this process to estimate the encoded vector size with a grain of salt. You should not feel compelled the use the estimated encoding size. It’s just a guideline.

*Hash collisions will still occur, in spite of the efforts to stave it off**The number of uniques values found pertain to the the data set being used. The actual count could be higher.*

## Feature Hashing Encoding Spark Job

The encoding based on feature hashing is implemented by the Spark job *CategoricalFeatureHashingEncoding*. The encoding vector size as estimated should be provided as a configuration parameter.

Other important parameter are hash functions for indexing and choosing sign. Here the choices. All hash functions return positive integer.

*JDK**Simple (repeated multiplication with prime number)**Fowler Noll Vo (FNV)**Murmur**Encryption followed by JDK hash*

Here is some sample output. You can choose the position of the encoded vector in a record withy respect to the remaining fields. I have chosen to place it after the first field.

YV993DAQO8879D05,0,0,0,0,0,0,-1,0,0,0,0,0,-1,0,-1,0,0,0,0,0,0,0,1,0,1564803446,Android,0 W6L88N9C8OX5TQ6X,0,-1,0,0,0,0,-1,0,0,0,0,0,0,-1,1,0,0,0,0,0,0,0,0,0,1564803451,Android,0 AGTYJ3H36H1UHBHT,0,0,0,1,1,0,0,0,0,0,0,0,0,0,-1,-1,0,0,0,0,0,0,0,0,1564803463,iOS,0 14KZ22M353W36X6P,0,-1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,1564803487,iOS,0 N955R2R1TC6MRY94,-1,0,0,0,0,0,0,0,0,0,1,1,0,0,-1,0,0,0,0,0,0,0,0,0,1564803499,iOS,0

## Wrapping Up

I was motivated to write this post, after someone contacted me looking for a Spark based implementation of Feature Hashing.We have gone though the process of encoding categorical variables with Feature Hashing. The tutorial document could be used for step by step execution of the use case.

If you are dealing with data set for supervised Machine Learning, there are techniques for encoding based on correlation between the categorical feature variables and the target variable.