Data Normalization with Spark


Data normalization is a required data preparation step for many Machine Learning algorithms. These algorithms are sensitive to the relative values of the feature attributes. Data normalization is the process of bringing all the attribute values within some desired range. Unless the data is normalized, these algorithms don’t behave correctly.

In this post, we will go through various data normalization techniques, as implemented on Spark. To provide some context, we will also discuss how different supervised learning algorithms are negatively impacted from lack of normalization

The Spark based implementation is available in my open source project chombo. There is also a Hadoop based implementation in the same project.

Why Normalize

Some Machine Learning algorithms are sensitive to the relative magnitudes of the feature attributes. Normalization alleviates this problem

The K Nearest Neighbor Algorithm (KNN) is based on distance between records. Unless data is normalized distance will be incorrectly calculated, because  different attributes will not contribute to the distance in an uniform way.  Attributes having a larger value range will  have unduly larger influence on the distance, because  they will make greater contribution to the distance.

In Artificial Neural Network (ANN) linear algebra operations are performed between input vector and weight vectors. With ANN, normalization is not strictly necessary, as the weights can accommodate varying range of input feature attributes. However, training can be more efficient and convergence can be reached faster when the data is normalized.

In Support Vector Machine (SVM), the algorithm finds the hyper plane separating the data points belonging to the different classed by optimization techniques and distance calculation enters the picture. Hence, normalization becomes a necessity. However, if  kernel functions are used instead of calculating distance directly, the function may be able to handle difference in scales between attributes directly and normalization may be skipped.

As a counter example, let’s consider Decision Tree and Random Forest. In Decision Tree, the feature space is subdivided into different regions, keeping data homogeneity in each region as the the criteria. The algorithm operates on each attribute independently and relative values of different attributes is irrelevant. So, normalization is not necessary.

Normalization Techniques

There are are various normalization techniques. The appropriate technique to be used depends on the machine learning algorithm to be used on the normalized data. The most popular techniques are minmax and zscore.

The minmax technique is based on the min and max values of the attribute as follows. Normalize values will be between 0 and 1.

vn = (v – vmin) / (vmax – vmin)       where
vn = normalized value
v = original value
vmin = minimum value
vmax = maximum value

The max technique only uses the max value for normalization. The normalized values will between -1 and 1.

vn = v / vamax    where
vamax = max(abs(vmax), abs(vmin))

The zscore technique is based on mean and standard deviation. Most of the normalized data will be between -1 and 1. Since the normalized data will follow a standard distribution, this technique is also known as standardization. Standard distribution N(0,1) is a normal distribution with a mean of 0 and standard deviation 1.

vn = (v – vmean) / s    where
vmean = mean value
s = standard deviation

The center technique is based on mean only as below. The normalized data is not constrained by any range limit.

vn = v – vmean

The decimal technique, the value is scaled by a quantity which is a power of 10 and greater than the max value. Normalized values will be within the limits -1 and 1

vn = v / 10m      where
vamax = max(abs(vmax), abs(vmin))
m = smallest integer such that 10m is greater than vamax

The unitSum technique is based on the sum of the values as below. The normalized data is not constrained by any range limit.

vn = v / sum     where
sum = ∑vi

All the techniques described are prone to outliers. The zscore technique provides the option of purging outlier data while normalizing. Since outliers have high zscore, we could remove any record with zscore above some threshold.

House Price

We will use house price data as the use case. Consider a scenario, where you want to build a regression based predictive model for house price based on various input feature attributes.

You have also decided to use the KNN regression algorithm. As alluded to earlier, nearest neighbor based algorithms perform distance calculation which require normalized data. Here are the attributes of the house price data set.

  1. transaction ID
  2. zip code
  3. floor area
  4. number of bedrooms
  5. number of bathrooms
  6. price

Here is some sample input

8544WY7325,94602,1987,5,2,1394000
VSK634510N,94702,1473,3,2,1178000
07C64O7OK0,94540,1680,4,2,1191000
6KR117M8EA,94538,1779,5,2,1186000
2A0T80P51T,95129,1365,3,2,894000
JMM83NVNM6,94540,1406,3,2,930000
PD7ES0I5G1,94602,1368,3,2,950000

Normalization Spark  Job

Normalization implementation is the scala object Normalizer. We are doing zscore normalization and also purging outlier records. Threshold for outliers has been set at 2 x std deviation. Here is some sample output

7OARSM21CR,94501,0.238,0.327,0.395,928000
R0JW4A171T,94538,0.213,0.327,0.395,867000
VSK634510N,94702,0.427,0.327,0.395,1178000
07C64O7OK0,94540,0.685,0.641,0.395,1191000
6KR117M8EA,94538,0.809,0.954,0.395,1186000
2A0T80P51T,95129,0.293,0.327,0.395,894000
JMM83NVNM6,94540,0.344,0.327,0.395,930000

The first field which is ID and the last field which is the output field for regression analysis have been skipped from normalization.

Wrapping Up

We have used house price data  an example and gone through the data normalization process using a Spark based implementation. Execution steps for this use case are detailed in the tutorial document.

Support

For commercial support for any solution in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing,

Advertisements

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Big Data, Data Science, ETL, Machine Learning, Spark and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s