Anomaly detection with with various statistical modeling based techniques are simple and effective. The Zscore based technique is one among them. Zscore is defined as the absolute difference between a data value and it’s mean normalized with standard deviation. A data point with Zscore value above some threshold is considered to be a potential outlier. One criticism against Zscore is that it’s prone to be influenced by outliers. To remedy that, a technique called robust Zscore can be used which is much more tolerant of outliers.

In this post, I will go over a robust Zscore based implementation on Hadoop to detect outliers in data. The solution is part of my open source project *chombo*. I will be using IP network data to detect anomalous packets, as an use case.

## Robust Zscore

Normal Zscore is based on mean and standard deviation as below and it’s a measure of how far a data point is from the mean.

*z = |x – m(x)| / s(x)*

* where*

* z = Zscore*

* m = Mean*

* s = Standard deviation*

As it is evident, we have a chicken and egg problem. The outliers that we are trying to detect have influenced the estimation of mean and standard deviation, unless the outlier are removed from the data set prior to calculating the mean and standard deviation. The removal can be done with an iterative process as below.

*Calculate mean and standard deviation**Find outliers and remove them from the data set**Repeat steps 1 and 2 until some convergence criteria is met*

However, there is a better way and that is what we will delve into now. Statistical methods not unduly affected by outliers are called robust statistical methods. From robust statistical point of view, we can make the following observation, referring to the link above.

*Median is robust measure of central tendency, while mean is not.**Median absolute deviation (MAD) is robust measure of statistical dispersion, while standard deviation is not.*

Robust Zscore as a function of median and median absolute deviation (MAD) is defined as below.

*rz = |x – med(x)| / mad(x)*

*where*

*rz = Robust Zscore *

*med(x) = Median *

*mad = Median absolute deviation*

With robust Zscore we can detect outliers reliably even in the presence of outliers in the data used to compute median and median absolute deviation.

Median, as we know corresponds to the 50 percentile value i.e., half the data points are below the median and the other half is above the median. Median absolute deviation is defined as below.

*mad = 1.4296 x med(|x – med(x))*

* where*

* mad = Median absolute deviation*

* med = Median*

We take the absolute of the deviation of a data point from the median and then take median of those absolute deviations. Robust Zscore definition was shown earlier.

## Map Reduce for Median and Median Absolute Deviation

The map reduce class *NumericalAttrMedian* calculates median as well median absolute deviation (MAD), depending on how a configuration parameter is set. The first run of the map reduce calculates median. The second run which calculates MAD uses median calculated in the first run.

Generally median calculation involves sorting data. However, I have used a bucketing approach, so that only the data in the bucket where the 50 percentile value falls is sorted. We can make this optimization, because we know that buckets prior to this bucket will be below the 50 percentile mark. On the same token all the following buckets will be above the 50 percentile mark.

The mapper key consists of any partitioning field, column ordinal for the column for which the median is being computed and the bucket index. The mapper output value is the list of all values in the corresponding bucket. The sorting is done in the reducer side.

The input consists of the following fields.

*Source IP address**Target IP address**Time stamp**Packet size*

The pair (source IP address, target IP address) serves as the partitioning fields, assuming that the input data consists of packet size data for many host pair combination.

## Map Reduce for Outlier Detection with Robust Zscore

With the median and MAD values in hand, outliers can easily be found using the generic data validation map reduce class *ValidationChecker*. With this map reduce, one or validators could be configured for any field. Details about this map reduce can be found can be found in an earlier post.

In our case, we have only one field that we are analyzing which is the packet size. We have configured only one validator for this field which is *robustZscoreBasedRange*. The validation checker map reduce will generate a report containing details of the records that were found to be invalid according to the validators configured for a field

In out case, any record with robust Zscore of packet size exceeding some user defined threshold is considered invalid. Those are also the outliers we are looking for. Here is the output.

165.68.75.105,165.68.65.106,1436192435,50 field:3 robustZscoreBasedRange 165.68.112.84,165.68.103.116,1436192602,9973 field:3 robustZscoreBasedRange 165.68.112.84,165.68.103.116,1436194507,514 field:3 robustZscoreBasedRange

For each invalid record, the output consists of list of fields found invalid. For each field, a list of validators that found the field to be invalid is also part of the output.

## Summing Up

We have gone through an outlier detection technique based on robust Zscore. Anomaly or outlier detection techniques can be of two types. Either they are based on instance data or sequence data. The algorithm discussed ion this post is instance data based.

The tutorial document contains step by steps instruction for generating data and executing the used case.

Hi, Pranab. Would calculating the robust zscore using the population median absolute deviation also work? Is there a wikipedia entry or article on the robust zscore that you can refer me to? Thanks

I am using median absolute deviation. There is a link to wikipedia in my post.

Hi, I did not get notified of your reply. I came back to this topic today. You are using the median absolute deviation but there are two kinds: population and sample. In my case I have a symmetric distribution with zero mean and I am using elasticsearch to store my data so I can calculate the 75th percentile (population median standard deviation) for a large dataset in milliseconds. I read the wikipedia entry and it says that mad = 1.4296 x med(|x – med(x)) is in fact the population MAD.

Thanks for taking the time to reply to me.

Pingback: Profiling Big Data | Mawazo

Archie

I am calculating population statistics. There is no low latency requirement for calculating these median related stats. Hadoop batch processing on historical data is fine.However, these stats could be used in a real time streaming context for outlier detection. The outlier detection may have a very low latency requirement. One example would be detecting outliers in sensor data and taking some action promptly when outliers are found.

Hi! Thanks for the great explanations here. However a (I think important remark): if one takes the absolute value, one loses the ‘direction’ in the normalization. If one wants or needs this, better use:

and

z = (x – m(x))/ s(x) instead of |x – m(x)|/ s(x)

and accordingly

rz = (x – med(x)) / mad(x)

in case you need the direction of an outlier/data-point or if you derive other measures from the normalized data.

Also needed when using it in a robust Z-score normalization procedure (as in gene-array screens and stuff like this)

cheers