Anomaly detection with with various statistical modeling based techniques are simple and effective. The Zscore based technique is one among them. Zscore is defined as the absolute difference between a data value and it’s mean normalized with standard deviation. A data point with Zscore value above some threshold is considered to be a potential outlier. One criticism against Zscore is that it’s prone to be influenced by outliers. To remedy that, a technique called robust Zscore can be used which is much more tolerant of outliers.
In this post, I will go over a robust Zscore based implementation on Hadoop to detect outliers in data. The solution is part of my open source project chombo. I will be using IP network data to detect anomalous packets, as an use case.
Normal Zscore is based on mean and standard deviation as below and it’s a measure of how far a data point is from the mean.
As it is evident, we have a chicken and egg problem. The outliers that we are trying to detect have influenced the estimation of mean and standard deviation, unless the outlier are removed from the data set prior to calculating the mean and standard deviation. The removal can be done with an iterative process as below.
- Calculate mean and standard deviation
- Find outliers and remove them from the data set
- Repeat steps 1 and 2 until some convergence criteria is met
However, there is a better way and that is what we will delve into now. Statistical methods not unduly affected by outliers are called robust statistical methods. From robust statistical point of view, we can make the following observation, referring to the link above.
- Median is robust measure of central tendency, while mean is not.
- Median absolute deviation (MAD) is robust measure of statistical dispersion, while standard deviation is not.
Robust Zscore as a function of median and median absolute deviation (MAD) is defined as below.
With robust Zscore we can detect outliers reliably even in the presence of outliers in the data used to compute median and median absolute deviation.
Median, as we know corresponds to the 50 percentile value i.e., half the data points are below the median and the other half is above the median. Median absolute deviation is defined as below.
We take the absolute of the deviation of a data point from the median and then take median of those absolute deviations. Robust Zscore definition was shown earlier.
Map Reduce for Median and Median Absolute Deviation
The map reduce class NumericalAttrMedian calculates median as well median absolute deviation (MAD), depending on how a configuration parameter is set. The first run of the map reduce calculates median. The second run which calculates MAD uses median calculated in the first run.
Generally median calculation involves sorting data. However, I have used a bucketing approach, so that only the data in the bucket where the 50 percentile value falls is sorted. We can make this optimization, because we know that buckets prior to this bucket will be below the 50 percentile mark. On the same token all the following buckets will be above the 50 percentile mark.
The mapper key consists of any partitioning field, column ordinal for the column for which the median is being computed and the bucket index. The mapper output value is the list of all values in the corresponding bucket. The sorting is done in the reducer side.
The input consists of the following fields.
- Source IP address
- Target IP address
- Time stamp
- Packet size
The pair (source IP address, target IP address) serves as the partitioning fields, assuming that the input data consists of packet size data for many host pair combination.
Map Reduce for Outlier Detection with Robust Zscore
With the median and MAD values in hand, outliers can easily be found using the generic data validation map reduce class ValidationChecker. With this map reduce, one or validators could be configured for any field. Details about this map reduce can be found can be found in an earlier post.
In our case, we have only one field that we are analyzing which is the packet size. We have configured only one validator for this field which is robustZscoreBasedRange. The validation checker map reduce will generate a report containing details of the records that were found to be invalid according to the validators configured for a field
In out case, any record with robust Zscore of packet size exceeding some user defined threshold is considered invalid. Those are also the outliers we are looking for. Here is the output.
220.127.116.11,18.104.22.168,1436192435,50 field:3 robustZscoreBasedRange 22.214.171.124,126.96.36.199,1436192602,9973 field:3 robustZscoreBasedRange 188.8.131.52,184.108.40.206,1436194507,514 field:3 robustZscoreBasedRange
For each invalid record, the output consists of list of fields found invalid. For each field, a list of validators that found the field to be invalid is also part of the output.
We have gone through an outlier detection technique based on robust Zscore. Anomaly or outlier detection techniques can be of two types. Either they are based on instance data or sequence data. The algorithm discussed ion this post is instance data based.
The tutorial document contains step by steps instruction for generating data and executing the used case.