Anomaly Detection with Robust Zscore


Anomaly detection with with various statistical modeling based techniques are simple and effective. The Zscore based technique is one among them. Zscore is defined as the absolute difference between a data value and it’s mean normalized with standard deviation. A data point with Zscore value above some threshold is considered to be a potential outlier. One criticism against Zscore is that it’s prone to be influenced by outliers. To remedy that, a technique called robust Zscore can be used which is much more tolerant of outliers.

In this post, I will go over a robust Zscore based implementation on Hadoop to detect outliers in data. The solution is part of my open source project chombo. I will be using IP network data to detect anomalous packets, as an use case.

Robust Zscore

Normal Zscore is based on mean and standard deviation as below and it’s a measure of how far a data point is from the mean.

z = |x – m(x)| / s(x)
where
z = Zscore
m = Mean
s = Standard deviation

As it is evident, we have a chicken and egg problem. The outliers that we are trying to detect have influenced the estimation of mean and standard deviation, unless the outlier are removed from the data set prior to calculating the mean and standard deviation. The removal can be done with an iterative process as below.

  1. Calculate mean and standard deviation
  2. Find outliers and remove them from the data set
  3. Repeat steps 1 and 2 until some convergence criteria is met

However, there is a better way and that is what we will delve into now. Statistical methods not unduly affected by outliers are called robust statistical methods. From robust statistical point of view, we can  make the following observation, referring to the link above.

  1. Median is robust measure of central tendency, while mean is not.
  2. Median absolute deviation (MAD) is robust measure of statistical dispersion, while standard deviation is not.

Robust Zscore as a function of median and median absolute deviation (MAD) is defined as below.

rz = |x – med(x)| / mad(x)
where
rz = Robust Zscore
med(x) = Median
mad = Median absolute deviation

With robust Zscore we can detect outliers reliably even in the presence of outliers in the data used to compute median and median absolute deviation.

Median, as we know corresponds to the 50 percentile value i.e., half the data points are below the median and the other half is above the median. Median absolute deviation is defined as below.

mad = 1.4296 x med(|x – med(x))
where
mad = Median absolute deviation
med = Median

We take the absolute of the deviation of a data point from the median and then take median of those absolute deviations. Robust Zscore definition was shown earlier.

Map Reduce for Median and Median Absolute Deviation

The map reduce class NumericalAttrMedian calculates median as well median absolute deviation (MAD), depending on how a configuration parameter is set. The first run of the map reduce calculates median. The second run which calculates MAD uses median calculated in the first run.

Generally median calculation involves sorting data. However, I have used a bucketing approach, so that only the data in the bucket where the  50 percentile value falls is sorted. We can make this optimization, because we know that buckets prior to this bucket will be below the 50 percentile mark. On the same token all the following buckets will be above the 50 percentile mark.

The mapper key consists of any partitioning field, column ordinal for the column for which the median is being computed and the bucket index. The mapper output value is the list of all values in the corresponding bucket. The sorting is done in the reducer side.

The input consists of the following fields.

  1. Source IP address
  2. Target IP address
  3. Time stamp
  4. Packet size

The pair (source IP address, target IP address) serves as the partitioning fields, assuming that the input data consists of packet size data for many host pair combination.

Map Reduce for Outlier Detection with Robust Zscore

With the median and MAD values in hand, outliers can easily be found using the generic data validation map reduce class ValidationChecker. With this map reduce,  one or validators could be configured for any field. Details about this map reduce can be found  can be found in an  earlier post.

In our case, we have only one field that we are analyzing which is the packet size. We have configured only one validator for this  field  which is robustZscoreBasedRange. The validation checker map reduce will generate a report containing details of the records that were found to be invalid according to the validators configured for a field

In out case, any record with robust Zscore of packet size exceeding some user defined threshold is considered invalid. Those are also the outliers we are looking for. Here is the output.

165.68.75.105,165.68.65.106,1436192435,50
field:3
robustZscoreBasedRange  
165.68.112.84,165.68.103.116,1436192602,9973
field:3
robustZscoreBasedRange  
165.68.112.84,165.68.103.116,1436194507,514
field:3
robustZscoreBasedRange  

For each invalid record, the output consists of list of fields found invalid. For each field, a list of validators that found the field to be invalid is also part of the output.

Summing Up

We have gone through an outlier detection technique based on robust Zscore. Anomaly or outlier detection techniques can be of two types. Either they are based on instance data or sequence data. The algorithm discussed ion this post is instance data based.

The tutorial document contains step by steps instruction for generating data and executing the used case.

Advertisements

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Anomaly Detection, Big Data, data quality, Data Science, Hadoop and Map Reduce and tagged , , . Bookmark the permalink.

7 Responses to Anomaly Detection with Robust Zscore

  1. Hi, Pranab. Would calculating the robust zscore using the population median absolute deviation also work? Is there a wikipedia entry or article on the robust zscore that you can refer me to? Thanks

    • Pranab says:

      I am using median absolute deviation. There is a link to wikipedia in my post.

      • Hi, I did not get notified of your reply. I came back to this topic today. You are using the median absolute deviation but there are two kinds: population and sample. In my case I have a symmetric distribution with zero mean and I am using elasticsearch to store my data so I can calculate the 75th percentile (population median standard deviation) for a large dataset in milliseconds. I read the wikipedia entry and it says that mad = 1.4296 x med(|x – med(x)) is in fact the population MAD.

        Thanks for taking the time to reply to me.

  2. Pingback: Profiling Big Data | Mawazo

  3. Pranab says:

    Archie
    I am calculating population statistics. There is no low latency requirement for calculating these median related stats. Hadoop batch processing on historical data is fine.However, these stats could be used in a real time streaming context for outlier detection. The outlier detection may have a very low latency requirement. One example would be detecting outliers in sensor data and taking some action promptly when outliers are found.

  4. roliarnold says:

    Hi! Thanks for the great explanations here. However a (I think important remark): if one takes the absolute value, one loses the ‘direction’ in the normalization. If one wants or needs this, better use:

    and
    z = (x – m(x))/ s(x) instead of |x – m(x)|/ s(x)
    and accordingly
    rz = (x – med(x)) / mad(x)

    in case you need the direction of an outlier/data-point or if you derive other measures from the normalized data.

    Also needed when using it in a robust Z-score normalization procedure (as in gene-array screens and stuff like this)

    cheers

  5. Francesco says:

    Hi, is it correct the formula: 1.4296 x med(|x – med(x)) which is SQRT(2) by X by MAD?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s