Data Quality Control With Outlier Detection


For many Big Data projects, it has been reported  that significant part of the time, sometimes up to 70-80% of time,  is spent in data cleaning and preparation. Typically, in most ETL tools,  you define constraints and rules statically for data validation. Some examples of such rules are limit checking for numerical quantities and pattern matching for text data.

Sometimes it’s not feasible to define the rules statically, because there could be too many variables and the variables could be non stationary. Data is non stationary when it’s statistical properties change with time.  In this post, we will go through a technique of detecting whether some numerical data is outside an acceptable range by detecting outliers. Outliers will be detected using statistical techniques. This earlier post of mine contains on overview of different outlier detection algorithms.

We will take an use case from supply chain to show how the solution works. The Hadoop based implementation is part of my OSS project chombo.

Outlier Detection

The essential premise is that data points that are identified as outliers are highly likely to be invalid. Outliers are data points that are significantly different from  most of the data. This definition may sound subjective and nebulous. However, there are various techniques for defining outliers in objective ways. For details on different outlier detection techniques, please refer to my earlier post.

We will be using statistical model based outlier detection techniques. This will require computation of mean and standard deviation.  Based on  mean and standard deviation, we will use one of two statistical quantities called Z score and Chebyshev Inequality to detect outliers.

Supply Chain Data Quality

A supply chain typically has the following elements. Demand is is generated at the retail front end and propagates down the chain all the way to the suppliers.

  1. Retailer
  2. Distributor
  3. Manufacturer
  4. Supplier.

On the same token, supply is initiated at at the supplier at the back end of the chain and propagates upward  towards the retailer to fulfill the demand.

Consider the scenario, where each retailer orders items from the distributor every day at the end of business. An order from a retailer will consist of product ID and  quantity among other details. Essentially every day the order that the distributor gets will consist of (retailerID, product ID) and a quantity.

We are interested in data validation check for the quantity. However, it’s not viable to define the valid range of quantity for each(retailer ID, product ID) pair, because of the following reasons.

  1. There are too many (retailer ID, product ID) pairs to define them manually.
  2. The quantity variable is  non stationary.

Presence of seasonal variation and long term trend makes quantity non stationary. Example of seasonal variation could be month of the year or day of the week. If a product has increasing demand, it will have an upward long term trend.

We will use historical order data to compute mean and standard deviation of quantity for each(retailer ID, product ID) pair.

Running Aggregation Map Reduce

There are two MR jobs involved in the solution. In the first phase, the running aggregation MR calculates mean and standard deviation of quantity for each each(retailer ID, product ID) pair. There are two kinds of data involved for this map reduce. The order data received daily from the retailer is the incremental data. Statistics calculated based on data received so far is in the aggregation data. Every time incremental order arrives, this map reduce job is run.

As incremental data files are processed, a new aggregation file is generated, which replaces the original aggregation file. This process is repeated until enough data is aggregated to generate reliable statistics.

The columns in the incremental file constituting the unique key are defined through configuration. This composite key is used as the mapper output key, which in our case is the (retailer ID, product ID) pair.

The map reduce job can calculate statistics for multiple fields, which are defined through configuration. In our case we are calculating statistics for only one field. Here is sample output of running aggregation after several incremental data files have been processed

7B0ABG6G9S8K,06L15V17V4,3,1,58,3364,58,0.0
7B0ABG6G9S8K,084RUB6CK0,3,2,121,7333,60,2.5
7B0ABG6G9S8K,09UZ825L9V,3,4,245,15067,61,3.897
7B0ABG6G9S8K,0C36OQ2EL6,3,1,69,4761,69,0.0
7B0ABG6G9S8K,0CJKO005GF,3,1,48,2304,48,0.0
7B0ABG6G9S8K,0F689BRF4P,3,1,63,3969,63,0.0
7B0ABG6G9S8K,0FW4WRMV1F,3,2,141,9941,70,0.5
7B0ABG6G9S8K,0GMA0U36M8,3,1,50,2500,50,0.0
7B0ABG6G9S8K,0GR4YRD9MW,3,1,50,2500,50,0.0
7B0ABG6G9S8K,0K5R526787,3,3,165,9179,55,5.887

Outlier Detection Map Reduce

In the second phase of the solution, the aggregation statistics generated in the first phase is used to detect outliers in fresh incremental order data. The outlier detection  map reduce job also takes two kinds of input, the incremental order data and the aggregate statistics data. Using the aggregate statistics data and some statistical criteria it detects the quantity values that are outliers. Anything that is detected to be an outlier is deemed to be invalid.

The criteria for outliers can be defined in terms of either Z Score. or Chebyshev Inequality. Z Score is defined as below.  Z Score is the ratio of the absolute of the difference between a value of a quantity and it’s mean and the standard deviation

zscore = |x – xm| / δ
where
x = some value
xm = mean value
δ= standard deviation

If Z score is being used, a maximum limit for Z Score is defined through configuration. A data value that exceeds the configured limit is considered to be an outlier.  A typical upper limit for Z score is 3.

Chebyshev Inequality is defined as below. It expresses the probability of the absolute of the difference between a value of a quantity and it’s mean exceeding some threshold value in terms of standard deviation and the threshold value.

p(|x – xm| ≥ t) ≤ (δ / t) 2
where
p = probability
x = some value
xm = mean value
t = threshold value
δ = standard deviation

Here we define a probability threshold e.g., .05 which is the right hand side of the expression. The probability threshold will correspond to a threshold value for t, which is used to detect outliers. One of these two criteria is used through configuration parameters.

Output is controlled through a configuration allowing the following  options

  1. All records
  2. Valid records
  3. Invalid records

If all records are output, invalid records are appended with ordinal values of columns that that are found to be invalid.

Here is some sample output for outlier detection map reduce. These are invalid data, because we have used output control configuration to generate only invalid data as output.

7B0ABG6G9S8K,U4MMNEFKIPFK,6LHV4BF638,63,normal
7B0ABG6G9S8K,U4MMNEFKIPFK,8IE1KV724G,67,normal
7B0ABG6G9S8K,U4MMNEFKIPFK,AOY2518FV2,55,normal
7B0ABG6G9S8K,U4MMNEFKIPFK,K8RXM08T54,43,normal
7B0ABG6G9S8K,U4MMNEFKIPFK,NSEKYR1921,76,express
G2M8W33H93P9,6LF38A21BX6E,0F689BRF4P,55,normal

Non Stationary Time Series Data

The quantity data in this use case is not stationary, because there may be seasonal and long term trend components in the data. Seasonal cycles could be year of the month or day of the week. Long term trend could be upward or downward.

Because of the non stationary nature of the data, we had to resort to statistical techniques to detect invalid data. To properly handle non stationary nature of data, aggregation statistics calculated by the first MR job should be done a sliding window basis. This can be achieved recalculating aggregate statistics based on the most recent  n incremental data files.

The window size n will depend  on the nature of the seasonal and trending components in the data and it should be based on consideration of the following

  1. Period of seasonal component
  2. Magnitude of variation in the seasonal component
  3. Magnitude of variation of trending components

Non Stationary Data with Only Seasonal Components

If there is no long term trending components we coud take a different approach. We could break up time horizon of data into different segments and for each segment we could treat the data as stationary. Here is an example for sensor data.

Consider a data from a temperature sensor. If we knew that temperature is set to set some level between 8 AM and 6 PM and another level between 6 PM and 8 AM, we could calculate two different sets of aggregate statistics corresponding to the two different hourly regions. Depending on the time stamp on the data, we could apply the  right set of aggregate statistics to detect outliers.

False Alarm and No Alarm

The threshold values for Z Score or Chebyshev Inequality Probability should be carefully selected based on the nature of the data.  If Z Score is too low or the  Chebyshev Inequality Probability is too low, there will be false positives i.e. potentially valid data will be identified as invalid.

On the other hand, If Z Score is too high or the  Chebyshev Inequality Probability is too high, there will false negatives i.e. potentially invalid data will be accepted as valid.

Summing Up

We have gone through outlier detection based techniques for identifying  invalid numerical data that is non stationary. There are numerous cases where these techniques could be used. One important use case is for sensor data in IoT. Here is the tutorial document for running the supply chain use case.

Advertisements

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Big Data, Data Science, ETL, Hadoop and Map Reduce, Internet of Things, Outlier Detection, Statistics and tagged , , , , . Bookmark the permalink.

One Response to Data Quality Control With Outlier Detection

  1. Pingback: Profiling Big Data | Mawazo

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s