For many Big Data projects, it has been reported that significant part of the time, sometimes up to 70-80% of time, is spent in data cleaning and preparation. Typically, in most ETL tools, you define constraints and rules statically for data validation. Some examples of such rules are limit checking for numerical quantities and pattern matching for text data.

Sometimes it’s not feasible to define the rules statically, because there could be too many variables and the variables could be non stationary. Data is non stationary when it’s statistical properties change with time. In this post, we will go through a technique of detecting whether some numerical data is outside an acceptable range by detecting outliers. Outliers will be detected using statistical techniques. This earlier post of mine contains on overview of different outlier detection algorithms.

We will take an use case from supply chain to show how the solution works. The Hadoop based implementation is part of my OSS project *chombo*.

## Outlier Detection

The essential premise is that data points that are identified as outliers are highly likely to be invalid. Outliers are data points that are significantly different from most of the data. This definition may sound subjective and nebulous. However, there are various techniques for defining outliers in objective ways. For details on different outlier detection techniques, please refer to my earlier post.

We will be using statistical model based outlier detection techniques. This will require computation of mean and standard deviation. Based on mean and standard deviation, we will use one of two statistical quantities called *Z score* and *Chebyshev Inequality* to detect outliers.

## Supply Chain Data Quality

A supply chain typically has the following elements. Demand is is generated at the retail front end and propagates down the chain all the way to the suppliers.

*Retailer**Distributor**Manufacturer**Supplier.*

On the same token, supply is initiated at at the supplier at the back end of the chain and propagates upward towards the retailer to fulfill the demand.

Consider the scenario, where each retailer orders items from the distributor every day at the end of business. An order from a retailer will consist of product ID and quantity among other details. Essentially every day the order that the distributor gets will consist of (*retailerID, product ID*) and a quantity.

We are interested in data validation check for the quantity. However, it’s not viable to define the valid range of quantity for each(*retailer ID, product ID*) pair, because of the following reasons.

- There are too many (
*retailer ID, product ID*) pairs to define them manually. - The
*quantity*variable is non stationary.

Presence of seasonal variation and long term trend makes *quantity* non stationary. Example of seasonal variation could be month of the year or day of the week. If a product has increasing demand, it will have an upward long term trend.

We will use historical order data to compute mean and standard deviation of quantity for each(*retailer ID, product ID*) pair.

## Running Aggregation Map Reduce

There are two MR jobs involved in the solution. In the first phase, the running aggregation MR calculates mean and standard deviation of quantity for each each(*retailer ID, product ID*) pair. There are two kinds of data involved for this map reduce. The order data received daily from the retailer is the incremental data. Statistics calculated based on data received so far is in the aggregation data. Every time incremental order arrives, this map reduce job is run.

As incremental data files are processed, a new aggregation file is generated, which replaces the original aggregation file. This process is repeated until enough data is aggregated to generate reliable statistics.

The columns in the incremental file constituting the unique key are defined through configuration. This composite key is used as the mapper output key, which in our case is the (*retailer ID, product ID*) pair.

The map reduce job can calculate statistics for multiple fields, which are defined through configuration. In our case we are calculating statistics for only one field. Here is sample output of running aggregation after several incremental data files have been processed

7B0ABG6G9S8K,06L15V17V4,3,1,58,3364,58,0.0 7B0ABG6G9S8K,084RUB6CK0,3,2,121,7333,60,2.5 7B0ABG6G9S8K,09UZ825L9V,3,4,245,15067,61,3.897 7B0ABG6G9S8K,0C36OQ2EL6,3,1,69,4761,69,0.0 7B0ABG6G9S8K,0CJKO005GF,3,1,48,2304,48,0.0 7B0ABG6G9S8K,0F689BRF4P,3,1,63,3969,63,0.0 7B0ABG6G9S8K,0FW4WRMV1F,3,2,141,9941,70,0.5 7B0ABG6G9S8K,0GMA0U36M8,3,1,50,2500,50,0.0 7B0ABG6G9S8K,0GR4YRD9MW,3,1,50,2500,50,0.0 7B0ABG6G9S8K,0K5R526787,3,3,165,9179,55,5.887

## Outlier Detection Map Reduce

In the second phase of the solution, the aggregation statistics generated in the first phase is used to detect outliers in fresh incremental order data. The outlier detection map reduce job also takes two kinds of input, the incremental order data and the aggregate statistics data. Using the aggregate statistics data and some statistical criteria it detects the *quantity* values that are outliers. Anything that is detected to be an outlier is deemed to be invalid.

The criteria for outliers can be defined in terms of either *Z Score*. or *Chebyshev Inequality*. *Z Score* is defined as below. *Z Score* is the ratio of the absolute of the difference between a value of a quantity and it’s mean and the standard deviation

*zscore = |x – x _{m}| / δ*

*where*

*x = some value*

*x*

_{m}= mean value*δ= standard deviation*

If Z score is being used, a maximum limit for Z Score is defined through configuration. A data value that exceeds the configured limit is considered to be an outlier. A typical upper limit for Z score is 3.

Chebyshev Inequality is defined as below. It expresses the probability of the absolute of the difference between a value of a quantity and it’s mean exceeding some threshold value in terms of standard deviation and the threshold value.

*p(|x – x _{m}| ≥ t) ≤ (δ / t) ^{2}*

*where*

*p = probability*

*x = some value*

*x*

_{m}= mean value*t = threshold value*

*δ = standard deviation*

Here we define a probability threshold e.g., .05 which is the right hand side of the expression. The probability threshold will correspond to a threshold value for t, which is used to detect outliers. One of these two criteria is used through configuration parameters.

Output is controlled through a configuration allowing the following options

- All records
- Valid records
- Invalid records

If all records are output, invalid records are appended with ordinal values of columns that that are found to be invalid.

Here is some sample output for outlier detection map reduce. These are invalid data, because we have used output control configuration to generate only invalid data as output.

7B0ABG6G9S8K,U4MMNEFKIPFK,6LHV4BF638,63,normal 7B0ABG6G9S8K,U4MMNEFKIPFK,8IE1KV724G,67,normal 7B0ABG6G9S8K,U4MMNEFKIPFK,AOY2518FV2,55,normal 7B0ABG6G9S8K,U4MMNEFKIPFK,K8RXM08T54,43,normal 7B0ABG6G9S8K,U4MMNEFKIPFK,NSEKYR1921,76,express G2M8W33H93P9,6LF38A21BX6E,0F689BRF4P,55,normal

## Non Stationary Time Series Data

The quantity data in this use case is not stationary, because there may be seasonal and long term trend components in the data. Seasonal cycles could be year of the month or day of the week. Long term trend could be upward or downward.

Because of the non stationary nature of the data, we had to resort to statistical techniques to detect invalid data. To properly handle non stationary nature of data, aggregation statistics calculated by the first MR job should be done a sliding window basis. This can be achieved recalculating aggregate statistics based on the most recent n incremental data files.

The window size n will depend on the nature of the seasonal and trending components in the data and it should be based on consideration of the following

- Period of seasonal component
- Magnitude of variation in the seasonal component
- Magnitude of variation of trending components

## Non Stationary Data with Only Seasonal Components

If there is no long term trending components we coud take a different approach. We could break up time horizon of data into different segments and for each segment we could treat the data as stationary. Here is an example for sensor data.

Consider a data from a temperature sensor. If we knew that temperature is set to set some level between 8 AM and 6 PM and another level between 6 PM and 8 AM, we could calculate two different sets of aggregate statistics corresponding to the two different hourly regions. Depending on the time stamp on the data, we could apply the right set of aggregate statistics to detect outliers.

## False Alarm and No Alarm

The threshold values for *Z Score* or *Chebyshev Inequality Probability* should be carefully selected based on the nature of the data. If *Z Score* is too low or the *Chebyshev Inequality Probability *is too low, there will be false positives i.e. potentially valid data will be identified as invalid.

On the other hand, If *Z Score* is too high or the *Chebyshev Inequality Probability *is too high, there will false negatives i.e. potentially invalid data will be accepted as valid.

## Summing Up

We have gone through outlier detection based techniques for identifying invalid numerical data that is non stationary. There are numerous cases where these techniques could be used. One important use case is for sensor data in IoT. Here is the tutorial document for running the supply chain use case.

Pingback: Profiling Big Data | Mawazo