In my earlier post, I did an overview of the outlier detection techniques in big data and specifically Hadoop context. As I mentioned, fraud detection is essentially translates to outlier detection in data mining parlance.

In his post, I will go over a distribution model based technique which has been implemented as two map reduce jobs in beymani available in github. I will use the credit card transaction as the example.

Although distribution based models focus on the temporal aspect of the data, they are the essential initial steps towards a complete fraud detection process.

Our distribution model is non parametric, because we don’t make any assumption about the Gaussian nature of the distribution and we don’t try to compute any of the parameters like mean and standard deviation.

## Multivariate histogram

We need to compute histogram for objects with multiple attributes. Then we focus on those buckets with with low frequency count. All objects in these low frequency buckets are potential outliers i.e., objects that don’t fit the distribution model well.

With objects of n attributes, the buckets are essentially n dimensional regions in an n dimensional space. For example, with 2 dimensions, each bucket is a rectangular region.

When all the attributes are numerical, it’s easy to visualize the the n dimensional buckets. For numerical attributes, we divide the range of values into segments of equal length. Each segment corresponds to a bucket. How do we handle categorical attributes. For categorical attributes, each value of the attribute defines a bucket range along that dimension.

Total number of buckets = Product of number of buckets along each dimension.

Here is sample credit card transaction data that goes as input. The attributes are *transaction id, time of the day, amount of money and the vendor type*. The final goal is to get a list of the transaction id’s sorted by ascending order of frequency.

YX66AJ9U[]588[]20.47[]drug store |

98ZCM6B1[]0[]55.50[]restaurant |

………… |

XXXX7362[]1300[]1875.40[]jewellery store |

The last record is an outlier that I seeded in data. I wanted to see if this records surfaces through my outlier analysis.

The *MutivaritateHistogram* map reduce class mapper builds the multi dimensional bucket for each record. The bucket width of each attributes is defined in a a json metadata file. Here is a snippet for time of the day attribute. The time of day is in terms of minutes. We use a bucket width of 60, so the bucket granularity is hour.

{ "name" : "time", "ordinal" : 1, "dataType" : "int", "bucketWidth" : 60 }

For any record, we find the bucket for each attribute and then create a multidimensional bucket by building a list of the single dimensional buckets encapsulated in *Tuple* object. The *Tuple* class is a generic *WritableComparable* class that encapsulate a list of java primitive objects. It can serve as a key or value object in a map reduce implementation. The mapper of *MutivaritateHistogram* emits the *Tuple* as the key and the ID of the object as the value.

So, the key is the bucket and the value is the ID of the object that falls into that bucket. The mapper output may be sparse, because there is no guarantee that all buckets will have some object in them.

In the reducer, all the object ID’s for a given bucket gets consolidated into a list. The reducer output consists of the bucket as Tuple object appended with id list of the objects in that bucket. Here is some sample output

0,2,clothing store[]PEK3SLKR,KW2YKPEF,GQEMYH49,41ROR62M |

1,4,air fare[]B53WMWEW,O7Q61GLK,KX1TSGL0,TUIW3U8C |

2,5,electronic store[]ZI289XPV,DN274M45 |

In the third record for example, the bucket consists of the following bucket components corresponding to the 3 attributes in our data set

- 2 for time of day
- 5 for money spent
- “electronic store” for vendor

The list on the right side of [] is the list of transaction ID’s for the transaction falling into this bucket.

## Sorting the distribution

The only thing left to do is to sort the histogram bucket in the ascending order of the number of objects in a bucket. In the mapper of the *DistributionSorter* map reduce, we count the number of objects in a bucket and then prepends the Tuple object which represents a bucket with the count value. Then the mapper emits the modified bucket as the key and the list of object ID’s as the value.

By the time the data gets to the reducer, the keys are sorted by the frequency count. In the reducers output the buckets are sorted by the ascending order of the number of objects in a bucket. Here are some sample output, which shows only the buckets with object count of 1. The outlier we seeded is in one on these buckets among many other buckets with object count of 1.

N7WB0VDQ |

SA5PQI6E |

……. |

XXXX7362 |

……. |

## Grid index

Multivariate histogram can also be used as a grid index to optimize some of the algorithms. For example, in case of proximity analysis based on average distance to nearest k neighbors, we could cut down the amount of computation by considering only the objects in neighboring grids of the grid that contains the object under consideration.

On an average if there are m objects in the neighboring grids, then we have to perform n x m computations instead of n x n computations, which could be a significant improvement.

## Where do we go from here

As I mentioned in my earlier post, outlier detection requires multiple techniques, since there is a temporal and a feature based aspect of objects. The distribution model based technique described here only deals with the temporal aspect.

One solution is to take all the objects with low frequency of occurrence e.g., objects with in buckets with frequency count of 1 and then apply proximity analysis on them, to further narrow down on the outliers. Essentially we try to identify those objects in this set that are far away from other objects in an n dimensional feature space. I will cover implementation of proximity analysis in a future post. So stay tuned.

Pingback: It’s a lonely life for outliers | Mawazo

Pingback: Fraudsters are not Model Citizens | BigDataCloud.com

How Exactly I would Run the Code to get a Fraud Outlier ?

Hi Pranab,

Excellent work. I am wondering whether the multivariate histogram is applicable for finding anomaly in application performance. There are a number of metrics like cpu, memory, disk util, queue depth, response time can be collected in a specified time interval. One of the challenge I see is that it is difficult to create bucket for some attributes like queue depth.

I appreciate your comments on this,

Thanks,

Seth.

Hi Seth

Sure, I don’t see any issue. For attributes with no known range, one approach will be to create buckets based on mean and standard deviation. I have meaning to to add this feature. That should take care of queue depth.

Pranab

Great. Make sense. Will that work similar to Gaussian distribution?

Regarding finding the anomaly in real time, would it possible to expose the calculated histogram representing normal behavior in cache. And whenever new transaction hit the system, the system check with the cache and if the transaction does not fall under one of the histogram, it is an anomaly.

Thoughts?

Seth.

Pingback: Profiling Big Data | Mawazo