## It’s a lonely life for outliers

In this post, I am back to outliers and fraud analytic. In this earlier post, I did an overview of outliers detection techniques that are being implemented with Hadoop in my open source project beymani. In this earlier post, I talked about a multivariate distribution model based implementation in beymani. We postulated that data point falling in the low frequency histogram bins are potentially outliers. In this post, the focus will be on a technique that belongs to the so called local memory based model based approach. In contrast to the distribution based approach, we don’t build any global model using training data that applies to all data, with the current approach.

## Proximity Based Analysis

As I mentioned, we don’t build any model, but instead hold on to all the data collected so far.  The idea is simple and intuitive. An outlier by it’s very nature is likely to be far away from it’s neighbors.

We essentially find the k nearest neighbors for a data point and find some aggregate distance to the k nearest neighbors and use that as the outlier score.

We declare that the data points with large distance to it’s nearest k neighbors are potential outliers. In other words, the outliers are maximally separated from their neighbors.

Here are the steps

• Use SameTypeSimilarity MR to find pair wise distance between all data points. This MR is from my Recommendation Engine project sifarish in github. In sifarish, it gets used to find similarities between entities of same type e.g., product
• We use the MR AverageDistance which takes the k nearest neighbor of a data point and then find the average distance to the k nearest neighbor.  This MR is part of the fraud detection project beymani on sifarish. Instead of average, we could have used median of the distance to the kth nearest neighbor.
One way to contrast the current solution with distribution  based model is that in distribution based model we were constructing constant bin width  histogram. In the solution we are discussing here, we are constructing equal number of sample and variable width histogram. By equal number of sample, I mean the k nearest neighbor.
We are doing the computation for every data point. That’s why it’s called memory based model. It’s  more computationally intensive than solutions based on global model.
I am using the same credit card transaction record as the test data with outlier data injected into it. The attributes of a transaction are transaction id, time of the day, amount of money and the vendor type

## Average Distance Map Reduce

Implementation details of this MapReduce can be found here. It takes the output SameTypeSimilarity MR as input. If you are familiar with the sifarish project, this MR generates output with 3 fields as below

1. First entity ID
2. Second entity ID
3. Distance between them

Here is an example of the output

```6JHQ79UA,JSXNUV9R,5
6JHQ79UA,Y1AWCM5P,89
6JHQ79UA,7QPXVXAN,5
6JHQ79UA,UFS5ZM0K,178
6JHQ79UA,DPTX02DI,343
6JHQ79UA,DQQEZ52C,269```

The mapper of AverageDistance has the first entity ID along with the distance as the key and the distance and the second entity ID as the value. The key is setup in a way to enable secondary sorting by the distance. The first entity ID is the base part of the key and the distance is the key extension.

In the reducer, for a given entity we get all the other entities sorted by distance. The reducer retains the first k neighboring entities and finds the average distance to them. Here is some sample output from this MR.  The outlier is the last record. This is the case of jewelry purchase after midnight.

```1IKVOMZE,5
1JI0A0UE,173
1JLNC1MS,168
1JY1JBJK,21
1K82ZJJD,7
1KWBJ4W3,278
...........
XXXX7362,538```

The first column is the entity ID and the second column is the average distance to the K nearest neighbor. In the last row I am showing the distance of the outlier from it’s k neighbors. As we can see it has a very large average distance to it’s neighbors.There are number of ways to detect outliers from the result, as follows

• Sort the result  by the  descending order of the average distance and pick the top n
• Sort the result  by the  descending order of the average distance and pick all with average distance over a predefined threshold

In reality, these results will typically  go through a  review by human experts to make the final determination. Oulier detection problem is like finding needle in haystack. Even the best algorithm may point to few needles. It’s up to human experts to make the final call.

## Incremental Processing

In real data processing pipeline, typically we have fresh data arriving continuously and accumulating. So the processing should focus on the freshly arrived data. For our example, let’s say we have decided to process data that arrived at the end of a 24 hour cycle.

The interval or the acceptable latency depends very much on the problem. At one extreme, the processing could be done real time, as soon a new data point arrives. Although Hadoop may not be a very effective tool when we have such real time processing requirement.

If E is the existing data set and N is the data set arrived in the last 24 hours, we need to do a Cartesian join of N and (N+E) in the first MR job where we find the distance between every pair of data points.

This can be broken up into 2 MR jobs and the results combined. The first will do the join N x N, for which we can the use the same MR SameTypeSimilarity. For the other job, we have to do the join N x E. Essentially we have to treat the data as having to distinct data sets. For this we can use DiffTypeSimilarity which is another MR job in sifarish.

Finally, we have to feed the output of both of these MR into AverageDistance as input. When done, we will detect outliers if any only the new data that arrived in the last 24 hours.

## Choosing Number of Neighbors

How to choose k, the number of nearest neighbors. The underlying model complexity is inversely related to k, in our case.

A smaller k implies a more complex underlying model. As model complexity grows, error due to bias comes down, but the error due to variance goes up. This is known as bias variance trade off.

The total error plotted against model complexity (inverse of k in our case) is an V shaped curve.  The error is large for very simple model and very complex model and minimum somewhere in between.  One approach is to try different k and choose the one that gives smallest error, which should  be near the bottom of the V.

We can treat the data for each day as part the validation data set. If we use data for N days, we have N number of data sets in our validation data set. We can process the whole set for several values of k, and choose the one that gives the lowest error.

## Tutorial

You can find the details of the execution of the two map reduce jobs in this tutorial.  You could  use the ruby script in the resource directory to generate credit card transaction data.

## What’s Next

When I implement my next outlier detection algorithm in beymani, I will be back with another post.