In this post, I am back to outliers and fraud analytic. In this earlier post, I did an overview of outliers detection techniques that are being implemented with Hadoop in my open source project beymani. In this earlier post, I talked about a multivariate distribution model based implementation in beymani. We postulated that data point falling in the low frequency histogram bins are potentially outliers. In this post, the focus will be on a technique that belongs to the so called local memory based model based approach. In contrast to the distribution based approach, we don’t build any global model using training data that applies to all data, with the current approach.
Proximity Based Analysis
As I mentioned, we don’t build any model, but instead hold on to all the data collected so far. The idea is simple and intuitive. An outlier by it’s very nature is likely to be far away from it’s neighbors.
We essentially find the k nearest neighbors for a data point and find some aggregate distance to the k nearest neighbors and use that as the outlier score.
Here are the steps
- Use SameTypeSimilarity MR to find pair wise distance between all data points. This MR is from my Recommendation Engine project sifarish in github. In sifarish, it gets used to find similarities between entities of same type e.g., product
- We use the MR AverageDistance which takes the k nearest neighbor of a data point and then find the average distance to the k nearest neighbor. This MR is part of the fraud detection project beymani on sifarish. Instead of average, we could have used median of the distance to the kth nearest neighbor.
Average Distance Map Reduce
Implementation details of this MapReduce can be found here. It takes the output SameTypeSimilarity MR as input. If you are familiar with the sifarish project, this MR generates output with 3 fields as below
- First entity ID
- Second entity ID
- Distance between them
Here is an example of the output
6JHQ79UA,JSXNUV9R,5 6JHQ79UA,Y1AWCM5P,89 6JHQ79UA,7QPXVXAN,5 6JHQ79UA,UFS5ZM0K,178 6JHQ79UA,DPTX02DI,343 6JHQ79UA,DQQEZ52C,269
The mapper of AverageDistance has the first entity ID along with the distance as the key and the distance and the second entity ID as the value. The key is setup in a way to enable secondary sorting by the distance. The first entity ID is the base part of the key and the distance is the key extension.
In the reducer, for a given entity we get all the other entities sorted by distance. The reducer retains the first k neighboring entities and finds the average distance to them. Here is some sample output from this MR. The outlier is the last record. This is the case of jewelry purchase after midnight.
1IKVOMZE,5 1JI0A0UE,173 1JLNC1MS,168 1JY1JBJK,21 1K82ZJJD,7 1KWBJ4W3,278 ........... XXXX7362,538
The first column is the entity ID and the second column is the average distance to the K nearest neighbor. In the last row I am showing the distance of the outlier from it’s k neighbors. As we can see it has a very large average distance to it’s neighbors.There are number of ways to detect outliers from the result, as follows
- Sort the result by the descending order of the average distance and pick the top n
- Sort the result by the descending order of the average distance and pick all with average distance over a predefined threshold
In reality, these results will typically go through a review by human experts to make the final determination. Oulier detection problem is like finding needle in haystack. Even the best algorithm may point to few needles. It’s up to human experts to make the final call.
In real data processing pipeline, typically we have fresh data arriving continuously and accumulating. So the processing should focus on the freshly arrived data. For our example, let’s say we have decided to process data that arrived at the end of a 24 hour cycle.
The interval or the acceptable latency depends very much on the problem. At one extreme, the processing could be done real time, as soon a new data point arrives. Although Hadoop may not be a very effective tool when we have such real time processing requirement.
This can be broken up into 2 MR jobs and the results combined. The first will do the join N x N, for which we can the use the same MR SameTypeSimilarity. For the other job, we have to do the join N x E. Essentially we have to treat the data as having to distinct data sets. For this we can use DiffTypeSimilarity which is another MR job in sifarish.
Finally, we have to feed the output of both of these MR into AverageDistance as input. When done, we will detect outliers if any only the new data that arrived in the last 24 hours.
Choosing Number of Neighbors
How to choose k, the number of nearest neighbors. The underlying model complexity is inversely related to k, in our case.
The total error plotted against model complexity (inverse of k in our case) is an V shaped curve. The error is large for very simple model and very complex model and minimum somewhere in between. One approach is to try different k and choose the one that gives smallest error, which should be near the bottom of the V.
We can treat the data for each day as part the validation data set. If we use data for N days, we have N number of data sets in our validation data set. We can process the whole set for several values of k, and choose the one that gives the lowest error.
You can find the details of the execution of the two map reduce jobs in this tutorial. You could use the ruby script in the resource directory to generate credit card transaction data.
When I implement my next outlier detection algorithm in beymani, I will be back with another post.