Recently I did some work on my open source fraud analytic project beymani. I implemented one of the proximity based algorithms using Relative Density of a data point as described in my earlier post. When the density of a data point varies widely across the feature space, average distance based approach may not be very effective and a density based approach is more appropriate.
In this post, I will go through a set of Hadoop Map Reduce jobs for detecting outliers using relative density. By relative density, I mean density of a data point with respect to it’s nearest k neighbours.
To put things in context, let’s consider the following example going back to the credit card use case from my earlier posts. Let’s say you fill up your gas tank and your cost comes in the range of $35 and $40.
You would expect a dense and cohesive cluster for your gas expenses around those numbers. If there is a charge of $50, in terms of average distance to it’s neighbours, it may not stand out as an outlier. But with a relative density based approach, it’s far more likely to be identified as an outlier.
The main steps in the solution are as follows. The complete solution requires 4 map reduce jobs. A data point along with it’s k nearest neighbour is defined as a neighbourhood. The data point with respect to which the neighbours are defined is the neighbourhood center. So there are as many neighbourhoods as the data points.
- Find density of a point based on k nearest neighbors
- Find all the neighbourhoods a point belongs to
- Use the output of 1 and 2 to generate entity ID, density keyed by the neighbourhood group
- Use the output of 3 to generate relative density of each data point
- Data points with low relative density are potential outliers
We start with pair wise distances between entities generated by the MR class SameTypeSimilarity. I will be using the same credit card transaction data that I have used before. So the entity is essentially a credit card transaction record.
We use the pair wise distance data as input to the AverageDistance MR, which calculates density for each data point. The density of a data point is simply the inverse of the average distance to the k nearest neighbours. Here is some sample output.
04QCQKGK,66666 04RM4MH2,83333 08SG2DWR,35714 0A7PSZ1A,35714 0AOX30XC,200000 0AULB5HP,250000 0B9S63ED,166666 0BNYXFXN,4201 0BVX2A0H,4739 0EDXJ518,200000
The first column is the entity ID and the second field is the density value, scaled by some pre configured factor
The input is same as before i.e., pairwise entity distance. We use the same MR AverageDistance but configure it appropriately to generate different kind of output. The output consists of entity ID, neighbourhood group ID, and group membership index as below
DPTX02DI,044VPJF3,2 ZZRE6PXG,044VPJF3,3 3AXY7N4Y,044VPJF3,4 FNDUIV00,044VPJF3,5 MIHDM7RV,044VPJF3,6 KKZXHSKC,044VPJF3,7 YAAZ4GKB,044VPJF3,8
Neighbourhood and Density
The purpose of the MR NeighborDensity is to prepare the data for final MR. Essentially we want get group ID, entity ID and the density. The entity ID and density will be repeated as many times as the number of neighbourhood groups an entity belongs to.
It takes the output of the last two phases as input. The density data and the neighbourhood data are secondary sorted, such that the in the input for the reducer, the first value is the density, and the subsequent values are the neighbourhood group IDs. The final output is as follows
04IDYE63,04IDYE63,17857 CGJXZUY9,04IDYE63,17857 U3W7JV20,04IDYE63,17857 KY31WOW2,04IDYE63,17857 BICJNFB3,04QCQKGK,66666, 04QCQKGK,04QCQKGK,66666 K4DJF7A3,04QCQKGK,66666 TH2N09JG,04QCQKGK,66666
The fields in the output are groupID, entityID and the density. This output goes to the final MR to calculate the relative density.
The output of the previous phase is run through the MR RelativeDensity to generate relative density of each data point. The final output is as follows
JP6SLGIT,188 JQK22JBS,71 JS7OAPPY,132 JSXNUV9R,120 JT4FR2SP,55 JV8SJN8W,129 JVKLOCF1,235 ............ XXXX7362,9
The second and the fifth row show some data points with low relative density and they are potential outliers. The last row shows the data point with the lowest relative density, which was the outlier seeded into our data. It does show up as the data point with the lowest relative density.
Here is a tutorial providing further details of the execution of the map reduce jobs for relative density.
Relative density based solutions are effective tools for detecting outliers when the density of the data points vary across feature space.