Relative Density and Outliers


Recently I did some work on my open source fraud analytic project beymani. I implemented one of the proximity based algorithms  using Relative Density of a data point as described in my earlier post. When the density of a data point varies widely across the feature space, average distance based approach may  not be very effective and a density based approach is more appropriate.

In this post, I will go through a set of Hadoop Map Reduce jobs for detecting outliers using relative density. By relative density, I mean density of a data point with respect to it’s nearest k neighbours.

Simple Example

To put things in context, let’s consider the following example going back to the credit card use case from my earlier posts. Let’s say you fill up your gas tank and your cost comes in the range of $35 and $40.

You would expect a dense and cohesive cluster for your gas expenses around those numbers. If there is a charge of $50, in terms of average distance to it’s neighbours, it may not stand out as an outlier. But with a relative density based approach, it’s far more likely to be identified as an outlier.

The Solution

The main steps in the solution are as follows. The complete solution requires 4 map reduce jobs. A data point along with it’s k nearest neighbour is defined as a neighbourhood. The data point with respect to which the neighbours are defined is the neighbourhood center. So there are as many neighbourhoods as the data points.

  1. Find density of a point based on k nearest neighbors
  2. Find all the neighbourhoods a point belongs to
  3. Use the output of 1 and 2 to generate entity ID, density keyed by the neighbourhood group
  4. Use the output of 3 to generate relative density of each data point
  5. Data points with low relative density are potential outliers

Density

We start with pair wise distances  between entities generated by the MR class SameTypeSimilarity. I will be using the same credit card transaction data that I have used before. So the entity is essentially a credit card transaction record.

We use the pair wise distance data as input to  the AverageDistance MR, which calculates density for each data point. The density of a data point is simply the inverse of the average distance to the k nearest neighbours. Here is some sample output.

04QCQKGK,66666
04RM4MH2,83333
08SG2DWR,35714
0A7PSZ1A,35714
0AOX30XC,200000
0AULB5HP,250000
0B9S63ED,166666
0BNYXFXN,4201
0BVX2A0H,4739
0EDXJ518,200000

The first column is the entity ID and the second field is the density value, scaled by some pre configured factor

Neighbourhood Group

The input is same as before i.e., pairwise entity  distance. We use the same MR AverageDistance but configure it appropriately to generate different kind of output. The output consists of entity ID, neighbourhood group ID, and group membership index as below

DPTX02DI,044VPJF3,2
ZZRE6PXG,044VPJF3,3
3AXY7N4Y,044VPJF3,4
FNDUIV00,044VPJF3,5
MIHDM7RV,044VPJF3,6
KKZXHSKC,044VPJF3,7
YAAZ4GKB,044VPJF3,8

Neighbourhood and Density

The purpose of the MR NeighborDensity is to prepare the data for final MR. Essentially we want get group ID, entity ID and the density. The entity ID and density will be repeated as many times as the number of neighbourhood groups an entity belongs to.

It takes the output of the last two phases as input. The density data and the neighbourhood data are secondary sorted, such that the in the input for the reducer, the first value is the density, and the subsequent values are the neighbourhood group IDs. The final output is as follows

04IDYE63,04IDYE63,17857
CGJXZUY9,04IDYE63,17857
U3W7JV20,04IDYE63,17857
KY31WOW2,04IDYE63,17857
BICJNFB3,04QCQKGK,66666, 
04QCQKGK,04QCQKGK,66666
K4DJF7A3,04QCQKGK,66666
TH2N09JG,04QCQKGK,66666

The fields in the output are groupID, entityID and the density. This output goes to the final MR to calculate the relative density.

Relative Density

The output of the previous phase is  run through the MR RelativeDensity to generate relative density of each data point. The final output is as follows

JP6SLGIT,188
JQK22JBS,71
JS7OAPPY,132
JSXNUV9R,120
JT4FR2SP,55
JV8SJN8W,129
JVKLOCF1,235
............
XXXX7362,9

The second and the fifth row show some data points with low relative density and they are potential outliers. The last row  shows the data point with the lowest relative density, which was the outlier seeded into our data. It does show up as the data point with the lowest relative density.

Tutorial

Here is a tutorial providing further details of the execution of the map reduce jobs for relative density.

Summing Up

Relative density based solutions are effective tools for detecting outliers when the density of the data points vary across feature space.

Advertisements

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Big Data, Data Mining, Fraud Detection, Hadoop and Map Reduce and tagged , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s