Fraudsters are not Model Citizens

In my earlier post, I did an overview of the outlier detection techniques in big data and specifically Hadoop context. As I mentioned, fraud detection is essentially translates to outlier detection in data mining parlance.

In his post, I will go over a distribution model based technique which has been implemented as two map reduce jobs in beymani available in github. I will use the credit card transaction as the  example.

Although distribution based models focus on the temporal aspect of the data, they are the essential initial steps towards a complete fraud detection process.

Our distribution model is non parametric, because we don’t make any assumption about the Gaussian nature of the distribution and we don’t try to compute any of the parameters like mean and standard deviation.

Multivariate histogram

We need to compute histogram for objects with multiple attributes. Then we focus on those buckets with with low frequency count. All objects in these low frequency buckets are potential outliers i.e., objects that don’t fit the distribution model well.

With objects of n attributes, the buckets are essentially n dimensional regions in an n dimensional space. For example, with 2 dimensions, each bucket is a rectangular region.

When all the attributes are numerical,  it’s easy to visualize the the n dimensional buckets. For numerical attributes, we divide the range of values into segments of equal length. Each segment corresponds to a bucket. How do we handle categorical attributes. For categorical attributes, each value of the attribute defines a bucket range along that dimension.

Total number of buckets = Product of number of buckets along each dimension.

Here is sample credit card transaction data that goes as input.  The attributes are transaction id, time of the day, amount of money and the vendor type. The final goal is to get a list of the transaction id’s sorted by ascending order of frequency.

YX66AJ9U[]588[]20.47[]drug store
XXXX7362[]1300[]1875.40[]jewellery store

The last record is an outlier that I seeded in data. I wanted to see if this records surfaces through my outlier analysis.

The MutivaritateHistogram map reduce class mapper builds the multi dimensional bucket for each record. The bucket width of each attributes is defined in a  a json metadata file. Here is a snippet for time of the day attribute. The time of day is in terms of minutes. We use a bucket width of 60, so the bucket granularity is hour.

	"name" : "time",
	"ordinal" : 1,
	"dataType" : "int",
	"bucketWidth" : 60

For any record, we find the bucket for each attribute and then create a multidimensional bucket by building a list of the single dimensional buckets encapsulated in Tuple object. The Tuple class is a generic WritableComparable class that encapsulate a list of java primitive objects. It can serve as a key or value object in a map reduce implementation. The mapper of MutivaritateHistogram emits the Tuple as the key and the ID of the object as the value.

So, the key is the bucket and the value is the ID of the object that falls into that bucket. The mapper output may be sparse, because there is no guarantee that all buckets will have some object in them.

In the reducer,  all the object ID’s for a given bucket gets consolidated into a list. The reducer output consists of the bucket as Tuple object appended with id list of the objects in that bucket. Here is some sample output

0,2,clothing store[]PEK3SLKR,KW2YKPEF,GQEMYH49,41ROR62M
1,4,air fare[]B53WMWEW,O7Q61GLK,KX1TSGL0,TUIW3U8C
2,5,electronic store[]ZI289XPV,DN274M45

In the third record for example, the bucket consists of the following bucket components corresponding to the 3 attributes in our data set

  1. 2 for time of day
  2. 5 for money spent
  3. “electronic store” for vendor

The list on the right side of [] is the list of transaction ID’s for the transaction falling into this bucket.

Sorting the distribution

The only thing left to do is to sort the histogram bucket in the ascending order of the number of objects in a bucket. In the mapper of the DistributionSorter map reduce, we count the number of objects in a bucket and then prepends the Tuple object which represents a bucket with the count value. Then  the mapper emits the modified bucket as the key and the list of object ID’s as the value.

By the time the data gets to the reducer, the keys are sorted by the frequency count. In the reducers output the buckets are sorted by the ascending order of the number of objects in a bucket. Here are some sample output, which shows only the buckets with object count of 1. The outlier we seeded is in one on these buckets among many other  buckets with object count of 1.


Grid index

Multivariate histogram can also be used as a grid  index to optimize some of the algorithms. For example, in case of proximity analysis based on average distance to nearest k neighbors, we could cut down the amount of computation by considering only the objects in neighboring grids of the grid that contains the object under consideration.

On an average if there are m objects in the neighboring grids, then we have to perform n x m computations instead of n x n computations, which could be a significant improvement.


Where do we go from here

As I mentioned in my earlier post, outlier detection requires multiple techniques, since there is a temporal and a feature based aspect of objects. The distribution model based technique described here only deals with the temporal aspect.

One solution is to take all the objects with low frequency of occurrence e.g., objects with in buckets with frequency count of 1 and then apply proximity analysis on them, to further narrow down on the outliers. Essentially we  try to identify those objects in this set that are far away from other objects in an n dimensional feature space. I will cover implementation of proximity analysis in a future post. So stay tuned.

For commercial support for any solution in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing,


About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Big Data, Fraud Detection, Hadoop and Map Reduce and tagged , , . Bookmark the permalink.

7 Responses to Fraudsters are not Model Citizens

  1. Pingback: It’s a lonely life for outliers | Mawazo

  2. Pingback: Fraudsters are not Model Citizens |

  3. Jagaran says:

    How Exactly I would Run the Code to get a Fraud Outlier ?

  4. Seth says:

    Hi Pranab,

    Excellent work. I am wondering whether the multivariate histogram is applicable for finding anomaly in application performance. There are a number of metrics like cpu, memory, disk util, queue depth, response time can be collected in a specified time interval. One of the challenge I see is that it is difficult to create bucket for some attributes like queue depth.

    I appreciate your comments on this,


    • Pranab says:

      Hi Seth
      Sure, I don’t see any issue. For attributes with no known range, one approach will be to create buckets based on mean and standard deviation. I have meaning to to add this feature. That should take care of queue depth.


      • seth says:

        Great. Make sense. Will that work similar to Gaussian distribution?

        Regarding finding the anomaly in real time, would it possible to expose the calculated histogram representing normal behavior in cache. And whenever new transaction hit the system, the system check with the cache and if the transaction does not fall under one of the histogram, it is an anomaly.


  5. Pingback: Profiling Big Data | Mawazo

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s