Recently, I started working on Hadoop based solutions for fraud detection. Fraud detection is critical for many industries, including but not limited to financial, insurance and retail. Data mining is a key enabler in effective fraud detection.
In this and some following posts, I will cover commonly used data mining solutions for fraud detection. I also have an accompanying open source project on github called beymani where the implementations of these algorithms will be available.
An outlier record, e.g., fraudulent credit card transaction, has attribute values that are significantly different from typical attribute values. In data mining context, fraud detection translates to outlier detection in a data set. Finding outlier is like a “needle in haystack problem”. Two data mining approaches that are applicable for outlier detection is supervised learning and unsupervised learning.
In real world fraud detection, the process is far more complex than simple application of some data mining techniques. In addition to data mining, the process may involve rule engine based processing, review by domain experts etc.
I will be referring to a simple example of a credit card transaction throughout this post. Each transaction in my simple model has the following attributes.
- Time of the day for the transaction
- Dollar amount
- Product or service purchased
The data model may be simplistic but good enough to explore the different algorithms. In real world, the credit transaction will have many other attributes.
We build a predictive model using the data set, so that when a new record is encountered, we can predict it’s label using the model built using the training data set. Some of the techniques that may be useful is Decision Tree, Bayesian filtering and Neural Network.
By it’s very definition, outliers occur very infrequently, which presents a dilemma in the application of supervised learning methods for fraud detection. My focus is on unsupervised learning methods, which is discussed next.
In unsupervised learning, we don’t have any labels associated with the data. Instead, we try to discover interesting patterns. The classic example is clustering.
Outliers live far apart from it’s neighbors and lie in a low density of the region of the multi dimensional space.All the techniques described below is based on this fundamental heuristic.
I don’t know about you, but I generally don’t buy expensive jewelery on line at 2 AM on a weekday. If someone does clustering analysis of my credit card transactions, such a transaction will be found to be away from all clusters of my credit card transactions and should be flagged as a potential fraudulent transaction.
Outlier detection based on unsupervised learning generally fall under the following categories. This list is not exhaustive by any means.
- Distribution Model
- Proximity Analysis
- Relative Density
- Shared Nearest Neighbor
- Projection Methods
Any object that has weak membership to the cluster it belongs to is potential outlier. If there is any small cluster far away from other clusters, all the members of the small clusters could also potentially be outliers. For example, there could be multiple fraudulent transactions, similar to each other and as a result form a cluster.
It may not be easy to distinguish between the two cases. To make the final call. we may have to perform additional analysis and or subject the results to review by domain experts. I won’t be pursuing any clustering based solution for my open source implementation.
If we consider probability distribution of the data, the outlier will have a low probability of occurrence. The outliers could be defined in two ways. In the first approach, we consider outliers to be those that lie a certain multiplier of std deviation away from the mean as below.
If the distribution is assumed to be Gaussian, the x value meeting the criteria could easily be found. The second approach is based on percentile. We find some threshold value, so that the probability exceeding the absolute value of the threshold is some predefined probability as below. Any value higher than y or lower than -y is likely to be an outlier.
With both approaches , all we are saying is that the points lying in the two tail ends of the distribution are likely to be outliers. I will be implementing a model based solution, except that my approach will be non parametric. Parametric model has several issues. The distribution model is assumed to be Gaussian. Moreover, presence of categorical attributes if any, makes things complex.
I will have one map reduce job to get the multivariate distribution of the data.With multi variate histogram, each bucket corresponds to bounded region in a multi dimensional attribute space.
Once I have the histogram, the focus will be on identifying those histogram buckets with smallest population through a second map reduce job. The criteria is that number of objects within a bucket should be smaller than a predefined percentage of of the total number of objects.
Essentially we are selecting objects with probability of occurrence below some threshold. Objects within those buckets are likely candidates to be outliers. We are using the second approach as described above. This implementation is available in beymani.
So it’s computationally intensive. For any given point we consider the distance to it’s k nearest neighbor where k is a number that is pre defined. We can take either the distance to the k th nearest neighbors or the average distance to the k nearest neighbor. We can call this distance the outlier score.
We need one map reduce job to find pair wise distance and a second one to find the k nearest neighbor for a point. Both of these map reduce classes exist in my open source project sifarish. Essentially, we solve the same problem for a recommender system where we are interested in finding similar items. The final output is a record Id of an object and it’s average distance to the nearest k neighbor or the distance of to k th nearest neighbor.
Once we sort the list by the descending order of the distance, the outliers will be at the beginning of the list. For this I use a sorter map reduce available in my open source project chombo. This project contains contains some generic map reduce classes and bunch of utility classes.
The proper selection of k can be tricky, if the k is too small, then the outlier score may be low if multiple outliers happen to lie close to each other, resulting in failure to detect to outlier, resulting in false negative. On the other hand if k is too big, then objects can be mis labeled as outlier, whereas in reality they belong to a small legitimate cluster, resulting in false positive.
Proximity based solution already exists in my open source project beymani on github.
By varying density we mean some clusters are densely packed and others not so. Density is defined as the inverse of the average distance to the nearest k neighbor. As density goes down the probability of an object being an outlier goes up.
In addition to the two map reduce jobs mentioned for proximity analysis, relative density computation will require two additional map reduce jobs. This is the most computationally intensive among all the algorithms discussed so far. I will provide more details when I get around to implementing it.
Shared Nearest Neighbor
This approach is similar to the relative density method. It also takes into consideration, the local context. This approach also works better for higher dimensional data.
For a given object we find the average similarity with the nearest k neighbors. Then we sort the objects by the ascending order of the average similarity. Potential outliers will appear at the beginning of the list.
Fraudulent transactions increase the entropy of our data set. What do I mean by that?
So when an oulier gets added to data, entropy goes up. In other words, an outlier makes a data set less homogeneous. Entropy is defined as
Given a data set, we want to pick those objects whose removal cause the largest drop in entropy. This will require a map reduce which calculates entropy. We remove one object at a a time and calculate entropy of the remaining data set.
Standard deviation is also good measure of the homogeneity of data. However with a mixture of numerical and categorical attributes, there is no meaningful definition of standard deviation. Entropy does not pose any such problem. Another option is to use Gini Index defined as below.
Data is converted to a lower dimension using Principal Component Analysis (PCA), Self Organizing Map (SOM) or Neural Network Encoders. Outliers will have large distance from the projected values.
The question that comes up is which technique to use. There is no clear answer. The essential characteristics of an outlier are as follows.
- It occurs infrequently
- It’s far away from other objects
Distribution model based approaches are based on the first condition but not the second. The focus is on the temporal aspect of an object. Proximity based techniques rely on the second criteria, which is related to the features of an object. It can be made to satisfy the first condition by appropriately choosing the number of nearest neighbors. Implications of the two parameters are as below.
|Far away from others||Close to others|
|Rarely occurring||Very likely outliers||Potential outliers.|
|Not rarely occurring||Potential outliers||Not outliers|
Objects that are not so rare but far from other objects will typically be part of a small cluster. It’s ambiguous and it may be difficult to label them. The small cluster could represent a small set of outliers that are similar to each other or they could be legitimate objects.
Objects that rare but close to neighbors, will typically lie around a cluster boundary. It may be difficult to interpret them correctly.
I will have some follow up posts, as I add these implementations to beymani. They will have details on Hadoop map reduce implementations. As I mentioned earlier, some implementations are already part of beymani