Real time fraud detection is one of the use cases, where multiple components of the Big Data eco system come into play in a significant way, Hadoop batch processing for building the predictive model and Storm for predicting fraud from real time transaction stream using the predictive model. Additionally, Redis is used as the glue between the different sub systems.
In this post I will go through the end to end solution for real time fraud detection, using credit card transactions as an example, although the same solution can be used for any kind of sequence based outlier detection. I will be building a Markov chain model using the Hadoop based implementation in my open source project avenir. The prediction algorithm implementation is in my open source project beymani.
Fraud detection is a particular use case for the general category of problems known as outlier detection. My earlier post contains a review of the various machine learning algorithms for outlier detection.
Outlier detection algorithms generally fall into two broad categories: one class of algorithms focuses on individual data points, the other class of algorithms treat the data as sequence in building the model. All the algorithms implemented in beymani belong to the first category.
Real time fraud detection is only possible if the learning algorithm yields a model, which can be used by the real time detector. There are some outlier detection algorithms which scan through the whole data set for detection e.g., proximity based algorithms. It’s not feasible to deploy them in a real time setting.
Hadoop Processing for Training Model
I will be using credit card transactions as our example. I will be using sequence mining algorithm based on markov chain.
For outlier detection, sequence based algorithms are often more powerful than the instance based algorithms. For example, if we consider a set of transaction for a customer and analyze each transaction independent of the others in the sequence, we may not find anything suspicious.
However, when treated as sequence, it might be a reveal a very different story. A sequence of many transactions within a small time window is very likely to be flagged as fraud by sequence mining based algorithms, but not with instance based algorithms.
Each transaction is encoded by the following three quantities and expressed as a 3 letter token, e.g., HNN
- Amount spent: Low, Normal and High
- Whether the transaction includes high price ticket item: Normal and High
- Time elapsed since the last transaction: Large, Normal and Small
There are 18 (3 x 2 x 3) possible types of transactions in our model. The goal of the markov chain training algorithm is to generate a 18 x 18 transaction state transition probability matrix from the training data. An element P(i,j) for the matrix tells us that that given the last transaction of type i, the probability of the next transaction being of type j is P(i,j).
Here is some input transaction data. The fields are: customer ID, transaction ID and transaction type.
9C9BIS4RP1,Y4G041KT6FRH,HNN YUV0EWU5QP,Q26W2NB7XK5C,HNN 332KJ9BKF1,0DDFV6YF1FRA,LNL QV384WVNA3,20R1G2DRL72U,LNN UCFEZ2KL0I,0A8MT34ZXJ73,MNL H31RUNWDMX,O8112VMX3JRQ,MHS
Since we need to get all the transactions for a given customer as a sequence, we perform a group by operation by running the data through the MR class Projection.
Next we run the output of the MR class Projection through the MR class MarkovStateTransitionModel to generate the 18×18 state transition probability matrix. Here is some output, showing only some sample rows of the matrix
0.09550118389897395,0.13338595106550907,0.05919494869771113,0.027624309392265192,0.03157063930544594,0.011838989739542225,0.16416732438831885,0.19494869771112866,0.08524072612470403,0.021310181531176007,0.035516969218626675,0.017363851617995266,0.03867403314917127,0.043409629044988164,0.018153117600631413,0.007892659826361484,0.011049723756906077,0.0031570639305445935 0.09809932556713673,0.11955855303494789,0.05763335377069283,0.011649294911097487,0.023911710606989576,0.012875536480686695,0.16799509503372165,0.21275291232372778,0.08890251379521766,0.024524831391784182,0.043531575720416923,0.015328019619865114,0.03126916002452483,0.05150214592274678,0.028203556100551808,0.003678724708767627,0.0061312078479460455,0.002452483139178418 0.09986859395532194,0.12746386333771353,0.056504599211563734,0.009198423127463863,0.02890932982917214,0.005256241787122208,0.16031537450722733,0.21813403416557162,0.08015768725361366,0.023653088042049936,0.026281208935611037,0.01576872536136662,0.03942181340341656,0.06438896189224705,0.01971090670170828,0.010512483574244415,0.010512483574244415,0.003942181340341655
So far we have built the Markov chain model using training data. Next, we deploy the model in Storm and predict fraud in real time transaction data stream.
Storm for Real Time Prediction
Storm is a clustered, scalaeble real time stream processing engine. My earlier post contains an overview of Storm architecture. There are two kinds of processing elements in Storm : Spout for ingesting realtime stream and Bolt for processing the stream data. I am using Redis as a message queue for incoming transactions and also as a cache for storing the Markov chain.
The class OutlierPredictor implements the storm topology. The topology has one spout for ingesting data from Redis, implemented by the RedisSpout class. The bolt class is PredictorBolt. It’s a generic class, into which various predictive models can be plugged in. I am using field grouping on the customerID field for the bolt, so that all the transactions for a customer will go to the same bolt instance.
For a transaction sequence for for customer, there is a corresponding path through the state transition probability matrix and there is a probability associated with that path. A lower probability value of that path implies higher chances of fraudulent activity.
In other words, a less traveled path through the state transition probability matrix corresponds to potential outliers i.e., fraudulent transactions.
One issue that needs attention is whether sequence analysis should be done globally over all transactions from the transaction stream for a customer or over a sliding window of pre defined size.
This choice is made through the configuration parameter local.predictor. if this parameter is set to true, then the sliding window size is set through the configuration parameter state.seq.window.size. With a smaller window size the detection algorithms is more sensitive to the data. Proper choice of the window size is an important issue.
For local predictor, transaction data of sliding window size for each customer is kept in a cache in the bolt. For global predictor, the only the two most recent transactions are kept track of. This solution is not fault tolerant. If the bolt node crashes, the cached data will be lost.
Outlier Sequence Metric
We glossed over what constitutes an outlier sequence. Now we will define some concrete metric as below. The details can be found in the technical paper Markov Chains, Classifiers and Intrusion Detection by S Jha, K Tan and R A Maxion. For all the metric, a higher value indicates outliers i.e, potential fraudulent transactions.
The first one is called Miss Probability metric. For any pair of consecutive transaction states t(i) and t(j) in a sequence, the following quantity is calculated. For the row corresponding to t(i), we are summing all the probabilities except for the target state t(j).
Then we sum F over all the transaction state pairs in the sequence and normalize by the number of such pairs.
For the second metric called Miss Rate metric, the quantity F is calculates as below. For any transition if transition corresponds to the maximum probability target state, the value is 0, otherwise it’s 1.
Again we sum F over all the consecutive transaction state pairs in the sequence and normalize by the number of such pairs.
The third metric is called Entropy Reduction metric. We calculate two quantities F and G as below. For given row F is the entropy excluding target state for the state pair under consideration. G is the entropy for the whole row.
We sum F and G over all consecutive state pairs and divide the two sums.
I have used the local predictor with a sliding window size of 5. I have used the miss probability metric. The threshold value for the metric is set through the parameter metric.threshold.
Whenever the metric exceeds the threshold, the storm bolt writes the customer ID, transaction sequence and the metric value to a Redis queue. Here is some sample output form the Redis output queue.
F37TXQP4DU : HNN LHL HNN LHS LHS : 0.9811746458811866 EQS5014S3W : MNL HHN MHL LNS MHS : 0.9766672286230776 3B2VYOK31L : MNN MHL MHN LHS LHN : 0.9756136034853501
These results simply raise a red flag. Obviously, they need to be complemented with other analysis and processes, before we can come to the firm conclusion that there are fraudulent activities.
How do we choose the right metric threshold. Unfortunately, there is no easy answer. The best one can do is to start with a value and iteratively refine it. If the value is too low, there will be false positives. On the other hand if it’s too high, there will be false negatives.
False negatives are costlier, because they may let fraudulent activities slip by. So it’s better to start with a lower threshold value. Based on the outcome of further investigation of whether the cases identified are truly fraudulent or not, the threshold could be gradually increased.
Redis as Glue
Redis is a key value store, with the unique characteristic that you can have different data structures for the value. Depending on the data structure, Redis can morph into playing various roles. If the value is a list, it can behave as a queue or stack. If the value is a map, it can be used as a distributed cache.
Redis has played a significant role in our solution in many ways. We have used Redis as a queue to pump transaction stream into Storm. The predictor bolt writes potentially fraudulent transaction sequence to another queue. The trained model is stored as a key, value pair. Here is an overview of Redis in one of my earlier posts.
The Markov chain model as created by Hadoop MR jobs is written as a text blob to Redis. The predictor bolt when it initializes, reads the model definition from Redis and converts it to a state transition probability matrix, which is cached.
As I have shown in this post, you can build powerful end to end big data solutions by tying together different components effectively. Here is a tutorial document with step by step instructions to run the example.