Real time fraud detection is one of the use cases, where multiple components of the Big Data eco system come into play in a significant way, *Hadoop* batch processing for building the predictive model and *Storm* for predicting fraud from real time transaction stream using the predictive model. Additionally, *Redis* is used as the glue between the different sub systems.

In this post I will go through the end to end solution for real time fraud detection, using credit card transactions as an example, although the same solution can be used for any kind of sequence based outlier detection. I will be building a *Markov* chain model using the *Hadoop* based implementation in my open source project *avenir*. The prediction algorithm implementation is in my open source project *beymani*.

## Fraud Detection

Fraud detection is a particular use case for the general category of problems known as outlier detection. My earlier post contains a review of the various machine learning algorithms for outlier detection.

Outlier detection algorithms generally fall into two broad categories: one class of algorithms focuses on individual data points, the other class of algorithms treat the data as sequence in building the model. All the algorithms implemented in *beymani* belong to the first category.

Real time fraud detection is only possible if the learning algorithm yields a model, which can be used by the real time detector. There are some outlier detection algorithms which scan through the whole data set for detection e.g., proximity based algorithms. It’s not feasible to deploy them in a real time setting.

## Hadoop Processing for Training Model

I will be using credit card transactions as our example. I will be using sequence mining algorithm based on markov chain.

For outlier detection, sequence based algorithms are often more powerful than the instance based algorithms. For example, if we consider a set of transaction for a customer and analyze each transaction independent of the others in the sequence, we may not find anything suspicious.

However, when treated as sequence, it might be a reveal a very different story. A sequence of many transactions within a small time window is very likely to be flagged as fraud by sequence mining based algorithms, but not with instance based algorithms.

Each transaction is encoded by the following three quantities and expressed as a 3 letter token, e.g., HNN

*Amount spent:***L**ow,**N**ormal and**H**igh*Whether the transaction includes high price ticket item:***N**ormal and**H**igh*Time elapsed since the last transaction:***L**arge,**N**ormal and**S**mall

There are 18 (3 x 2 x 3) possible types of transactions in our model. The goal of the markov chain training algorithm is to generate a 18 x 18 transaction state transition probability matrix from the training data. An element P(i,j) for the matrix tells us that that given the last transaction of type i, the probability of the next transaction being of type j is P(i,j).

Here is some input transaction data. The fields are: customer ID, transaction ID and transaction type.

9C9BIS4RP1,Y4G041KT6FRH,HNN YUV0EWU5QP,Q26W2NB7XK5C,HNN 332KJ9BKF1,0DDFV6YF1FRA,LNL QV384WVNA3,20R1G2DRL72U,LNN UCFEZ2KL0I,0A8MT34ZXJ73,MNL H31RUNWDMX,O8112VMX3JRQ,MHS

Since we need to get all the transactions for a given customer as a sequence, we perform a group by operation by running the data through the MR class *Projection*.

Next we run the output of the MR class Projection through the MR class *MarkovStateTransitionModel* to generate the 18×18 state transition probability matrix. Here is some output, showing only some sample rows of the matrix

0.09550118389897395,0.13338595106550907,0.05919494869771113,0.027624309392265192,0.03157063930544594,0.011838989739542225,0.16416732438831885,0.19494869771112866,0.08524072612470403,0.021310181531176007,0.035516969218626675,0.017363851617995266,0.03867403314917127,0.043409629044988164,0.018153117600631413,0.007892659826361484,0.011049723756906077,0.0031570639305445935 0.09809932556713673,0.11955855303494789,0.05763335377069283,0.011649294911097487,0.023911710606989576,0.012875536480686695,0.16799509503372165,0.21275291232372778,0.08890251379521766,0.024524831391784182,0.043531575720416923,0.015328019619865114,0.03126916002452483,0.05150214592274678,0.028203556100551808,0.003678724708767627,0.0061312078479460455,0.002452483139178418 0.09986859395532194,0.12746386333771353,0.056504599211563734,0.009198423127463863,0.02890932982917214,0.005256241787122208,0.16031537450722733,0.21813403416557162,0.08015768725361366,0.023653088042049936,0.026281208935611037,0.01576872536136662,0.03942181340341656,0.06438896189224705,0.01971090670170828,0.010512483574244415,0.010512483574244415,0.003942181340341655

So far we have built the *Markov* chain model using training data. Next, we deploy the model in Storm and predict fraud in real time transaction data stream.

## Storm for Real Time Prediction

*Storm* is a clustered, scalaeble real time stream processing engine. My earlier post contains an overview of *Storm* architecture. There are two kinds of processing elements in Storm : *Spout* for ingesting realtime stream and *Bolt* for processing the stream data. I am using *Redis* as a message queue for incoming transactions and also as a cache for storing the M*arkov* chain.

The class *OutlierPredictor* implements the storm topology. The topology has one spout for ingesting data from *Redis*, implemented by the *RedisSpout* class. The bolt class is *PredictorBolt*. It’s a generic class, into which various predictive models can be plugged in. I am using field grouping on the *customerID* field for the bolt, so that all the transactions for a customer will go to the same bolt instance.

For a transaction sequence for for customer, there is a corresponding path through the state transition probability matrix and there is a probability associated with that path. A lower probability value of that path implies higher chances of fraudulent activity.

In other words, a less traveled path through the state transition probability matrix corresponds to potential outliers i.e., fraudulent transactions.

One issue that needs attention is whether sequence analysis should be done globally over all transactions from the transaction stream for a customer or over a sliding window of pre defined size.

This choice is made through the configuration parameter *local.predictor*. if this parameter is set to true, then the sliding window size is set through the configuration parameter *state.seq.window.size.* With a smaller window size the detection algorithms is more sensitive to the data. Proper choice of the window size is an important issue.

For local predictor, transaction data of sliding window size for each customer is kept in a cache in the bolt. For global predictor, the only the two most recent transactions are kept track of. This solution is not fault tolerant. If the bolt node crashes, the cached data will be lost.

## Outlier Sequence Metric

We glossed over what constitutes an outlier sequence. Now we will define some concrete metric as below. The details can be found in the technical paper *Markov Chains, Classifiers and Intrusion Detection by S Jha, K Tan and R A Maxion*. For all the metric, a higher value indicates outliers i.e, potential fraudulent transactions.

The first one is called **Miss Probability** metric. For any pair of consecutive transaction states t(i) and t(j) in a sequence, the following quantity is calculated. For the row corresponding to t(i), we are summing all the probabilities except for the target state t(j).

F(t(i), t(j)) = Sum(P(t(i), t(k)) | k != j)

where P(t(i), t(k)) is the probability of transitioning from transaction state t(i) to t(k)

Then we sum F over all the transaction state pairs in the sequence and normalize by the number of such pairs.

For the second metric called **Miss Rate** metric, the quantity F is calculates as below. For any transition if transition corresponds to the maximum probability target state, the value is 0, otherwise it’s 1.

F(t(i), t(j)) = 0 if t(j) = t(k) else 1

where t(k) is the target state when P(t(i), t(k)) = max(P(t(i), t(l)) for all l

Again we sum F over all the consecutive transaction state pairs in the sequence and normalize by the number of such pairs.

The third metric is called **Entropy Reduction** metric. We calculate two quantities F and G as below. For given row F is the entropy excluding target state for the state pair under consideration. G is the entropy for the whole row.

F(t(i), t(j)) = sum (-P(t(i), t(k)) log(P(t(i), t(k)) | t(k) != t(j)

G(t(i)) = sum (-P(t(i), t(k)) log(P(t(i), t(k))

We sum F and G over all consecutive state pairs and divide the two sums.

## Predictor Output

I have used the local predictor with a sliding window size of 5. I have used the miss probability metric. The threshold value for the metric is set through the parameter *metric.threshold*.

Whenever the metric exceeds the threshold, the storm bolt writes the *customer ID, transaction sequence and the metric value* to a *Redis* queue. Here is some sample output form the Redis output queue.

F37TXQP4DU : HNN LHL HNN LHS LHS : 0.9811746458811866 EQS5014S3W : MNL HHN MHL LNS MHS : 0.9766672286230776 3B2VYOK31L : MNN MHL MHN LHS LHN : 0.9756136034853501

These results simply raise a red flag. Obviously, they need to be complemented with other analysis and processes, before we can come to the firm conclusion that there are fraudulent activities.

## Metric Threshold

How do we choose the right metric threshold. Unfortunately, there is no easy answer. The best one can do is to start with a value and iteratively refine it. If the value is too low, there will be false positives. On the other hand if it’s too high, there will be false negatives.

False negatives are costlier, because they may let fraudulent activities slip by. So it’s better to start with a lower threshold value. Based on the outcome of further investigation of whether the cases identified are truly fraudulent or not, the threshold could be gradually increased.

## Redis as Glue

*Redis* is a key value store, with the unique characteristic that you can have different data structures for the value. Depending on the data structure, *Redis* can morph into playing various roles. If the value is a list, it can behave as a queue or stack. If the value is a map, it can be used as a distributed cache.

*Redis* has played a significant role in our solution in many ways. We have used Redis as a queue to pump transaction stream into Storm. The predictor bolt writes potentially fraudulent transaction sequence to another queue. The trained model is stored as a key, value pair. Here is an overview of *Redis* in one of my earlier posts.

The Markov chain model as created by Hadoop MR jobs is written as a text blob to Redis. The predictor bolt when it initializes, reads the model definition from Redis and converts it to a state transition probability matrix, which is cached.

## Wrapping Up

As I have shown in this post, you can build powerful end to end big data solutions by tying together different components effectively. Here is a tutorial document with step by step instructions to run the example.

great article!

Thanks,

Marco

Thanks for your series of articles. I’m having trouble understanding the diff between HMM and Markov Chains. I read this http://www.wilmott.com/messageview.cfm?catid=8&threadid=94129 but do you have a pointer to a better explanation (for non stats background folks)?

Thanks again and keep writing!

Ashwin

In plain markov model, as in this post, we deal with events that are observable. In hidden markov model, additionally there are hidden states. In this context, we may have two hidden states: “normal” and “fraud”.

In HMM. the hidden states are related to the observation in a probabilistic way (emission probability). The transitions between the hidden states are also defined probabilistically(state transition probability).

Any good machine learning book should be helpful. My personal favorite is :

http://www.amazon.com/Introduction-Machine-Learning-Adaptive-Computation/dp/026201243X/ref=sr_1_13?s=books&ie=UTF8&qid=1382935086&sr=1-13&keywords=machine+learning

I like this paper too

http://www.cs.sjsu.edu/~stamp/RUA/HMM.pdf

BTW, I am not a stats person either. I am pretty much self taught in this area.

Thanks!

Hi,

I found your blog via a presentation (outlier and fraud detection) posted on slideshare, and you literally ruined my schedule for yesterday – and probably for today as well -, I can’t stop reading your blog.

Receantly I started to dig into these two topics (hadoop, fraud detection) at the same time and it’s really supersonic to read about them in the same place.

Now Hadoop – in generally the noSQL based distributed, parallel real-time/batch processing architectures – is getting clear. The concept is quite cool however as far as I see it’s just a very-very sophisticated and well-armoured log analyzing “ecosystem” which enables to reduce the IT related time&cost parameters for business problems which – to get real shiny achievements – need to be handled on extremly large (mostly unstructured) data set.

So, this GREAT log analyzing application-set opens many, long-year-closed doors by noSQL and mapr.

Ok, checked.

But fraud detections – actually the beginnings, the first considerations (and constraints), its natural evolving – is very messy in my head due to the mistery around this topic (and due to my limitation of course.. ).

Could you please light me up a little bit on this? Here are some – quite lame – question:

1.When it was introduced at the first time in Banking why it looked like a good idea to apply the supervised learning?

Why was it good to feed your outlier detection mechanism with (unsufficient by nature) training data then evaluate a super-complex model, to do it for several months, and to get a heart attack if something is changed in the source data structure and begin to create a new more-super-complex model?

What was the technical/business constraints which forced this approach?

Or just the v&v&v of the source data was so low, that this approach fits to the frame?

And now when the v&v&v gets so high that nor the RDBMS nor the usual feed-then-model-it method can cover the new and always evolving fraud activites?

2.On the other hand, it would be great to read about if you are aware of any benchmark which was evaluated on the same big, batch sample data for different type of fraud detection methods. (for example: O(~10million) depersonalized, historical mixed test data-set (clickstream,account logs, transaction logs, etc) from a real(-like) environment, and then race the linear combination of supervised/unsupervised learning, instance/sequence based , model/memory based solutions.

And any similar benchmark for real time detection comparsion. (Actually what acpects could differ the optimal solutions for realtime and for batch processing – if there is any -? I.e. considering the same big data, if you get it in realtime or you have it in a batch, can you apply the same outlier detection approach for both situation? or speed/performance/other aspect would differ it?)

3.On the other hand++ to read a post from You, which summarizes ALL the different approaches for outlier detection in one chart both for batch and realtime processing would be fantastic.

Cheers,

gabor

Gabor

I am glad you found the blog useful and thanks for the compliments. Hadoop is great for batch processing, but not really meant for real time processing. Storm and Samza are two good open source products for real time processing.

For model based fraud detection, I would say unsupervised learning is better. What characterizes as fraud is continuously evolving due new new threat pattern etc. So unsupervised learning is likely to be more effective, as in this post. While training the model, I have not labelled any data as fraudulent.

The so called memory based approach (I call them full data scan based approach e.g, proximity based) can not be deployed for real time detection. Because they generally require batch processing in Hadoop scanning the whole data set.The only way that could be feasible for real time deployement is to store the data in a NOSQL database with proper indexing for fast access.

I am not aware of any such benchmark that you are referring to.

Hi Mr.Pranab,

I just came across your blog and you wouldn’t know how many concepts I have learnt from your blog.First of all thanks for that.

Now I read through your post about “smarter email marketing using markov model” and this one both of which employs markov models.And also i saw you mentioning in the “email” post that markov models are mainly used in intrusion detection systems.I too have a similar scenario.

My requirement is I am building a prediction model to detect SYN-FLOOD attacks.

Now here is what i came up with after reading through your posts and other things on the internet.

Now assume the dataset in my case is going to be IP addresses(analogous to Customer ID in the email post) followed by whether the request sent by the client was ‘SYN’ or ‘ACK’ along with the time I received them.After some processing like grouping assume I have ordered information with respect to a particular IP as follows :

IPAddr ; Time difference between the last two requests ; Change in the type of request (ie whether SYN or ACK is recieved) of the last two requests.

Similar to your “email” post let us assume three states S : Small , M : Medium , L : Large to be the quantifier for the time difference between the two requests.

And let R and W be the states that denotes the change in the type of request

R – last packet was an ACK and the one previous before that a SYN

W – last packet was an SYN and the one previous before that a SYN

For eg.)If my dataset for IP 192.168.1.3 to my computer (192.168.1.2) is

192.168.1.3 12:36:51 SYN

192.168.1.3 12:37:00 ACK

and for IP 192.168.1.4 to my computer (192.168.1.2) is

192.168.1.4 12:38:42 SYN

192.168.1.4 12:38:49 SYN

Then the output after processing would be

192.168.1.3 SR

192.168.1.4 SW

So now we have six states in our method (3 states for time and 2 states for the sequence of packet)

So now I need to build the state transition matrix.Assume that I have built it.So after building the state transition probability matrix now I need to analyse the live network traffic and give into the model the same data that was given earlier to train it and now the model should predict in real time whether the current requests are likely to be a potential SYN-FLOOD attack or not.

I am clear till now.Now I have two problems.

1.)Now you have explained all these processing of things and building the state transition probability matrix using Hadoop.I havent used Hadoop or am familiar with it.So is there any other way other than Hadoop so that i can achieve what you have done.And also since i need to analyse real time traffic like your credit card example can you please suggest me some other alternative with which I can achieve it similar to STORM.

2.)My second question is whether the above problem of detecting SYN Flood attack be done using Hidden Markov Model.If Hidden Markov Model is used all I could think of is the states such as S,M,L,R,W are observations and the hidden states to be “Attack” and “Normal”.So given the sequence of observations for a particular IP we need to predict the hidden state which is “Attack” or “Normal”.But the point where I get confused is if I use this approach how can I build the state transition matrix,the emission matrix and also the initial probability matrix called as Pi.

I came across a few codes that implements Hidden Markov Model but since I am not able to understand how to design these matrices I dont know how to use the code.So in case if you have a sample application involving Hidden Markov Model and if you can share that with me (similar to your posts here) it would be of very much help.

Since this project is not going to analyse very very large streams of real time data and rather only from a single IP (because this is just a POC we are building) it would be of much help if you can guide me in achieving this through some other method instead of Hadoop.

Also I have posted these questions on StackOverflow and Quora but nobody answered.Hopefully I will hear back from you.Sorry for the long post.And thanks once again for the information you have provided in your blog.

Jayanth

good to hear that my blog helped you . I will make some quick comments, but we should continue the discussion through email. My email is pkghosh@yahoo.com. Alternatively, you could also post it as an issue in github.

If you are going to be dealing with large data set in production, there is no harm using Hadoop for POC. If you didn’t want to use Hadoop, you could look at avenir code to do pure java (or any other language) implementation. I have been meaning to decouple all my algorithm implementations from Hadoop, so that they could be used in a pure java environment also as an option. But that has not happened yet. If you want to real time prediction, Storm is a good choice. Other option is Kafka

In Hidden Markov Model. You have train the model. This can be done is two ways 1. Manually tag the the data stream with hidden states (as in NLP). I have this implemented this. 2. Build the model with Baum Welch algorithms (don’t have this implemented). Once you have the model trained, toy could predict hidden states from observation sequence using viterbi algorithm (I have this implemented, but only in Hadoop).

Hi Pranab,

I found your blog very useful for detecting the credit card fraudulent transactions using Markov chain modeling. I’m also doing something similar to it.

Regarding metrics to evaluate the built model, you have mentioned the methods which deal with a sequence of transactions using a window size. For ex: If window size is 5, every 5 transactions will end up in a result value which suggests if the sequence has a fraud or not (correct me if I’m wrong).

However, in real time processing of credit card transactions, I’ll be interested to compute the metric’s result for an incoming transaction. How do we compute in such case ? and how do we decide on threshold to say it as outlier or not.

Siva

Although I have used a sequence based algorithm, the latest transaction does contribute to the prediction. As a transaction arrives, it displaces the earliest transaction from the the window and adds itself. Then the algorithm runs on the new sequence.

If you want prediction based only on the newly arriving transaction, you have to use an instance based algorithm e.g., multivariate distribution. I have a post on that topic that you can look up.

I have been asked multiple times about the metric threshold. I have added a new section on that issue in this post.

Hi Pranab,

Thanks for the quick reply.

I got your point on displacing the earliest transaction to retain the window size for an incoming transaction. Having said, to be more precise, lets consider a case where I train a model(constructing state transition probability matrix) based on a given set of N transactions which forms training set. I’ve moved this model to production. Now consider that a new transaction (N+1) comes from a particular user’s card. At this point of time, I don’t have any window sequence of transactions to add the new one. Now how will I calculate the probability of this card’s transaction’s state transition. If my window size is 5, I can only start computing the state transition probability for (N+5)th transaction. How will detect whether the transactions from (N+1) to (N+4) are anomalous or not ? Please suggest.

Thanks.

Siva

I am not sure I understand your issue. First of all, as my post clearly says I am doing sequence based detection using a window size of n. When a new transaction arrives and the algorithm detects fraud, it’s NOT attributed to the last transaction but to all n transactions in the window.

As I said in my earlier reply, if you want to predict based on 1 transaction, you have to use instance based algorithm. The current Storm implementation may still work., but I have to verify. You have to set the window size to 1 and change the algorithm to multivariate distribution or something. The algorithm is pluggable.

If the window size is 5, then the following will happen

– It will start processing only when the window has 5 transactions

– Then on as new transactions arrive, it will process 2nd through 6th, 3rd through 7th and so on

Thank you Pranab.

Does it works on hadoop 2.x ?

I don’t know is it the best solution but I got work around

in pom.xml i had to add:

org.apache.hadoop

hadoop-common

2.4.1

and I use hadoop-core 2.3.0-mr1-cdh5.1.0

Margus

It has not been ported to hadoop2. Hadoop is used to calculate the correlation matrix. If you want to port to Hadoop2, I will be happy to create a branch in github. I have to keep the master branch for Hadoop1, so that it remains compatible with EMR

that explains why output directory is empty. We are looking for mahout or spark solution to detect outliers. If there is not quick examples for prototype then we may be need to go for porting hadoop1 to hadoop2.

Margus

This link seems seems to have lot of relevant info

http://hortonworks.com/blog/running-existing-applications-on-hadoop-2-yarn/

Margus

Please use the branch nuvo of avenir for Hadoop 2. The project home page in github has instructions for yarn and non yarn build

Pingback: Anomaly Detection: Concepts and Techniques | My views of the World and Systems

Hi, I am curious whether you have ever tested the response time of the fraud detection? Usually, the resonse time required for fraud detection is around 50ms. Many thanks for your reply.

Reblogged this on My Big Data and Machine Learning Hub.

Informative post !

Thanks