Time Series Sequence Anomaly Detection with Markov Chain on Spark


There are many techniques for time series anomaly detection. In this post, the focus is on sequence based anomaly detection of time series data with Markov Chain. The technique will be elucidated with a use case involving data from a health monitoring device. Anomaly detection is critical for this kind of health monitoring data, since it may indicate potential harmful health condition.

The spark implementation is available in my open source project beymani on github. The complete solution also uses my other open source projects avenir and chombo. As with all my other open source Spark implementation, it is agnostic of any specific application. Generous use of configuration and meta data enables us to do that.

Anomaly Detection Technique Taxonomy

Anomaly detection techniques can be categorized along 3 aspects. The first aspect is based on the context of the data. If the focus of anomaly detection is a single record, it called point anomaly. If the the anomaly detection technique hinges on the sequential order of the data, it’s called sequence anomaly. When the context of sequential data is time, it’s called time series data. The choice between point anomaly and sequence anomaly should be based on the type of problem being considered.

The second aspect is related to the dimension of the data. Univariate anomaly detection pertains to scalar data. For vector data, multi variate anomaly detection anomaly detection techniques are applicable. Problem at hand will dictate whether univariate or multi variate technique will be used.

The third aspect is regarding whether the technique falls under the category of supervised or unsupervised machine learning. Supervised machine learning techniques are rarely applied for anomaly detection, because of lack of labeled outlier data. Even if labeled data is available, it will be highly un balanced and the related problems need to be circumvented.

Based on the categories described above, the anomaly detection solution for the health monitoring device data under consideration can be categorized as below

  • Sequential
  • Univariate
  • Unsupervised

Markov Chain Sequence Anomaly Detection

A first order Markov Chain is a finite state machine, where there is probability associated with transition from one state to another. In first order Markov Chain, the probability of transition to a future state from the current state depends on the current state only and not any earlier state.

For higher order Markov Chain, past states come into the picture in predicting the future state. For example for second order Markov Chain, the future state will depend on the current state and the immediate past state.

Markov Chain is applicable for symbolic sequence data. How is that applicable for time series data then? As we will see later, we discretize numeric time series data to convert time series to symbolic sequence.

Unsupervised anomaly detection techniques generally follow these broad steps. The details of the first 2 steps will depend on the nuances of a particular algorithm. For sequence anomaly score, a sliding window of specified size is used.

  • Build model based on training data
  • Calculate point or sequence anomaly score for anomaly detection
  • Label a point or sequence as anomalous based on user defined threshold

With a Markov Chain model, there are several techniques for calculation anomaly score as listed below. The intuition behind all of them is that any sequence that has low probability of occurrence, as determined by the Markov Chain will have have a high anomaly score.

  • Miss probability
  • Miss rate
  • Entropy reduction
  • Sequence probability

Details of these techniques for calculating anomaly score are available. For the use case under consideration, we have used the last technique i.e Sequence probability. The sequence probability is estimated by multiplication of a set of first order conditional probabilities within a window. Since multiplication of several probabilities produces very small number, for pragmatic reasons, we will add negative log probabilities instead.

Health Monitoring Device Data

We will be using a simulated data from a blood sugar monitoring data. The synthetic data has been generated with a python script. The data contains the following fields

  • Device ID
  • Timestamp
  • Measurement

The blood sugar measuring device generates measurement once a day. Here is some example data.

U93HY849EK8R,1546442006,93
63Q8PQ07D6W5,1546440487,99
972733HJY71O,1546440535,93
88MCYTEAE40P,1546440779,164
K5RY9W52R3JK,1546441897,81
QD9NOHYMA7ET,1546440979,101
W8W05ZS0Q202,1546440853,89

The reason we are detecting anomaly based on sequence is based on what is considered anomalous for this problem. For this problem it is anomalous when measurement goes to some extreme region and hovers around there for several consecutive days. Occasional straying into some extreme value is not considered anomalous.

Model Building Spark Job

Model building for Markov Chain involves two steps. In the first step the data is discretized, to convert measurement to discrete states. The data transformation spark job is used for this purpose.

The next step is to calculate state transition probability matrix. The spark job MarkovStateTransitionModel is used for this purpose. Here is part of the output. The first line is the key. Following the first line, we have one line for every row of the state transition probability matrix.

all
0.029,0.029,0.029,0.029,0.029,0.029,0.029,0.029,0.029,0.029,0.029,0.029,0.029,0.029,0.029,0.029,0.029,0.029,0.029,0.029,0.029,0.029,0.029,0.029,0.029,0.029,0.029,0.029,0.029,0.029,0.029,0.029,0.029,0.029,0.029
0.006,0.006,0.006,0.006,0.057,0.096,0.223,0.115,0.115,0.096,0.070,0.051,0.013,0.006,0.006,0.006,0.006,0.006,0.006,0.006,0.006,0.006,0.006,0.006,0.006,0.006,0.006,0.006,0.006,0.006,0.006,0.006,0.006,0.006,0.006
0.005,0.005,0.005,0.005,0.068,0.068,0.089,0.284,0.079,0.100,0.063,0.047,0.032,0.005,0.005,0.005,0.005,0.011,0.005,0.011,0.016,0.005,0.011,0.005,0.005,0.005,0.005,0.005,0.005,0.005,0.005,0.011,0.005,0.005,0.005
0.003,0.003,0.003,0.003,0.061,0.058,0.074,0.112,0.282,0.119,0.074,0.064,0.032,0.010,0.006,0.003,0.006,0.003,0.003,0.006,0.010,0.006,0.006,0.003,0.006,0.003,0.010,0.003,0.003,0.006,0.003,0.003,0.003,0.003,0.003
0.001,0.001,0.001,0.001,0.009,0.012,0.036,0.229,0.367,0.148,0.062,0.032,0.012,0.004,0.001,0.001,0.005,0.006,0.004,0.007,0.005,0.004,0.005,0.005,0.005,0.005,0.004,0.007,0.006,0.005,0.005,0.005,0.001,0.001,0.001

We see “all” in the first line. What’s this about. We have two options in building the model as below

  • One model per key (per device for our use case)
  • One global model

We have selected the global option for our use case. In this case it generates an artificial key called “all”.

Since the training data may not cover all the regions of the data space, it’s necessary to mitigate the problem with a correction. The correction adds 1 to all the state transition counts, so that there is no state transition with count of 0. For this technique to work properly for Markov Chain based prediction, it is necessary that the training data contains some occasional stray into extreme regions.

Anomaly Detection Spark Job

There are two steps for anomaly detection. As in model building, the first step is to discretize the data using data transformer Spark job.

The second step is the anomaly detection which is performed by the Spark job The second step is the actual detection which is performed by the Spark job MarkovChainPredictor. The 2 important parameters for this spark job is the window size and anomaly score threshold.

For each key. the data is sorted by ascending order of the time stamp. The window slides through the data, one data point at a time. As sequence anomaly is detected for the sequence, all the data points within the window get marked anomalous. Here is some sample output

SCWVCS4AJ0U5,1561475585,17,0.745768,N
SCWVCS4AJ0U5,1561561634,18,0.777908,N
SCWVCS4AJ0U5,1561647202,18,0.731116,O
SCWVCS4AJ0U5,1561733508,25,0.919162,O
SCWVCS4AJ0U5,1561820996,37,0.973091,O
SCWVCS4AJ0U5,1561905895,38,0.990509,O
SCWVCS4AJ0U5,1561992551,36,0.989914,N
SCWVCS4AJ0U5,1562079348,36,0.985103,N
SCWVCS4AJ0U5,1562165170,27,0.982848,N

When the 6th record enters the sliding window of size 4, the anomaly score for the sequence of 4 records exceeds the threshold which is set at 0.99. As a result, all 3 preceding records are marked as anomalous.

Threshold Parameter

Outliers are in the eyes of the beholder. The outlier score threshold decides when a a data point or sequence is an outlier or not. Proper selection and tuning of the outlier threshold parameter is critical. for the success of any anomaly detection system

Any unsupervised anomaly detection algorithms will generate a score. Based on the score and a user threshold parameter, a record is declared an outlier. How do we know what should be the correct value. We actually don’t know it and we can make an educated guess. However, we may end up with false positive or false negatives, if the threshold is not set properly.

There are two ways to mitigate the problem. Both approaches require the threshold tuning to be done continuously. The frequency of threshold parameter tuning will be dictated by how dynamic the underlying data is.

The first approach, which is not the ideal, can be used when no user feedback is available. We decided on target percentage of outlier we would like to see and tune the threshold parameter accordingly.

The second approach is better and it requires user feedback. Based on user feedback records are marked as as false positive, false negative or correct. The labeled data is fed into a supervised machine learning algorithm to find more accurate threshold level, that will minimize the number of false postives or false negatives.

Sometimes, having one global anomaly score threshold level is not sufficient. Consider an IoT scenario, where outliers are detected for different kind of sensor data.The difference in the dynamics of the different signals may be significant and having different threshold parameter for each signal is the only solution. For our use case per device threshold is not necessary and we a global threshold parameter.

Key specific threshold parameters, if used is provided through a file, which gets loaded when the Spark job starts.

Seasonality

If the data is seasonal, then models need to be built for each seasonal component for each key for the particular seasonality type. The data for our uses case is not seasonal. Here is an anomaly detection use case with seasonality in data.

The first step is to detect the seasonality type in the data. To enable seasonality for anomaly detection, a set of configuration parameters need to be set appropriately.

Other Sequence Anomaly Detection Algorithms

There are many other sequence anomaly detection algorithms for numerical data. Some of the common algorithms are listed below

  • Forecast
  • Nearest sub sequence
  • Sub sequence frequency distribution
  • Sub sequence clustering

With forecast based methods, the difference between forecasted and the actual value is used to compute outlier score.With nearest subsequence, distance to the nearest kth subsequence is used to calculate outlier score. In sub sequence frequency distribution based method, outlier score is inversely related to frequency. With clustering approach, the outlier score is determined by the distance to nearest sequence cluster.

Wrapping Up

We have gone through a Markov Chain based sequence anomaly detection technique and applied to health monitoring time series data. To execute the use case please follow the steps in the tutorial.

Support

For commercial support for any solution in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing.

Advertisements

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Anomaly Detection, Big Data, Data Science, Machine Learning, Outlier Detection, Scala, Spark and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s