Time Series Seasonal Cycle Detection with Auto Correlation on Spark


There are may benefits of auto correlation analysis on time series data, as we will be alluding to in detail later. It allows us to gain important insights on the nature of the time series data. Cycle detection is one of them. To put things in context, we will use cycle detection for energy usage time series data as an example to demonstrate the usefulness of auto correlation.

The Spark implementation is available in my open source project ruscello on github. This project has  various Spark based implementation of time series analysis.

Auto Correlation Defined

Auto correlation defines how a data point correlates with past data with some time lag. Typically auto correlation is calculated for various time lag values.  Generally, the auto correlation value is normalized with variance. The normalized auto correlation coefficient value ranges from +1 to -1. The definition is as follows

ck = 1/n ∑(xt – m)(xt-k – m)
rk = ck / c0
where
ck = auto covariance for lag k
c0 = variance
rk = auto correlation for lag k
xt = value at time t
m = mean

Auto correlation can be very useful in gaining insight into time series data, as listed below.

  • If auto correlation is close to 0 for any lag, the time series is random. If there are district peaks, the time series has underlying pattern.
  • If the time series is stationary, auto correlation quickly decays with increasing lag. Slower the decay, more non stationary the time series.
  • If the time series is stationary, auto correlation will be function of lag only and not the time
  • If there are cycles in the data, there will distinct peaks in the auto correlation, corresponding to the periods of cycles
  • If there is trend in the data, auto correlation will be large for small lags and it will slowly decay with increasing lag
  • If the underlying model of the time series is auto regressive, the data at time t will be a function of some immediate past values. It will be manifested in auto correlation with high values for small lags.
  • For many time series forecasting algorithms, you need to specify the cycle types, if the data has cycles. Auto correlation can help here

Having these  insights  can be very useful before undertaking analysis of any time series data.

Cycle Detection and Auto Correlation

Consider an utility company with energy usage data coming from smart meters every minute. They know there are cycle or seasonality  with peaks and troughs in the usage data. They want to use tools like sub series plot or box plot to analyze cycles and make personal recommendations to the consumers to change their usage habit and save costs.

By cycle, I mean cycles with fixed period, which is often described as seasonality in literature.

However, these cycle analysis tools require knowledge cycle periods and that’s where auto correlation comes into the picture. The lag that correspond to peaks in auto correlation will correspond to cycle periods.

Here are  some sample data.  It has only 3 fields, meterID, timeStamp and the meter reading. The sampling interval is 1 minute.

5RWN90P4L3,1544445103,0.010
U588G2HW81,1544445107,0.017
552909423K,1544445106,0.014
B3V896XZRF,1544445166,0.010
HZKSEIXXAG,1544445166,0.014
5RWN90P4L3,1544445167,0.012
U588G2HW81,1544445164,0.022
552909423K,1544445167,0.008

The data is synthetic, generated with a python script. There are readings for 5 meters over a period of 10 days. For each consumer we are interested in the finding the cycles in the energy usage.

Statistics Calculation Spark Job

Auto correlation calculation requires mean of the time series data. Accordingly, the first Spark job we run is NumericalAttrStats, which calculates mean, std deviation etc for each partition or group by key in the data. The key in the meterID. Here is the output for statistics. This Spark job is part of the my open source chombo project on github

5RWN90P4L3,2,$,253.553000,4.333503,15840,0.016007,0.000017,0.004166,0.005000,0.030000
HZKSEIXXAG,2,$,253.475000,4.331147,15840,0.016002,0.000017,0.004167,0.005000,0.030000
552909423K,2,$,253.212000,4.319856,15840,0.015986,0.000017,0.004145,0.005000,0.030000
B3V896XZRF,2,$,253.855000,4.345415,15840,0.016026,0.000017,0.004182,0.005000,0.030000
U588G2HW81,2,$,265.057000,4.848833,15840,0.016733,0.000026,0.005109,0.001000,0.031000

The first field is the group by key (i.e. meter ID). Results are from the 4th field onward. Mean is the 7th field in the output.

Temporal Averaging Spark Job

We anticipate daily cycle in the data. Since we are interested in discovering daily cycles in the data and the time granularity of the input data is minutes, we do temporal averaging to turn the data into hourly reading. Hourly data will be used as input to auto correlation. it should make the auto correlation results less noisy.

The temporal averaging is done by the Spark job TemporalAggregator, implemented in scala.  It performs aggregation operations on time series data. The aggregation functions supported are count, sum and average.

The aggregate values are always always aligned with corresponding time unit boundaries. For example, in our case since we are doing hourly aggregation all values within clock hour will be aggregated and the time stamp for the aggregated values will be the time stamp for the corresponding hour end.

Auto Correlation Spark Job

Auto correlation is implemented in the Spark job AutoCorrelation.  It takes as input the output of temporal averaging done by the second Spark job and statistics as calculated by the first Spark job.

One key input parameter is list of lag values.  We can also calculate auto correlation for multiple data columns by specifying a list of column indexes as part of the configuration. Since we are speculating daily cycles in data, we have used the lag values 12, 24 and 36, all in hours. Here is the output.

U588G2HW81,2,24,0.997267
5RWN90P4L3,2,24,0.986436
552909423K,2,24,0.986425
HZKSEIXXAG,2,24,0.985413
B3V896XZRF,2,24,0.982911
HZKSEIXXAG,2,12,0.807520
HZKSEIXXAG,2,36,0.807350
5RWN90P4L3,2,12,0.799401
B3V896XZRF,2,36,0.799189
B3V896XZRF,2,12,0.798675
552909423K,2,36,0.795565
5RWN90P4L3,2,36,0.792159
552909423K,2,12,0.789131
U588G2HW81,2,12,-0.931098
U588G2HW81,2,36,-0.933244

The fields in the output are as follows

  1. meter ID
  2. column index of the data being analyzed
  3. lag
  4. auto correlation coefficient

For all meters we find that the auto correlation coefficient is highest for 24 hour lag. It supports our speculation of daily cycles in the data. This is how the auto correlation Spark job works. We hypothesize about the cycle periods in the data, try those periods and verify from the output if our hypothesis is correct.

Data Variation within a Cycle

Auto correlation only discovers the periods of cycles if any. Auto correlation does not does not shed any light how the data changes within a cycle.  You might recall that the energy company is interested the actual variation in energy usage within a cycle after the cycle periods have be found through auto correlation.

There are two ways you can approach this. You can take the visualization route and use sub series plot or box plot. Or you can do it quantitatively going back to the Spark job . This time you have to specify cycle type as a configuration parameter, which is daily in the statistics calculating Spark job. For each meter and each hour of the day, It will calculate mean , std deviation and other statistics.

Time Series Anomaly Detection

I have used auto correlation as a pre processing step in a statistics based time series anomaly detection project. After cycle periods are found through auto correlation, statistics is calculated for each cycle index.

For example, if the seasonal cycle is found to be weekly, then 7 sets of statistic are calculated, one for each cycle index i.e. day of the week. These statistic get used to detect anomaly based on z score.

Essentially,  anomaly detection is contextualized with seasonal cycles. For example, what is an anomaly in CPU usage for a server on a weekend may be normal on a weekday. It depends very much on the seasonal context.

Summing Up

Auto correlation is a powerful tool for time series analysis. We have shown how you can gain valuable insights on time series data through auto correlation as implemented on Spark. If there is trend in the data, it will be desirable  to remove the trend before performing auto correlation.  If you want to execute the use case in this post, you could follow the tutorial.

Advertisements

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Big Data, Correlation, Spark, Statistics, Time Series Analytic and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s