There are may benefits of auto correlation analysis on time series data, as we will be alluding to in detail later. It allows us to gain important insights on the nature of the time series data. Cycle detection is one of them. To put things in context, we will use cycle detection for energy usage time series data as an example to demonstrate the usefulness of auto correlation.

The Spark implementation is available in my open source project *ruscello* on github. This project has various Spark based implementation of time series analysis.

## Auto Correlation Defined

Auto correlation defines how a data point correlates with past data with some time lag. Typically auto correlation is calculated for various time lag values. Generally, the auto correlation value is normalized with variance. The normalized auto correlation coefficient value ranges from +1 to -1. The definition is as follows

*c _{k} = 1/n ∑(x_{t} – m)(x_{t-k} – m)*

*r*

_{k}= c_{k}/ c_{0}*where*

*c*

_{k}= auto covariance for lag k*c*

_{0}= variance*r*

_{k}= auto correlation for lag k*x*

_{t}= value at time t*m = mean*

Auto correlation can be very useful in gaining insight into time series data, as listed below.

*If auto correlation is close to 0 for any lag, the time series is random. If there are district peaks, the time series has underlying pattern.**If the time series is stationary, auto correlation quickly decays with increasing lag. Slower the decay, more non stationary the time series.**If the time series is stationary, auto correlation will be function of lag only and not the time**If there are cycles in the data, there will distinct peaks in the auto correlation, corresponding to the periods of cycles**If there is trend in the data, auto correlation will be large for small lags and it will slowly decay with increasing lag**If the underlying model of the time series is auto regressive, the data at time t will be a function of some immediate past values. It will be manifested in auto correlation with high values for small lags.**For many time series forecasting algorithms, you need to specify the cycle types, if the data has cycles. Auto correlation can help here*

Having these insights can be very useful before undertaking analysis of any time series data.

## Cycle Detection and Auto Correlation

Consider an utility company with energy usage data coming from smart meters every minute. They know there are cycle or seasonality with peaks and troughs in the usage data. They want to use tools like sub series plot or box plot to analyze cycles and make personal recommendations to the consumers to change their usage habit and save costs.

By cycle, I mean cycles with fixed period, which is often described as seasonality in literature.

However, these cycle analysis tools require knowledge cycle periods and that’s where auto correlation comes into the picture. The lag that correspond to peaks in auto correlation will correspond to cycle periods.

Here are some sample data. It has only 3 fields, *meterID*, *timeStamp* and the *meter reading*. The sampling interval is 1 minute.

5RWN90P4L3,1544445103,0.010 U588G2HW81,1544445107,0.017 552909423K,1544445106,0.014 B3V896XZRF,1544445166,0.010 HZKSEIXXAG,1544445166,0.014 5RWN90P4L3,1544445167,0.012 U588G2HW81,1544445164,0.022 552909423K,1544445167,0.008

The data is synthetic, generated with a python script. There are readings for 5 meters over a period of 10 days. For each consumer we are interested in the finding the cycles in the energy usage.

## Statistics Calculation Spark Job

Auto correlation calculation requires mean of the time series data. Accordingly, the first Spark job we run is *NumericalAttrStats*, which calculates *mean*, *std deviation* etc for each partition or group by key in the data. The key in the *meterID*. Here is the output for statistics. This Spark job is part of the my open source *chombo* project on github

5RWN90P4L3,2,$,253.553000,4.333503,15840,0.016007,0.000017,0.004166,0.005000,0.030000 HZKSEIXXAG,2,$,253.475000,4.331147,15840,0.016002,0.000017,0.004167,0.005000,0.030000 552909423K,2,$,253.212000,4.319856,15840,0.015986,0.000017,0.004145,0.005000,0.030000 B3V896XZRF,2,$,253.855000,4.345415,15840,0.016026,0.000017,0.004182,0.005000,0.030000 U588G2HW81,2,$,265.057000,4.848833,15840,0.016733,0.000026,0.005109,0.001000,0.031000

The first field is the group by key (i.e. meter ID). Results are from the 4^{th} field onward. Mean is the 7^{th} field in the output.

## Temporal Averaging Spark Job

We anticipate daily cycle in the data. Since we are interested in discovering daily cycles in the data and the time granularity of the input data is minutes, we do temporal averaging to turn the data into hourly reading. Hourly data will be used as input to auto correlation. it should make the auto correlation results less noisy.

The temporal averaging is done by the Spark job *TemporalAggregator,* implemented in scala. It performs aggregation operations on time series data. The aggregation functions supported are *count*, *sum* and *average*.

The aggregate values are always always aligned with corresponding time unit boundaries. For example, in our case since we are doing hourly aggregation all values within clock hour will be aggregated and the time stamp for the aggregated values will be the time stamp for the corresponding hour end.

## Auto Correlation Spark Job

Auto correlation is implemented in the Spark job *AutoCorrelation*. It takes as input the output of temporal averaging done by the second Spark job and statistics as calculated by the first Spark job.

One key input parameter is list of lag values. We can also calculate auto correlation for multiple data columns by specifying a list of column indexes as part of the configuration. Since we are speculating daily cycles in data, we have used the lag values 12, 24 and 36, all in hours. Here is the output.

U588G2HW81,2,24,0.997267 5RWN90P4L3,2,24,0.986436 552909423K,2,24,0.986425 HZKSEIXXAG,2,24,0.985413 B3V896XZRF,2,24,0.982911 HZKSEIXXAG,2,12,0.807520 HZKSEIXXAG,2,36,0.807350 5RWN90P4L3,2,12,0.799401 B3V896XZRF,2,36,0.799189 B3V896XZRF,2,12,0.798675 552909423K,2,36,0.795565 5RWN90P4L3,2,36,0.792159 552909423K,2,12,0.789131 U588G2HW81,2,12,-0.931098 U588G2HW81,2,36,-0.933244

The fields in the output are as follows

*meter ID**column index of the data being analyzed**lag**auto correlation coefficient*

For all meters we find that the auto correlation coefficient is highest for 24 hour lag. It supports our speculation of daily cycles in the data. This is how the auto correlation Spark job works. We hypothesize about the cycle periods in the data, try those periods and verify from the output if our hypothesis is correct.

## Data Variation within a Cycle

Auto correlation only discovers the periods of cycles if any. Auto correlation does not does not shed any light how the data changes within a cycle. You might recall that the energy company is interested the actual variation in energy usage within a cycle after the cycle periods have be found through auto correlation.

There are two ways you can approach this. You can take the visualization route and use sub series plot or box plot. Or you can do it quantitatively going back to the Spark job . This time you have to specify cycle type as a configuration parameter, which is daily in the statistics calculating Spark job. For each meter and each hour of the day, It will calculate *mean* , *std deviation *and other statistics.

## Time Series Anomaly Detection

I have used auto correlation as a pre processing step in a statistics based time series anomaly detection project. After cycle periods are found through auto correlation, statistics is calculated for each cycle index.

For example, if the seasonal cycle is found to be weekly, then 7 sets of statistic are calculated, one for each cycle index i.e. day of the week. These statistic get used to detect anomaly based on z score.

Essentially, anomaly detection is contextualized with seasonal cycles. For example, what is an anomaly in CPU usage for a server on a weekend may be normal on a weekday. It depends very much on the seasonal context.

## Summing Up

Auto correlation is a powerful tool for time series analysis. We have shown how you can gain valuable insights on time series data through auto correlation as implemented on Spark. If there is trend in the data, it will be desirable to remove the trend before performing auto correlation. If you want to execute the use case in this post, you could follow the tutorial.

## Support

For commercial support for any solution in my github repositories, please talk to ThirdEye Data Science Services. Support is available for *Hadoop* or Spark deployment on cloud including installation, configuration and testing.

Pingback: Normal Distribution Fitness Test with Chi Square on Spark | Mawazo