You may be interested in decomposing a time series into level, trend, seasonality and remainder components to gain more insight into your time series. You may also be interested in decomposition to separate out the remainder component for anomaly detection. We will allude to various real life use cases later in this post. We will go through a time series decomposition solution using *Seasonal and Trend decomposition using Loess (STL)* algorithm as implemented on Spark.

The spark implementation is available in my open source project *ruscello*. The implementation is agonistic to any problem domain or data set, because as in my other Spark implementations, it is metadata and configuration driven. We will use eCommerce product sales data to show case the STL implementation.

## Decomposition by STL

The insight gained from time series decomposition can be used in various ways. For eCommerce, decomposing demand time series may help you better optimize your inventory. Cloud resource usage time series could be decomposed to better optimize cloud infrastructure.

Time series may have any combination of the following 4 components. Each component could be thought as the product of different generative processes.

**L****evel**: Average value**Trend**: Long term variation, linear or non linear**Seasonal**: Short term periodic change**Remainder**: The random rest

Time series decomposition may be additive or multiplicative. We will consider only additive decomposition of components, where the time series is considered to be the sum of all the components.

STL algorithm has 2 loops. The outer loop assigns a robustness weight to each data point based on how extreme a value is in the remainder component. This weight gets used in the inner loop to mitigate the impact of outliers. Convergence is generally achieved with 2 iterations of the outer and inner loop, depending on the extent of outliers in the data.

If you think your time series is outlier free, the the outer loop can be skipped by setting the outer loop counter to 1 in the configuration. If you suspect your data has outlier and you are able to detect and remove outlier, then also you can skip the outlier loop.

The inner loop extracts the trend and seasonal components and has the following steps. Most of the steps involve low pass filtering and Loess Smoothing. For low pass filtering, moving average has been used. For brevity, I have skipped many details, which can be found in the STL paper cited earlier.

*De trend time series, by subtracting the trend component as DT = Y – T**Extract seasonal sub sequence from DT for each seasonal index. Smooth each subsequence with Loess smoothing. Re generate the time series from the smoothed sub sequences for output C**Perform multiple low pass filtering and one Loess smoothing on the reconstructed and smoothed time series C for output L**De trend reconstructed and smoothed time series to get the seasonal component as S = C – L**De seasonalize time series as DS = Y – S**Perform Loess smoothing on DS to generate trend T**Get remainder as R = Y – T – S**after all loops complete*

Details of the various steps can be found in the STL paper. Guidance on choosing various parameters needed for filtering and smoothing is also available in the paper.

## Product Sales Data

The eCommerce product sales data per hour is generated with a Python Script. You can control the trend, seasonal variation and the random remainder while generating time series data.

For seasonality, I have added daily cycle i.e hour of day variation and weekly cycle i.e day of the week variation. The remainder component is generated by a Gaussian process with zero mean and prescribed standard deviation. Trend is linearly increasing . The data has the following fields

*Product ID**Time tamp (aligned with hour boundaries)**Sale quantity*

Here is some sample input

63J1365378,1561456800,1705 8JCVNONHJ7,1561460400,1476 63J1365378,1561460400,1662 8JCVNONHJ7,1561464000,1450 63J1365378,1561464000,1673

## Spark Workflow

The complete solution involves 3 Spark jobs. One of the important parameters in STL is the seasonal cycle period defined as the number of observation per period. Since our data is synthetic, we already have the prior knowledge about the cycle period.

However, generally you would not know this. Even if you have some knowledge about the seasonal cycle, you may want verification of your hunch. This is where auto correlation comes into the picture.

Auto correlation requires mean values for each product sale quantity, which require yet another Spark job. We will get the mean value from a statistics calculating Spark job. So we will have 3 Spark jobs for the complete solution. Here is the configuration for all the Spark jobs

## Statistic Calculating Spark Job

This Spark job *NumericalAttrStats* calculates mean, standard deviation and various other related statistic. This Spark job is from my open source project *chombo*. The output is as follows

63J1365378,2,$,3335717.000000,5315472911.000000,2160,1544.313426,75963.130931,275.614098,1042.000000,2237.000000 8JCVNONHJ7,2,$,3184040.000000,4736569220.000000,2160,1474.092593,19907.148834,141.092696,1123.000000,1863.000000

I have data for 2 products in my synthetic data set. That’s why there are 2 records in this output. This data gets used in the next Spark job, which is for auto correlation.

## Auto Correlation Spark Job

The auto correlation Spark job *AutoCorrelation *calculates auto correlation for various specified lag values. We have added 24 and 168 for testing daily and weekly cycle, corresponding to daily and weekly cycles respectively. There are few other additional lag values to have some context. We are testing for few specific values of lag, because real world data has well defined cycles e.g daily, weekly or yearly.

Lags that result in high correlation will have have values close to 1 or -1, for positive and negative correlation respectively. Here is the output

63J1365378,2,168,0.945026 8JCVNONHJ7,2,168,0.906365 63J1365378,2,24,0.381904 8JCVNONHJ7,2,24,0.090933 63J1365378,2,130,0.011817 8JCVNONHJ7,2,130,-0.087460

The last field is the auto correlation values. Weekly cycle has the highest correlation, followed by daily cycle. The lag of 130 provides some context, showing low correlation value.

## Component Decomposition Spark Job

The Spark job is implemented in the scala object *StlDecomposition*. The output has the following fields

*Product ID**Time stamp**Full value**Average**Trend**Seasonal**Remainder*

Here is some sample output

8JCVNONHJ7,1565456400,1538.000,1473.155,5.251,35.798,23.796 8JCVNONHJ7,1565460000,1469.000,1473.155,5.129,-2.373,-6.911 8JCVNONHJ7,1565463600,1492.000,1473.155,5.195,2.596,11.053 8JCVNONHJ7,1565467200,1581.000,1473.155,5.150,78.385,24.310 8JCVNONHJ7,1565470800,1527.000,1473.155,5.022,23.000,25.822 8JCVNONHJ7,1565474400,1468.000,1473.155,4.884,-10.302,0.263 8JCVNONHJ7,1565478000,1480.000,1473.155,4.903,-17.780,19.722 8JCVNONHJ7,1565481600,1176.000,1473.155,4.802,-264.883,-37.073

You might have noticed that the trend has some high frequency component and not very smooth. The trend smoothing parameter used in step 6 of the STL algorithm decides how high frequency variation is distributed between trend and remainder.

By choosing a higher value for the Loess smoothing parameters, trend can be made smoother and most of the high frequency variation can be assigned to the remainder. it’s a choice you have to make.

## Component Decomposition and Anomaly Detection

Anomaly detection problem is exacerbated by the presence of trend and seasonal components, unless anomaly algorithm used is local and sequence based e.g. Forecast based or Markov Chain based. For example, if the seasonal component is not removed, a spike could be masked by the seasonal variation and not detected by the anomaly detection algorithm.

For all other anomaly detection algorithms, with the help of decomposition, the remainder component could be extracted and anomaly detection performed on the remainder. While doing the extraction most of the high frequency variation should be attributed to the remainder by appropriately setting the smoothing parameter for trend.

Anomalies in other components could also be detected e.g significant level shift in trend. In an eCommerce setting, you may find significant upswing in the demand trend for certain product, because a competing product had a price increase. Any significant change the seasonal cycle magnitude could also be detected with anomaly detection. With a weekly cycle, you might find that maximum demand has shifted from one day of the week to another.

The other way to handle anomaly detection, albeit more complex, is to build separate models for each index of the seasonal cycle. Every seasonal cycle type has associated seasonal indexes. For example, weekly cycle has day of the week as the seasonal index.

## Wrapping Up

We have gone through a time series decomposition solution based on STL algorithm, as implemented on Spark. Feel free to use the tutorial to execute the use case in this post.