Time Series Trend and Seasonality Component Decomposition with STL on Spark

You may be interested in decomposing a time series into level, trend, seasonality and remainder components to gain more insight into your time series. You may also be interested in decomposition to separate out the remainder component for anomaly detection. We will allude to various real life use cases later in this post. We will go through a time series decomposition solution using Seasonal and Trend decomposition using Loess (STL) algorithm as implemented on Spark.

The spark implementation is available in my open source project ruscello. The implementation is agonistic to any problem domain or data set, because as in my other Spark implementations, it is metadata and configuration driven. We will use eCommerce product sales data to show case the STL implementation.

Decomposition by STL

The insight gained from time series decomposition can be used in various ways. For eCommerce, decomposing demand time series may help you better optimize your inventory. Cloud resource usage time series could be decomposed to better optimize cloud infrastructure.

Time series may have any combination of the following 4 components. Each component could be thought as the product of different generative processes.

  • Level : Average value
  • Trend : Long term variation, linear or non linear
  • Seasonal : Short term periodic change
  • Remainder : The random rest

Time series decomposition may be additive or multiplicative. We will consider only additive decomposition of components, where the time series is considered to be the sum of all the components.

STL algorithm has 2 loops. The outer loop assigns a robustness weight to each data point based on how extreme a value is in the remainder component. This weight gets used in the inner loop to mitigate the impact of outliers. Convergence is generally achieved with 2 iterations of the outer and inner loop, depending on the extent of outliers in the data.

If you think your time series is outlier free, the the outer loop can be skipped by setting the outer loop counter to 1 in the configuration. If you suspect your data has outlier and you are able to detect and remove outlier, then also you can skip the outlier loop.

The inner loop extracts the trend and seasonal components and has the following steps. Most of the steps involve low pass filtering and Loess Smoothing. For low pass filtering, moving average has been used. For brevity, I have skipped many details, which can be found in the STL paper cited earlier.

  • De trend time series, by subtracting the trend component as DT = Y – T
  • Extract seasonal sub sequence from DT for each seasonal index. Smooth each subsequence with Loess smoothing. Re generate the time series from the smoothed sub sequences for output C
  • Perform multiple low pass filtering and one Loess smoothing on the reconstructed and smoothed time series C for output L
  • De trend reconstructed and smoothed time series to get the seasonal component as S = C – L
  • De seasonalize time series as DS = Y – S
  • Perform Loess smoothing on DS to generate trend T
  • Get remainder as R = Y – T – S after all loops complete

Details of the various steps can be found in the STL paper. Guidance on choosing various parameters needed for filtering and smoothing is also available in the paper.

Product Sales Data

The eCommerce product sales data per hour is generated with a Python Script. You can control the trend, seasonal variation and the random remainder while generating time series data.

For seasonality, I have added daily cycle i.e hour of day variation and weekly cycle i.e day of the week variation. The remainder component is generated by a Gaussian process with zero mean and prescribed standard deviation. Trend is linearly increasing . The data has the following fields

  • Product ID
  • Time tamp (aligned with hour boundaries)
  • Sale quantity

Here is some sample input


Spark Workflow

The complete solution involves 3 Spark jobs. One of the important parameters in STL is the seasonal cycle period defined as the number of observation per period. Since our data is synthetic, we already have the prior knowledge about the cycle period.

However, generally you would not know this. Even if you have some knowledge about the seasonal cycle, you may want verification of your hunch. This is where auto correlation comes into the picture.

Auto correlation requires mean values for each product sale quantity, which require yet another Spark job. We will get the mean value from a statistics calculating Spark job. So we will have 3 Spark jobs for the complete solution. Here is the configuration for all the Spark jobs

Statistic Calculating Spark Job

This Spark job NumericalAttrStats calculates mean, standard deviation and various other related statistic. This Spark job is from my open source project chombo. The output is as follows


I have data for 2 products in my synthetic data set. That’s why there are 2 records in this output. This data gets used in the next Spark job, which is for auto correlation.

Auto Correlation Spark Job

The auto correlation Spark job AutoCorrelation calculates auto correlation for various specified lag values. We have added 24 and 168 for testing daily and weekly cycle, corresponding to daily and weekly cycles respectively. There are few other additional lag values to have some context. We are testing for few specific values of lag, because real world data has well defined cycles e.g daily, weekly or yearly.

Lags that result in high correlation will have have values close to 1 or -1, for positive and negative correlation respectively. Here is the output


The last field is the auto correlation values. Weekly cycle has the highest correlation, followed by daily cycle. The lag of 130 provides some context, showing low correlation value.

Component Decomposition Spark Job

The Spark job is implemented in the scala object StlDecomposition. The output has the following fields

  • Product ID
  • Time stamp
  • Full value
  • Average
  • Trend
  • Seasonal
  • Remainder

Here is some sample output


You might have noticed that the trend has some high frequency component and not very smooth. The trend smoothing parameter used in step 6 of the STL algorithm decides how high frequency variation is distributed between trend and remainder.

By choosing a higher value for the Loess smoothing parameters, trend can be made smoother and most of the high frequency variation can be assigned to the remainder. it’s a choice you have to make.

Component Decomposition and Anomaly Detection

Anomaly detection problem is exacerbated by the presence of trend and seasonal components, unless anomaly algorithm used is local and sequence based e.g. Forecast based or Markov Chain based. For example, if the seasonal component is not removed, a spike could be masked by the seasonal variation and not detected by the anomaly detection algorithm.

For all other anomaly detection algorithms, with the help of decomposition, the remainder component could be extracted and anomaly detection performed on the remainder. While doing the extraction most of the high frequency variation should be attributed to the remainder by appropriately setting the smoothing parameter for trend.

Anomalies in other components could also be detected e.g significant level shift in trend. In an eCommerce setting, you may find significant upswing in the demand trend for certain product, because a competing product had a price increase. Any significant change the seasonal cycle magnitude could also be detected with anomaly detection. With a weekly cycle, you might find that maximum demand has shifted from one day of the week to another.

The other way to handle anomaly detection, albeit more complex, is to build separate models for each index of the seasonal cycle. Every seasonal cycle type has associated seasonal indexes. For example, weekly cycle has day of the week as the seasonal index.

Wrapping Up

We have gone through a time series decomposition solution based on STL algorithm, as implemented on Spark. Feel free to use the tutorial to execute the use case in this post.

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Anomaly Detection, Big Data, Data Science, ETL, Spark, Time Series Analytic and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s