Operational Analytics with Seasonal Data


Time sequence data which is all around us may contain seasonal components. Data is seasonal when there is a seasonal component e.g month of the year, day of the week, hour of week day etc in the data. It is defined by a time range and a period.

My open source project chombo has solutions for seasonality analysis. The solution is two fold. First, there is a map reduce job to detect seasonality in data. Second, there is another map reduce job to calculate  statistics of the seasonal components. In this post we will go through the steps for analysing operational data, with seasonal components.

Seasonal Data

Time sequence data may have the following components. In this post the focus is on the seasonal component

  1. Seasonal : The data has seasonal component which has a fixed period and time range
  2. Cyclic : The data has cycle without fixed periods
  3. Long term trend : The data has long term trend of increasing or decreasing

If the data data has cyclic or long term trend, they need to be removed before performing any analysis as outlined in this post.

Seasonal Operational Data

We will be using operational data from a data center as an use case. The data is for CPU usage of servers in a data center. Such data is likely to have strong seasonal component e.g., hour of week day.

You might wonder about the purpose of such analysis. Here are some examples in operational data analytics. As part of a threat detection strategy, we might be interested in anomalous server behavior, which in this case happens to be unusual CPU usage.

One popular way of detecting anomaly is to compute z score which is a function of mean and standard deviation. Since the data is seasonal, the mean and standard deviation will also be seasonal. The average CPU usage may be 40% in certain hours and 80% in some other hours. Usage of 83% may be normal, when the average usage is 80%, but not when it is 40%.

As another example, consider a server cluster that needs to scale elastically depending on the load. If we knew the average usage for different hours on week days and week ends, server provisioning could be done pro actively to handle changing load.

The data for our use case will have the following fields. There may be other fields in the data. But through configuration, only these relevant fields are picked up.

  1. Server ID
  2. CPU usage as percentage
  3. Time stamp as epoch time

Although the focus of this post is analytics with operational data, there are numerous use cases in other business domains.

Seasonality Detection

The first task at hand is detecting seasonal patterns in data. Data is grouped by hour at the hour boundary and average value of data  is calculated for each hour. This is accomplished with the map reduce class SeasonalDetector. Here is some sample output.

server1,16713,19,1,70.211
server1,16713,20,1,28.900
server1,16713,21,1,28.567
server1,16713,22,1,29.594
server1,16713,23,1,29.794
server1,16716,0,1,28.578
server1,16716,1,1,31.467

The fields in the output are as follows. The second and third fields require some explanation. The cycle index, which is the third field,  is the the index of the time range. For hour of day, the index will be between 0 and 23. The parent cycle index is the index of the parent of seasonality type. Parent of hour of day is day. So the parent cycle index is the index of the day since January 1, 1970.

  1. Entity (server) ID
  2. Parent cycle index
  3. Cycle index
  4. Attribute ordinal
  5. Average

Actual detection involves building histogram of hourly average data and picking hours with average of data more or less in the same range. This is done with manual inspection, with the help of some visualization tools.  This gives us groups of hours within  a day, where each group of hours having more or less equal average of hourly data.

I have analyzed data for week days and found the following hour groups. The pattern that emerged contains 4 hour groups. Each hour group contain 1 or more contiguous hours. All the hour groups together cover 24 hours of a day.

Hours Comment
0-7, 20-23 low usage
8-9, 12-13 moderate usage
10-11, 18-19 more than moderate usage
14-17 high usage

The hour groups and the associated usage level seems to make sense for a server in a data center.

Statistics of Seasonal Data

Now that we have the hour groups, we are ready to calculate statistics for the the different hour groups. This is performed with the Map reduce class NumericalAttrStats. This is a general purpose Map Reduce class for calculating various statistics of numerical data.

A configuration flag can be set to indicate that the data is seasonal. When seasonal, it’s also necessary to indicate the type of seasonal cycle through another configuration parameter. The following configuration parameter values for seasonal cycle type  are supported.

Cycle type configuration Comment
monthOfYear Month of a year
dayOfWeek Day of a week
weekDayOfWeek Week day of a week
weekEndDayOfWeek Week end day of a week
hourOfDay Hour of a day
hourOfWeekDay Hour of a week day
hourOfWeekEndDay Hour of a week end day
halfHourOfDay Half hour of a day
halfHourOfWeekDay Half hour of a week day
halfHourOfWeekEndDay Half hour of a week end day
quarterHourOfDay Quarter hour of a day
quarterHourOfWeekDay Quarter hour of a week day
quarterHourOfWeekEndDay Quarter hour of a week end day
hourRangeOfWeekDay List of hour ranges in a week day
hourRangeOfWeekEndDay List of hour ranges in a week end day

For our case, we are using the seasonal cycle type hourRangeOfWeekDay.  The actual hour ranges as detected with the first Map Reduce job is specified through another configuration parameter. Here is the complete output.

server1,1,0,387164.0,13339284.000,12960,29.874,136.824,11.697,0.000,66.000
server1,1,1,260677.0,16562365.000,4320,60.342,192.736,13.883,18.000,102.000
server1,1,2,303725.0,22411777.000,4320,70.307,244.877,15.649,22.000,117.000
server1,1,3,388890.0,35269002.000,4320,90.021,60.370,7.770,66.000,114.000
server2,1,0,387011.0,13351645.000,12960,29.862,138.483,11.768,0.000,66.000
server2,1,1,259414.0,16388676.000,4320,60.050,187.728,13.701,18.000,102.000
server2,1,2,301719.0,22155569.000,4320,69.842,250.649,15.832,23.000,117.000
server2,1,3,388082.0,35130068.000,4320,89.834,61.849,7.864,66.000,114.000

We had data for 2 servers and we had 4 hour groups. That’s why there 8 rows in the output. The fields in the output are as follows

  1. Server ID
  2. Attribute ordinal
  3. Cycle Index
  4. Sum
  5. Sum of square
  6. Count
  7. Average
  8. Variance
  9. Std dev
  10. Min
  11. Max

There are 4 hour groups. Here the cycle index corresponds to the index of the hour groups which is between 0 and 3.

If this result is used for outlier detection in CPU usage depending on the hour of the day, then the mean and standard deviation could be used for zScore based outlier detection.

Summing Up

We have gone through a process for extracting statistics from seasonal data, which could be used for various purposes. A tutorial document exists for steps for execution of this use case. Although I have analysed only week day data, the same steps could be repeated for week end data, with appropriate changes in configuration.

Advertisements

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Big Data, Statistics, Time Series Analytic and tagged , , . Bookmark the permalink.

2 Responses to Operational Analytics with Seasonal Data

  1. Pingback: Association Mining with Improved Apriori Algorithm | Mawazo

  2. Pingback: Mining Seasonal Products from Sales Data | Mawazo

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s