Time sequence data which is all around us may contain seasonal components. Data is seasonal when there is a seasonal component e.g month of the year, day of the week, hour of week day etc in the data. It is defined by a time range and a period.
My open source project chombo has solutions for seasonality analysis. The solution is two fold. First, there is a map reduce job to detect seasonality in data. Second, there is another map reduce job to calculate statistics of the seasonal components. In this post we will go through the steps for analysing operational data, with seasonal components.
Time sequence data may have the following components. In this post the focus is on the seasonal component
- Seasonal : The data has seasonal component which has a fixed period and time range
- Cyclic : The data has cycle without fixed periods
- Long term trend : The data has long term trend of increasing or decreasing
If the data data has cyclic or long term trend, they need to be removed before performing any analysis as outlined in this post.
Seasonal Operational Data
We will be using operational data from a data center as an use case. The data is for CPU usage of servers in a data center. Such data is likely to have strong seasonal component e.g., hour of week day.
You might wonder about the purpose of such analysis. Here are some examples in operational data analytics. As part of a threat detection strategy, we might be interested in anomalous server behavior, which in this case happens to be unusual CPU usage.
One popular way of detecting anomaly is to compute z score which is a function of mean and standard deviation. Since the data is seasonal, the mean and standard deviation will also be seasonal. The average CPU usage may be 40% in certain hours and 80% in some other hours. Usage of 83% may be normal, when the average usage is 80%, but not when it is 40%.
As another example, consider a server cluster that needs to scale elastically depending on the load. If we knew the average usage for different hours on week days and week ends, server provisioning could be done pro actively to handle changing load.
The data for our use case will have the following fields. There may be other fields in the data. But through configuration, only these relevant fields are picked up.
- Server ID
- CPU usage as percentage
- Time stamp as epoch time
Although the focus of this post is analytics with operational data, there are numerous use cases in other business domains.
The first task at hand is detecting seasonal patterns in data. Data is grouped by hour at the hour boundary and average value of data is calculated for each hour. This is accomplished with the map reduce class SeasonalDetector. Here is some sample output.
server1,16713,19,1,70.211 server1,16713,20,1,28.900 server1,16713,21,1,28.567 server1,16713,22,1,29.594 server1,16713,23,1,29.794 server1,16716,0,1,28.578 server1,16716,1,1,31.467
The fields in the output are as follows. The second and third fields require some explanation. The cycle index, which is the third field, is the the index of the time range. For hour of day, the index will be between 0 and 23. The parent cycle index is the index of the parent of seasonality type. Parent of hour of day is day. So the parent cycle index is the index of the day since January 1, 1970.
- Entity (server) ID
- Parent cycle index
- Cycle index
- Attribute ordinal
Actual detection involves building histogram of hourly average data and picking hours with average of data more or less in the same range. This is done with manual inspection, with the help of some visualization tools. This gives us groups of hours within a day, where each group of hours having more or less equal average of hourly data.
I have analyzed data for week days and found the following hour groups. The pattern that emerged contains 4 hour groups. Each hour group contain 1 or more contiguous hours. All the hour groups together cover 24 hours of a day.
|0-7, 20-23||low usage|
|8-9, 12-13||moderate usage|
|10-11, 18-19||more than moderate usage|
The hour groups and the associated usage level seems to make sense for a server in a data center.
Statistics of Seasonal Data
Now that we have the hour groups, we are ready to calculate statistics for the the different hour groups. This is performed with the Map reduce class NumericalAttrStats. This is a general purpose Map Reduce class for calculating various statistics of numerical data.
A configuration flag can be set to indicate that the data is seasonal. When seasonal, it’s also necessary to indicate the type of seasonal cycle through another configuration parameter. The following configuration parameter values for seasonal cycle type are supported.
|Cycle type configuration||Comment|
|monthOfYear||Month of a year|
|dayOfWeek||Day of a week|
|weekDayOfWeek||Week day of a week|
|weekEndDayOfWeek||Week end day of a week|
|hourOfDay||Hour of a day|
|hourOfWeekDay||Hour of a week day|
|hourOfWeekEndDay||Hour of a week end day|
|halfHourOfDay||Half hour of a day|
|halfHourOfWeekDay||Half hour of a week day|
|halfHourOfWeekEndDay||Half hour of a week end day|
|quarterHourOfDay||Quarter hour of a day|
|quarterHourOfWeekDay||Quarter hour of a week day|
|quarterHourOfWeekEndDay||Quarter hour of a week end day|
|hourRangeOfWeekDay||List of hour ranges in a week day|
|hourRangeOfWeekEndDay||List of hour ranges in a week end day|
For our case, we are using the seasonal cycle type hourRangeOfWeekDay. The actual hour ranges as detected with the first Map Reduce job is specified through another configuration parameter. Here is the complete output.
server1,1,0,387164.0,13339284.000,12960,29.874,136.824,11.697,0.000,66.000 server1,1,1,260677.0,16562365.000,4320,60.342,192.736,13.883,18.000,102.000 server1,1,2,303725.0,22411777.000,4320,70.307,244.877,15.649,22.000,117.000 server1,1,3,388890.0,35269002.000,4320,90.021,60.370,7.770,66.000,114.000 server2,1,0,387011.0,13351645.000,12960,29.862,138.483,11.768,0.000,66.000 server2,1,1,259414.0,16388676.000,4320,60.050,187.728,13.701,18.000,102.000 server2,1,2,301719.0,22155569.000,4320,69.842,250.649,15.832,23.000,117.000 server2,1,3,388082.0,35130068.000,4320,89.834,61.849,7.864,66.000,114.000
We had data for 2 servers and we had 4 hour groups. That’s why there 8 rows in the output. The fields in the output are as follows
- Server ID
- Attribute ordinal
- Cycle Index
- Sum of square
- Std dev
There are 4 hour groups. Here the cycle index corresponds to the index of the hour groups which is between 0 and 3.
If this result is used for outlier detection in CPU usage depending on the hour of the day, then the mean and standard deviation could be used for zScore based outlier detection.
We have gone through a process for extracting statistics from seasonal data, which could be used for various purposes. A tutorial document exists for steps for execution of this use case. Although I have analysed only week day data, the same steps could be repeated for week end data, with appropriate changes in configuration.