Mining Seasonal Products from Sales Data

The other day someone asked me how to include products with seasonal demand in recommendations based on collaborative filtering or some other technique. The solution to the problem involves two steps. The first step is to identify products with seasonal demand. The second step involves merging two ranked lists, one list being the recommendation list based on some recommendation algorithm, the other being the list of products with seasonal demand. The second list is a function of time.

Our focus in this post is on the first problem. As we will explore here in this post, products with seasonal demand can be found with some simple statistical technique. The solution involves running three Hadoop Map Reduce jobs in tandem. The implementation is available in open source project chombo available on github.

Order Transaction Data as Input

Our use case is for an on line store. We use the order transaction data with the following components.

Transaction ID
Product ID
Monetary amount
Time stamp

The monetary amount is used as an indicator for demand. We could have also used product quantity. Here is some sample input. Our data will contain higher than normal pre school sales for certain products.

Solution Outline

At any given time of the year we want the solution to give us a list of products that has high seasonal demand at that time of the year.

The solution has three steps. The first step is to detect seasonality in the data. Data with seasonal components has the following characteristics.

A fixed cycle of seasonality e.g for week of the year seasonality,the cycle is a year
A time interval e.g for week of the year seasonality, the interval is 7 days
Seasonal cycle index, e.g for week of the year seasonality, the index is between 0 and 53

In this step, we process past transaction data and bin the transaction amount with one bin per week. Since are using week of the year as the seasonality cycle, there will be 53 bins, one of each week in an year.

We sum the amount for each bin. The output is essentially an de normalized histogram, A week is considered to begin on a Monday. Since there may be partial weeks near the year boundary, we may have have up to 53 weeks in a year.More on seasonality detection can be found in an earlier post.

In the second step we take the output of the first step and calculate normalized distribution of transaction amount and certain statistical properties from the distribution.

In the third and final step, we take the output of the second step and identify the modes of the histogram for products with significant non uniformity in the histogram distribution. The modes will correspond to weeks with higher than average demand.

Map Reduce Pipeline

As mentioned earlier, there are three map reduce jobs that run in tandem, one for each of the 3 steps as described earlier. The first map reduce job SeasonalDetector maps the time stamp to a week of the year index i.e cycle index for each transaction. It’s output is as follows. There is one output record for each product and week of the years.

ZWTG9FZP,0,24,2,342
ZWTG9FZP,0,25,2,156
ZWTG9FZP,0,26,2,360
ZWTG9FZP,0,27,2,381
ZWTG9FZP,0,28,2,1300
ZWTG9FZP,0,29,2,2907
ZWTG9FZP,0,30,2,2259
ZWTG9FZP,0,31,2,2588

The different fields are as follows.

Product ID
Parent cycle index
Cycle index i.e. week of the year in this case
Attribute index
Total amount

Some seasonality cycle have a parent cycle. For example, day of the month has parent cycle which is month of the year.

We have a field called attribute index which refers to the field for transaction amount in the input. It’s there to identify the column being analyzed, because this map reduce job can process multiple data columns for seasonality.

The next map reduce job NumericalAttrDistrStats takes the input of the first job and calculates distribution and some statistical quantities for each product. The histogram bin is the week of the year index and the bin value is the total amount for the week. Here is some sample output. There is one output record for each product.

XAYSOMO8,53,0,886,1,875,2,778,3,889,4,1149,5,1344,6,196,7,660,8,397,9,761,10,669,11,516,12,393,13,592,14,475,15,1143,16,398,17,899,18,667,19,474,20,777,21,388,22,382,23,745,24,295,25,582,26,878,27,1139,28,379,29,190,30,297,31,87,32,594,33,88,34,378,35,594,36,764,37,897,38,971,39,671,40,651,41,568,42,280,43,1238,44,855,45,946,46,672,47,752,48,840,49,395,50,493,51,369,52,381,3.859,5.500,10.731,25.172,40.023
Y1LHFN55,53,1,376,2,243,3,315,4,177,5,312,6,237,7,246,8,314,9,171,10,283,11,140,12,247,13,207,14,171,15,146,16,427,17,171,18,348,19,310,20,210,21,315,22,244,23,242,24,316,25,345,26,278,27,270,28,173,29,163,30,37,31,181,32,108,33,105,34,109,35,413,36,317,37,177,38,65,39,246,40,342,41,427,42,240,43,243,44,74,45,173,46,312,47,305,48,237,49,244,50,135,51,67,52,138,53,101,3.878,16.500,12.947,24.772,39.955
Y3N81J94,53,0,586,1,872,2,711,3,1235,4,1002,5,827,6,361,7,510,8,1214,9,229,10,947,11,828,12,971,13,844,14,491,15,958,16,822,17,1212,18,352,19,1092,20,847,21,856,22,844,23,476,24,956,25,568,26,1241,27,843,28,714,29,110,30,232,31,264,32,381,33,449,35,600,36,720,37,972,38,983,39,495,40,642,41,498,42,480,43,461,44,835,45,498,46,831,47,1589,48,602,49,808,50,845,51,593,52,487,53,133,3.873,47.500,12.162,23.729,39.640

The fields are as follows. For our problem, we will be using the histogram and optionally the entropy value.

Product ID
Number of bins
Bin and bin value pairs for all the bins. The output could be normalized optinally
Entropy
Mode
25% quantile
50% quantile or median
75% quantile

The final map reduce job SeasonalCycleFinder detects modes in the histogram of products with significant non uniformity in distribution. There are two ways to detect non uniformity and modes as follows

Modes with value some percentage above the average value
Top n modes.

There is also an additional filtering based on entropy. Entropy is a measure of randomness in a distribution. Distributions with high non uniformity and peaks have low entropy value.

With this filter on, we apply a upper limit for the entropy. With the second strategy for mode identification, it is important to apply the entropy based filter. Here is some sample output. We have used the first strategy for mode identification. The threshold for mode is set to 200% of average.

23,7P6U253E,2008,45I15DS7,923
24,JTQFOU27,1400,ILCJW987,498,04KLY51P,1303,8L07Q41R,1456
25,GV6DZTX9,1027,WNUB5122,726,N48JP2IT,1629,SAWW6PBM,979,N603Z8V8,2778,939XB2F7,657
26,34109709,1916,962NH84D,1950,N5U0ZPYI,2090,FOPC4URK,1142
27,E069MWVG,611,TWL5XG9G,1943,C4JQZCBI,2618
28,RR44R650,4739,53I098F5,6452,Q5VTTNJJ,4660,ZWTG9FZP,1300,9B8187AP,1419,TN5G0U75,4140,9ZJF3LFT,5801,ON727318,6829,J6H6T2BB,4751,65XZLTQ7,934,FVQHQGEY,4378,LYN5B152,2404,HM99YZ4W,2074
29,J6H6T2BB,11283,9ZJF3LFT,12901,ODH5UK99,3408,RR44R650,8668,LYN5B152,5504,HM99YZ4W,4749,65XZLTQ7,1783,53I098F5,14876,TN5G0U75,6915,29UO8078,5036,9B8187AP,2576,0G0GB7OW,7148,ZWTG9FZP,2907,ON727318,9979,FVQHQGEY,8535,Q5VTTNJJ,10505,8W9AFV83,16016
30,TN5G0U75,5298,LYN5B152,5552,9ZJF3LFT,12947,0G0GB7OW,5753,53I098F5,17555,ON727318,13936,J6H6T2BB,10821,FVQHQGEY,6747,HM99YZ4W,3559,8W9AFV83,17845,65XZLTQ7,1852,RR44R650,8049,29UO8078,4411,Q5VTTNJJ,10507,ZWTG9FZP,2259,9B8187AP,2618,ODH5UK99,4830
31,ZWTG9FZP,2588,9ZJF3LFT,13496,29UO8078,4283,LYN5B152,6043,9B8187AP,3345,0G0GB7OW,4795,65XZLTQ7,2141,ODH5UK99,3942,RR44R650,10101,Q5VTTNJJ,11365,J6H6T2BB,7045,TN5G0U75,7791,8W9AFV83,14827,53I098F5,17611,HM99YZ4W,4223,ON727318,12560,FVQHQGEY,7899
32,65XZLTQ7,2044,8W9AFV83,18309,53I098F5,17405,9B8187AP,2883,9ZJF3LFT,11801,FVQHQGEY,8683,HM99YZ4W,5587,LYN5B152,4354,29UO8078,4506,ODH5UK99,3633,ON727318,13263,Q5VTTNJJ,11611,RR44R650,8102,TN5G0U75,8278,0G0GB7OW,6908,ZWTG9FZP,2562,J6H6T2BB,9449
33,9ZJF3LFT,10660,0G0GB7OW,6827,LYN5B152,5577,ON727318,14155,ZWTG9FZP,2426,TN5G0U75,8942,8W9AFV83,10899,29UO8078,4656,RR44R650,8841,J6H6T2BB,9586,65XZLTQ7,1656,53I098F5,14147,FVQHQGEY,9102,Q5VTTNJJ,10916,9B8187AP,2742,ODH5UK99,3882,HM99YZ4W,5836
34,J6H6T2BB,5486,LYN5B152,2972,FVQHQGEY,4074,ODH5UK99,2209,ON727318,7773,9ZJF3LFT,7188,9B8187AP,1680,HM99YZ4W,2643,65XZLTQ7,1568,Q5VTTNJJ,5153,53I098F5,11457,8W9AFV83,9942,ZWTG9FZP,2054,29UO8078,3602,RR44R650,4346,0G0GB7OW,5115,TN5G0U75,6474
35,B8S0ZDTA,2864,67MENHLU,1430,7X9R4BB2,2532,X577IAFI,2877
36,9GC7AM3X,2462,YY46E21M,2595,7U7258G5,914,U9OZQNHC,1622
37,H8EFF5CG,1493,248WCFGF,2225,VD7B26W5,1243,JTQFOU27,1501
38,PGE7LB8Z,964,99SHYNCN,2332,09YGVQDT,1026,OM710FQA,1514,U9OZQNHC,1846
39,28U770B0,1858,R9B95265,1070,P24SYA92,389
40,90967MZP,2463,C4JQZCBI,2987

The fields in the output are as follows. There is one record for each week of the year.

Week of the year index
List of product ID and weekly sales amount for the product

We notice a sharp rise in the number of products from week 28 through week 34. That range of weeks happens to be the pre school sales season.

Product Recommendation

Although not the main focus, I will discuss briefly the technique to include seasonal products into the core product recommendation list. Typically, a personalized recommendation engine will have output with following components.

User ID
Product ID
Score

We have to convert the seasonal product list to a similar format. We can can convert the amount which reflects demand to a score with same scale as the recommendation list. We also also have to replicate the list for each user by pre pending the the user IDs.

Once we have both ranked list, we can use the map reduce job RankAggregator to consolidate the two lists and generate a net recommendation list. The scores from the two list could be combined in several ways, including average, weighted average etc.

Summing Up

We have gone through a Hadoop map reduce based solution for identifying seasonal products from sales data. To execute the steps in this use case, please use the tutorial.