The other day someone asked me how to include products with seasonal demand in recommendations based on collaborative filtering or some other technique. The solution to the problem involves two steps. The first step is to identify products with seasonal demand. The second step involves merging two ranked lists, one list being the recommendation list based on some recommendation algorithm, the other being the list of products with seasonal demand. The second list is a function of time.
Our focus in this post is on the first problem. As we will explore here in this post, products with seasonal demand can be found with some simple statistical technique. The solution involves running three Hadoop Map Reduce jobs in tandem. The implementation is available in open source project chombo available on github.
Order Transaction Data as Input
Our use case is for an on line store. We use the order transaction data with the following components.
- Transaction ID
- Product ID
- Monetary amount
- Time stamp
The monetary amount is used as an indicator for demand. We could have also used product quantity. Here is some sample input. Our data will contain higher than normal pre school sales for certain products.
At any given time of the year we want the solution to give us a list of products that has high seasonal demand at that time of the year.
The solution has three steps. The first step is to detect seasonality in the data. Data with seasonal components has the following characteristics.
- A fixed cycle of seasonality e.g for week of the year seasonality,the cycle is a year
- A time interval e.g for week of the year seasonality, the interval is 7 days
- Seasonal cycle index, e.g for week of the year seasonality, the index is between 0 and 53
In this step, we process past transaction data and bin the transaction amount with one bin per week. Since are using week of the year as the seasonality cycle, there will be 53 bins, one of each week in an year.
We sum the amount for each bin. The output is essentially an de normalized histogram, A week is considered to begin on a Monday. Since there may be partial weeks near the year boundary, we may have have up to 53 weeks in a year.More on seasonality detection can be found in an earlier post.
In the second step we take the output of the first step and calculate normalized distribution of transaction amount and certain statistical properties from the distribution.
In the third and final step, we take the output of the second step and identify the modes of the histogram for products with significant non uniformity in the histogram distribution. The modes will correspond to weeks with higher than average demand.
Map Reduce Pipeline
As mentioned earlier, there are three map reduce jobs that run in tandem, one for each of the 3 steps as described earlier. The first map reduce job SeasonalDetector maps the time stamp to a week of the year index i.e cycle index for each transaction. It’s output is as follows. There is one output record for each product and week of the years.
ZWTG9FZP,0,24,2,342 ZWTG9FZP,0,25,2,156 ZWTG9FZP,0,26,2,360 ZWTG9FZP,0,27,2,381 ZWTG9FZP,0,28,2,1300 ZWTG9FZP,0,29,2,2907 ZWTG9FZP,0,30,2,2259 ZWTG9FZP,0,31,2,2588
The different fields are as follows.
- Product ID
- Parent cycle index
- Cycle index i.e. week of the year in this case
- Attribute index
- Total amount
Some seasonality cycle have a parent cycle. For example, day of the month has parent cycle which is month of the year.
We have a field called attribute index which refers to the field for transaction amount in the input. It’s there to identify the column being analyzed, because this map reduce job can process multiple data columns for seasonality.
The next map reduce job NumericalAttrDistrStats takes the input of the first job and calculates distribution and some statistical quantities for each product. The histogram bin is the week of the year index and the bin value is the total amount for the week. Here is some sample output. There is one output record for each product.
XAYSOMO8,53,0,886,1,875,2,778,3,889,4,1149,5,1344,6,196,7,660,8,397,9,761,10,669,11,516,12,393,13,592,14,475,15,1143,16,398,17,899,18,667,19,474,20,777,21,388,22,382,23,745,24,295,25,582,26,878,27,1139,28,379,29,190,30,297,31,87,32,594,33,88,34,378,35,594,36,764,37,897,38,971,39,671,40,651,41,568,42,280,43,1238,44,855,45,946,46,672,47,752,48,840,49,395,50,493,51,369,52,381,3.859,5.500,10.731,25.172,40.023 Y1LHFN55,53,1,376,2,243,3,315,4,177,5,312,6,237,7,246,8,314,9,171,10,283,11,140,12,247,13,207,14,171,15,146,16,427,17,171,18,348,19,310,20,210,21,315,22,244,23,242,24,316,25,345,26,278,27,270,28,173,29,163,30,37,31,181,32,108,33,105,34,109,35,413,36,317,37,177,38,65,39,246,40,342,41,427,42,240,43,243,44,74,45,173,46,312,47,305,48,237,49,244,50,135,51,67,52,138,53,101,3.878,16.500,12.947,24.772,39.955 Y3N81J94,53,0,586,1,872,2,711,3,1235,4,1002,5,827,6,361,7,510,8,1214,9,229,10,947,11,828,12,971,13,844,14,491,15,958,16,822,17,1212,18,352,19,1092,20,847,21,856,22,844,23,476,24,956,25,568,26,1241,27,843,28,714,29,110,30,232,31,264,32,381,33,449,35,600,36,720,37,972,38,983,39,495,40,642,41,498,42,480,43,461,44,835,45,498,46,831,47,1589,48,602,49,808,50,845,51,593,52,487,53,133,3.873,47.500,12.162,23.729,39.640
The fields are as follows. For our problem, we will be using the histogram and optionally the entropy value.
- Product ID
- Number of bins
- Bin and bin value pairs for all the bins. The output could be normalized optinally
- 25% quantile
- 50% quantile or median
- 75% quantile
The final map reduce job SeasonalCycleFinder detects modes in the histogram of products with significant non uniformity in distribution. There are two ways to detect non uniformity and modes as follows
- Modes with value some percentage above the average value
- Top n modes.
There is also an additional filtering based on entropy. Entropy is a measure of randomness in a distribution. Distributions with high non uniformity and peaks have low entropy value.
With this filter on, we apply a upper limit for the entropy. With the second strategy for mode identification, it is important to apply the entropy based filter. Here is some sample output. We have used the first strategy for mode identification. The threshold for mode is set to 200% of average.
23,7P6U253E,2008,45I15DS7,923 24,JTQFOU27,1400,ILCJW987,498,04KLY51P,1303,8L07Q41R,1456 25,GV6DZTX9,1027,WNUB5122,726,N48JP2IT,1629,SAWW6PBM,979,N603Z8V8,2778,939XB2F7,657 26,34109709,1916,962NH84D,1950,N5U0ZPYI,2090,FOPC4URK,1142 27,E069MWVG,611,TWL5XG9G,1943,C4JQZCBI,2618 28,RR44R650,4739,53I098F5,6452,Q5VTTNJJ,4660,ZWTG9FZP,1300,9B8187AP,1419,TN5G0U75,4140,9ZJF3LFT,5801,ON727318,6829,J6H6T2BB,4751,65XZLTQ7,934,FVQHQGEY,4378,LYN5B152,2404,HM99YZ4W,2074 29,J6H6T2BB,11283,9ZJF3LFT,12901,ODH5UK99,3408,RR44R650,8668,LYN5B152,5504,HM99YZ4W,4749,65XZLTQ7,1783,53I098F5,14876,TN5G0U75,6915,29UO8078,5036,9B8187AP,2576,0G0GB7OW,7148,ZWTG9FZP,2907,ON727318,9979,FVQHQGEY,8535,Q5VTTNJJ,10505,8W9AFV83,16016 30,TN5G0U75,5298,LYN5B152,5552,9ZJF3LFT,12947,0G0GB7OW,5753,53I098F5,17555,ON727318,13936,J6H6T2BB,10821,FVQHQGEY,6747,HM99YZ4W,3559,8W9AFV83,17845,65XZLTQ7,1852,RR44R650,8049,29UO8078,4411,Q5VTTNJJ,10507,ZWTG9FZP,2259,9B8187AP,2618,ODH5UK99,4830 31,ZWTG9FZP,2588,9ZJF3LFT,13496,29UO8078,4283,LYN5B152,6043,9B8187AP,3345,0G0GB7OW,4795,65XZLTQ7,2141,ODH5UK99,3942,RR44R650,10101,Q5VTTNJJ,11365,J6H6T2BB,7045,TN5G0U75,7791,8W9AFV83,14827,53I098F5,17611,HM99YZ4W,4223,ON727318,12560,FVQHQGEY,7899 32,65XZLTQ7,2044,8W9AFV83,18309,53I098F5,17405,9B8187AP,2883,9ZJF3LFT,11801,FVQHQGEY,8683,HM99YZ4W,5587,LYN5B152,4354,29UO8078,4506,ODH5UK99,3633,ON727318,13263,Q5VTTNJJ,11611,RR44R650,8102,TN5G0U75,8278,0G0GB7OW,6908,ZWTG9FZP,2562,J6H6T2BB,9449 33,9ZJF3LFT,10660,0G0GB7OW,6827,LYN5B152,5577,ON727318,14155,ZWTG9FZP,2426,TN5G0U75,8942,8W9AFV83,10899,29UO8078,4656,RR44R650,8841,J6H6T2BB,9586,65XZLTQ7,1656,53I098F5,14147,FVQHQGEY,9102,Q5VTTNJJ,10916,9B8187AP,2742,ODH5UK99,3882,HM99YZ4W,5836 34,J6H6T2BB,5486,LYN5B152,2972,FVQHQGEY,4074,ODH5UK99,2209,ON727318,7773,9ZJF3LFT,7188,9B8187AP,1680,HM99YZ4W,2643,65XZLTQ7,1568,Q5VTTNJJ,5153,53I098F5,11457,8W9AFV83,9942,ZWTG9FZP,2054,29UO8078,3602,RR44R650,4346,0G0GB7OW,5115,TN5G0U75,6474 35,B8S0ZDTA,2864,67MENHLU,1430,7X9R4BB2,2532,X577IAFI,2877 36,9GC7AM3X,2462,YY46E21M,2595,7U7258G5,914,U9OZQNHC,1622 37,H8EFF5CG,1493,248WCFGF,2225,VD7B26W5,1243,JTQFOU27,1501 38,PGE7LB8Z,964,99SHYNCN,2332,09YGVQDT,1026,OM710FQA,1514,U9OZQNHC,1846 39,28U770B0,1858,R9B95265,1070,P24SYA92,389 40,90967MZP,2463,C4JQZCBI,2987
The fields in the output are as follows. There is one record for each week of the year.
- Week of the year index
- List of product ID and weekly sales amount for the product
We notice a sharp rise in the number of products from week 28 through week 34. That range of weeks happens to be the pre school sales season.
Although not the main focus, I will discuss briefly the technique to include seasonal products into the core product recommendation list. Typically, a personalized recommendation engine will have output with following components.
- User ID
- Product ID
We have to convert the seasonal product list to a similar format. We can can convert the amount which reflects demand to a score with same scale as the recommendation list. We also also have to replicate the list for each user by pre pending the the user IDs.
Once we have both ranked list, we can use the map reduce job RankAggregator to consolidate the two lists and generate a net recommendation list. The scores from the two list could be combined in several ways, including average, weighted average etc.
We have gone through a Hadoop map reduce based solution for identifying seasonal products from sales data. To execute the steps in this use case, please use the tutorial.