Mining Seasonal Products from Sales Data


The other day someone asked me how to include products with seasonal demand in recommendations based on collaborative filtering or some other technique. The solution to the problem involves two steps. The first step is to identify products with seasonal demand. The second step involves merging two ranked lists, one list being the recommendation list based on some recommendation algorithm, the other being the list of products with seasonal demand. The second list is a function of time.

Our focus in this post is on the first problem. As we will explore here in this post, products with seasonal demand can be found with some simple statistical technique. The solution involves running three Hadoop Map Reduce jobs in tandem. The implementation is available in open source project chombo available on github.

Order Transaction Data as Input

Our use case is for an on line store. We use the order transaction data with the following components.

  1. Transaction ID
  2. Product ID
  3. Monetary amount
  4. Time stamp

The monetary amount is used as an indicator for demand. We could have also used product quantity. Here is some sample input. Our data will contain higher than normal pre school sales for certain products.

Solution Outline

At any given time of the year we want the solution to give us a list of products that has high seasonal demand at that time of the year.

The solution has three steps. The first step is to detect seasonality in the data. Data with seasonal components has the following characteristics.

  1. A fixed cycle of seasonality e.g for week of the year seasonality,the cycle is a year
  2. A time interval e.g for week of the year seasonality, the interval is 7 days
  3. Seasonal cycle index, e.g for week of the year seasonality, the index is between 0 and 53

In this step, we process past transaction data and bin the transaction amount with one bin per week. Since are using week of the year as the seasonality cycle, there will be 53 bins, one of each week in an year.

We sum the amount for each bin. The output is essentially an de normalized histogram, A week is considered to begin on a Monday. Since there may be partial weeks near the year boundary, we may have have up to 53 weeks in a year.More on seasonality detection can be found in an earlier post.

In the second step we take the output of the first step and calculate normalized distribution of transaction amount and certain statistical properties from the distribution.

In the third and final step, we take the output of the second step and identify the modes of the histogram for products with significant non uniformity in the histogram distribution. The modes will correspond to weeks with  higher than average demand.

Map Reduce Pipeline

As mentioned earlier, there are three map reduce jobs that run in tandem, one for each of the 3 steps as described earlier. The first  map reduce job SeasonalDetector maps the time stamp to a week of the year index i.e cycle index for each transaction. It’s output  is as follows. There is one output record for each product and week of the years.

ZWTG9FZP,0,24,2,342
ZWTG9FZP,0,25,2,156
ZWTG9FZP,0,26,2,360
ZWTG9FZP,0,27,2,381
ZWTG9FZP,0,28,2,1300
ZWTG9FZP,0,29,2,2907
ZWTG9FZP,0,30,2,2259
ZWTG9FZP,0,31,2,2588

The different fields are as follows.

  1. Product ID
  2. Parent cycle index
  3. Cycle index i.e. week of the year in this case
  4. Attribute index
  5. Total amount

Some seasonality cycle have a parent cycle. For example, day of the month has parent cycle which is month of the year.

We have a field called attribute index which refers to the field for transaction amount in the input. It’s there to identify the column being analyzed, because this map reduce job can process multiple data columns for seasonality.

The next map reduce job NumericalAttrDistrStats takes the input of the first job and calculates distribution and some statistical quantities for each product. The  histogram bin is the week of the year index and the bin value is the total amount for the week. Here is some sample output. There is one output record for each product.

XAYSOMO8,53,0,886,1,875,2,778,3,889,4,1149,5,1344,6,196,7,660,8,397,9,761,10,669,11,516,12,393,13,592,14,475,15,1143,16,398,17,899,18,667,19,474,20,777,21,388,22,382,23,745,24,295,25,582,26,878,27,1139,28,379,29,190,30,297,31,87,32,594,33,88,34,378,35,594,36,764,37,897,38,971,39,671,40,651,41,568,42,280,43,1238,44,855,45,946,46,672,47,752,48,840,49,395,50,493,51,369,52,381,3.859,5.500,10.731,25.172,40.023
Y1LHFN55,53,1,376,2,243,3,315,4,177,5,312,6,237,7,246,8,314,9,171,10,283,11,140,12,247,13,207,14,171,15,146,16,427,17,171,18,348,19,310,20,210,21,315,22,244,23,242,24,316,25,345,26,278,27,270,28,173,29,163,30,37,31,181,32,108,33,105,34,109,35,413,36,317,37,177,38,65,39,246,40,342,41,427,42,240,43,243,44,74,45,173,46,312,47,305,48,237,49,244,50,135,51,67,52,138,53,101,3.878,16.500,12.947,24.772,39.955
Y3N81J94,53,0,586,1,872,2,711,3,1235,4,1002,5,827,6,361,7,510,8,1214,9,229,10,947,11,828,12,971,13,844,14,491,15,958,16,822,17,1212,18,352,19,1092,20,847,21,856,22,844,23,476,24,956,25,568,26,1241,27,843,28,714,29,110,30,232,31,264,32,381,33,449,35,600,36,720,37,972,38,983,39,495,40,642,41,498,42,480,43,461,44,835,45,498,46,831,47,1589,48,602,49,808,50,845,51,593,52,487,53,133,3.873,47.500,12.162,23.729,39.640

The fields are as follows. For our problem, we will be using the histogram and optionally the entropy value.

  1. Product ID
  2. Number of bins
  3. Bin and bin value pairs for all the bins. The output could be normalized optinally
  4. Entropy
  5. Mode
  6. 25% quantile
  7. 50% quantile or median
  8. 75% quantile

The  final map reduce job SeasonalCycleFinder detects modes in the histogram of products with significant non uniformity in distribution.  There are two ways to detect non uniformity and modes as follows

  1. Modes with value some percentage above the average value
  2. Top n modes.

There is also an additional filtering based on entropy. Entropy is a measure of randomness in a distribution. Distributions with high non uniformity and peaks have low entropy value.

With this filter on, we apply a upper limit for the entropy. With the second strategy for mode identification, it is important to apply the entropy based filter. Here is some sample output. We have used the first strategy for mode identification. The threshold for mode is set to 200% of average.

23,7P6U253E,2008,45I15DS7,923
24,JTQFOU27,1400,ILCJW987,498,04KLY51P,1303,8L07Q41R,1456
25,GV6DZTX9,1027,WNUB5122,726,N48JP2IT,1629,SAWW6PBM,979,N603Z8V8,2778,939XB2F7,657
26,34109709,1916,962NH84D,1950,N5U0ZPYI,2090,FOPC4URK,1142
27,E069MWVG,611,TWL5XG9G,1943,C4JQZCBI,2618
28,RR44R650,4739,53I098F5,6452,Q5VTTNJJ,4660,ZWTG9FZP,1300,9B8187AP,1419,TN5G0U75,4140,9ZJF3LFT,5801,ON727318,6829,J6H6T2BB,4751,65XZLTQ7,934,FVQHQGEY,4378,LYN5B152,2404,HM99YZ4W,2074
29,J6H6T2BB,11283,9ZJF3LFT,12901,ODH5UK99,3408,RR44R650,8668,LYN5B152,5504,HM99YZ4W,4749,65XZLTQ7,1783,53I098F5,14876,TN5G0U75,6915,29UO8078,5036,9B8187AP,2576,0G0GB7OW,7148,ZWTG9FZP,2907,ON727318,9979,FVQHQGEY,8535,Q5VTTNJJ,10505,8W9AFV83,16016
30,TN5G0U75,5298,LYN5B152,5552,9ZJF3LFT,12947,0G0GB7OW,5753,53I098F5,17555,ON727318,13936,J6H6T2BB,10821,FVQHQGEY,6747,HM99YZ4W,3559,8W9AFV83,17845,65XZLTQ7,1852,RR44R650,8049,29UO8078,4411,Q5VTTNJJ,10507,ZWTG9FZP,2259,9B8187AP,2618,ODH5UK99,4830
31,ZWTG9FZP,2588,9ZJF3LFT,13496,29UO8078,4283,LYN5B152,6043,9B8187AP,3345,0G0GB7OW,4795,65XZLTQ7,2141,ODH5UK99,3942,RR44R650,10101,Q5VTTNJJ,11365,J6H6T2BB,7045,TN5G0U75,7791,8W9AFV83,14827,53I098F5,17611,HM99YZ4W,4223,ON727318,12560,FVQHQGEY,7899
32,65XZLTQ7,2044,8W9AFV83,18309,53I098F5,17405,9B8187AP,2883,9ZJF3LFT,11801,FVQHQGEY,8683,HM99YZ4W,5587,LYN5B152,4354,29UO8078,4506,ODH5UK99,3633,ON727318,13263,Q5VTTNJJ,11611,RR44R650,8102,TN5G0U75,8278,0G0GB7OW,6908,ZWTG9FZP,2562,J6H6T2BB,9449
33,9ZJF3LFT,10660,0G0GB7OW,6827,LYN5B152,5577,ON727318,14155,ZWTG9FZP,2426,TN5G0U75,8942,8W9AFV83,10899,29UO8078,4656,RR44R650,8841,J6H6T2BB,9586,65XZLTQ7,1656,53I098F5,14147,FVQHQGEY,9102,Q5VTTNJJ,10916,9B8187AP,2742,ODH5UK99,3882,HM99YZ4W,5836
34,J6H6T2BB,5486,LYN5B152,2972,FVQHQGEY,4074,ODH5UK99,2209,ON727318,7773,9ZJF3LFT,7188,9B8187AP,1680,HM99YZ4W,2643,65XZLTQ7,1568,Q5VTTNJJ,5153,53I098F5,11457,8W9AFV83,9942,ZWTG9FZP,2054,29UO8078,3602,RR44R650,4346,0G0GB7OW,5115,TN5G0U75,6474
35,B8S0ZDTA,2864,67MENHLU,1430,7X9R4BB2,2532,X577IAFI,2877
36,9GC7AM3X,2462,YY46E21M,2595,7U7258G5,914,U9OZQNHC,1622
37,H8EFF5CG,1493,248WCFGF,2225,VD7B26W5,1243,JTQFOU27,1501
38,PGE7LB8Z,964,99SHYNCN,2332,09YGVQDT,1026,OM710FQA,1514,U9OZQNHC,1846
39,28U770B0,1858,R9B95265,1070,P24SYA92,389
40,90967MZP,2463,C4JQZCBI,2987

The fields in the output are as follows. There is one record for each week of the year.

  1. Week of the year index
  2. List of  product ID and  weekly sales amount for the product

We notice a sharp rise in the number of products from week 28 through week 34. That range of weeks happens to be the pre school sales season.

Product Recommendation

Although not the main focus, I will discuss briefly the technique to include seasonal products into the core product recommendation list. Typically, a personalized recommendation engine will have output with following components.

  1. User ID
  2. Product ID
  3. Score

We have to convert the seasonal product list to a similar format. We can can convert the amount which reflects demand to a score with same scale as the recommendation list. We also also have to replicate the list for each user by pre pending the the user IDs.

Once we have both ranked list, we can use the map reduce job  RankAggregator to consolidate the two lists and generate a net recommendation list. The scores from the two list could be combined in several ways, including average, weighted average etc.

Summing Up

We have gone through a Hadoop map reduce based solution for identifying seasonal products from sales data.  To execute the steps in this use case, please use the tutorial.

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Big Data, Data Mining, Data Science, eCommerce, Map Reduce, Recommendation Engine and tagged , , , . Bookmark the permalink.

Leave a comment