Model Drift Detection with Kolmogorov Smirnov Statistic on Spark


In retail business, you may be using various business solutions based on product demand data e.g inventory management or how a newly introduced product may be performing with time. The buying behavior model may change with time rendering the those solutions ineffective. It may be necessary to periodically tune the system, i.e detect any drift in the behavior model. If the drift is significant, appropriate changes to be made in business solutions that depend on demand distribution.

In this post we will find out how to use a statistic called Kolmogorov Smirnov Statistic (KS statistic )to measure product demand model drift. The spark implementation for KS Statistic is available in my open source project chombo and avenir. The implementation uses a heavy doses of configuration and is application agnostic.

Model Drift and Kolmogorov Smirnov Statistic

Non stationary nature of data causes model drift. Data is non stationary when the statistical properties of the data change with time. There are various ramifications of model drift. For classification and regression problems there will be significant drop in accuracy.

Kolmogorov Smirnov Statistic (KS statistic) detects difference in data distributions. Classification algorithms can be broadly be divided into 2 categories, data distribution based and discrimination based. In discrimination based algorithms like Decision Tree and SVM, the algorithms are based on the class boundaries. For such algorithms, detecting drift in data distribution may not suffice.

For the particular use case we are discussing, the model is defined by the probability distribution of time elapsed since the last transaction and the sold quantity.

Kolmogorov Smirnov Statistic is based on the maximum difference between cumulative distribution between a reference distribution and a recent distribution. There are other techniques for detecting distribution drift, but KS statistic has a good statistical foundation because it’s based on significance test.

When a model is built for deployment, a distribution is calculated for the data set and saved. This distribution is used as the reference distribution. At some future time, distribution of the data could be calculated and could be used along with reference distribution to calculate KS statistic.

Sales Data

Our use case involves retail sales data counting s product ID, transaction time and sold quantity. Our goal is detect any drift in the distribution of time between transactions and the quantity.

The Spark pipeline contains 3 Spark jobs, as below for data preprocessing and calculation of KS statistic.

  • Convert time stamp to elapsed time since last transaction
  • Calculate empirical probability distribution for elapsed time and quantity
  • Calculate KS statistic

The first two jobs need to run twice once for reference data and once for recent data. Finally, the last job takes the output for reference data and recent data to compute KS statistic.

Spark Pipeline

The Spark job TimeIntervalGenerator converts time stamp converts time stamp to time interval.

The next spark job NumericalAttrDistrStats calculates histogram for time interval and transaction amount. There is seasonality in the sales data. Transactions are segmented based on whether purchases are made during the day or evening hours. Separate histograms are calculated for each product and each of the two seasonal indexes. Here is some sample output.

94L0FMWN6LJG,nightDayHourOfDay,0,3,1.000,5,1,217,2,123,3,64,4,22,5,8,434,783.000,1837.000
MDFFQUJNPJ3A,nightDayHourOfDay,0,3,1.000,6,1,164,2,148,3,149,4,87,5,42,6,8,598,1513.000,4827.000
MDFFQUJNPJ3A,nightDayHourOfDay,1,2,10.000,128,25,4,26,37,27,50,28,127,29,194,30,332,31,491,32,703,33,986,34,1278,35,1506,36,1698,37,1863,38,1849,39,1721,40,1534,41,1285,42,1054,43,742,44,568,45,381,46,231,47,120,48,86,49,48,50,12,277,1,288,2,294,1,319,1,329,1,336,1,339,1,345,1,368,1,370,1,371,1,372,1,373,1,380,1,381,1,384,1,388,1,392,1,396,1,401,1,406,1,407,1,408,1,411,1,418,1,423,1,425,1,427,1,429,1,436,2,441,1,443,1,445,2,450,1,452,1,453,1,455,1,456,1,457,2,458,1,465,2,466,2,468,1,471,1,478,1,479,1,480,1,483,1,486,1,492,1,493,1,498,1,500,1,501,1,511,1,515,2,516,1,519,1,524,1,529,1,530,1,532,3,535,2,537,2,539,1,541,1,545,1,557,1,558,1,559,1,561,1,566,2,567,1,568,1,571,1,573,1,576,1,578,1,580,1,590,1,591,1,597,1,600,1,605,1,613,2,620,1,621,1,623,1,628,1,629,1,635,1,636,1,638,1,669,1,672,1,682,1,684,1,687,1,689,1,691,2,697,1,705,1,19016,7777244.000,5811194616.000
94L0FMWN6LJG,nightDayHourOfDay,1,2,10.000,135,39,11,40,23,41,34,42,91,43,112,44,145,45,240,46,331,47,450,48,577,49,746,50,830,51,959,52,1031,53,1017,54,1086,55,1003,56,866,57,860,58,671,59,571,60,464,61,361,62,253,63,193,64,117,65,79,66,41,67,24,68,10,396,1,406,1,411,1,422,1,447,1,457,1,468,1,472,1,481,1,490,1,493,2,511,1,514,1,515,2,518,1,532,1,535,1,536,1,546,1,549,1,559,1,562,2,563,1,579,1,583,2,589,1,592,1,593,1,595,1,599,1,604,1,607,1,611,1,622,1,627,2,631,1,632,1,634,1,635,1,636,1,641,1,646,1,654,1,655,1,657,1,658,1,664,2,665,1,675,1,678,1,680,1,681,1,683,2,686,1,689,1,692,1,694,1,698,1,705,1,709,1,710,1,712,1,717,1,726,1,732,1,735,1,756,1,758,1,760,1,764,1,783,1,792,1,796,1,801,1,805,1,809,1,813,2,820,1,822,1,842,1,846,1,850,1,851,1,856,1,857,1,861,1,862,2,873,1,874,1,876,1,877,1,881,1,882,1,884,2,900,1,921,1,927,1,936,1,937,1,939,1,948,1,969,1,973,1,983,2,1042,1,13312,7933436.000,9808878546.000
51742QL0GZ14,nightDayHourOfDay,0,3,1.000,5,1,385,2,102,3,31,4,7,5,1,526,715.000,1209.000
51742QL0GZ14,nightDayHourOfDay,1,2,10.000,125,33,19,34,34,35,83,36,138,37,242,38,401,39,621,40,855,41,1055,42,1370,43,1512,44,1603,45,1681,46,1563,47,1276,48,1123,49,825,50,570,51,410,52,261,53,164,54,79,55,47,56,16,285,1,316,1,317,1,331,1,337,1,352,1,357,1,364,1,374,1,375,2,380,1,388,1,404,1,408,1,414,1,415,1,417,1,419,1,424,1,427,1,432,1,442,1,444,1,445,1,450,1,459,1,464,1,465,2,468,1,471,1,474,2,478,1,481,1,485,1,486,1,487,1,488,1,493,2,494,1,497,1,498,2,500,1,507,1,514,1,519,1,524,1,527,1,530,2,532,1,535,1,539,2,540,1,543,1,548,1,549,1,560,1,563,3,564,2,568,2,574,1,576,1,579,1,582,2,591,1,595,1,597,1,608,2,613,1,615,1,616,1,617,1,621,1,623,1,624,1,633,1,638,1,645,1,646,2,647,1,656,1,658,2,662,1,663,1,664,1,669,1,672,1,674,1,675,1,686,1,692,1,697,1,710,1,713,1,725,1,734,1,742,1,748,1,775,1,781,1,801,1,837,1,16064,7802060.000,6839062226.000

The last Spark job KolmogorovSmirnovModelDrift calculates KS statistic based on the distributions for reference data and recent data. Here is some output

P0XT64H5J36E,nightDayHourOfDay,1,2,0.012,false
51742QL0GZ14,nightDayHourOfDay,1,3,0.015,false
IEMDIEX12CFL,nightDayHourOfDay,1,3,0.016,false
6TWQH176QQ6G,nightDayHourOfDay,1,2,0.020,false
G58T046F9X1P,nightDayHourOfDay,1,3,0.004,false
4ALR756SE52E,nightDayHourOfDay,1,3,0.721,true
3494Y1WV4K6D,nightDayHourOfDay,1,3,0.011,false

We find that the product 4ALR756SE52E has significant deviation in distribution of quantity sold.

Drift in Demand Distribution

Demand distribution is defined by distribution of quality sold and time elapsed between sales. Smaller quantity sold and / or larger time between sales implies less demand. Significant drift in the demand distribution can be can be used in various ways in business. Here are some examples

  • Inventory management
  • To decide whether a product should be discontinued
  • To decide whether more marketing money should be spent for a product
  • To decide whether a product with increasing demand is cannibalizing a similar product

Wrapping Up

We have seen how KS statistic can be used to detect significant drift in probability distribution. To run the use case in this post, please follow the steps in the tutorial document.

You may be wondering why we need Spark to calculate KS statistic. It’s only necessary if you want to do at scale computation as in the example in this post.

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Data Science, Machine Learning, Spark, Statistics and tagged . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s