Normal Distribution Fitness Test with Chi Square on Spark


Many Machine Learning models is based on certain assumptions made about the data. For example, in ZScore based  anomaly detection, it is  assumed that the data has normal distribution. Your Machine Learning model will be as good as how those assumptions hold true. In this post, we will go over a Spark based implementation of Chi Square test for the assumptions of some distribution of the data set

The implementation  is available in my open source chombo on github. Like my all other Spark projects,  the implementation is met data driven and completely decoupled from any specific data set. Our use case will involve validation of normal distribution assumption of the data.

Chi Square Fitness Test

To test any hypothesis about data, a statistic is calculated from the data set. Using the critical value table of the statistic, we check whether the statistic falls with the rejection regions. If so, the null hypothesis is rejected  between the

Along the same line here are the steps for Chi Square fitness testChi Square statistic is based based on squared difference  between the expected count and observed count in each cell.

  1. We hypothesize about the data, stating the null and alternative hypothesis. In our use case null hypothesis is that data distribution is normal.
  2. We compute the frequencies of count of expected distribution and also the frequencies of count of observed data.
  3. For each cell, Take the square of difference beteen expected and observed prequencies and divide by the excepted frequency. Sum them up for all cells. This is Chi Squere statistic.
  4. We check the statistic value falls within the  rejection region for some chosen critical factor, we reject the null hypothesis

Next, we will discuss how all these come in to play for a real life use case with customer transaction data.

We will build a non parametric distribution of the data. The non parametric distribution i.e histogram  along with normal distribution will be used for the Chi Square fitness test.

eCommerce Customer Transaction Data

A smart group of Data Scientist in a company that successfully built and eCommerce operation, built a customer segmentation model,  to enable matketing to do more targeted marketing. After running kNN clustering algorithms multiple times,  the optimum number of clusters is 4.

Another group of Data Scientist that have thinking about building a fraud detection system. They though they could leverage the cluster already built and train an anomaly detection model for each cluster or segment.They decided to use the ZScore algorithm.  They might have used the Spark based implementation of ZScore based anomaly detection in beymani or something else.

After deploying the anomaly detection solution, they found that too many false alarms were being generated. They decided to investigate the validity of the normal distribution assumption of the data. This brings us back to the main topic of this post. Here is some sample input data

3,19A2ULG1DA,40WV61X060WCK5UY,1548476924,349.43
3,A4T9Q86NHX,O8503O5RPA332ZOR,1548477464,195.81
0,M669EF46CF,O508K80Q80S9KDI1,1548478064,45.80
2,036O4X0UU0,1DBWDXWDSBLQ1LWR,1548478604,180.68
2,9QXSJS79I7,36748D5WGZ7X51C6,1548479024,226.88
1,1NDZF139KH,1EU868Y6I35IKVPL,1548479384,58.23
0,71G145M9LA,68QEJ8WQ2090P038,1548479924,74.25
2,27D025RU09,8QIL8O9IWN3W587N,1548480404,168.20
2,77AZVZ9KJI,603297E1G34LZGL3,1548481004,203.61

The fields are cluster ID, customer ID, transaction ID, transaction time and monetary amount.

Preparatory Steps

Since mean and standard deviation of the normal distribution are not known, they are estimated from samples. This is done through the Spark job  NumericalAttrStats, details of which are available..  Here is the output

2,4,$,2625537.790000,495013582.037300,14594,179.905289,1553.066104,39.408960,61.010000,299.570000
0,4,$,874028.560000,53799442.768600,14580,59.947089,96.294565,9.812979,30.050000,89.730000
1,4,$,1075069.000000,89349948.997200,13396,80.252986,229.355899,15.144501,44.410000,127.920000
3,4,$,3885728.760000,1213051919.350399,12943,300.218555,3591.442873,59.928648,120.760000,479.700000

The first 2 fields identify the key. The remaining fields are various outputs.  For the non paramteric distribution, we need to define bin or bucket width. Chi Square fitness requires at least 5 samples in each bucket.

The last 2 fields  in each line of input are min and max. We use a python script that uses these fields  to estimate bucket width. The script takes the average number of samples per bucket as one of the inputs. Here is the out put of the script. The last field is the bucket width.

2,4,3.313
0,4,0.829
1,4,1.265
3,4,5.608

Spark Job for Histogram and Fitness Test

The Spark job NumericalAttrDistrStats essentially  builds histogram. Additionally, depending on how it’s configured, it can do Chi Square of other fitness test. Here is some sample output.

0,4,0.829,73,36,5,37,12,38,4,39,12,40,21,41,20,42,14,43,24,44,28,45,41,46,52,47,53,48,76,49,70,50,83,51,98,52,128,53,121,54,174,55,202,56,197,57,264,58,253,59,265,60,290,61,314,62,355,63,354,64,397,65,425,66,434,67,452,68,512,69,481,70,443,71,482,72,513,73,481,74,478,75,464,76,423,77,471,78,403,79,411,80,376,81,376,82,317,83,325,84,303,85,272,86,225,87,206,88,184,89,165,90,149,91,131,92,136,93,110,94,78,95,85,96,59,97,54,98,41,99,39,100,36,101,30,102,20,103,18,104,23,105,6,106,11,107,2,108,3,14580,874028.560,53799442.769,92.793,100.425,true
2,4,3.313,73,18,5,19,11,20,6,21,15,22,13,23,19,24,21,25,36,26,25,27,29,28,52,29,51,30,71,31,92,32,99,33,106,34,124,35,140,36,173,37,159,38,185,39,242,40,274,41,293,42,319,43,282,44,359,45,359,46,416,47,412,48,454,49,449,50,464,51,472,52,496,53,480,54,472,55,497,56,463,57,471,58,413,59,425,60,421,61,373,62,369,63,346,64,367,65,298,66,331,67,282,68,236,69,233,70,189,71,194,72,148,73,156,74,123,75,63,76,97,77,79,78,68,79,56,80,53,81,38,82,23,83,25,84,26,85,19,86,8,87,9,88,14,89,4,90,2,14594,2625537.790,495013582.037,119.162,100.425,false
3,4,5.608,65,21,5,22,10,23,11,24,16,25,23,26,17,27,26,28,32,29,45,30,36,31,60,32,70,33,97,34,97,35,123,36,127,37,145,38,182,39,226,40,247,41,248,42,274,43,314,44,332,45,389,46,399,47,405,48,434,49,451,50,440,51,472,52,452,53,470,54,445,55,508,56,463,57,459,58,395,59,417,60,415,61,378,62,349,63,316,64,285,65,275,66,231,67,181,68,190,69,140,70,142,71,112,72,108,73,76,74,81,75,60,76,50,77,40,78,46,79,23,80,29,81,21,82,12,83,10,84,7,85,4,12943,3885728.760,1213051919.350,59.042,90.802,true
1,4,1.265,67,35,5,36,5,37,11,38,62,39,86,40,115,41,99,42,135,43,120,44,152,45,168,46,199,47,207,48,224,49,283,50,285,51,303,52,297,53,330,54,317,55,401,56,391,57,383,58,430,59,423,60,417,61,413,62,424,63,467,64,413,65,366,66,396,67,388,68,382,69,380,70,342,71,349,72,314,73,299,74,276,75,245,76,249,77,224,78,204,79,196,80,194,81,153,82,113,83,112,84,120,85,75,86,71,87,53,88,53,89,51,90,44,91,35,92,29,93,21,94,20,95,21,96,16,97,8,98,17,99,8,100,5,101,2,13396,1075069.000,89349948.997,298.755,93.217,false

The first 2 fields are key, the first field being the cluster ID. The remaining fields are related to histogram, until the last  3 fields.

The 3rd field from the end is calculated Chi Square statistic. The next field is is critical value Chi Square statistic. The last field is true when the calculated Chi Square statistic is less than the critical value Chi Square statistic, implying that null hypothesis is accepted.

We fins that cluster with ID 2, violates the assumption of normal distribution somewhat since the Chi Square statistic is marginally into the rejection region.  The cluster with ID 1,  violates the assumption of normal distribution significantly since the Chi Square statistic is well over into the rejection region.

The conclusion that the Data Science reaches based on these finding is ZScore is not the appropriate algorithm.

Summing Up

Many Machine Learning models are based on certain assumptions about the data. It’s important to verify the validity of such assumptions before proceeding. In this post, we have gone through the steps on how to check normal distribution fitness with Chi Square test. The tutorial document provides the details of the teps.

Support

For commercial support for any solution in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing.

Advertisements

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Anomaly Detection, Big Data, Data Science, Spark, Statistics and tagged , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s