Many Machine Learning models is based on certain assumptions made about the data. For example, in ZScore based anomaly detection, it is assumed that the data has normal distribution. Your Machine Learning model will be as good as how those assumptions hold true. In this post, we will go over a Spark based implementation of Chi Square test for the assumptions of some distribution of the data set

The implementation is available in my open source *chombo* on github. Like my all other Spark projects, the implementation is met data driven and completely decoupled from any specific data set. Our use case will involve validation of normal distribution assumption of the data.

## Chi Square Fitness Test

To test any hypothesis about data, a statistic is calculated from the data set. Using the critical value table of the statistic, we check whether the statistic falls with the rejection regions. If so, the null hypothesis is rejected between the

Along the same line here are the steps for Chi Square fitness test. Chi Square statistic is based based on squared difference between the expected count and observed count in each cell.

*We hypothesize about the data, stating the null and alternative hypothesis. In our use case null hypothesis is that data distribution is normal.**We compute the frequencies of count of expected distribution and also the frequencies of count of observed data.**For each cell, Take the square of difference beteen expected and observed prequencies and divide by the excepted frequency. Sum them up for all cells. This is Chi Squere statistic.**We check the statistic value falls within the rejection region for some chosen critical factor, we reject the null hypothesis*

Next, we will discuss how all these come in to play for a real life use case with customer transaction data.

We will build a non parametric distribution of the data. The non parametric distribution i.e histogram along with normal distribution will be used for the Chi Square fitness test.

## eCommerce Customer Transaction Data

A smart group of Data Scientist in a company that successfully built and eCommerce operation, built a customer segmentation model, to enable matketing to do more targeted marketing. After running kNN clustering algorithms multiple times, the optimum number of clusters is 4.

Another group of Data Scientist that have thinking about building a fraud detection system. They though they could leverage the cluster already built and train an anomaly detection model for each cluster or segment.They decided to use the ZScore algorithm. They might have used the Spark based implementation of ZScore based anomaly detection in *beymani* or something else.

After deploying the anomaly detection solution, they found that too many false alarms were being generated. They decided to investigate the validity of the normal distribution assumption of the data. This brings us back to the main topic of this post. Here is some sample input data

3,19A2ULG1DA,40WV61X060WCK5UY,1548476924,349.43 3,A4T9Q86NHX,O8503O5RPA332ZOR,1548477464,195.81 0,M669EF46CF,O508K80Q80S9KDI1,1548478064,45.80 2,036O4X0UU0,1DBWDXWDSBLQ1LWR,1548478604,180.68 2,9QXSJS79I7,36748D5WGZ7X51C6,1548479024,226.88 1,1NDZF139KH,1EU868Y6I35IKVPL,1548479384,58.23 0,71G145M9LA,68QEJ8WQ2090P038,1548479924,74.25 2,27D025RU09,8QIL8O9IWN3W587N,1548480404,168.20 2,77AZVZ9KJI,603297E1G34LZGL3,1548481004,203.61

The fields are cluster ID, customer ID, transaction ID, transaction time and monetary amount.

## Preparatory Steps

Since mean and standard deviation of the normal distribution are not known, they are estimated from samples. This is done through the Spark job *NumericalAttrStats*, details of which are available.. Here is the output

2,4,$,2625537.790000,495013582.037300,14594,179.905289,1553.066104,39.408960,61.010000,299.570000 0,4,$,874028.560000,53799442.768600,14580,59.947089,96.294565,9.812979,30.050000,89.730000 1,4,$,1075069.000000,89349948.997200,13396,80.252986,229.355899,15.144501,44.410000,127.920000 3,4,$,3885728.760000,1213051919.350399,12943,300.218555,3591.442873,59.928648,120.760000,479.700000

The first 2 fields identify the key. The remaining fields are various outputs. For the non paramteric distribution, we need to define bin or bucket width. Chi Square fitness requires at least 5 samples in each bucket.

The last 2 fields in each line of input are min and max. We use a python script that uses these fields to estimate bucket width. The script takes the average number of samples per bucket as one of the inputs. Here is the out put of the script. The last field is the bucket width.

2,4,3.313 0,4,0.829 1,4,1.265 3,4,5.608

## Spark Job for Histogram and Fitness Test

The Spark job *NumericalAttrDistrStats* essentially builds histogram. Additionally, depending on how it’s configured, it can do Chi Square of other fitness test. Here is some sample output.

0,4,0.829,73,36,5,37,12,38,4,39,12,40,21,41,20,42,14,43,24,44,28,45,41,46,52,47,53,48,76,49,70,50,83,51,98,52,128,53,121,54,174,55,202,56,197,57,264,58,253,59,265,60,290,61,314,62,355,63,354,64,397,65,425,66,434,67,452,68,512,69,481,70,443,71,482,72,513,73,481,74,478,75,464,76,423,77,471,78,403,79,411,80,376,81,376,82,317,83,325,84,303,85,272,86,225,87,206,88,184,89,165,90,149,91,131,92,136,93,110,94,78,95,85,96,59,97,54,98,41,99,39,100,36,101,30,102,20,103,18,104,23,105,6,106,11,107,2,108,3,14580,874028.560,53799442.769,92.793,100.425,true 2,4,3.313,73,18,5,19,11,20,6,21,15,22,13,23,19,24,21,25,36,26,25,27,29,28,52,29,51,30,71,31,92,32,99,33,106,34,124,35,140,36,173,37,159,38,185,39,242,40,274,41,293,42,319,43,282,44,359,45,359,46,416,47,412,48,454,49,449,50,464,51,472,52,496,53,480,54,472,55,497,56,463,57,471,58,413,59,425,60,421,61,373,62,369,63,346,64,367,65,298,66,331,67,282,68,236,69,233,70,189,71,194,72,148,73,156,74,123,75,63,76,97,77,79,78,68,79,56,80,53,81,38,82,23,83,25,84,26,85,19,86,8,87,9,88,14,89,4,90,2,14594,2625537.790,495013582.037,119.162,100.425,false 3,4,5.608,65,21,5,22,10,23,11,24,16,25,23,26,17,27,26,28,32,29,45,30,36,31,60,32,70,33,97,34,97,35,123,36,127,37,145,38,182,39,226,40,247,41,248,42,274,43,314,44,332,45,389,46,399,47,405,48,434,49,451,50,440,51,472,52,452,53,470,54,445,55,508,56,463,57,459,58,395,59,417,60,415,61,378,62,349,63,316,64,285,65,275,66,231,67,181,68,190,69,140,70,142,71,112,72,108,73,76,74,81,75,60,76,50,77,40,78,46,79,23,80,29,81,21,82,12,83,10,84,7,85,4,12943,3885728.760,1213051919.350,59.042,90.802,true 1,4,1.265,67,35,5,36,5,37,11,38,62,39,86,40,115,41,99,42,135,43,120,44,152,45,168,46,199,47,207,48,224,49,283,50,285,51,303,52,297,53,330,54,317,55,401,56,391,57,383,58,430,59,423,60,417,61,413,62,424,63,467,64,413,65,366,66,396,67,388,68,382,69,380,70,342,71,349,72,314,73,299,74,276,75,245,76,249,77,224,78,204,79,196,80,194,81,153,82,113,83,112,84,120,85,75,86,71,87,53,88,53,89,51,90,44,91,35,92,29,93,21,94,20,95,21,96,16,97,8,98,17,99,8,100,5,101,2,13396,1075069.000,89349948.997,298.755,93.217,false

The first 2 fields are key, the first field being the cluster ID. The remaining fields are related to histogram, until the last 3 fields.

The 3rd field from the end is calculated Chi Square statistic. The next field is is critical value Chi Square statistic. The last field is true when the calculated Chi Square statistic is less than the critical value Chi Square statistic, implying that null hypothesis is accepted.

We fins that cluster with ID 2, violates the assumption of normal distribution somewhat since the Chi Square statistic is marginally into the rejection region. The cluster with ID 1, violates the assumption of normal distribution significantly since the Chi Square statistic is well over into the rejection region.

The conclusion that the Data Science reaches based on these finding is ZScore is not the appropriate algorithm.

## Summing Up

Many Machine Learning models are based on certain assumptions about the data. It’s important to verify the validity of such assumptions before proceeding. In this post, we have gone through the steps on how to check normal distribution fitness with Chi Square test. The tutorial document provides the details of the teps.

## Support

For commercial support for any solution in my github repositories, please talk to ThirdEye Data Science Services. Support is available for *Hadoop* or Spark deployment on cloud including installation, configuration and testing.