It’s a costly mistake to jump straight into building machine learning models before getting a good insight into your data. I have made the mistake and paid the price. Since then I made a resolution to learn about the data as much as possible first before taking the next step. While exploring data, I always found myself using multiple python libraries and doing plethora of imports for various python modules.
That experience motivated me to consolidate all common python data exploration functions, in one python class to make it easier to use. As an added feature I have also provided a workspace like interface, using which you can register multiple data sets with user provided name for each data set. You can refer to the data sets by name and perform various operations. The python implementation is available in my open source project avenir in GitHub
API Usage
Most of the data exploration function implementation is based on existing python libraries. Very few are implemented from scratch. Following python libraries have been used.
- numpy
- scipy
- pandas
- statsmodels
- scikit-learn
There two kinds of API. Most are for data exploration. Rest of them are for workspace management. The API usage for data exploration has the following pattern.
- Create an instance of DataExplorer
- Register multiple data sets. A data set is a list or array with a name. A data source can be a file, pandas data frame, bumpy array or a list. A data set is essentially a 1D array with a name. For example you can pass a CSV file, specifying the columns you want to use. For each column it will register a data set.
- Call various data exploration API among the 66 available. Names of one or data sets are always passed as an argument
- Result is alway returned as python dictionary
- By default, there is alway console output. It can be disabled by setting the argument verbose to False in the constructor
- You can add notes for any data set registered as you are exploring it.
- The whole workspace can be saved and restored, if you want to continue your exploration session later. The workspace consists of a dictionary holding all the data sets and a dictionary holding the meta data formal data sets
The source code has comments on the input arguments for each function. For further details, it’s best to refer to the documentation of the base python library used for any particular function.
Workspace Management API
Here are the functions for workspace management. Through these, you can load data from various sources, save and restore workspace. You need to use these to register various data sets before you can operate on them, although data exploration API allows you to pass any unregistered list or numpy array also.
Function | Comment |
save(filePath) | save workspace |
restore(filePath) | restore workspace |
queryFileData(filePath, *columns) | query data types for file columns |
queryDataFrameData(df, *columns) | query data types for data frame columns |
getDataType(col) | query data types for a data set (numeric, binary, categorical |
addFileNumericData(filePath, *columns) | add numeric columns from file |
addFileBinaryData(filePath, *columns) | add binary columns from file |
addDataFrameNumericData(filePath, *columns) | add numeric columns from data frame |
addDataFrameBinaryData(filePath, *columns) | add binary columns from data frame |
addListNumericData(ds, name) | add numeric data from list |
addListBinaryData(ds, name) | add binary data from list |
addFileCatData(filePath, *columns) | add categorical columns from file |
addDataFrameCatData(df, *columns) | add categorical columns from data frame |
addCatListData(ds, name) | add categorical data from list |
remData(ds) | remove data set |
addNote(ds, note) | add note for a data set |
getNotes(ds) | get notes for a data set |
getNumericData(ds) | get numeric data for a data set |
getCatData(ds) | get categorical data for a data set |
showNames() | get list of names for datasets registered |
getCatData(ds) | get categorical data for a data set |
Data Exploration API.
Rest of the sections list of all the data exploration functions. They are split into two separate section 1) Summary Statistics and 2)Test Statistics. The API has the following characteristics
- Since we are exploring data and learning insight, the functions don’t mutate the data
- The functions for adding data are data type aware. If you try to use an invalid data type for a function e.g cross correlation with categorical data, it will be detected and an AssertionError will be raised
- You will always pass one for names for data sets already registered. However, you may pass any unregistered list or numpy array also.
- If the underlying library returns pvalue, then the output will indicate the null hypothesis is accepted or rejected, based on the critical value passed
- Some of the functions take two data sets and require that the the data sets be of same size. In such cases, the size check is done
- By default there is always console output. To disable console output, you should set verbose to False in the constructor.
Following data types are supported
- Numerical (integer, float)
- Binary (integer with values 0 and 1)
- Categorical (integer, string)
Summary Statistics API
The functions listed below belong to 3 sub categories. Most functions will return some result wrapped in dictionary. Some will do plotting.
- Basic summary statistics
- Frequency related statistics
- Correlation
Function | Comment |
queryFileData(filePath, *columns) | query column data type from a data file |
queryDataFrameData(df, *columns) | query column data type from a data frame |
plot(ds, yscale=None) | line plot |
scatterPlot(ds1, ds2) | scatter plot |
print(self, ds) | prints size of data set and first 50 elements |
plotHist(ds, cumulative, density, nbins=None) | plots histogram or cumulative distribution |
isMonotonicallyChanging(ds) | checks if data is monotonically increasing or decreasing |
getFeqDistr(ds, nbins=10) | gets frequency distribution or histogram |
getCumFreqDistr(ds, nbins=10) | gets cumulative frequency distribution |
getEntropy(ds, nbins=10) | gets entropy |
getRelEntropy(ds1, ds2, nbins=10) | gets relativ entropy |
getMutualInfo(ds1, ds2, nbins=10) | gets mutual information |
getPercentile(ds, value) | gets percentile |
getValueAtPercentile(ds, percent) | gets value at percentile |
getUniqueValueCounts(ds, maxCnt=10) | gets unique values and counts |
getCatUniqueValueCounts(ds, maxCnt=10) | gets categorical data unique values and counts |
getStats(ds, nextreme=5) | gets summary statistics |
getDifference(self, ds, order) | gets difference of given order |
getTrend(ds, doPlot=False) | gets trend |
deTrend(self, ds, trend, doPlot=False) | gets trend removed data |
getTimeSeriesComponents(ds, model, freq, summaryOnly, doPlot=False) | gets trend, cycle and residue components of time series |
getOutliersWithIsoForest(contamination, *dsl) | gets outliers with isolation forest |
getOutliersWithLocalFactor(contamination, *dsl) | gets outliers with local factor |
getOutliersWithSupVecMach(nu, *dsl) | gets outliers using one class svm |
fitLinearReg(ds, doPlot=False) | get linear regression coefficients |
fitSiegelRobustLinearReg(ds, doPlot=False) | gets siegel robust linear regression coefficients based on median |
fitTheilSenRobustLinearReg(ds, doPlot=False) | gets thiel sen robust linear fit regression coefficients based on median |
plotRegFit(x, y, slope, intercept) | plots regression fitted line |
getCovar(*dsl) | gets covariance |
getPearsonCorr(ds1, ds2, sigLev=.05) | gets pearson correlation coefficient |
getSpearmanRankCorr(ds1, ds2, sigLev=.05) | gets spearman correlation coefficient |
getKendalRankCorr(ds1, ds2, sigLev=.05) | gets kendall’s tau, correlation for ordinal data |
getPointBiserialCorr(ds1, ds2, sigLev=.05) | gets point biserial correlation between binary and numeric data |
getConTab(ds1, ds2) | gets contingency table forcategorical data pair |
getChiSqCorr(ds1, ds2, sigLev=.05) | gets chi square correlation for categorical data |
getAnovaCorr(ds1, ds2, grByCol, sigLev=.05) | gets anova correlation for numerical and categorical data |
plotAutoCorr(ds, lags, alpha, diffOrder=0) | plots auto correlation |
getAutoCorr(ds, lags, alpha=.05) | gets auts correlation |
plotParAcf(ds, lags, alpha) | plots partial auto correlation |
getParAutoCorr(ds, lags, alpha=.05) | gets partial auts correlation |
plotCrossCorr(ds1, ds2, normed, lags) | plots cross correlation |
getCrossCorr(ds1, ds2) | gets cross correlation |
getFourierTransform(ds) | gets fast fourier transform |
getNullCount(ds) | gets count of null (None, nan) values |
getValueRangePercentile(ds, value1, value2) | gets percentile difference for value range |
getLessThanValues(ds, cvalue) | gets values less than given value |
getGreaterThanValues(ds, cvalue) | gets values greater than given value |
getGausianMixture(ncomp, cvType, ninit, *dsl) | gets parameters of Gaussian mixture components |
getKmeansCluster(nclust, ninit, *dsl) | gets cluster parameters with kmeans clustering |
getOutliersWithCovarDeterminant(contamination, *dsl) | gets outliers using covariance determinant |
getPrincComp(ncomp, *dsl) | gets principal components |
getDiffSdNoisiness(ds) | get noisiness based on std dev of first order difference |
getMaRmseNoisiness(ds, wsize) | gets noisiness based on RMSE with moving average |
The function getStats() packs lot of statistic on the data in it’s return value as below
- Data size
- Min value
- Max value
- Smallest n values
- Largest n values
- Mean
- Median
- Mode
- Mode count
- Std deviation
- Skew
- Kurtosis
- Median absolute deviation
Test Statistics API
These functions perform tests for various statistical properties as below.
- Fitness test for various distributions
- Stationary test
- Two sample statistic test
Function | Comment |
testStationaryAdf(ds, regression, autolag, sigLev=.05) | ADF stationary test |
testStationaryKpss(ds, regression, nlags, sigLev=.05) | KPSS stationary test |
testNormalJarqBera(ds, sigLev=.05) | Jarque Bera normalcy test |
testNormalShapWilk(ds, sigLev=.05) | Shapiro Wilks normalcy test |
testNormalDagast(ds, sigLev=.05) | D’Agostino’s K square normalcy test |
testNormalShapWilk(ds, sigLev=.05) | Shapiro Wilks normalcy test |
testDistrAnderson(ds, dist, sigLev=.05) | Anderson test for normal,expon,logistic,gumbel,gumbel_l,gumbel_r |
testSkew(ds, sigLev=.05) | test skew for normal distr |
testTwoSampleStudent(ds1, ds2, sigLev=.05) | Student t 2 sample test |
testTwoSampleKs(ds1, ds2, sigLev=.05) | Kolmogorov Sminov 2 sample statistic test |
testTwoSampleMw(ds1, ds2, sigLev=.05) | Mann-Whitney 2 sample statistic test |
testTwoSampleWilcox(ds1, ds2, sigLev=.05) | Wilcoxon Signed-Rank 2 sample statistic test |
testTwoSampleKw(ds1, ds2, sigLev=.05) | Kruskal-Wallis 2 sample statistic test |
testTwoSampleFriedman(ds1, ds2, ds3, sigLev=.05) | Friedman 2 sample statistic test |
testTwoSampleEs(ds1, ds2, sigLev=.05) | Epps Singleton 2 sample statistic test |
testTwoSampleAnderson(ds1, ds2, sigLev=.05) | Anderson 2 sample statistic test |
testTwoSampleScaleAb(ds1, ds2, sigLev=.05) | Ansari Bradley 2 sample scale statistic test |
testTwoSampleScaleMood(ds1, ds2, sigLev=.05) | Mood 2 sample scale statistic test |
testTwoSampleVarBartlet(ds1, ds2, sigLev=.05) | Ansari Bradley 2 sample scale statistic test |
testTwoSampleVarLevene(ds1, ds2, sigLev=.05) | Levene 2 sample variance statistic test |
testTwoSampleVarFk(ds1, ds2, sigLev=.05) | Fligner-Killeen 2 sample variance statistic test |
testTwoSampleMedMood(ds1, ds2, sigLev=.05) | Mood 2 sample median statistic test |
testTwoSampleZc(ds1, ds2, sigLev=.05) | Zhang-C 2 sample statistic statistic test |
testTwoSampleZa(ds1, ds2, sigLev=.05) | Zhang-A 2 sample statistic test |
testTwoSampleZk(ds1, ds2, sigLev=.05) | Zhang-K 2 sample statistic |
testTwoSampleCvm(ds1, ds2, sigLev=.05) | CVM sample statistic test |
Usage Examples
In this section, we will go through examples of API usage. For each I will provide the example code and the result. Please refer to the tutorial for more examples.
The first one is summary statistic as below. It adds 2 data sets corresponding to 2 columns in a file containing supply chain demand data and then calls getStats().
sys.path.append(os.path.abspath("../mlextra")) from daexp import * exp = DataExplorer() exp.addFileNumericData("bord.txt", 0, 1, "pdemand", "demand") exp.getStats("pdemand") output: == adding numeric columns from a file == done == getting summary statistics for data sets pdemand == { 'kurtosis': -0.12152386739702337, 'length': 1000, 'mad': 2575.2762, 'max': 18912, 'mean': 10920.908, 'median': 11011.5, 'min': 3521, 'mode': 10350, 'mode count': 3, 'n largest': [18912, 18894, 17977, 17811, 17805], 'n smallest': [3521, 3802, 4185, 4473, 4536], 'skew': -0.009681701835865877, 'std': 2569.1597609989144}
In the next example, we will analyze retails daily sales data. The data has weekly seasonality. In auto correlation we expect to find a large peak at lag 7. Let’s find out
sys.path.append(os.path.abspath("../mlextra")) from daexp import * exp = DataExplorer() exp.addFileNumericData("sale.txt", 0, "sale") exp.getAutoCorr("sale", 20) output: == adding numeric columns from a file == done == getting auto correlation for data sets sale == result details: { 'autoCorr': array([ 1. , 0.5738174 , -0.20129608, -0.82667856, -0.82392299, -0.20331679, 0.56991343, 0.91427488, 0.5679168 , -0.20108015, -0.81710428, -0.8175842 , -0.20391004, 0.56864915, 0.90936982, 0.56528676, -0.20657182, -0.81111562, -0.81204275, -0.1970099 , 0.56175539]), 'confIntv': array([[ 1. , 1. ], [ 0.5118379 , 0.6357969 ], [-0.28111578, -0.12147637], [-0.90842511, -0.74493201], [-0.93316119, -0.71468479], [-0.33426918, -0.07236441], [ 0.43775398, 0.70207288], [ 0.77298956, 1.0555602 ], [ 0.40548625, 0.73034734], [-0.37096731, -0.03119298], [-0.98790327, -0.64630529], [-1.00279183, -0.63237657], [-0.40249873, -0.00532136], [ 0.36925779, 0.76804052], [ 0.70384298, 1.11489665], [ 0.34484471, 0.7857288 ], [-0.43251377, 0.01937013], [-1.03778192, -0.58444933], [-1.04959751, -0.57448798], [-0.44499878, 0.05097898], [ 0.313166 , 0.81034477]])}
As expected the largest peat is ay=t lag 0. The next largest peak is at lag 7 with a value 0.91427488.
Finally with the knowledge of seasonal period we can extract the time series components as below
#code same as in the last example exp.getTimeSeriesComponents("sale","additive", 7, True, False) output: == adding numeric columns from a file == done == extracting trend, cycle and residue components of time series for data sets sale == result details: { 'residueMean': 0.022420235699977295, 'residueStdDev': 19.14825253159541, 'seasonalAmp': 98.22786720321932, 'trendMean': 1004.9323081345215, 'trendSlope': -0.0048913825348870996}
The average value is in the trend mean. Trend has a small negative slope. Seasonality has amplitude of 98.227. Residue has mean and standard deviation. Because we set the 4th argument to True we got summary of the time series components. If it was False, the function would have returned the actual values of the 3 components.
Wrapping Up
We have gone through a python data exploration API with close to 70 functions. It should be easy to build an web application based on the API. Please refer to the tutorial document for more examples on how to use the API. Hope you find it useful. If you have any suggestion for new functions in the API , please let me know.
Update
More functions have been added. Now there are close to 100 functions. This module along with several other Python modules for general and statistical utility functions have been published in testpypi as python package called matumizi. The API doc is available in Github project wiki.
Pingback: Customer Service Quality Monitoring with AutoEncoder based Anomalous Case Detection | Mawazo
Pingback: Data Driven Causal Relationship Discovery with Python Example Code | Mawazo
Pingback: Class Separation based Machine Learning Model Performance Metric | Mawazo
Pingback: Feature Selection with Information Theory Based Techniques in Python. | Mawazo
Pingback: Tabular Data Column Semantic Type Identification with Contrastive Deep Learning | Mawazo
Pingback: Information Gain based Feature Selection in Python for Machine Learning Models | Mawazo
Pingback: Time Series Forecasting with Decomposition and Two Linear Networks | Mawazo