Learn about Your Data with about Hundred Data Exploration Functions All in One Python Class


It’s a costly mistake to jump straight  into building machine learning models before getting a good insight into your data. I have made the mistake and paid the price.  Since then I made a resolution to learn about the data as much as possible first before  taking the next step. While exploring data, I always found myself using multiple python libraries and doing plethora of  imports for various python modules.

That experience motivated me to consolidate all common python data exploration functions, in one python class to make it easier to use. As an added feature I have also provided a workspace like interface, using which you can register multiple data sets with user provided name for each data set. You can refer to the data sets by name and perform various operations. The python implementation is available in my open source  project avenir in GitHub

API  Usage

Most of the data exploration function implementation is based on existing python libraries. Very few are implemented from scratch. Following python libraries have been used.

  1. numpy
  2. scipy
  3. pandas
  4. statsmodels
  5. scikit-learn

There two kinds of API. Most are for data exploration. Rest of them are for workspace management. The API usage for data exploration has the following pattern.

  1. Create an instance of DataExplorer
  2. Register multiple data sets. A data set is a list or array with a name. A data source can be a file, pandas data frame, bumpy array or a list. A data set is essentially a 1D  array with a name. For example you can pass   a CSV file, specifying the columns you want to use. For each column it will register a  data set.
  3. Call various data exploration API among the 66 available. Names of one or  data sets are always passed as an argument
  4. Result is alway returned as python dictionary
  5. By default, there is alway console output. It can be disabled by setting the argument verbose to False in the constructor 
  6. You can add notes for any data set registered as you are exploring it.
  7. The whole workspace can be saved and restored, if you want to continue your exploration session later. The workspace consists of a dictionary holding all the data sets and a dictionary holding the meta data formal data sets

The source code has comments on the input arguments for each function. For further details, it’s best to refer to the documentation of the base python library used for any particular function.

Workspace Management API

Here are the functions for workspace management. Through these, you can load data from various sources, save and restore workspace. You need to  use these to register various data sets before you can operate on them, although data exploration API allows you to pass any unregistered list or numpy array also.

Function Comment
save(filePath) save workspace
restore(filePath) restore workspace
queryFileData(filePath, *columns) query data types for file columns
queryDataFrameData(df, *columns) query data types for data frame columns
getDataType(col) query data types for a data set (numeric, binary, categorical
addFileNumericData(filePath, *columns) add numeric columns from file
addFileBinaryData(filePath, *columns) add binary columns from file
addDataFrameNumericData(filePath, *columns) add numeric columns from data frame
addDataFrameBinaryData(filePath, *columns) add binary columns from data frame
addListNumericData(ds, name) add numeric data from list
addListBinaryData(ds, name) add binary data from list
addFileCatData(filePath, *columns) add categorical columns from file
addDataFrameCatData(df, *columns) add categorical columns from data frame
addCatListData(ds, name) add categorical data from list
remData(ds) remove data set
addNote(ds, note) add note for a data set
getNotes(ds) get notes for a data set
getNumericData(ds) get numeric data for a data set
getCatData(ds) get categorical data for a data set
showNames() get list of names for datasets registered
getCatData(ds) get categorical data for a data set

Data Exploration API.

Rest of the sections list of all the data exploration functions. They are split into two separate section 1) Summary Statistics and 2)Test Statistics. The API has the following characteristics

  • Since we are exploring data and learning insight, the functions don’t mutate the data
  • The functions for adding data are data type aware. If you try to use an invalid data type for a function e.g cross correlation with categorical data, it will be detected and an AssertionError will be raised
  • You will always pass one for names for data sets already registered. However, you may  pass any unregistered list or numpy array also.
  • If the underlying library returns pvalue, then the output will indicate the null hypothesis is accepted or rejected, based on the critical value passed
  • Some of the functions take two data sets and require that the the data sets be of same size. In such cases, the size check is done
  • By default there is always console output. To disable console output, you should set verbose to False in the constructor.

Following data types are supported

  1. Numerical (integer, float)
  2. Binary (integer with values 0 and 1)
  3. Categorical (integer, string)

Summary Statistics API

The functions listed below belong to 3 sub categories. Most functions will return some result wrapped in dictionary. Some will do plotting.

  1. Basic summary statistics
  2. Frequency related statistics
  3. Correlation
Function Comment
queryFileData(filePath, *columns) query column data type from a data file
queryDataFrameData(df, *columns) query column data type from a data frame
plot(ds, yscale=None) line plot
scatterPlot(ds1, ds2) scatter plot
print(self, ds) prints size of data set and first 50 elements
plotHist(ds, cumulative, density, nbins=None) plots histogram or cumulative distribution
isMonotonicallyChanging(ds) checks if data is monotonically increasing or decreasing
getFeqDistr(ds, nbins=10) gets frequency distribution or histogram
getCumFreqDistr(ds, nbins=10) gets cumulative frequency distribution
getEntropy(ds, nbins=10) gets entropy
getRelEntropy(ds1, ds2, nbins=10) gets relativ entropy
getMutualInfo(ds1, ds2, nbins=10) gets mutual information
getPercentile(ds, value) gets percentile
getValueAtPercentile(ds, percent) gets value at percentile
getUniqueValueCounts(ds, maxCnt=10) gets unique values and counts
getCatUniqueValueCounts(ds, maxCnt=10) gets categorical data unique values and counts
getStats(ds, nextreme=5) gets summary statistics
getDifference(self, ds, order) gets difference of given order
getTrend(ds, doPlot=False) gets trend
deTrend(self, ds, trend, doPlot=False) gets trend removed data
getTimeSeriesComponents(ds, model, freq, summaryOnly, doPlot=False) gets trend, cycle and residue components of time series
getOutliersWithIsoForest(contamination, *dsl) gets outliers with isolation forest
getOutliersWithLocalFactor(contamination, *dsl) gets outliers with local factor
getOutliersWithSupVecMach(nu, *dsl) gets outliers using one class svm
fitLinearReg(ds, doPlot=False) get linear regression coefficients
fitSiegelRobustLinearReg(ds, doPlot=False) gets siegel robust linear regression coefficients based on median
fitTheilSenRobustLinearReg(ds, doPlot=False) gets thiel sen robust linear fit regression coefficients based on median
plotRegFit(x, y, slope, intercept) plots regression fitted line
getCovar(*dsl) gets covariance
getPearsonCorr(ds1, ds2, sigLev=.05) gets pearson correlation coefficient
getSpearmanRankCorr(ds1, ds2, sigLev=.05) gets spearman correlation coefficient
getKendalRankCorr(ds1, ds2, sigLev=.05) gets kendall’s tau, correlation for ordinal data
getPointBiserialCorr(ds1, ds2, sigLev=.05) gets point biserial correlation between binary and numeric data
getConTab(ds1, ds2) gets contingency table forcategorical data pair
getChiSqCorr(ds1, ds2, sigLev=.05) gets chi square correlation for categorical data
getAnovaCorr(ds1, ds2, grByCol, sigLev=.05) gets anova correlation for numerical and categorical data
plotAutoCorr(ds, lags, alpha, diffOrder=0) plots auto correlation
getAutoCorr(ds, lags, alpha=.05) gets auts correlation
plotParAcf(ds, lags, alpha) plots partial auto correlation
getParAutoCorr(ds, lags, alpha=.05) gets partial auts correlation
plotCrossCorr(ds1, ds2, normed, lags) plots cross correlation
getCrossCorr(ds1, ds2) gets cross correlation
getFourierTransform(ds) gets fast fourier transform
getNullCount(ds) gets count of null (None, nan) values
getValueRangePercentile(ds, value1, value2) gets percentile difference for value range
getLessThanValues(ds, cvalue) gets values less than given value
getGreaterThanValues(ds, cvalue) gets values greater than given value
getGausianMixture(ncomp, cvType, ninit, *dsl) gets parameters of Gaussian mixture components
getKmeansCluster(nclust, ninit, *dsl) gets cluster parameters with kmeans clustering
getOutliersWithCovarDeterminant(contamination, *dsl) gets outliers using covariance determinant
getPrincComp(ncomp, *dsl) gets principal components
getDiffSdNoisiness(ds) get noisiness based on std dev of first order difference
getMaRmseNoisiness(ds, wsize) gets noisiness based on RMSE with moving average

The function getStats() packs lot of  statistic on the data in it’s return value as below

  1. Data size
  2. Min value
  3. Max value
  4. Smallest n values
  5. Largest n values
  6. Mean
  7. Median
  8. Mode
  9. Mode count
  10. Std deviation
  11. Skew
  12. Kurtosis
  13. Median absolute deviation

Test Statistics API

These functions perform tests for various statistical properties as below.

  1. Fitness test for various distributions
  2. Stationary test
  3. Two sample statistic test
Function Comment
testStationaryAdf(ds, regression, autolag, sigLev=.05) ADF stationary test
testStationaryKpss(ds, regression, nlags, sigLev=.05) KPSS stationary test
testNormalJarqBera(ds, sigLev=.05) Jarque Bera normalcy test
testNormalShapWilk(ds, sigLev=.05) Shapiro Wilks normalcy test
testNormalDagast(ds, sigLev=.05) D’Agostino’s K square normalcy test
testNormalShapWilk(ds, sigLev=.05) Shapiro Wilks normalcy test
testDistrAnderson(ds, dist, sigLev=.05) Anderson test for normal,expon,logistic,gumbel,gumbel_l,gumbel_r
testSkew(ds, sigLev=.05) test skew for normal distr
testTwoSampleStudent(ds1, ds2, sigLev=.05) Student t 2 sample test
testTwoSampleKs(ds1, ds2, sigLev=.05) Kolmogorov Sminov 2 sample statistic test
testTwoSampleMw(ds1, ds2, sigLev=.05) Mann-Whitney 2 sample statistic test
testTwoSampleWilcox(ds1, ds2, sigLev=.05) Wilcoxon Signed-Rank 2 sample statistic test
testTwoSampleKw(ds1, ds2, sigLev=.05) Kruskal-Wallis 2 sample statistic test
testTwoSampleFriedman(ds1, ds2, ds3, sigLev=.05) Friedman 2 sample statistic test
testTwoSampleEs(ds1, ds2, sigLev=.05) Epps Singleton 2 sample statistic test
testTwoSampleAnderson(ds1, ds2, sigLev=.05) Anderson 2 sample statistic test
testTwoSampleScaleAb(ds1, ds2, sigLev=.05) Ansari Bradley 2 sample scale statistic test
testTwoSampleScaleMood(ds1, ds2, sigLev=.05) Mood 2 sample scale statistic test
testTwoSampleVarBartlet(ds1, ds2, sigLev=.05) Ansari Bradley 2 sample scale statistic test
testTwoSampleVarLevene(ds1, ds2, sigLev=.05) Levene 2 sample variance statistic test
testTwoSampleVarFk(ds1, ds2, sigLev=.05) Fligner-Killeen 2 sample variance statistic test
testTwoSampleMedMood(ds1, ds2, sigLev=.05) Mood 2 sample median statistic test
testTwoSampleZc(ds1, ds2, sigLev=.05) Zhang-C 2 sample statistic statistic test
testTwoSampleZa(ds1, ds2, sigLev=.05) Zhang-A 2 sample statistic test
testTwoSampleZk(ds1, ds2, sigLev=.05) Zhang-K 2 sample statistic
testTwoSampleCvm(ds1, ds2, sigLev=.05) CVM sample statistic test

Usage Examples

In this section, we will go through examples of API usage. For each I will provide the example code and the result. Please refer to the tutorial for more examples.

The first one is summary statistic as below.  It adds 2 data sets corresponding to 2 columns in a file containing supply chain demand data and then calls getStats().

sys.path.append(os.path.abspath("../mlextra"))
from daexp import *

exp = DataExplorer()
exp.addFileNumericData("bord.txt", 0, 1, "pdemand", "demand")
exp.getStats("pdemand")

output:
== adding numeric columns from a file ==
done

== getting summary statistics for data sets pdemand ==
{   'kurtosis': -0.12152386739702337,
    'length': 1000,
    'mad': 2575.2762,
    'max': 18912,
    'mean': 10920.908,
    'median': 11011.5,
    'min': 3521,
    'mode': 10350,
    'mode count': 3,
    'n largest': [18912, 18894, 17977, 17811, 17805],
    'n smallest': [3521, 3802, 4185, 4473, 4536],
    'skew': -0.009681701835865877,
    'std': 2569.1597609989144}

In the next example, we will analyze retails daily sales data. The data has weekly seasonality. In auto correlation we expect to find a large peak at lag 7. Let’s find out

sys.path.append(os.path.abspath("../mlextra"))
from daexp import *

exp = DataExplorer()
exp.addFileNumericData("sale.txt", 0, "sale")
exp.getAutoCorr("sale", 20)

output:
== adding numeric columns from a file ==
done

== getting auto correlation for data sets sale ==
result details:
{   'autoCorr': array([ 1.        ,  0.5738174 , -0.20129608, -0.82667856, -0.82392299,
       -0.20331679,  0.56991343,  0.91427488,  0.5679168 , -0.20108015,
       -0.81710428, -0.8175842 , -0.20391004,  0.56864915,  0.90936982,
        0.56528676, -0.20657182, -0.81111562, -0.81204275, -0.1970099 ,
        0.56175539]),
    'confIntv': array([[ 1.        ,  1.        ],
       [ 0.5118379 ,  0.6357969 ],
       [-0.28111578, -0.12147637],
       [-0.90842511, -0.74493201],
       [-0.93316119, -0.71468479],
       [-0.33426918, -0.07236441],
       [ 0.43775398,  0.70207288],
       [ 0.77298956,  1.0555602 ],
       [ 0.40548625,  0.73034734],
       [-0.37096731, -0.03119298],
       [-0.98790327, -0.64630529],
       [-1.00279183, -0.63237657],
       [-0.40249873, -0.00532136],
       [ 0.36925779,  0.76804052],
       [ 0.70384298,  1.11489665],
       [ 0.34484471,  0.7857288 ],
       [-0.43251377,  0.01937013],
       [-1.03778192, -0.58444933],
       [-1.04959751, -0.57448798],
       [-0.44499878,  0.05097898],
       [ 0.313166  ,  0.81034477]])}

As expected the largest peat is ay=t lag 0. The next largest peak is at lag 7 with a value 0.91427488.

Finally with the knowledge of seasonal period we can extract the time series components as below

#code same as in the last example 
exp.getTimeSeriesComponents("sale","additive", 7, True, False)

output:
== adding numeric columns from a file ==
done

== extracting trend, cycle and residue components of time series for data sets sale ==
result details:
{   'residueMean': 0.022420235699977295,
    'residueStdDev': 19.14825253159541,
    'seasonalAmp': 98.22786720321932,
    'trendMean': 1004.9323081345215,
    'trendSlope': -0.0048913825348870996}

The average value is in the trend mean. Trend has a small negative slope. Seasonality has amplitude of 98.227. Residue has mean and standard deviation. Because we set the 4th argument to True we got summary of the time series components. If it was False, the function would have returned the actual values of the 3 components.

Wrapping Up

We have gone through a python data exploration API with close to 70 functions. It should be easy to build an web application based on the API. Please refer to the tutorial document for more examples on how to use the API. Hope you find it useful. If you have any suggestion for new functions in the API , please let me know.

Update

More functions have been added. Now there are close to 100 functions. This module along with several other Python modules for general and statistical utility functions have been published in testpypi as python package called matumizi. The API doc is available in Github project wiki.

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Data Science, Python, Statistics and tagged , , , , , . Bookmark the permalink.