Data quality is critical for the healthy operation of any data driven enterprise. There are various kinds of data quality metrics. In this post, the focus will be on the completeness of data. Data quality from a completeness point of view will be expressed as metrics in different levels of granularity. Additionally, data completeness will be assessed within the context of a consuming process and hence it is contextual.
The Spark implementation is available in my open source project chombo. The implementation is agnostic to the specific data set. It’s configuration and meta data driven.
We are all familiar with the adage “garbage in garbage out“. If the source data data is plagued with data quality issues, the output of analysis based on the data will be of poor quality and unreliable.
Al though our focus is on data completeness, there are various data quality metrics. Here are some examples. I will address them in future posts
- Data validity
- Data accuracy
- Data timeliness
A field in a record is considered incomplete if it’s missing or null. There are two important aspects of the completeness metric granularity of the metric and context of the metric.
First, we will be evaluating data completeness metric at the following 3 levels of granularity. Completeness metric at data set level can be used for an overall assessment of the completeness of the data and accordingly a decision can be made whether to use a data set at all. Completeness metric at the record level may be used to filter out records that have low completeness score.
- Data Set
Second, any data quality metric, including completeness, is only meaningful within the context of a consuming application. For example, if retail transaction data is going to be used for demand forecasting, any incomplete field in the customer related fields in the data set is irrelevant.
The data set for retail transactions was created synthetically with a script. Here are the different fields
- Transaction ID
- Customer ID
- Store ID
- Zip code
- Time stamp
- Product ID
- Monetary amount
There are two steps for creating the data set. In the first step, the complete data set is created. Next, based sampling some pre defined distributions, a record and a set of fields within the record is selected. Those field values are either removed or replaced with null values. Here is some sample input
2C9FUCNK775N,B02XHP6N9Z,FX971N,98901,2019-08-11 17:23:50,QR25ZEA7,3,USD 84.62 2C9FUCNK775N,B02XHP6N9Z,FX971N,98901,2019-08-11 17:23:50,AA3LP54D,5,USD 358.78 2C9FUCNK775N,B02XHP6N9Z,FX971N,98901,2019-08-11 17:23:50,86I4YY30,3,USD 264.01 BZND2MSF910F,FHOY5110JI,QF011Q,null,null,8172T00O,1,USD 71.48 0YG7G145NJH7,BIOI6Q82DC,V8EH78,91483,2019-08-11 17:34:45,MXIQ52OS,2,USD 196.30 67FRR9197P82,YSCK8PYJ4R,200C17,98122,2019-08-11 17:35:15,JL9N1BM3,5,USD 487.08
Incompleteness is indicated either with missing values or an especially designated string for null. The null indicating string can be be defined in the configuration.
Data Completeness Spark Job
The Spark job is implemented by the scala object MissingValueMetric. To make the metric calculation contextual, a set of weights are specified for all the fields. Higher the weight more relevant the field is. If a field is completely irrelevant, the corresponding weight should be set to 0. One simple strategy would be to assign weight of 1 for all important and relevant fields and 0 for the rest. Here is some sample output.
1ME7YD7Q7G51,Y0ENO067YL,D2NR29,98122,2019-08-13 04:04:24,53D0YKAX,1,USD 47.97,8,1.000 1ME7YD7Q7G51,Y0ENO067YL,D2NR29,98122,2019-08-13 04:04:24,CRKSL56C,3,USD 88.65,8,1.000 J44439NM3Z70,Z11USBT1XF,FX971N,98901,2019-08-13 04:12:16,744UXNAN,3,USD 276.61,8,1.000 J44439NM3Z70,Z11USBT1XF,FX971N,98901,2019-08-13 04:12:16,Q044HSHJ,3,USD 45.29,8,1.000 null,Z11USBT1XF,FX971N,98901,2019-08-13 04:12:16,OEC3B72X,4,null,6,0.870 89P39QE94117,W3S22T6679,1G0EJ6,98901,2019-08-13 04:19:02,04HS4B8I,3,USD 188.85,8,1.000 88JEAXF009A0,BCY5OTFEM4,QF011Q,91483,2019-08-13 04:20:45,PFHPZAQA,4,USD 41.78,8,1.000
In the output, we see one record that in incomplete. The last field is the completeness score. The completeness score value is between 0 and 1. The second field from the end is the actual count of number of complete fields. Completeness score is based on the weight associated with each field. Here are some statistics about the completeness score of the whole data set.
The output is bit cryptic. The first part is the histogram. The first field is the number of bins (14) in the histogram. It’s followed by the distribution of the score. We find that fraction of data that is complete is 0.901.
The second part consists of various summary statistic. We find that mean completeness score is 0.974 and the std deviation is 0.090. All these statistic values reflect the overall completeness of the whole data set.
Handling Incomplete Data
Completeness score can be computed row wise or column wise. So far all the results have been shown for row wise completeness. For column wise completeness, number of missing fields and fraction of rows with missing values in a column are calculated.
There are two ways of handling incomplete data, one based on filtering and the other based on imputation. One reasonable strategy for filtering will be to filter out rows with low completeness score and then columns with low completeness score.
With imputation, the missing values are re constructed. There are various algorithms for imputation e.g. nearest neighbor and regression.
We had a walk through for a Spark based implementation of contextual data completeness analysis. The result of the analysis is a completeness metric at various levels of granularity e.g record level or data set level. Please follow the steps in the tutorial if you are interested in running the use case.