Missing values are just part of life in the data processing world. In most cases you can not simply ignore the missing values as it may adversely affect whatever analytic processing you are going to do. Broadly speaking, handling missing data consists of two steps, gaining some insight on missing fields in the data and then taking some actions based on the insight gained from the first step. In this post the focus will be primarily on the first step.
The Hadoop based implementation is available in my OSS project chombo on github. In future I will provide Spark port for the implementation.
Missing Data Mechanism
There are 3 underlying mechanisms behind missing data as follows. Knowledge of the underlying mechanism helps the choice of missing value processing techniques to be used.
- Missing Completely at Random (MCAR) : Missing values for a field in record are completely random and the probability of having missing values does not depend upon the known values of other fields or the missing field value
- Missing at Random (MAR): Missing values are random, but the probability of a missing values in a field depends on the values in other fields.
- Missing Not at Random (MNAR): The probability of having a missing value in a field depends on the value for the missing field.
With MCAR, probability of missing value for a field is same for all records i.e, it is completely random.
With MAR, the probability of a missing value in a field depends on the known values in other fields in the same record. For example, in a survey if the education level college, the probability of the earning field being missing is higher. The probability of a missing value may also depend on a hidden variable that’s not part of the record.
With MNAR, the probability of the missing value depends on the missing value itself. For example, people with higher income are more likely not to provide their income information.
Handling Missing Values
Unless having missing values doesn’t have adverse consequence, which is rare, you have to do something about the missing values. Here is a workflow that could be used to handle missing values.
- Do exploratory analysis to get statistics on missing values, row wise and column wise. The insight gained could be used in the next steps
- Filter out row and columns with too many missing based on results obtained in the previous step.
- Accept the data data obtained from step 2 with some missing values. Depending on the further analysis that’s going to be performed, this may be viable approach
- If missing values are not acceptable replace the missing values with values predicted through various statistical techniques. These techniques are known as missing value imputation.
For our use case, we will follow the steps 1,2 and 4 in the workflow above. Focus of this post is on steps 1 and 2. Missing value imputation as in step 4 will be covered a future post. There two drawback to filtering as follows
- If the underlying mechanism for missing values is MAR or MNAR, then the data resulting from the filtering may not conform to the underlying probability distribution of the data and it becomes biased.
- If there are too many missing fields, there may not be enough data left after filtering and may not be useful the analysis to be performed on the data
Customer Service Survey
We will use customer service survey data along with some data collected from customer calls as the use case. The fields in the survey data are as follows.
- Customer ID
- Customer type
- Call hold time
- Number of previous calls on the same issue
- Number of customer service re routes in the call
- Score for customer service friendliness
- Score for customer identity verification process
- Whether issue got resolved
- Customer service representative satisfaction score
- Over all satisfaction score
Here is some sample input data.
APE4EAYQYU6K,residence,632,0,0,4,,T,9,9 4P85H8286NAV,residence,536,1,3,7,,F,3,4 XA7I0QPT7QJ4,residence,378,0,0,5,,F,5,4 3BGGP08JCJ6A,residence,262,2,2,5,,T,9,5 THW42R8BATHE,residence,282,2,2,8,,T,7,5 084RK82T23DN,residence,641,1,0,5,5,T,9,9 N7S8W4GBX4QE,business,297,0,2,5,,F,2,5 NHRN0G0U02EL,business,601,0,1,6,,T,9,10 C2U9ZZ0PHJR7,residence,545,5,3,8,5,T,7,6 78D7K8R0KY48,business,542,0,2,6,,T,8,10
Missing Value Statistics
Missing value statistics is calculated using the Map Reduce class MissingValueCounter. It operates in the 3 different modes based on the parameter mvc.counting.operation.
When set to row, it calculates row wise missing field counts as below. The output contains the row unique key followed by the count of missing fields, with the count in the descending order.
04A0COFHM2DQ,3 11YR7UVNMAB8,3 09K0AW8UEFPR,3 CL2TY6051M66,3 V3U7SC3MQQLT,3
When set to col, it calculates column wise missing field counts as below. It shows column index followed by the count of the missing fields with the counts in descending order.
6,709 5,52 9,7 7,4 8,4
There is also another setting distr, which allows you to get distribution of counts row wise and column wise. Here is some output
row,1,664 row,2,48 row,3,5 col,4,2 col,52,1 col,6,1 col,709,1
The second row in the output tells us that there are 48 rows with 2 missing fields. Similarly, the fifth record tells us that there 1 column with 52 missing fields.
Filtering Out Rows and Columns
The next step is to filter out row and columns from the data using the output of the last MapReduce as a guide. For filtering we will use a Map Reduce class called Projection, which is essentially an implementation of SQL select query. You can set the projection fields and predicates for select clause. It also supports some UDF for predicates.
Based on the findings from the previous Map Reduce job, this is how we are going to filter the data
- Filter out rows with more than 2 missing fields
- Filter out any row with the last field missing
- Filter out the 7th column which has most of the values missing
Overall customer satisfaction score is mandatory. That’s why we filter out any rows that have this field missing, as indicated by the 2nd filter condition. Here is some sample output from the filtering operation
SQF48TC1DY0G,residence,141,0,3,7,T,9,9 TK36Z2P8489U,residence,264,2,0,4,F,2,2 60T2OHJOL4O5,business,692,2,0,8,T,10,5 8UUAYNE7FF39,residence,478,3,2,5,T,6,7 VHMEHBH46141,residence,339,0,2,5,T,9,8 Q6T6A26UG2UP,business,581,1,0,,F,2,3 QY36W6OTO9W0,residence,709,0,1,,T,6,8 SHSPOV41BC2X,residence,259,1,1,4,T,9,8 FCFWXD120MBT,business,625,0,1,7,T,8,7
The output will still have some missing values. However the most offending rows and columns have been removed.
Missing Value Imputation
Imputation is essentially prediction of missing values. Available data can used to predict the missing values using various statistical techniques.
Any missing field in the data after filter operation can be filled in with imputed values. This will be the topic of a future post.
Imputation also has the same risk as filtering. After imputation, the data may not any longer correspond to the underlying probability distribution that generated the data set.
In this post we have gone through a Hadoop based solution for handling missing values. The steps to to execute this use case can be found in this tutorial document.