Processing Missing Values with Hadoop

Missing values are just part of life in the data processing world. In most cases you can not simply ignore the missing values as it may adversely affect whatever analytic processing you are going to do. Broadly speaking, handling missing data consists of two steps, gaining some insight on missing fields in the data and then taking some actions based on the insight gained from the first step. In this post the focus will be primarily on the first step.

The Hadoop based implementation is available in my OSS project chombo on github. In future I will provide Spark port for the implementation.

Missing Data Mechanism

There are 3 underlying mechanisms behind missing data as follows. Knowledge of the underlying mechanism helps the choice of missing value processing techniques to be used.

Missing Completely at Random (MCAR) : Missing values for a field in record are completely random and the probability of having missing values does not depend upon the known values of other fields or the missing field value
Missing at Random (MAR): Missing values are random, but the probability of a missing values in a field depends on the values in other fields.
Missing Not at Random (MNAR): The probability of having a missing value in a field depends on the value for the missing field.

With MCAR, probability of missing value for a field is same for all records i.e, it is completely random.

With MAR, the probability of a missing value in a field depends on the known values in other fields in the same record. For example, in a survey if the education level college, the probability of the earning field being missing is higher. The probability of a missing value may also depend on a hidden variable that’s not part of the record.

With MNAR, the probability of the missing value depends on the missing value itself. For example, people with higher income are more likely not to provide their income information.

Handling Missing Values

Unless having missing values doesn’t have adverse consequence, which is rare, you have to do something about the missing values. Here is a workflow that could be used to handle missing values.

Do exploratory analysis to get statistics on missing values, row wise and column wise. The insight gained could be used in the next steps
Filter out row and columns with too many missing based on results obtained in the previous step.
Accept the data data obtained from step 2 with some missing values. Depending on the further analysis that’s going to be performed, this may be viable approach
If missing values are not acceptable replace the missing values with values predicted through various statistical techniques. These techniques are known as missing value imputation.

For our use case, we will follow the steps 1,2 and 4 in the workflow above. Focus of this post is on steps 1 and 2. Missing value imputation as in step 4 will be covered a future post. There two drawback to filtering as follows

If the underlying mechanism for missing values is MAR or MNAR, then the data resulting from the filtering may not conform to the underlying probability distribution of the data and it becomes biased.
If there are too many missing fields, there may not be enough data left after filtering and may not be useful the analysis to be performed on the data

Customer Service Survey

We will use customer service survey data along with some data collected from customer calls as the use case. The fields in the survey data are as follows.

Customer ID
Customer type
Call hold time
Number of previous calls on the same issue
Number of customer service re routes in the call
Score for customer service friendliness
Score for customer identity verification process
Whether issue got resolved
Customer service representative satisfaction score
Over all satisfaction score

Here is some sample input data.

APE4EAYQYU6K,residence,632,0,0,4,,T,9,9
4P85H8286NAV,residence,536,1,3,7,,F,3,4
XA7I0QPT7QJ4,residence,378,0,0,5,,F,5,4
3BGGP08JCJ6A,residence,262,2,2,5,,T,9,5
THW42R8BATHE,residence,282,2,2,8,,T,7,5
084RK82T23DN,residence,641,1,0,5,5,T,9,9
N7S8W4GBX4QE,business,297,0,2,5,,F,2,5
NHRN0G0U02EL,business,601,0,1,6,,T,9,10
C2U9ZZ0PHJR7,residence,545,5,3,8,5,T,7,6
78D7K8R0KY48,business,542,0,2,6,,T,8,10

Missing Value Statistics

Missing value statistics is calculated using the Map Reduce class MissingValueCounter. It operates in the 3 different modes based on the parameter mvc.counting.operation.

When set to row, it calculates row wise missing field counts as below. The output contains the row unique key followed by the count of missing fields, with the count in the descending order.

04A0COFHM2DQ,3
11YR7UVNMAB8,3
09K0AW8UEFPR,3
CL2TY6051M66,3
V3U7SC3MQQLT,3

When set to col, it calculates column wise missing field counts as below. It shows column index followed by the count of the missing fields with the counts in descending order.

6,709
5,52
9,7
7,4
8,4

There is also another setting distr, which allows you to get distribution of counts row wise and column wise. Here is some output

row,1,664
row,2,48
row,3,5
col,4,2
col,52,1
col,6,1
col,709,1

The second row in the output tells us that there are 48 rows with 2 missing fields. Similarly, the fifth record tells us that there 1 column with 52 missing fields.

Filtering Out Rows and Columns

The next step is to filter out row and columns from the data using the output of the last MapReduce as a guide. For filtering we will use a Map Reduce class called Projection, which is essentially an implementation of SQL select query. You can set the projection fields and predicates for select clause. It also supports some UDF for predicates.

Based on the findings from the previous Map Reduce job, this is how we are going to filter the data

Filter out rows with more than 2 missing fields
Filter out any row with the last field missing
Filter out the 7th column which has most of the values missing

Overall customer satisfaction score is mandatory. That’s why we filter out any rows that have this field missing, as indicated by the 2nd filter condition. Here is some sample output from the filtering operation

SQF48TC1DY0G,residence,141,0,3,7,T,9,9
TK36Z2P8489U,residence,264,2,0,4,F,2,2
60T2OHJOL4O5,business,692,2,0,8,T,10,5
8UUAYNE7FF39,residence,478,3,2,5,T,6,7
VHMEHBH46141,residence,339,0,2,5,T,9,8
Q6T6A26UG2UP,business,581,1,0,,F,2,3
QY36W6OTO9W0,residence,709,0,1,,T,6,8
SHSPOV41BC2X,residence,259,1,1,4,T,9,8
FCFWXD120MBT,business,625,0,1,7,T,8,7

The output will still have some missing values. However the most offending rows and columns have been removed.

Missing Value Imputation

Imputation is essentially prediction of missing values. Available data can used to predict the missing values using various statistical techniques.

Any missing field in the data after filter operation can be filled in with imputed values. This will be the topic of a future post.

Imputation also has the same risk as filtering. After imputation, the data may not any longer correspond to the underlying probability distribution that generated the data set.

Summing Up

In this post we have gone through a Hadoop based solution for handling missing values. The steps to to execute this use case can be found in this tutorial document.

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.

View all posts by Pranab →

2 Responses to Processing Missing Values with Hadoop

Kumaramangalam Vangavolu says:

January 10, 2019 at 7:31 am

Hi Sir,
I am trying to process missing values of a large dataset(10 lakh rows and 15 columns which includes both continuous and random data,where the missing values are found at random).When I try to implement in Python(based on the working of ‘missforest’ package in R).I face either ‘memory out of error” or the model fit takes so much time.Can you suggest a possible solution for this issue

- Pranab says:
  
  January 10, 2019 at 7:44 am
  
  You can either use a machine with more RAM use a Hadoop or Spark cluster based solutions.