Missing value is a common problem in many real world data set. There are various techniques for imputing missing values. We will use a kind of Neural Network called RBM for imputing missing values. Restricted Boltzmann Machine (RBM) are stochastic neural network used for probabilistic graphical modeling. We will use a customer survey data set with missing income fields to show how to use RBM to impute missing values.
The Python implementation is available in my open source project avenir on github. It provides a user friendly wrapper around RBM implementation in scikit Python ML library. It allow you to use RBM by appropriate settings in a property configuration files. There is very little coding involved except to call the train and prediction API.
Restricted Boltzmann Machine
Restricted Boltzmann Machines have 2 layers, a visible or input layer and a hidden layer. The bi directional connections are only between the visible layer and the hidden layer. There are no connections between the visible units and between the hidden units. Foundation of RBM is rooted in statistical Physics.
Just like three layer feedforward networks are known as a universal approximation for any function, RBM networks are called universal approximation for any probability distribution.
Among many uses of RBM, one important one is to use RBM as an unsupervised pre training layer for feature extraction in Neural Networks with deep architecture. Here are the different ways RBM gets used
- Estimating probability density p(x), given input x
- For noisy data, get estimated correct input
- For missing value, getting an estimated value
- Generates additional samples, given some input x
- Make recommendations
- Unsupervised pre trained layer of a deep network
- Extract features for use with any classification algorithm e.g. SVM
- Anomaly detection
The difference in the expected correlation between visible and hidden units in the clamped condition (visible units clamped to training data) and free running condition is used to the learn the weights of the network.
Calculating the correlation expectation under the free running condition is computationally expensive. Alternate block Gibbs sampling during this process until convergence. This process is known as burning the Markov Chain. A technique called Contrastive Divergence is generally used to to expedite the process.
In other words, after a model is trained, the expected correlation between a visible and hidden unit is same for both under the camped condition and the free running condition.
Missing Value Imputation
There are various techniques for missing value imputation, starting from simple techniques like using mean or median values all the way to K Nearest Neighbor(KNN) and Multi Variate Imputation by Chained equation (MICE)
RBM is another way of doing missing value imputation. We take all the records without missing values and train an RBM model. Then we use the trained model to predict missing values for records with missing values. One of the important configuration is the number of hidden units. it should be in the same order as the number of unique records in the data set.
Because the input data has bias and certain distribution, the number of hidden units required is much less. For the data set I used, about 100 hidden units was found to the the optimum, although as per the guidelines about 400 was required.
Once we have the trained model, to find the missing value of a field, we put some random value in the missing field and regenerate the input. Since the regenerated value is a sample based on the conditional distribution of the hidden units, we repeat the regeneration process many times and take the expected value from the empirical distribution of the missing field.
The process of getting a sample of the missing value is through the reconstruction process, which internally works as follows
- Given the input record, we get a a distribution of the hidden units conditioned on the values of the input units
- The conditional hidden units distribution is sampled to get hidden unit values
- The visible unit distribution conditioned on the hidden unit values from the previous step is sampled, to generate visible unit values.
Customer Survey Data
We will use customer survey data as the use case. The data contains the following fields, which are either binary or categorical.
- Sex (binary)
- Marital status (binary)
- Age (categorical with cardinality 3)
- Income (categorical with cardinality 3)
- Ethnicity (categorical with cardinality 4)
Since RBM works with binary data only, all the categorical values are converted to binary using One Hot Encoding. Here is some sample data before encoding
1,1,O,M,WH 1,0,O,H,WH 1,1,O,M,SA 0,0,M,M,WH 1,1,M,M,BL 0,1,Y,L,WH 0,0,O,M,BL
The data is synthetically generated , by sampling from some pre specified distribution. Income distribution is conditional on age. Two sets of data are generated, one for training and the other for testing and validation.
Real World Data
For real world data with missing values in one or fields, the steps need be followed.
- Separate out data with no missing fields
- From the data from the step above set aside 80% for training
- Use the remaining 20% for validation
- Make a copy of the validation data set and for the missing fields replace with realistic random values. We will call this test set
- Use records with missing fields, separated out in the first step for prediction of missing values. This the prediction set
You can also use k fold cross validation. During testing and validation all the RBM parameters will be tuned. In a copy of the validation data set, you have to simulate missing values, i.e replace some fields with realistic values. This data set will be used for prediction during validation and compared with the validation data set for error calculation,
RBM Training and Prediction
Training and tuning go hand in hand just like in any other Machine Learning model building process. It goes through the following cycle
- Train model
- Make prediction with test set
- Compare with validation set and calculate error
- Change parameters and repeat all the steps until your are happy with the error rate.
The python wrapper class for RBM makes it easy to use. All that is required proper population of the properties configuration file. Here are some important training related parameters, after the model has been tuned.
train.num.components=100 train.learning.rate=0.05 train.batch.size=10 train.num.iter=50
All the tuning was done manually. I got an accuracy of 60% after manual tuning. It could have been tuned automatically using parameter search and optimization algorithms. Here are some of them
- Grid Search
- Random Search
- Simulated Annealing
- Genetic Programming.
Prediction is the easy part. As alluded to earlier, since the prediction is sampling based, it needs to repeated many times and then the expected value taken. The iteration count is set by the parameter analyze.missing.iter.count.
We have gone through a technique for missing value imputation using restricted Boltzmann Machine. The tutorial document could used for step by step instruction on how to execute the use case.
For commercial support for any solution in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing.