Is Neural Network Better Off with Big Data

How does neural network or for that matter any machine learning model relates to Big Data. Do we get a better quality learning model with bigger data. That’s what we will explore in this post. We will explore sample complexity i.e. the way model performance varies with training sample size. This will be particularly interesting from a Big Data point of view. We will also look at model complexity which tells us how model performance varies with model complexity.

Although I have used a multi layer neural network for my experiments, the findings should apply to any machine learning algorithm.

Neural Network

Supervised machine learning is all about discovering functions that transform input to output. The model learns from training data which provides input as vector and the output which is the class or label as a scalar. The functions could be linear or non linear depending on the underlying hypothesis in the training data.

Let’s find out how learning happens in a neural network. Consider a learning problem with 2 dimensional input data x and output y which is either is class A or class B. Let’s assume that we already know the hypothesis to be learned by the neural network model.Consider the following hyper plane in 2 dimensional space

a x₁ + b x₂ + c

If a training data point lies above this line it belongs to class A , other wise class C. The structure of the network that will solve this problem will consist of 3 input elements corresponding to x₁, x₂ and 1 sand one output element. The third input called bias is always 1.

The 3 connecting edges from the 3 input elements to the output element corresponding to the 3 coefficients a,b and c that we want to learn by training the network. We will call a,b and c the weights of the 3 edges.

You can find the details back propagation learning in any machine learning text book, but here are the steps. We are using batch learning here.

Assign random values for a,b and c
Multiply each input with it’s edge weight and aggregate them which is essentially calculating ax₁ + bx₂ + c
Transform the value from step2 through a nonlinear functions which will bound the output between 0 if (if class B) and 1 (if class A)
Take the difference between the output and the actual target output which is the error.
Repeat step 2 on wards for all the input data points and find the aggregate error.
Back propagate the error to update the 3 weights.
Repeat from step 2 until convergence

Essentially, we have assumed that the separating hyper plane is linear and by training the network we have learned the parameters of the linear separating hyper plane.

If the separating hyper plane was non linear instead, the training would not have converged with low error. Consider a triangle in the the 2 dimensional input plane. Let’s assume that if a point is inside the triangle, the point belongs to class A and otherwise B.

To solve this problem involving non linear separating hyper plane, you will find your self having a 3 layer network, including a hidden layer. The hidden layer will have 3 units and a bias unit. Each unit of the hidden layer will correspond to one side of the triangle we just discussed.

The solution will be similar to algorithm already outlined, except that the forward propagation and the backward propagation will have to traverse multiple layers.

Example Neural Network

The neural network used is a 3 layer network with the following characteristics for a binary classification problem.

Had 3 input units including 1 bias input
Number of hidden units was a variable in the tests
Used tanh function for hidden layer activation function
Used softmax function for output unit activation
Used cross entropy error function
Used batch learning back propagation algorithm
Used regularization to penalize large weights

I took an python implementation of neural network from this excellent post and made some modifications e.g., passing many parameters from command line arguments . It uses numpy and scikit-learn python libraries. My version of the code is in github. My version takes the following parameters from the command line arguments

Number of hidden units
Date set size
Noise level in the data
Number of iterations
Learning rate

Each input is a 2 dimensional vector. Data was generated using scikit-learn data generation library. The separating hyper plane in the data set is non linear, which made it necessary to use a network with hidden units. Once the data was generated , 20% of it was set aside for validation. This is one of the changes I made in the original implementation.

Generalization Error

Here is the error bound for generalization error, which is probably the most important formula in machine learning theory.. We can gain lot of valuable insights from this formula.

E_g <= E_t + sqrt(8/N * ln((4 ((2N)^dvc + 1) / δ))
E_g = generalization error
E_t = training error
N = training sample size
dvc = VC dimension
δ = probability bound

Generalization error is within the bound specified above with a probability of (1 – δ) or more. Model complexity is characterized by the VC dimension. A more complex model will have a higher VC dimension. In our case, VC dimension should increase with increasing number of hidden units in the network. As per the expression above, generalization error consists of 2 components

Training error, which is the first term
Error due to model complexity and sample size , which is the second term

Model Complexity

We will look at error behavior with model complexity holding training data size constant.Training error goes down with more complex model, as the learner finds more ways to fit the training data to the model. However the complexity term goes up with increasing model complexity.

As a result, we get a convex shaped error function. At low model complexity we get high generalization error, called error due to bias. As the complexity grows, the second term becomes dominant and we get high generalization error, called error due to high variance.

Sample Complexity.

To study sample complexity we will look at how error changes with sample size N, while holding the model complexity i.e. dvc constant. Training error does not change since we are holding model complexity constant. However the complexity (second) term steadily decreases with increasing sample size.

So, overall with increasing training sample size, we should expect decreasing generalization error. You might want to draw the premature conclusion that we can compensate for an inferior model with lots data. Unfortunately, as we will find out later that’s not necessarily true.

Tests with Noisy Data.

Here are the test results with noisy data. Fortunately, scikit-learn allows us to control the noise level in the generated data. Error rates are captured after visually identifying convergence of error rates after multiple iterations.

Rows are for different training sample size (N). The columns are for the number of hidden units (H) which dictate the model complexity.

	H = 2	H = 3	H = 4	H = 5	H = 6
N = 160	0.257	0.193	0.171	0.173	0.170
N = 320	0.221	0.097	0.094	0.091	0.092
N = 480	0.263	0.056	0.055	0.050	0.051
N = 640	0.346	0.094	0.078	0.098	0.096
N = 960	0.258	0.052	0.052	0.051	0.051
N = 1280	0.318	0.129	0.133	0.131	0.134

Here are the observations from the test results and how they compare with generalization error theory.

Error goes down with increasing model complexity i.e more hidden units. The rate of error drop varies with training sample size. there is no significant increase in error with very high model complexity. May be I didn’t have test cases with very large number of hidden units.
Error does not go down monotonically with sample size as predicted by theory. Instead, the error drops first and then increases with large training sample size
The best results are are for a training sample size of 480. We get comparable low errors with 3 or more hidden units.

Increase of error with large training sample size is intriguing. My speculation is that with noisy data and large sample sample size we introduce more local minimum in the error function and the solution gets stuck in some sub optimal local minimum.

While researching into this issue, I came across a concept called self efficiency that’s related to this problem.

Tests with Insignificant Noise in Data

In the next set of tests we ran the tests with the same parameters as before, except that the noise level in the data was kept very low. Here are the results.

	H = 2	H = 3	H = 4	H = 5	H = 6
N = 160	0.369	0.027	0.023	0.020	0.023
N = 320	0.207	0.009	0.009	0.009	0.009
N = 480	0.208	0.007	0.007	0.007	0.007
N = 640	0.293	0.006	0.006	0.005	0.005
N = 960	0.225	0.004	0.004	0.004	0.004
N = 1280	0.267	0.003	0.003	0.003	0.003

Here are the observations from the results and we find that they are more in compliance with generalization error theory.

Results are in general significantly better than results with noisy training data
Error goes down with increasing model complexity as before. There is no increase in error with high model complexity, probably because we have not entered the very high complexity region in our tests
Error goes down monotonically with training sample size for any model complexity, as generalization error theory suggests
The best results are for highest training sample size of 1280 with hidden units 3 or more
The lowest error is about 6% of the lowest error with noisy data

We see significant difference in results depending on whether there is noise in the data or not.

Final Thoughts

Big Data is not a panacea. You get a better quality model with more data provided the data is more or less noise free. Even in this ideal scenario of noise free training data, with moderate amount of training data you might reach the acceptable error rate and the error rate may not have significant drop with further increase of data size.

With noisy data, more training data seems to harm the model, at least from my experiment. If you are model is very complex, you will require lot of training data. However, you will be able to build good complex model only if the training data is relatively error free. To summarize, you generally don’t need very large training data set for building learning model.