Is Bigger Data Better for Machine Learning

I have seen the topic of data size as it relates to machine learning being discussed often. But they are mostly opinions, views and innuendos, not backed by any rational explanation. Comments like we need “we need meaningful data not big data” or “insightful data, not big data” abound which often leave me puzzled. How does Big Data preclude meaning or insight in the data. Being motivated to find concrete answers to the question of data size as it relates to learning, I decided to explore theories of Machine Learning to get to the heart of the issue. I found some interesting and insightful answers in Computational Learning Theory.

In this post I will summarize my findings. As part of this exercise, I have implemented a python script to estimate the minimum training set size required by a learner, based on PAC theory. The script takes as input the cardinality of the feature variables, the cardinality of the class variable, type of hypothesis space and relevant parameters for the hypothesis space type chosen.

Questions

The questions I had asked myself are following, before setting out for my journey into Computational Learning theory.

How does the training data size relate to error rate
How does the training data size relate to problem size or complexity
Given a problem, how do I find minimum training data size

These are questions almost everyone building machine learning model faces. In rest of this post, I will go over the details of the answers to these questions.

Computational Learning Theory

In machine learning problem we have a set of n feature variables. The variables could be numerical. categorical or boolean. A hypothesis is a function of the feature variables, which when evaluated with an instance of the training data generates the class or output variable value. For classification problem, the class variables have discrete values. If they have two values, we have a binary classification problem.

The purpose of a training a model is to find the actual target hypothesis. If the correct hypothesis is found, it will have low training error and generalization error. Generalization error is the prediction error rate for input instances drawn randomly from a sample distribution. which were not part of the training sample.

The set of all hypothesis considered by a learner is the hypothesis space. Every machine learning algorithm has an implied hypothesis space it works on For complex problems, you may need a learner that can explore a large hypothesis space.

Loan Approval Example

We will do our analysis for a loan approval example. It’s characterized by the following feature variables. Since we will be dealing with only discrete valued feature variables, we will assume that any numerical variable has been appropriately discretized. The class variable will have two values. So it’s binary classification problem.

Feature Variable	Cardinality	Comment
Sex	2
Marital status	2
Number of years in current home	3	Dicretized
History of past loan payments	3
Amount of current loan	5	Discretized
Type of employment	10
Number of years in current job	3	Discretized
Number of working years	3	Discretized
Family yearly income	5	Discretized
Loan amount	5	Discretized

In my example, all the variable values are discrete, because the learning theory I will be discussing is valid only if the size of the hypothesis space size is bounded. Only if the feature variable values are discrete, we can have a bound on the hypothesis space size.

Minimum Training Sample Size

We will find the minimum training sample size for the loan approval problem. The minimum sample size according to Probably Approximate Correct (PAC) learning theory is as follows, if we accept some generalization error rate with probability below some threshold.

A hypothesis is Probably Approximately Correct (PAC), if the probability that it is approximately correct is greater than 1 – δ, where δ is a confidence parameter.

m = ln(H/δ)/ε
where
m = Lower bound on training sample size
H = Hypothesis space size
ε = Acceptable generalization error rate
δ = Probability threshold for error

Here some observations we can make from the relation above. Intuitively, they all make sense.

As Hypothesis space size and complexity increases, the number training samples required goes up.
As the generalization error goes down, the number training samples required goes up.
As the the probability of generalization error threshold goes down, the number training samples required goes up.

The relation above holds only under some assumptions as below. Other options are available when these assumptions are not true.

Hypothesis space size is bounded. This is generally true as long as real valued feature variables are not included.
Training error is zero for the hypothesis learnt by a learner.

All Terms Hypothesis Space

In this hypothesis space family, we take conjunction of all feature variables comprising of all possible values of each feature variable. This is the simplest of the hypothesis space families we will be considering.

The hypothesis space size is essentially the multiplication of all cardinality values, except that we have to add 1 to each feature variable cardinality before multiplication to account for any value of a feature variable. Here is the output from the the script from the loan approval problem

10948608,0.010,0.010,2081
10948608,0.010,0.020,2012
10948608,0.010,0.030,1971
10948608,0.010,0.040,1942
10948608,0.010,0.050,1920
10948608,0.020,0.010,1040
10948608,0.020,0.020,1006
10948608,0.020,0.030,985
10948608,0.020,0.040,971
10948608,0.020,0.050,960
10948608,0.030,0.010,693
10948608,0.030,0.020,670
10948608,0.030,0.030,657
10948608,0.030,0.040,647
10948608,0.030,0.050,640
10948608,0.040,0.010,520
10948608,0.040,0.020,503
10948608,0.040,0.030,492
10948608,0.040,0.040,485
10948608,0.040,0.050,480
10948608,0.050,0.010,416
10948608,0.050,0.020,402
10948608,0.050,0.030,394
10948608,0.050,0.040,388
10948608,0.050,0.050,384

The fields in the output are as follows. The number of hypothesis is based on all 10 feature variables and the class variable.

Number of hypothesis
Maximum error threshold
Maximum error probability threshold
Minimum number of training samples

As expected, the number of samples go down as the error threshold and the error probability threshold go up.

K Term DNF Hypothesis Space

In this hypothesis space, we take conjunction of all variables as before and then take all possible disjunctions of the conjunctions, where each disjunction consists of K conjunctions. With K as 3, here is the output.

22143210975270000,0.010,0.010,4224
22143210975270000,0.010,0.020,4154
22143210975270000,0.010,0.030,4114
22143210975270000,0.010,0.040,4085
22143210975270000,0.010,0.050,4063
22143210975270000,0.020,0.010,2112
22143210975270000,0.020,0.020,2077
22143210975270000,0.020,0.030,2057
22143210975270000,0.020,0.040,2042
22143210975270000,0.020,0.050,2031
22143210975270000,0.030,0.010,1408
22143210975270000,0.030,0.020,1384
22143210975270000,0.030,0.030,1371
22143210975270000,0.030,0.040,1361
22143210975270000,0.030,0.050,1354
22143210975270000,0.040,0.010,1056
22143210975270000,0.040,0.020,1038
22143210975270000,0.040,0.030,1028
22143210975270000,0.040,0.040,1021
22143210975270000,0.040,0.050,1015
22143210975270000,0.050,0.010,844
22143210975270000,0.050,0.020,830
22143210975270000,0.050,0.030,822
22143210975270000,0.050,0.040,817
22143210975270000,0.050,0.050,812

This is a more complex hypothesis space and as expected there is a significant increase in the number of hypothesis. The required number of training samples have almost doubled compared to the earlier case.

With this hypothesis space, we have set out to build a more complex learning model, which explains the increase the number of training samples required.

K CNF Hypothesis Space

First, we construct all possible disjunctions comprising of K variables. Then we construct the conjunctions by taking all possible subsets of the disjunctions. This is the most complex Hypothesis space causing an exponential increase of the number of hypothesis. Here is the output with 3 as the value of K.

0.010,0.010,4101396
0.010,0.020,4101327
0.010,0.030,4101286
0.010,0.040,4101257
0.010,0.050,4101235
0.020,0.010,2050698
0.020,0.020,2050663
0.020,0.030,2050643
0.020,0.040,2050628
0.020,0.050,2050617
0.030,0.010,1367132
0.030,0.020,1367109
0.030,0.030,1367095
0.030,0.040,1367085
0.030,0.050,1367078
0.040,0.010,1025349
0.040,0.020,1025331
0.040,0.030,1025321
0.040,0.040,1025314
0.040,0.050,1025308
0.050,0.010,820279
0.050,0.020,820265
0.050,0.030,820257
0.050,0.040,820251
0.050,0.050,820247

I am not showing the number of hypothesis, because it’s too large. Remarkably, the number of samples required is almost 2000 times more than what is required for the simple all terms hypothesis space. With increasing complexity, there is an exponential growth in minimum sample size required.

K DNF Hypothesis Space

This is similar to K CNF, except that we start with conjunctions. We construct conjunctions comprising of K variables. Next we construct the disjunctions by taking all possible subsets of the conjunctions.

K CNF and K DNF can be shown to be equivalent to each other. I am not showing any result, because it’s identical to the K CNF case.

Infinite Hypothesis Space

If the data contains numerical feature variables, the hypothesis space becomes unbounded and the the estimate for number of training samples we have used so far is not valid any more. Instead of hypothesis space size, we have to consider something called Vapnik-Chervonenkis (VC) dimension.

The VC dimension of an hypothesis space is the largest subset of data that can be shattered by the hypothesis space. A set of training data instances S is shattered by the hypothesis space H, if for dichotomy of S, there exists some hypothesis in H consistent with the dichotomy.

In terms of VC dimension, minimum training data size is as follows. The biggest challenge is here is finding the VC dimension. For some special cases , VC dimension estimates are available.

m = (4 log(2/δ) + 8 VC log(13/ε) / ε
where
VC = VC dimension
Other parameters are as defined earlier

Summing Up

The Machine Learning book by Mitchell is an excellent resource for the theoretical details of computational learning theory. The python script used is available here.

Going back to the question of whether bigger data helps, generally the answer is yes. When the problem is complex and the learner is operating with a bigger hypothesis space, the benefit of having large amount of training data is more pronounced.

For simple learning problems, smaller data size will suffice. However, with increasing complexity, brought on by increasing number of feature variables or more complex hypothesis space, there is an exponential growth of minimum training data size required.