I have seen the topic of data size as it relates to machine learning being discussed often. But they are mostly opinions, views and innuendos, not backed by any rational explanation. Comments like we need “we need meaningful data not big data” or “insightful data, not big data” abound which often leave me puzzled. How does Big Data preclude meaning or insight in the data. Being motivated to find concrete answers to the question of data size as it relates to learning, I decided to explore theories of Machine Learning to get to the heart of the issue. I found some interesting and insightful answers in Computational Learning Theory.
In this post I will summarize my findings. As part of this exercise, I have implemented a python script to estimate the minimum training set size required by a learner, based on PAC theory. The script takes as input the cardinality of the feature variables, the cardinality of the class variable, type of hypothesis space and relevant parameters for the hypothesis space type chosen.
The questions I had asked myself are following, before setting out for my journey into Computational Learning theory.
- How does the training data size relate to error rate
- How does the training data size relate to problem size or complexity
- Given a problem, how do I find minimum training data size
These are questions almost everyone building machine learning model faces. In rest of this post, I will go over the details of the answers to these questions.
Computational Learning Theory
In machine learning problem we have a set of n feature variables. The variables could be numerical. categorical or boolean. A hypothesis is a function of the feature variables, which when evaluated with an instance of the training data generates the class or output variable value. For classification problem, the class variables have discrete values. If they have two values, we have a binary classification problem.
The purpose of a training a model is to find the actual target hypothesis. If the correct hypothesis is found, it will have low training error and generalization error. Generalization error is the prediction error rate for input instances drawn randomly from a sample distribution. which were not part of the training sample.
The set of all hypothesis considered by a learner is the hypothesis space. Every machine learning algorithm has an implied hypothesis space it works on For complex problems, you may need a learner that can explore a large hypothesis space.
Loan Approval Example
We will do our analysis for a loan approval example. It’s characterized by the following feature variables. Since we will be dealing with only discrete valued feature variables, we will assume that any numerical variable has been appropriately discretized. The class variable will have two values. So it’s binary classification problem.
|Number of years in current home||3||Dicretized|
|History of past loan payments||3|
|Amount of current loan||5||Discretized|
|Type of employment||10|
|Number of years in current job||3||Discretized|
|Number of working years||3||Discretized|
|Family yearly income||5||Discretized|
In my example, all the variable values are discrete, because the learning theory I will be discussing is valid only if the size of the hypothesis space size is bounded. Only if the feature variable values are discrete, we can have a bound on the hypothesis space size.
Minimum Training Sample Size
We will find the minimum training sample size for the loan approval problem. The minimum sample size according to Probably Approximate Correct (PAC) learning theory is as follows, if we accept some generalization error rate with probability below some threshold.
A hypothesis is Probably Approximately Correct (PAC), if the probability that it is approximately correct is greater than 1 – δ, where δ is a confidence parameter.
Here some observations we can make from the relation above. Intuitively, they all make sense.
- As Hypothesis space size and complexity increases, the number training samples required goes up.
- As the generalization error goes down, the number training samples required goes up.
- As the the probability of generalization error threshold goes down, the number training samples required goes up.
The relation above holds only under some assumptions as below. Other options are available when these assumptions are not true.
- Hypothesis space size is bounded. This is generally true as long as real valued feature variables are not included.
- Training error is zero for the hypothesis learnt by a learner.
All Terms Hypothesis Space
In this hypothesis space family, we take conjunction of all feature variables comprising of all possible values of each feature variable. This is the simplest of the hypothesis space families we will be considering.
The hypothesis space size is essentially the multiplication of all cardinality values, except that we have to add 1 to each feature variable cardinality before multiplication to account for any value of a feature variable. Here is the output from the the script from the loan approval problem
10948608,0.010,0.010,2081 10948608,0.010,0.020,2012 10948608,0.010,0.030,1971 10948608,0.010,0.040,1942 10948608,0.010,0.050,1920 10948608,0.020,0.010,1040 10948608,0.020,0.020,1006 10948608,0.020,0.030,985 10948608,0.020,0.040,971 10948608,0.020,0.050,960 10948608,0.030,0.010,693 10948608,0.030,0.020,670 10948608,0.030,0.030,657 10948608,0.030,0.040,647 10948608,0.030,0.050,640 10948608,0.040,0.010,520 10948608,0.040,0.020,503 10948608,0.040,0.030,492 10948608,0.040,0.040,485 10948608,0.040,0.050,480 10948608,0.050,0.010,416 10948608,0.050,0.020,402 10948608,0.050,0.030,394 10948608,0.050,0.040,388 10948608,0.050,0.050,384
The fields in the output are as follows. The number of hypothesis is based on all 10 feature variables and the class variable.
- Number of hypothesis
- Maximum error threshold
- Maximum error probability threshold
- Minimum number of training samples
As expected, the number of samples go down as the error threshold and the error probability threshold go up.
K Term DNF Hypothesis Space
In this hypothesis space, we take conjunction of all variables as before and then take all possible disjunctions of the conjunctions, where each disjunction consists of K conjunctions. With K as 3, here is the output.
22143210975270000,0.010,0.010,4224 22143210975270000,0.010,0.020,4154 22143210975270000,0.010,0.030,4114 22143210975270000,0.010,0.040,4085 22143210975270000,0.010,0.050,4063 22143210975270000,0.020,0.010,2112 22143210975270000,0.020,0.020,2077 22143210975270000,0.020,0.030,2057 22143210975270000,0.020,0.040,2042 22143210975270000,0.020,0.050,2031 22143210975270000,0.030,0.010,1408 22143210975270000,0.030,0.020,1384 22143210975270000,0.030,0.030,1371 22143210975270000,0.030,0.040,1361 22143210975270000,0.030,0.050,1354 22143210975270000,0.040,0.010,1056 22143210975270000,0.040,0.020,1038 22143210975270000,0.040,0.030,1028 22143210975270000,0.040,0.040,1021 22143210975270000,0.040,0.050,1015 22143210975270000,0.050,0.010,844 22143210975270000,0.050,0.020,830 22143210975270000,0.050,0.030,822 22143210975270000,0.050,0.040,817 22143210975270000,0.050,0.050,812
This is a more complex hypothesis space and as expected there is a significant increase in the number of hypothesis. The required number of training samples have almost doubled compared to the earlier case.
With this hypothesis space, we have set out to build a more complex learning model, which explains the increase the number of training samples required.
K CNF Hypothesis Space
First, we construct all possible disjunctions comprising of K variables. Then we construct the conjunctions by taking all possible subsets of the disjunctions. This is the most complex Hypothesis space causing an exponential increase of the number of hypothesis. Here is the output with 3 as the value of K.
0.010,0.010,4101396 0.010,0.020,4101327 0.010,0.030,4101286 0.010,0.040,4101257 0.010,0.050,4101235 0.020,0.010,2050698 0.020,0.020,2050663 0.020,0.030,2050643 0.020,0.040,2050628 0.020,0.050,2050617 0.030,0.010,1367132 0.030,0.020,1367109 0.030,0.030,1367095 0.030,0.040,1367085 0.030,0.050,1367078 0.040,0.010,1025349 0.040,0.020,1025331 0.040,0.030,1025321 0.040,0.040,1025314 0.040,0.050,1025308 0.050,0.010,820279 0.050,0.020,820265 0.050,0.030,820257 0.050,0.040,820251 0.050,0.050,820247
I am not showing the number of hypothesis, because it’s too large. Remarkably, the number of samples required is almost 2000 times more than what is required for the simple all terms hypothesis space. With increasing complexity, there is an exponential growth in minimum sample size required.
K DNF Hypothesis Space
This is similar to K CNF, except that we start with conjunctions. We construct conjunctions comprising of K variables. Next we construct the disjunctions by taking all possible subsets of the conjunctions.
K CNF and K DNF can be shown to be equivalent to each other. I am not showing any result, because it’s identical to the K CNF case.
Infinite Hypothesis Space
If the data contains numerical feature variables, the hypothesis space becomes unbounded and the the estimate for number of training samples we have used so far is not valid any more. Instead of hypothesis space size, we have to consider something called Vapnik-Chervonenkis (VC) dimension.
The VC dimension of an hypothesis space is the largest subset of data that can be shattered by the hypothesis space. A set of training data instances S is shattered by the hypothesis space H, if for dichotomy of S, there exists some hypothesis in H consistent with the dichotomy.
In terms of VC dimension, minimum training data size is as follows. The biggest challenge is here is finding the VC dimension. For some special cases , VC dimension estimates are available.
Going back to the question of whether bigger data helps, generally the answer is yes. When the problem is complex and the learner is operating with a bigger hypothesis space, the benefit of having large amount of training data is more pronounced.
For simple learning problems, smaller data size will suffice. However, with increasing complexity, brought on by increasing number of feature variables or more complex hypothesis space, there is an exponential growth of minimum training data size required.
For commercial support for any solution in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing,