Access to good training data set is a serious impediment to building supervised Machine Learning models. Such data is scarce and when available, the quality of the data set may be questionable. Even if good quality data set is available, you may prefer to use synthetic data for various reasons, we will allude to later.

In this post we will go through an Ancestral Sampling based solution for generating synthetic training data. The implementation can easily be adopted for other classification problem. The ancestral sampling python implementation along with sample code on how to use it is available in my open source project *avenir*.

## Why Synthetic Data

Here are some the reasons for using synthetic data. The last one is critical and will require some further elaboration.

*You have no choice. You don’t have access to any relevant data set or if you do you are not sure about the quality of the data.**You have a data set, but there is not enough data. There is way you can marry the available data with the synthetic data generation process to generate more data.**With synthetic data generation, you have complete control over the data generation process. For example, you want to include an additional feature in your prediction model and the available data set does not include that.**Since with synthetic data, since you know the noise level injected into the data, it’s easier to tune your model with parameter search.*

The second point needs further elaboration. With the limited data you have, you can build the various distributions empirically using the limited data and use the solution outlined in the rest of this post to generate as much data as you need. If you have enough data but class imbalanced, you can use the same approach to generate more samples of the minority class.

Regarding the fourth point, we have to delve into some details on machine learning generalization error, i.e error we get from data sets other than the training data set. The goal of parameter tuning is to find that unique combination of the parameter values for the algorithms being used that will yield the minimum generalization error.

As you are exploring the parameter search space you never know whether you have reached the global minimum for the generalization error, even if you cast a very wide net in the parameter search space. However if using synthetic data, this problem is alleviated since you know the noise in the synthetic data. Let’s take a look at the following error expression.

*e _{o} = e_{m} + e_{n} where *

*e*

_{m}= generalization error attributed to model*e*

_{o}= generalization error as observed from tests*e*

_{n}= generalization error attributed to noise in data.Our goal in parameter tuning is to drive *e _{o}* as close as possible to

*e*, because the lower bound for

_{n}*e*is

_{o}*e*. With real data set, since we don’t know

_{n}*e*, all we can do is to minimize

_{n}*e*, not knowing where to stop. However with synthetic data, we know where to stop while minimizing

_{o}*e*because we know the global minimum which is

_{o}*e*.

_{n}This argument is valid only under the assumption that your model is not over fitted with parameter tuning i.e it has not learnt how to model the noise in the data.

You should always strive to make the distribution models as close as possible to the underlying data generating process of the real problem you have in mind, using domain expertise or any other information you have access to.

## Ancestral Sampling

To generate training data for classification problem, we need to sample from the joint distribution *Pr(f _{i}, c)*, where f

_{i}is i

^{th}feature and

*c*is the class value for all features. We can make our life simpler by leveraging the following probability product rule.

*Pr(f _{i}, c) = Pr(f_{i} | c) x Pr(c) where *

*Pr(f*

_{i}, c) = joint probability of feature f_{i}and class variable c*Pr(f*

_{i}| c) = class conditional probability of feature f_{i}a.k.a evidence*Pr(c) = probability of class variable*

Now instead of sampling from the joint distribution, we can sample from 2 different distributions sequentially, one of them being conditional.

The training data generation process consists of the following steps. The number of probability distributions you have to define is *k x d + 1* where *k* = number of class values (2 for binary classification) and *d* = the number of features

*Define class prior distribution Pr(c)**For all features, define class conditional probability distributions Pr(f*_{i}| c)*To generate a sample training data record , first sample a class value from Pr(c). Then, for the sampled class value, for each feature, sample feature values from the conditional distribution Pr(f*_{i}| c)

The underlying assumption in the sampling technique above is that conditioned on a class value, features are independent of each other.

The terms Ancestral Sampling refers to a Bayesian Network. In terms of Bayesian or Probabilistic Network, we can think of the class variable as a parent node and each feature node to be a child of the parent node. The edge from the parent class node to the child feature node represents the conditional distribution *Pr(f _{i}, c)*

How do we do the sampling? Various Python sampling implementations are available in *avenir*. The different probability distributions supported for different data type are as follows.

Data Type |
Distribution |
Comment |

numeric | gaussian | defined by mean and std dev |

numeric | gaussian mixture | defined by mean and std dev for each gaussian and a distribution for mixture coefficient |

numeric | non parametric | defined as a histogram by providing bin width and values for each bin |

categorical | non parametric | defined as a histogram |

All the sampling is done by the Rejection Sampling algorithm. The non parametric distribution values do not have to be normalized, as long as the the relative values reflect the desired distributions

## Generating Disease Prediction Data

We will use a medical use case as an example. This hypothetical data set is about predicting onset of heart disease within next 1 year. These are the feature variables.

*sex ***age**Patient weight**systolic blood pressure**diastolic blood pressure**smoker ***diet ***physical activity per week**number of years of education**ethnicity **

The features marked * are categorical variables. Class variable is categorical with 2 values, because it’s a binary classification problem. The code for generation the data set for this use case is available. Here is some sample generated data

Q2Z8T84CF75I,M,44,184,156,105,SM,BA,6,10,BL,1 5ERIOUHTNOJ3,M,67,106,112,61,SM,GO,13,15,BL,0 0F3S4ZKA6HSX,M,61,131,100,97,NS,AV,14,21,SA,0 13W9J1KXB16J,M,52,138,113,97,NS,AV,10,14,WH,0 R4QO7F44FKJV,M,45,184,160,103,SM,BA,5,7,WH,1 Y8RI3NOJYZ6X,F,70,148,142,97,SS,GO,17,16,EA,0 XEU88GUIB661,M,52,129,143,108,NS,AV,14,11,BL,0 08B39ROP211J,F,40,113,140,102,SS,AV,13,16,EA,0 YR529L5OW2JC,F,58,130,120,70,SS,AV,14,17,WH,0

Python code for Ancestral sampling implementation and the example code to use it to generate this data set are available in *avenir*.

## Injecting Noise into Data

To generate realistic data set you would like to inject noise in the data set. It can be done in done 2 ways. The first technique is implicit. To have more noise, you can allow more overlap between the *k* conditional distributions for any feature., where *k* is the number of classes. You can repeat this for as many features as you want. This will result in class boundary overlap.

The second technique is explicit with a specified noise level (*n*). The noise level can be specified as a number less that e.g 0.1. For numerical features noise is added as below.

*v _{n} = v_{s} x (1.0 + e_{n}) where*

*v*

_{n}= feature value after adding noise*v*

_{s}= feature value as generated through sampling*e*

_{n}= noise generated by sampling from a Gaussian distribution N(0,n)The noise(*e _{n}*) is sampled from a Gaussian distribution with mean 0 and standard deviation equal to the specified noise level(

*n*).

For categorical features and class variables, with a probability of specified noise level (*n*), the existing value is replaced with value sampled uniformly from a list of values for that variable.It’s important to keep in mind that the actual noise will be less than specified, because with random selection, sometimes it will end up choosing the existing value.

The biggest impact is caused by adding noise to the class variable. Noise added to feature variables has much less impact, especially for discriminant based classification algorithms e.g *Random Forest* and *SVM*.

## When to Use

Here are some scenarios where synthetic data generated with Ancestral Sampling could be used.

*Models trained with synthetic data, should never be deployed in production. You will need real data to train the model to be deployed in production.**You have small data problem i.e not enough data to build the model you want. As alluded to earlier, you can build distributions using limited data and then sample as described in this post. Using this approach, model can be deployed in production.**If you want to find out the amount of real data needed for training a model with real data, you can create data sets of different sizes and observe how the generalization error drops with increasing training data size.**You are just exploring different classification algorithms to judge their relative merits and you are not intending to deploy the model in production.*

For small data problems there are other solutions as well e.g Bootstrapping, which is essentially sampling without replacement. The post I just referred to provides other solutions to alleviate small data set problem.

For the training data size issue, for a model of given complexity, the generalization error drops with increasing data size, until it reaches a plateau, beyond which there is no added benefit of bigger data set. The point at which error rate flattens out is the minimum size of the data needed.

## Wrapping Up

We have gone through Ancestral Sampling technique for generating synthetic data for classification problems. You can use the tutorial to generate data for the use case in this post. It also has information on how to modify the example sample generating python code to generate data for other problems and use cases.

## Support

For commercial support for any solution in my github repositories, please talk to ThirdEye Data Science Services. Support is available for *Hadoop* or Spark deployment on cloud including installation, configuration and testing.