In this post, I will be venturing into the medical domain and show how big data analytic can play a crucial role in the complex and daunting world of health care. There is a kind of cancer that affects the male population above a certain age. There are also other important contributing factors like race, family history etc.

In this post, I will provide a Hadoop based machine learning solution to predict that threshold age. My focus will be only on one attribute of the patient data i.e., the age. The goal is to mine to the data to extract a simple rule like *if age > x then patient should take test y* . The doctor may use this rule to order specific test for a patient.

As we work our way through the post, we will find out that the solution is one of the critical steps in building a full blow decision tree. A decision tree typically includes predicates involving multiple attributes. You can think of our solution as single level decision tree i.e. a tree with a root node and one or more leaf nodes. This implementation is part of of my open source project *avenir*.

## Quick Tour of Decision Tree

The fundamental idea in decision tree is to choose an attribute and split it in such a way so that the sub population of the data within each split will be as homogeneous as possible. Referring back to our problem we want to find an age x, such that below that age a patient is significantly less likely to get the disease compared to someone above that age.

In building a decision tree, there are too many choices. There are many attributes to choose from and each attribute can be split in many ways. Essentially there is combinatorial explosion. Certain statistical quantitiy can be calculated for each candidate attribute, split pair. The best attribute, split pair is chosen according to that statistical criterion.

The next step in building a decision tree is to partition the data set based the attribute, split selected and then to recursively apply the attribute splitting logic within the sub populations. Once the tree is built, it can used to make prediction by navigating down from the root, apply the condition at each node, until we get to leaf node. Whatever is the predominant class attribute value in the sub population of the leaf node is the the predicted class attribute value.

## Splitting Attribute

Our focus here is not on a full blown decision tree and let’s put that aside for now. We simply want to find the split points for the *age* attribute. The two popular algorithms for optimal splitting of an attribute are based on information theory. We want to split an attribute in such a way that each sub population is as homogeneous as possible.

Once a attribute is chosen, the next decision is how many splits to use. We can have a binary split e.g. age = 50. We could have a three way split e.g age = 40 and age = 60. In our solutions the maximum number of split segments is specified in the metadata through the parameter *maxSplit* . Choosing a high a value for this is not a good idea, as it will result into an over fitted model. We are using a value of 3 in our example.

For each case, there could be many combination of split point values. As we can see there is a combinatorial explosion unfolding here . We will leverage the Hadoop to explore all the candidate combinations and calculate the statistical splitting criterion for each candidate split which will enable us to select the best split.

## Entropy

Entropy is a measure of how homogeneous the data is with respect to the class attribute value. For a given split segment, *entropy* is calculated as follows.

*en(s(i)) = sum(-p(c(j)) * log(p(c(j)))*

* where*

* en(s(i)) is the entropy fro the i th split.*

* p(c(j)) is the fraction of the records having the class attribute value c(j) within the split.*

* the sum is over all the possible class attribute values*

The lower the value of *entropy*, more homogeneous the corresponding subset of the data is. In our case, it means that the patients in that age group either mostly have the disease or not have the disease. Splits should be chosen is such a way that the entropy is as low as possible within each split segment.

The next step to calculate a weighted average of the entropy across all split segments that a split contains. It’s as follows

*info = sum (en(s(i)) * count(s(i))) / totalCount*

* where*

* count(s(i)) is the number of records for the i th split*

* totalCount is the total number records*

We calculate this quantity for all the splits and choose the one that has the lowest value of *info*.

## Gini Index

This is another popular measure of purity of data. Gini Index defined as below for any split segment.

gi((s(i)) = 1 – sum (p(c(j)) * p(c(j)))

As you can see if all the records belong to one class then p(c(j)) = 1 for certain class j and 0 for all others. This will result *gini index* being 0. So when the data is completely skewed i.e. belong to one class only, the *gini index* has the lowest possible value.

The next step of finding weighted aggregate of *gini index* is same as for the case of *entropy* and it’s as below

*info = sum (gi(s(i)) * count(s(i))) / totalCount*

## Class Imbalance Problem

Unfortunately, when I tried these two statistical measures, it did not produce very good results. For our example, we have the so called class imbalance problem. In our data, class attribute value is mostly negative i.e., most patients don’t have the disease. This is good news for the patients, but not so good news for these information theory based algorithms.

This is a nagging problem for most predictive analytic algorithms. In most real life problems, class imbalance is not that uncommon. The class attributes value we are interested in predicting is under represented in the training sample. One common trick to address this problem is to under sample over represented data and / or over sample under represented data. Instead of taking this route, I went out looking for algorithms that address this problem directly.

On further research, I found two algorithms that address this specific problem. They are both based on the basic premise of normalizing the count of records belonging to a certain class attribute value in a split segment, by diving with the number of records belonging to that class attribute value in the whole data set. It essentially, levels the playing field and all the class attribute values and neutralizes over representation and under representation.

## Hellinger Distance

This algorithm normalizes the population count for certain class attribute value within a split segment the with total count for that data belonging to that class attribute. *Hellinger Distance* defined as below

*he = sqrt(sum((sqrt(r(s(i), c1)) – sqrt(r(s(i), c2))) * (sqrt(r(s(i), c1)) – sqrt(r(s(i), c2))))*

* where*

* r(s(i),c1) is the ratio of the count of samples with class value c1 in the i th split and the total number of sample with class value c1.*

* sum is taken over all the splits.*

Since *Hellinger* distance is based on the distance between the probability distributions, it only works for binary values class attributes.

## Class Confidence Proportion

Class confidence proportion also normalizes the sample count with some specific class attribute value and it’s defined as below. It’ used in the same way as *entropy*, except that p(c(j)) for a split is is defined as below.

*p(c(j)) = cc(j) / sum(cc(j))*

* where*

* cc(j) = count(s(i), c(j)) / count(c(j))*

* count(s(i), c(j)) is the count of samples with class value c(j) in split s(i)*

* count(c(j)) is the total count of sample with class value c(j)*

It’s similar to entity, but instead of ratio of sample count directly, we first normalize the count within a split segment with a total count for a given class attribute value across all splits.

## Map Reduce for Attribute Splitting

I think we had enough theory. Let’s dive into a map reduce solution for splitting. Since we are considering many possible splits, each record in the inputs participates in each of these split calculations. On the reducer side, we calculate the statistical quantities for each attribute, split pair. Based on the reducer results we choose the optimum split. If there are n input records and p possible splits across all attributes, the mapper will output n * p records. The map reduce implementation is in the class *ClassPartitionGenerator*. All the statistical calculations related to splitting logic is in the class *AttributeSplitStat*.

The enumeration of the split goes something as below

for each feature attribute for each split find the split segment the record belongs to, depending on the attribute value

The mapper key consists of

*attribute**split**split segment**class attribute value*

The mapper output value is just a count which is 1. In our implementation, for efficiency sake we have combiners which will reduce the load on the shuffle. On the reducer side, the statistics calculation is based on aggregation of counts of attributes, split pair.

The implication is that we need to ensure data for the same attribute, split pair end up with the same reducer. We are using a custom reducer partitioner to facilitate that. Essentially, the custom partitioner peels off the the first two elements of the mapper key i.e. attribute and split and hashes them.

## Results

Here is some sample input training data set. The last filed is the class attribute value. It’s a boolean value of Yes or No, depending on whether the patient has the disease or not.

H86W5YTH242I,42,AFA,177,REG,NFH,DP,No 1ZZ7X9VG2ZD0,47,EUA,132,REG,NFH,DP,No DYCT8R50Z3NS,40,EUA,163,REG,NFH,DP,No P19PLPNT3IVO,59,EUA,137,HF,NFH,S,No W3ZR57D5958B,65,EUA,206,REG,NFH,DP,No Q31Q94HV53ON,33,EUA,237,HF,NFH,DP,No 57JRFQXC50RO,55,EUA,134,LF,NFH,DP,Yes

The attributes of the input record are as follows. Although there are many attributes in the patient data, in our example we are only using the age attribute.

*patient ID**age**race**weight**diet**family history**domestic life**family history**whether the patient has the disease*

Here is some sample output. For brevity, I have not included all the attribute, split pairs in the output. The output needs some explaining. The first field is the attribute defined by it’s ordinal value in the record. The second field is the split. The different elements separated by semi colon are the split points.

For example the split 50;60 tells us that there two split points one at 50 and one at 60, diving the attribute domain into 3 split segments The third field is statistical quantity for this attribute, split pair. I have used the *Hellinger Distance* as my statistical criterion for splits. It’s defined through the configuration parameter *split.algorithm. *The list of attributes to be analyzed are defined through the configurations parameter *split.attributes*

1,25;50,0.07727437540664325 1,45;70,0.0741231625202242 1,55;60,0.08572333745708177 1,40;70,0.06856087477900381 1,40;75,0.06794201351915066 1,65,0.07605202338611025 1,30;50,0.07723661976268037 1,70;75,0.05149182786034924 1,60,0.08550661761031225 1,30;55,0.07554129558323494 1,55;65,0.08264991140055443 1,45,0.06957381127582622

For this particular example training data set, the split at 60 seems to the best choice. The corresponding *Hellinger* Distance has the maximum value. Based on this finding we conclude that the threshold age we were looking for is 60.

The splits are generated by by shifting split points by an amount specified by *bucketWidth *between the limits* min *and* max *as* * in JSON meta data specification for the *age* attribute shown below. The *maxSplit* parameters specified the maximum number of split segments in a split.

{ "name" : "age", "ordinal" : 1, "dataType" : "int", "feature" : true, "min" : 20, "max" : 80, "maxSplit" : 3, "bucketWidth" : 5 }

## Categorical Attributes

So far, all my discussion has centered around a numerical attributes. How do we implement splitting logic got categorical attributes. If a categorical attribute has n possible values, we have to split them into all possible 2 sets, all possible 3 sets etc. Each of these subsets is comparable to a split point for numerical attributes.

## Wrapping Up

The attribute splitting solution described here is a key component of a full decision tree implementation which will be available in *avenir* in near future.

Here is a tutorial document for running the example in this post. The meta data file is available in the same resource directory.

For commercial support for any solution in my github repositories, please talk to ThirdEye Data Science Services. Support is available for *Hadoop* or Spark deployment on cloud including installation, configuration and testing,

Pingback: Retarget Campaign for Abandoned Shopping Carts with Decision Tree | Mawazo