Nobody likes hospital readmission soon after discharge, whether it’s the patient or the insurance company. Predictive analytic techniques have been used to predict the likelihood of hospital readmission, using the various medical, personal and demographic input or feature attributes. However, some problems including the one we are discussing has a very large input feature set.

Before we plow ahead with building a predictive model, it’s important to pause and ask ourselves what features are really important, especially with a problem like this with a very large set of input features.

Machine learning algorithms generally work better if the dimensionality i.e. the number of feature attributes is lowered. One of the techniques for lowering the dimensionality is to select a subset of the original feature set, known as feature subset selection. Even if a predictive model is not built, having insight on what features are important is valuable.

The focus of this post is feature subset selection using certain statistical techniques based on entropy and mutual information. The solution is part of my open source Hadoop based machine learning project *avenir*.

## Entropy and Mutual Information

The solution is based some information theory concepts, which we will cover. The entropy of a random variable indicates the level of randomness and is defined as below.

*H(x) = – Σ (p(x) log(p(x))) where*

* H(x) = entropy*

* p(x) = probability distribution*

Entropy is maximum when the distribution p(x) is uniform and minimum when x always has one value.

The conditional entropy of x given y is the uncertainty remaining about x, given that we know y. It’s defined as below

*H(x|y) = – Σ(p(y) Σ(p(x|y) log(p(x|y)))) where*

* H(x|y) = conditional entropy*

* p(y) = probability distribution of y*

* p(x|y) = conditional probability distribution of x*

Mutual information of x and y is the difference between the two entropy and also can be thought of as the amount of uncertainty of x that is removed based on the knowledge of y. It’s defined as

*I(x;y) = Σ Σ p(xy) log(p(xy) / (p(x) p(y))) where*

* p(xy) = joint probability distribution of x and y*

## Relevance Score

Feature attributes are assigned relevant scores which are calculated based on mutual information. The scores indicate how an important a feature attribute is in a predictive model. Based on the relevance score, a sub set of features is selected and the rest ignored.

This paper by Brown lists 17 such scores that are based on mutual information. We will be focusing on some of the important ones. Most of the scores are functions relevance of a feature with respect to the output or class attribute and the redundancy of a feature with respect to other features. An important feature will simultaneously meet the following conditions

*Maximum relevance to class or output attribute**Minimum redundancy with all other feature attributes*

**Mutual Information Maximization (MIM)**

The relevance score (RS) for a feature x(i) according to this criteria is as follows. It’s based only on the relevance of the feature attribute with respect the output attribute. It treats each feature independently and does not take into account inter feature redundancy.

*RS(x(i)) = I(x(i); y) where*

* I(x(i);y) = Mutual information between feature attribute x(i) and output attribute y*

**Mutual Information Feature Selection (MIFS)**

This score takes into inter feature redundancy and is defined as below. The parameter β controls the influence inter feature redundancy in the final score. It’s value is usually between 0 and 1. The sum is taken over the features already selected.

*RS(x(i) = I(x(i); y) – β Σ(I(x(i);x(j))) where*

* I(x(i);x(j)) = Mutual information between feature x(i) and feature x(j)*

**Joint Mutual Information (JMI)**

This score is based on mutual information between feature attribute pair and the output attribute. Each feature is paired with features already selected. The sum is taken over features already selected

*RS(x(i) = Σ I(x(i)x(j);y) where*

* I(x(i)x(j);y) = Mutual information between feature pair x(i),x(j) and output y*

**Double Input Symmetrical Relevance (DISR)**

This score is similar to JMI except it’s normalized with feature pair, class entropy. The definition is as follows

*RS(x(i) = Σ I(x(i)x(j);y) / H(x(i) x(j) y) where*

* I(x(i)x(j);y) = Mutual information between feature pair x(i),x(j) and output y
H(x(i)x(j)y) = Entropy of feature pair and class *

**Min Redundancy Max Relevance (MRMR)**

This score is similar to MIFS and defined as follows. The sum is taken over currently selected features.

*RS(x(i)) = I(x(i);y) – sum I(x(i); x(j)) / S where*

* I(x(i);y) = Mutual Information between feature x(i) and output y*

* I(x(i); x(j)) = Mutual Information between feature x(i) and x(j)*

* S = Size of the currently selected feature set*

## Map Reduce for Relevance Score

The computation consists various univariate and multivariate distributions. The distributions are used to compute various mutual information values. Finally, the relevance scores are computed using the mutual information values.

There is one reducer to do the aggregates and calculate all the quantities listed above. I have leaned on combiners heavily to do partial aggregates and reduce the shuffle load. Here is the implementation.

## Input

The input consists of 10 medical, personal and socio-economic attributes as below. I am not a doctor. I am just playing one. So don’t blame me if you disagree with the feature set. I have made my best guess as far as selecting relevant attributes.

*Age**Weight**Height**Employment status**Family status**Diet**Exercise**Followup**Smoking**Alcohol consumption*

The data has been synthetically generated with a script. Certain hypothesis regarding relevancy and redundancy has been coded into the script. Here is some sample input

MIG2ST4OBL7O,84,156,54,retired,alone,average,average,average,smoker,average,Y WKELAC02JSOU,66,239,63,employed,with partner,good,low,high,non smoker,low,N 64N3LGAEMH8H,77,165,52,retired,with partner,average,low,low,non smoker,average,Y HQC0JEU10220,43,215,61,employed,with partner,poor,average,average,smoker,average,N EZWE8EWD046I,52,144,56,employed,alone,average,low,low,non smoker,low,Y SX3SUF143JLY,58,148,67,employed,alone,average,low,average,non smoker,high,N OM48O83TE3Q1,87,197,52,employed,alone,average,high,low,non smoker,low,N TK0ZSUGA2TPJ,75,164,66,employed,with partner,poor,average,low,non smoker,low,N

## Output

There are various relevance score algorithms implemented in avenir. I have used the Min Redundancy Max relevance (MRMR) and Joint Mutual Info (JMI) algorithm. Here is the output. Each line contains the feature attribute ordinal value followed by the score.

mutualInformationScoreAlgorithm: joint.mutual.info 1,0.0549192827980888 5,0.05209545668766983 8,0.047606114277359196 4,0.041213766013011924 2,0.02983300395890561 9,0.02843161878952145 10,0.027937614098402116 7,0.02392661618820073 3,0.022036868529608306 6,0.021552273477405435 mutualInformationScoreAlgorithm: min.redundancy.max.relevance 1,0.004257812730107618 5,0.003883857398456544 8,0.003242146705535658 9,0.0010947244851423456 10,9.234817989504589E-4 7,3.7413897659935465E-4 2,7.403796958212693E-5 6,-4.7722734258504626E-5 3,-2.769439670059089E-4 4,-0.02233201002304927

If we compare the top 4 relevant features, there is partial agreement. Here are the top 4 relevant attributes according JMI and MRMR side by side ordered by relevance score. They have first 3 feature attributes in common appearing in the same order. Intuitively, the findings seem to be correct.

JMI | MRMR |

age | age |

family status | family status |

follow up | follow up |

employment status | alcohol consumption |

## Summing Up

We have used some statistical techniques for feature sub set selection. As we saw the different scoring algorithms don’t produce identical results.

Only way to choose the right algorithm is to build different predictive models using different feature sub sets and then the choose the right relevance score algorithm based on the accuracy of the predictive model.

My solution draws heavily on this paper, which serves as the reference material. That paper has references to other papers with detail of the specific algorithms listed there.

I will be adding implementation of some more relevance score algorithms. Here is a tutorial to run the example.

Pingback: Nearest Flunking Neighbors | Mawazo

Pingback: Customer Conversion Prediction with Markov Chain Classifier | Mawazo