Using Mutual Information to Find Critical Factors in Hospital Readmission

Nobody likes hospital readmission soon after discharge, whether it’s the patient or the insurance company. Predictive analytic techniques have been used to predict the likelihood of hospital readmission, using the various medical, personal and demographic input or feature attributes. However, some problems including the one we are discussing has a very large input feature set.

Before we plow  ahead with building a predictive model, it’s important to pause and ask ourselves what features are really important, especially with a problem like this with a very large set of input features.

Machine learning algorithms generally work better if the dimensionality i.e. the number of feature attributes is lowered. One of the techniques for lowering the dimensionality is to select a subset of the original feature set, known as feature subset selection. Even if a predictive model is not built, having  insight on what features are important is valuable.

The focus of this post is feature subset selection using certain statistical techniques based on entropy and mutual information. The solution is part of my open source  Hadoop based machine learning project avenir.

Entropy and Mutual Information

The solution is based some information theory concepts, which we will cover. The entropy of a random variable indicates the level of randomness and is defined as below.

H(x) = – Σ (p(x) log(p(x))) where
H(x) = entropy
p(x) = probability distribution

Entropy is maximum when the distribution p(x) is uniform and minimum when x always has one value.

The conditional entropy of x given y is the uncertainty remaining about x, given that we know y. It’s defined as below

H(x|y) = – Σ(p(y)  Σ(p(x|y) log(p(x|y)))) where
H(x|y) = conditional entropy
p(y) = probability distribution of y
p(x|y) = conditional probability distribution of x

Mutual information of x and y is the difference between the two entropy and also can be thought of as the amount of uncertainty of x that is removed based on the knowledge of y. It’s defined as

I(x;y) = Σ Σ p(xy) log(p(xy) / (p(x) p(y))) where
p(xy) = joint probability distribution of x and y

Relevance Score

Feature attributes are assigned relevant scores which are calculated based on mutual information. The scores indicate how an important a feature attribute is in a predictive model.  Based on the relevance score, a sub set of features is selected and the rest ignored.

This paper by Brown lists 17 such scores that are based on mutual information. We will be focusing on some of the important ones. Most of the scores are functions relevance of a feature with respect to the output or class attribute and the redundancy of a feature with respect to other features. An important feature will simultaneously meet the following conditions

  1. Maximum relevance to class or output attribute
  2. Minimum redundancy with all other feature attributes
  • Mutual Information Maximization (MIM)

The relevance score (RS) for a feature x(i) according to this criteria is as follows. It’s based only on the relevance of the feature attribute with respect the output attribute. It treats each feature independently and does not take into account inter feature redundancy.

RS(x(i)) = I(x(i); y)   where
I(x(i);y) = Mutual information between feature attribute x(i) and output attribute y

  • Mutual Information Feature Selection (MIFS)

This score takes into inter feature redundancy and is defined as below. The parameter β controls the influence inter feature redundancy in the final score. It’s value is usually between 0 and 1. The sum is taken over the features already selected.

RS(x(i) = I(x(i); y) – β Σ(I(x(i);x(j)))   where
I(x(i);x(j)) = Mutual information between feature x(i) and feature x(j)

  • Joint Mutual Information (JMI)

This score is based on mutual information between feature attribute pair and the output attribute. Each feature is paired with features already selected. The sum is taken over features already selected

RS(x(i) = Σ I(x(i)x(j);y) where
I(x(i)x(j);y) = Mutual information between feature pair x(i),x(j) and output y

  • Double Input Symmetrical Relevance (DISR)

This score is similar to JMI except it’s normalized with feature pair, class entropy. The definition is as follows

RS(x(i) = Σ I(x(i)x(j);y) / H(x(i) x(j) y)  where
I(x(i)x(j);y) = Mutual information between feature pair x(i),x(j) and output y
H(x(i)x(j)y) = Entropy of feature pair and class

  • Min Redundancy Max Relevance (MRMR)

This score is  similar to MIFS and defined as follows. The sum is taken over currently selected features.

RS(x(i)) = I(x(i);y) – sum I(x(i); x(j))  / S where
I(x(i);y) = Mutual Information between feature x(i) and output y
I(x(i); x(j)) = Mutual Information between feature x(i) and x(j)
S = Size of the currently selected feature set

Map Reduce for Relevance Score

The computation consists various univariate and multivariate distributions. The distributions are used to compute various mutual information values. Finally, the relevance scores are computed using the mutual information values.

There is one reducer to do the aggregates and calculate all the quantities listed above. I have leaned on combiners heavily  to do partial aggregates and reduce the shuffle load. Here is the implementation.


The input consists of 10 medical, personal and socio-economic attributes as below. I am not a doctor. I am just playing one. So don’t blame me if you disagree with the feature set. I have made my best guess as far as selecting relevant attributes.

  1. Age
  2. Weight
  3. Height
  4. Employment status
  5. Family status
  6. Diet
  7. Exercise
  8. Followup
  9. Smoking
  10. Alcohol consumption

The data has been synthetically generated with a script. Certain hypothesis regarding relevancy and redundancy has been coded into the script. Here is some sample input

WKELAC02JSOU,66,239,63,employed,with partner,good,low,high,non smoker,low,N
64N3LGAEMH8H,77,165,52,retired,with partner,average,low,low,non smoker,average,Y
HQC0JEU10220,43,215,61,employed,with partner,poor,average,average,smoker,average,N
EZWE8EWD046I,52,144,56,employed,alone,average,low,low,non smoker,low,Y
SX3SUF143JLY,58,148,67,employed,alone,average,low,average,non smoker,high,N
OM48O83TE3Q1,87,197,52,employed,alone,average,high,low,non smoker,low,N
TK0ZSUGA2TPJ,75,164,66,employed,with partner,poor,average,low,non smoker,low,N


There are various relevance score algorithms implemented in avenir. I have used the Min Redundancy Max relevance (MRMR) and Joint Mutual Info (JMI) algorithm. Here is the output. Each line contains the feature attribute ordinal value followed by the score.

mutualInformationScoreAlgorithm: min.redundancy.max.relevance

If we compare the top 4 relevant features, there is partial agreement. Here are the top 4 relevant attributes according JMI and MRMR side by side ordered by relevance score. They have first 3 feature attributes in common appearing in the same order. Intuitively, the findings seem to be correct.

age age
family status family status
follow up follow up
employment status alcohol consumption

Summing Up

We have used some statistical techniques for feature sub set selection. As we saw the different scoring algorithms don’t produce identical results.

Only way to choose the right algorithm is to build different predictive models using different feature sub sets and then the choose the right relevance score algorithm based on the accuracy of the predictive model.

My solution  draws heavily on this paper, which serves as the reference material. That paper has references to other papers with detail of the specific algorithms listed there.

I will be adding implementation of some more relevance score algorithms. Here is a tutorial to run the example.

For commercial support for this solution or other solutions in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing,

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Big Data, Correlation, Data Mining, Data Science, Hadoop and Map Reduce, Healthcare Analytic, Predictive Analytic and tagged , , , . Bookmark the permalink.

5 Responses to Using Mutual Information to Find Critical Factors in Hospital Readmission

  1. Pingback: Nearest Flunking Neighbors | Mawazo

  2. Pingback: Customer Conversion Prediction with Markov Chain Classifier | Mawazo

  3. Pingback: Profiling Big Data | Mawazo

  4. Pingback: Gaining Insight by Mining Simple Rules from Customer Service Call Data | Mawazo

  5. Pingback: traiteur rabat

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s