In this post, I will focus on a time honored machine learning technique called Fisher Discriminant Analysis and will use it for customer segmentation for on an line music store customers. The store offers music of different genres for download. When there is a new release in certain genre, the store wants to do targeted email marketing for customers in the age group that are most likely to be to be interested in that genre.

The store has divided the different genres offered into two groups. One group tends to be preferred by younger customers, the other by the older customers. Based on past past purchase and download history, the store wants to build a predictive model for predicting the age that separates the younger from the older customers. This is where Fisher Discriminant Analysis comes into the picture. The Hadoop based implementation of the solution is available as part my project *avenir* in github.

## Fisher Discriminant Analysis

Discriminant analysis belongs to set of machine learning classification techniques where the focus is on boundary between between population of different classes. It’s all about predicting the hyper plane separating the population of different classes, and not predicting the class membership of data points directly.

By Baye’s theorem, the probability of a data point or feature x belonging to class c is as follows. The data point could be a scalar or a vector. For our example, it’s a scalar i.e., customer age.

*p(c|x) = p(c) p(x|c) / p(x)*

* where*

* p(c|x) = class posterior probability*

* p(x|c) = class likelihood*

* p(c) = class prior probability*

* p(x) = evidence or marginal probability of data point x*

Considering binary classification, with two classes c1 and c2, the the discriminating hyper plane in the feature space will correspond to the condition when p(c1|x) = p(c2|x). Using logarithm, it can be expressed as

*ln(p(c1|x) / p(c2|x)) = 0*

The following assumptions are made for Fisher Discriminant Analysis (FDA).

*Class likelihood p(x|c) i.e. the class conditional density of feature is Gaussian**The two variances of the class conditional feature population are same**The two classes are linearly separable*

Using these two assumptions and some mathematical juggling of the equation we arrive at the following solution of the feature boundary value (i.e. customer age in our example).

*x = (m1 + m2) /2 – ln(n1 / n2) * var / (m1 – m2)*

* where*

* m1 = the mean of the data points belonging to class c1 and m2 for class c2*

* n1 = the number of data points belonging to class c1 and n2 for class c2*

* var = variance of each sub population of data points*

For our example problem, here is the translation.

*m1 = average age of customer who have downloaded music category 1**m2 = average age of customer who have downloaded music category 2**n1 = number of customer who have downloaded music category 1**n2 = number of customer who have downloaded music category 2**var = age variance of customers who have downloaded music category 1 and 2*

If the likelihood of customers preferring the two music categories are same, the second term disappears. Then, the discrimination age lies exactly half way between the average age of the the two customer sub populations, which makes lot of sense intuitively.

## Fisher Discriminant Map Reduce

Enough of theory. Let’s get down to some computation details. Our input consists of the following fields

. Irrelevant for our processing*CustomerID**Age*. Irrelevant for our processing*Date*. value is Y if the customer has downloaded music category 1 and N otherwise*Download boolean flag*

The customer population is divided into two segments, based on the value of the last field i.e., download flag. For each sub population we need to compute the following

*Customer count**Average age**Variance of age*

The generic Map Reduce class *NumericalAttrStatsManager* computes these statistics among other things. The mapper output is keyed by the tuple (attribute ordinal, conditioning attribute value). A combiner is used compute various partial sums and to optimize the map reduce shuffle.

The Map Reduce class *FisherDiscriminant* is a simple extension of the previous class. It uses the statistics generated by the first map reduce to calculate the discrimination customer age separating the young from the old.

The assumption of variance being equal in the two sub population will not hold in many cases. We use pooled variance as below. It’s essentially a weighted average of the two variance values

*var = (var1 * n1 + var2 * n2) / (n1 + n2)*

## Here Comes Some Results

Some sample input is shown below. We only care about about the 2nd and the 4th field, as far as our analysis goes.

41BAJX7ZRI,30,2012-03-01,Y 6SCTZKDV58,57,2012-03-01,N TMH20834WC,69,2012-03-01,N 0213UJ6VD0,23,2012-03-01,Y 2VP1PV2SO3,71,2012-03-01,N CY3TLBC2QE,69,2012-03-01,N 2XNVBIY0CE,53,2012-03-01,Y 9XSQVPLB5V,25,2012-03-01,Y 3T4IPME99C,47,2012-03-01,Y

Here is the oputput

1,0,795636.0,3.940716E7,17908,44.42908197453652,226.5905145246627,15.052923786582548,15.0,74.0 1,N,530657.0,3.0104135E7,9684,54.79729450640232,105.90325166419507,10.290930553851535,31.0,74.0 1,Y,264979.0,9303025.0,8224,32.22020914396887,93.06252446987742,9.646891959065231,15.0,55.0 1,0.1634183303630028,100.00632624281533,42.78488217602655

The first column each is the ordinal of the attribute. The first line contains mean, variance and other statistical parameters for the whole population. The second and the and the third lines contain the conditional statistics i.e., mean, variance etc. of the two sub populations. The second column contains the class attribute values, which is Y or N for our example.

The last line contains the class boundary value i.e. the discriminating age for our example. The last field of this line contains this value, which is 42. The other fields contain some intermediate results. For example, the the 3rd field is the pooled variance.

## Multivariate Discriminant Analysis

So far we have done simple single variate discriminant analysis. It’s been a somewhat naive example, since our input of customer age is just a simple scalar. If we also included the number of past downloads as another input parameter, our input feature would have been a vector instead of scalar.

The same analysis technique still holds. However for feature vector, we had to use mean vector, co variance matrix, instead of scalar statistics. The class boundary would have been a line in a 2 dimensional space, instead of a point. The customer population on one side of the line would have corresponded to music category 1 and vice versa.

## Quadratic Discriminant Analysis

If we let go of the assumption of equal variance or co variance matrix, the solution becomes more complex and we end up with non linear separating plane between different classes. But in many cases the additional complexity may not be worth the effort, as the result may not be significantly different from the linear case.

## Wrapping Up

Fisher Discriminant Analysis is a simple but powerful predictive modeling technique. It’s accuracy will depend upon the extent to which the assumptions listed earlier hold for a given data set.