## Customer Segmentation with Fisher Discriminant Analysis

In this post, I will focus on a time honored machine learning technique called Fisher Discriminant Analysis and will use it for  customer segmentation for on an line music store customers.  The store offers music of different genres for download.  When there is a new release in certain genre, the store wants to do targeted email marketing for customers in the age group that are most likely to be to be interested in that genre.

The store has divided the different genres offered into two groups. One group tends to be preferred by younger customers, the other by the older customers. Based on past past purchase and  download history, the store wants to build a predictive model for predicting the age that separates the younger from the older customers. This is where Fisher Discriminant Analysis comes into the picture. The Hadoop based implementation of the solution is available as part my project avenir in github.

## Fisher Discriminant Analysis

Discriminant analysis belongs to set of machine learning classification techniques where the focus is on boundary between between population of different classes. It’s all about predicting the hyper plane separating the population of different classes, and not  predicting the class membership of data points directly.

By Baye’s theorem, the probability of a data point or feature  x belonging to class c is as follows. The data point could be a scalar or a vector. For our example, it’s a scalar i.e., customer age.

p(c|x) = p(c) p(x|c) / p(x)
where
p(c|x) = class posterior probability
p(x|c) = class likelihood
p(c) = class prior probability
p(x) = evidence or marginal probability of data point x

Considering binary classification, with two classes c1 and c2, the the discriminating hyper plane in the feature space will correspond to the condition when p(c1|x) = p(c2|x). Using logarithm, it can be expressed as

ln(p(c1|x) / p(c2|x)) = 0

The following assumptions are made for Fisher Discriminant Analysis (FDA).

• Class likelihood p(x|c) i.e. the class conditional density of feature is Gaussian
• The two variances of the class conditional feature population are same
• The two classes are linearly separable

Using these two assumptions and some mathematical juggling of the equation we arrive at the following solution of the feature boundary value (i.e. customer age in our example).

x = (m1 + m2) /2 – ln(n1 / n2) * var / (m1 – m2)
where
m1 = the mean of the data points belonging to class c1 and m2 for class c2
n1 = the number of data points belonging to class c1 and n2 for class c2
var = variance of each sub population of data points

For our example problem, here is the translation.

• m1 = average age of customer who have downloaded music category 1
• m2 = average age of customer who have downloaded music category 2
• n1 = number of customer who have downloaded music category 1
• n2 = number of customer who have downloaded music category 2
• var = age variance of customers who have downloaded music category 1 and 2

If the likelihood of customers preferring the two music categories are same,   the second term disappears. Then, the discrimination age lies exactly half way between the average age of the the two customer sub populations, which makes lot of sense intuitively.

## Fisher Discriminant Map Reduce

Enough of theory. Let’s get down to some computation details. Our input consists of the following fields

• CustomerID. Irrelevant for our processing
• Age
• Date Irrelevant for our processing

The customer population is divided into two segments, based on the value of the last field i.e., download flag. For each sub population we need to compute the following

• Customer count
• Average age
• Variance of age

The generic Map Reduce class NumericalAttrStatsManager computes these statistics among other things.  The mapper output is keyed by the tuple (attribute ordinal, conditioning attribute value). A combiner is used compute various partial sums and to optimize the map reduce shuffle.

The Map Reduce class FisherDiscriminant is a simple extension of the previous class. It uses the statistics generated by the first map reduce  to calculate the discrimination customer age separating the young from the old.

The assumption of variance being equal in the two sub population will not hold in many cases. We use pooled variance as below. It’s essentially a weighted average of the two variance values

var = (var1 * n1 + var2 * n2) / (n1 + n2)

## Here Comes Some Results

Some sample input is shown below. We  only care about  about the 2nd and the 4th field, as far as our analysis goes.

```41BAJX7ZRI,30,2012-03-01,Y
6SCTZKDV58,57,2012-03-01,N
TMH20834WC,69,2012-03-01,N
0213UJ6VD0,23,2012-03-01,Y
2VP1PV2SO3,71,2012-03-01,N
CY3TLBC2QE,69,2012-03-01,N
2XNVBIY0CE,53,2012-03-01,Y
9XSQVPLB5V,25,2012-03-01,Y
3T4IPME99C,47,2012-03-01,Y```

Here is the oputput

```1,0,795636.0,3.940716E7,17908,44.42908197453652,226.5905145246627,15.052923786582548,15.0,74.0
1,N,530657.0,3.0104135E7,9684,54.79729450640232,105.90325166419507,10.290930553851535,31.0,74.0
1,Y,264979.0,9303025.0,8224,32.22020914396887,93.06252446987742,9.646891959065231,15.0,55.0
1,0.1634183303630028,100.00632624281533,42.78488217602655```

The first column each is the ordinal of the attribute. The first line contains  mean, variance and other statistical parameters for the whole population. The second and the and the third lines contain the conditional statistics i.e., mean, variance etc. of the two sub populations. The second column contains the class attribute values, which is Y or N for our example.

The last line contains the class boundary value i.e. the discriminating age for our example. The last field of this line contains this value, which is 42. The other fields contain some intermediate results. For example, the the 3rd field is the pooled variance.

## Multivariate Discriminant Analysis

So far we have done simple single variate discriminant analysis. It’s been a somewhat naive example, since our input of customer age is just a simple scalar. If we also included the number of past downloads as another input parameter, our input feature would have been a vector instead of scalar.

The same analysis technique still holds. However for feature vector, we had to use mean vector, co variance matrix, instead of scalar statistics. The class boundary would have been a line in a 2 dimensional space, instead of a point. The customer population on one side of the line would have corresponded to music category 1 and vice versa.