Although the goal for most predictive analytic problem is to make prediction, sometimes we are more interested in the model learnt by the learning algorithm. If the learnt model could be expressed as s set of rules, then those rules could be source of valuable insight into the data. For most Machine Learning algorithms, the learnt model is a black box and can not be interpreted in an intuitive way. However, Decision Tree is an exception. Decision Tree result is essentially a set of rules. There are few other rule extraction algorithms.
In this post I will cover solutions for mining simple rules based on one attribute only. It’s quick and dirty way of gaining insight into data. The solution is part of my open source project avenir. The implementation is in Hadoop. I will also discuss more complex rule mining algorithms based on Decision Tree and other algorithms.
The solution is two fold. First, we want to find how relevant the feature attributes are. A feature is relevant to prediction if it’s highly correlated to the class variable and weakly correlated with other feature variables. The features relevance is calculated based on entropy and mutual information. All the technical details can be found in my earlier post.
Second, once we have determined the relevance of the feature attributes we want to identify the values of the feature attributes that correspond to the positive or negative value of class attribute value. In other words we want to find out how particular feature variable value discriminates between positive and negative class variable outcome
Various statistical scores are calculated based conditional distribution of the feature variables. They are as follows.
- Odds ratio
- Distribution difference
- Minimum risk
- KL divergence
The underlying theme for all these techniques are the same. We take two conditional distribution values for the value of feature variable, one conditioned on the positive class variable value. and the other based on the negative class variable value.
Then we calculate a parameter, based on the the one of the 4 parameters listed above, which reflects the difference between the conditional distribution values. I would call this discrimination score. The more extreme the value of the score, the more discriminative power the particular feature variable value has in predicting the class variable. A high value corresponds to positive class variable and a low value corresponds to negative class variable.
Here are the detailed definitions of the statistical parameters listed above. They are defined for some feature variable xi. The class variable has the values t and f.
A rule has the following components. A rule is based on a set of numerical and categorical attributes.
- A rule has an antecedent and a precedent
- A rule antecedent is a conjunction of predicates.
- A predicate contains an operand, operator and a value. In the rule age le 30, age is the operand, le (less than) is an operator and 30 is the value.
- A rule precedent is a class attribute value
In the context of a Decision Tree, a rule is a path from the root to the leaf of the tree. All the nodes up to the leaf node constitute the antecedent of the rule. Each internal node corresponds to a predicate in the rule. The leaf node is the consequent of the rule, which contains the class attribute value.
Since we are working with rules involving only one feature attribute, our rules will have only one predicate.
Call Center Data
I will use call center call data and the related prediction problem is to predict whether the customer will hangup the call after the call is put on hold. The prediction could be used in various ways to improve the customer service experience. For example, if the customer is high value, the customer call could be re routed, so that more expedited service could be provided.
Here are the attributes of the call data. Customer ID is not used in the analysis. The last variable hungup is a binary variable which is the class variable.
- Customer ID
- Customer type
- Area code
- Tome of day
- Hold Time
Here is some sample input data. The data is for the call center of a fictitious cable and internet service provider company.
IFVGJ804H5,business,980,other,PM,505,T 3885XEU12L,residence,620,other,PM,514,T 1EYX5YR6H8,business,336,internet,AM,446,F 0J5ZPW52PE,business,770,internet,AM,566,F MP73W8O26O,residence,646,billing,PM,413,T 816JV2EA4R,business,615,internet,AM,367,F K1B904R0RX,residence,248,cable,AM,578,T 2FOVXNDS88,business,703,internet,AM,473,T ON42TG16ZF,residence,206,cable,PM,389,T XE7D8832WB,residence,510,cable,AM,491,T
The last field is cal variable i.e whether the customer hung up or not. The first field which is customeID is ignored in the analysis.
Here are the results from relevance analysis. There are various algorithms for relevance score. I have used something called Joint Mutual Information. The first field is the attribute column index. The second field is the relevance score. Higher score indicates higher relevance.
3,0.02847212413615168 2,0.061586810354280756 5,0.13537058600774948 1,0.08248495820454346 4,0.08850483575860568
Here is list of the feature attributes sorted by the relevance score in descending order. As expected, hold time seems to be the most important predictor for call hang up. Next is the time of the day for the call. The people may behave differently depending on whether the call came in the morning or afternoon.
|time of day||0.088|
The output of relevance analysis has the following 3 parts. Distribution and mutual information are secondary output.
- Distribution and conditional distribution
- Mutual Information
- Relevance score
Class conditional distribution data is used as input to the discrimination analysis map reduce input.
In the next phase, we calculate the discrimination score for different values for each feature attribute. The Map Reduce implementation class is CategoricalClassAffinity. All the results are based on odds ratio. Let’s take a look at the feature variable holding time .
5,8,1.673708784596871 5,9,1.5978124439662904 5,10,1.5365251727541953 5,11,1.0605464777273035 5,7,1.0164399092970524 5,12,0.8243052284503062 5,3,0.8243052284503062 5,6,0.7486576341998027 5,5,0.5964716478696741 5,4,0.35769469278073485
The first field is the feature attribute index. The second field contains the value of the feature variable, which is hold time in minutes and the last field is the discrimination score. Hold time of 8 minutes seems to be the strongest predictor for hang up. Interestingly higher hold time of 9 or 10 minutes are weaker predictors.
This is surprising, since one would think that a higher wait time will make it more likely for the customer to hangup. But let’s not forget that there are other variables involved and we are trying to make prediction based on one feature variable. Maybe some customers may be willing to hold for longer time and not hang up, because of the issue they have.
Let’s look at another feature variable customer type which is either residential or business. Here is the result.
Here we find that residential customers more likely to hangup. Business customers are willing to stay on the line, possibly because resolving a service issue is more critical for a business.
The solution described so far only extracts rules based on one feature attribute. For mining rules based on multiple feature attributes as a conjunction of predicates, here are some of the popular solutions. I have plans to implement them.
- Decision tree
Decision trees follow a strategy of generalization to specialization. Any path from the root of the tree to the leaf is a rule. There will be as many rules as the number of paths.
With CN2, one decision path at a time is built. After a path is created, all the data that match with the rule corresponding to the path are are removed and the process is repeated to find the next path.
Rise follows a strategy of specialization to generalization. Initially each data point is a rule. A rule is generalized is a such a way that the data point nearest to a rule is matched by the generalized rule.
I want to reiterate that we have gone through a solution of for singe attribute based rule mining. It’s a quick and dirty way to gain some insight on data. As I alluded to earlier, for more rigorous rule mining. one of the algorithms listed earlier should be used. There is a tutorial available that documents all the steps necessary to run this use case.