## Predicting Call Hangup in Customer Service Calls with Decision Tree and Random Forest

When customers hangup after a long wait in a call, it’s money wasted for the company. Moreover, it leaves the customer with a poor experience. It would have been nice, if we could predict in real time while the customer is on hold, how likely is the customer to hangup and based on the prediction give queue priority to the customer or take some other action.

In this post, we will discuss a  Decision Tree and Random Forest based solution implemented on Hadoop. Random forest is just an ensemble of decision trees. The Hadoop based implementation of decision tree and random forest is part of open source project avenir.

## Decision Tree

Decision tree is an widely used machine learning algorithm, mostly used for classification. It can be used for regression also. It’s powerful and popular because of the following reasons.

1. Works for non linearly separable classification problems
2. Results are in the form of if then rules and comprehensible by humans
3. Can handle both numerical and categorical attributes
4. Tolerant to errors and missing values in training data

In decision tree, each intermediate node splits the features space into 2 or more sub spaces base on some feature selected at that node.  The child node defines a (n-1) dimensional region, where the parent node defines a n dimensional region.

There are various criteria that can be used to select the feature and the split along the feature dimension. A split is a mutually exclusive set of partitions that cover the domain of the feature. The tree is grown recursively. There are various strategies for stopping the tree growth. Here are the steps in the algorithm.

1. Select a feature attributes and select optimum split points to  partition the feature
2. Partition the data as per step 1
3. Repeat form step 1 for each of the partitions from step 2, except for  the partitions that have met the stopping criteria

Each path is the final decision  tree represents a rule. For a data point, a predicate for each  node in the path evaluates to true e.g age < 25. Taking all the nodes  together we have conjunction of predicates i.e the IF part of the rule.

The leaf node is associated with the class or target attribute value i.e. the THEN part of the rule. This value is the class attribute value of the majority of the sub population  of training data in the leaf node, in case of  classification.

The leaf node value could also be associated with a probability of class attribute value. For binary classification problem, the probability is the ratio of the count of training sample with  positive class attribute value and count of training sample with  negative class attribute value.

For prediction with an input instance, we start from the root and follow a path where the data evaluates to true for the predicate associated with each  node in the path up to the leaf node. The prediction is the class attribute value associated with the leaf node.

An internal node is associated with the following

1. Feature attribute
2. One of the partitions of it’s parent node
3. Set of partitions, each pointing to a child node

A leaf node is associated with the following

1. Feature attribute
2. One of the partitions of it’s parent node
3. Class attribute value and optionally associated probability

## Customer Call Data

Past customer call data is used as the training data for the decision tree. The data is for a fictitious internet service provider and it contains the following attributes

1. Customer ID
2. Customer type
3. Area code
4. Customer issue
5. Time of day
6. Hold time
7. Hung up

The first attributes is not used. The last attribute is the binary class or target attribute. The remaining are feature attributes. Time of day and hold time are numerical feature attributes. The rest of the feature attributes are categorical. Here is some sample data.

```V14SIXU4FO,residence,571,cable,PM,431,T
W1DU70XB45,residence,615,billing,PM,502,T
```

The meta data about the the attributes is defined in in JSON file is saved in HDFS, which gets used by the Hadoop Map Reduce job. There are some meta data worth making a note of.

## Partitioning

The attribute maxSplit defines maximum number of partitions to use. If it’s a numerical attribute, it’s the maximum number of partitions to use when splitting . For example if it’s set to 3 , we have to consider all possible partition combinations when there 2 partitions, and then all possible partition combinations when there 3 partitions. When set to 2, we get a binary decision tree.

For real numbers there infinite number of partition combinations. To address this problem and to create a finite number of partition sets, partition generation is controlled by another parameter called splitScanInterval.

For categorical attributes with n discrete values  maxSplit defines the maximum number of sub sets, the set of values  is going to be split into. Assuming maxSplit is set to 3, ,  we consider all possible ways the n values can be split into 2 sub sets and then all possible ways the n values can be split into 3 sub sets.

The maximum  number of partitions defined by maxSplit should be set based on the complexity of the regions in the feature space that are to be part of the predictive model. Consider a classification problem in the healthcare domain. We want to predict healthcare need which has the binary value low and high.

Based on the prior knowledge that healthcare needs are high at young age and old age, we should set  maxSplit to 3 for the feature attribute age. If we set the parameter to a lower  value of 2, then we will end up using the feature age multiple times while recursively building the tree.

## Tree Builder Map Reduce

The Map Reduce job runs iteratively. Each iteration grows the tree by one level. The mapper selects a set of  feature attributes to try. The selection of the set is based on some strategy e.g attributes not used so far for the data partition under consideration.

In the mapper for each such attributes, it generates all possible partition set combinations.  Each such (attribute, partition set) combination  is a candidate node for the next tree level. The total number of candidate child nodes generated is as follows.

c = Σ1 to n si    where
c = number of child nodes
n = number of features
si = number of splits for ith feature, where a split is a set of partitions

Here is an example node for the feature attribute issue type (issue type, ((internet, cable), (billing, other))). In this example, the attribute is categorical and the partition set contains 2 partitions.

The record being processed by the mapper is associated with each  (feature, partition) combination it belongs to  and emitted.  Here is some sample mapper output after 3 iterations.

```\$root;5:3 in cable:billing:other;10:5 gt 480,CE91IMCI3M,residence,507,billing,AM,535,T
```

Actual data is preceded by a set of partitions separated by semi colon. Each partition start with an internally generated partition ID. The partition has the syntax attrubuteIndex operator valueSet for categorical attributes and attrubuteIndex operator value for numerical attribute.

The reducer selects the child node among all the candidate child nodes.. The reducers writes out all the paths grown so far into a JSON file. The mapper uses this information to filter out records unsuccessful  partitions and associated records while emitting.  We want to grow the tree only from the child node selected in the previous iteration.

The reducer calculates a score for each candidate child node and selects the child node based on some strategy e.g. the best score. Generally the score is information content gain compared to the parent node.

Information content is measured by either Entropy or Gini Index. They measure how homogeneous all the data under a partition is. A data set is more homogeneous if they have predominantly one value of the class or target attribute.

After the tree is expanded, the expanded tree is written in a JSON file. The file essentially contains the tree model definition, which includes all the paths through the tree i.e all the rules. This file gets used by the mapper as mentioned earlier.

There are various configuration parameters. Here are are some of the important parameters. Some of them are used for Random Forest generation.

 dtb.split.attribute.selection.strategy Feature selection strategy for generating partitions. Options are all, notUsedYet, ransomAll, randomNotUsedYet dtb.split.algorithm Information content calculation algorithm. Options are entropy, giniIndex dtb.path.stopping.strategy Tree growing stopping strategy. Options are maxDepth, minPopulation, minInfoGain dtb.sub.sampling.strategy Training data sampling strategy. Options are none, withoutReplace, withReplace dtb.split.select.strategy Split or child node selection strategy. Options are best, randomAmongTop dtb.custom.base.attributes Feature attributes to be used. Sub set of all available feature attributes.

The prediction from the decision tree model could be discrete i.e. a target attribute value or probabilistic i.e. probability of predicted target attribute value.

## Random Forest

Decision trees tend to over fit causing high generalization error due to variance. One way to overcome the over fitting problem is to use an ensemble of decision trees a.k.a Random Forest.

Prediction is made based aggregating the prediction from each decision tree model in the ensemble. One common way to to aggregate is to take majority vote of individual tree predictions.

An ensemble of decision trees can be built from the same training data set  by introducing randomness. randomness can be introduced in the following ways.

1. Training data
2. Input features
3. Algorithm parameters

Here is a list of ensemble techniques for building Random Forest. They belong to one of the 3 categories listed above.

1. For each of the k trees in the ensemble, sample n data instances from the original  n data instances with replacement a.k.a bagging.
2. For each of the k trees in the ensemble, select a random subset of q features from the original p features.
3. Boost each training data instance with weight depending on classification accuracy. For the next tree to be built among the k trees, sample from the weighted sample by weight.
4. While growing the tree use a random subset of features randomly. Build k trees with the same training data.
5. While growing the tree, use a random sub set of partition sets i.e among candidate child candidate nodes for a feature, use only a random sub set. Build k trees with the same training data.
6. While choosing among the candidate child nodes for a feature attribute, select top m among n child nodes for a feature. Choose one of the m child nodes randomly. Build k trees with the same training data.
7. Use new features by taking random linear combinations of existing features. Number of features to combine and which features to combine to create a new feature is random. Build k trees with the same training data.

Currently, most of the  ensemble techniques are supported in avenir. I have plans to implement most of them. Some of the techniques where the randomization logic resides on the mapper side, will not work if the data set is large enough to create multiple mappers e.g. number 4 in the list above.

To generate Random Forest, you should set proper configurations for the randomization technique to be used and then generate multiple decision trees with the same training data.

The prediction for Random Forest could be discrete based on the majority of the prediction of target attribute. It could also be probabilistic by converting the vote count to a probability or by taking average of probabilities if the individual decision trees make probabilistic prediction.

## Model Deployment and Prediction

For this use case, since the prediction needs to be made in near real time, the model should be deployed in Spark Streaming, Flink or Storm. The call status data could collected every 30 sec or so and fed to the streaming engine through a Kafka topic.

The prediction output could be written to another Kafka topic. Another process could read this topic and take some action e.g moving callers in the call queue.

## Wrapping Up

We have gone through a Hadoop based implementation of Decision Tree and Random Forest. We have used it for a call center use case involving customer call data. Here is a tutorial document with step by step instructions for this use case.