Customer segmentation or clustering is useful in various ways. It could be used for targeted marketing. Sometimes when building predictive model, it’s more effective to cluster the data and build a separate predictive model for each cluster. In this post, we will segment customers based on their online behavior in an eCommerce web site.
The focus of this post is on solving a specific problem and interpret the results and not a broad overview of clustering techniques. We will use python scikit-learn machine learning library. The python implementation can be found in my open source project avenir.
Customer Online Behavior Data
Clustering or segmentation is a way of grouping data, such that all the data elements within a group are maximally similar to each other and minimally similar to data elements in other clusters, based on some measure of similarity. It’s an unsupervised machine learning technique.
Customers can be segmented along various aspects and dimensions. Our focus is on online behavior related customer data. Customers’s online behavior is captured through the following attributes.They are based on historical data collected over some time horizon
- Number of visits
- Average visit duration
- Hour of the visit
- Number of transactions
- Average amount spent per transaction
The data is generated with a python script which creates 3 clusters. Later on we will see if the clustering algorithm finds all the clusters and the quality of the result. Here is some sample input data.
1000016,22,9,3,13,335.976 1000017,22,11,3,14,366.891 1000018,14,12,2,7,239.211 1000019,4,25,3,1,120.155 1000020,4,29,3,2,54.039 1000021,18,12,2,9,309.805 1000022,18,21,1,7,99.919 1000023,9,19,3,4,380.363 1000024,16,14,2,9,378.058
The first field is the customer ID. Rest of the fields are as described earlier. The field hour of visit requires some explanation. A 24 hour time window is split into 4 sub windows, each sub window spanning 6 hours. The field contains index of the hour sub window e.g. 2 corresponds to 6 AM to 12 PM.
Do We Have Clusters
That’s is legitimate question to ask before we embark on actual clustering analysis. Any clustering algorithm will dutifully always find clusters when fed with data, whether the data has good quality clusters or not. But fortunately, there are some measures of cluster quality which can be used to validate the results.
The answer to the question of whether a data set has clusters is not binary as true or false. The algorithm we will use to determine presence of clusters will generate a numerical score that reflects the extent of clustering in the data.
We will discuss 2 algorithms. Both of them test the randomness of the distribution of data points in n dimensional space. Both are implemented in cluster.py.
We will compute a statistic called Hopkins statistic. It’s value lies between 0 and 1. A smaller value indicates highly clustered and well separated data. The steps are as follows
- Generate data set of size k where data elements are randomly distributed in an n dimensional space. Find the nearest neighbor distance with respect to the original data set for each element in this data set. We will call the sum of distances r.
- Generate another data set of size k by sampling from the original data set. Find the nearest neighbor distance with respect to the original data set for each element in this data set. We will call the sum of distances s.
- Hopkins statistics is defined as h = s / (r + s). The previous 2 steps could be repeated and average Hopkins statistic could be calculated.
Intuitively, it’s clear that if data is clustered and well separated s will tend to be small and r will tend to be large, making h small.
If we find a large Hopkins statistics for a data set it indicates lack of clusters in the data and it may be futile to do cluster analysis on the data set.
Here is the partial result for Hopkins statistic. One of the parameters for the python script that generates input data is the percentage of data points that are randomly generated and don’t necessarily belong to any cluster.
The random data set for Hopkins statistic was generated by setting the random percentage parameter to 100. We have used 10 iterations to calculate the statistic. The last output line shows the average Hopkins statistic.
----------------- random sum 38.969 split sum 15.732 hopkins stats 0.288 random sum 38.220 split sum 13.553 hopkins stats 0.262 random sum 39.106 split sum 13.218 hopkins stats 0.253 random sum 38.977 split sum 12.309 hopkins stats 0.240 average hopkins stat 0.267
Based on the Hopkins statistic, we can conclude that the quality of clusters in the data is average and not necessarily the best.
Nearest Neighbor Distance
With this technique, we find the distance to kth nearest distance of all data points. We sort the distances. If the data is clustered and well separated, there will a sharp increase in the distance to the kth neighbor for some value of k. This procedure needs to be repeated for different values of k.
On the same token, if the data was randomly distributed in the n dimensional space, the distance to the kth neighbor will remain more or less same without any noticeable sharp increase.
Entropy is another way to detect presence of clusters in data. Presence of clusters in data causes entropy to drop.
Entropy will be highest for random data. The maximum entropy values is a function of the number of attributes and the range of values for each attribute. At the other extreme, entropy will be 0 when all the data points are identical i.e. they converge to 1 cluster.
Extent of clustering in the data depends on where the actual entropy value falls in that range of entropy.
How Many Clusters
In many clustering algorithms, you have to specify the number of clusters. But how do we know how many clusters to use. This the second question to ask ourselves. Although, the data naturally may have certain number of clusters, it is unknown to us.
Various techniques have been proposed for identifying the number of clusters, without actually clustering the data. Typically you have to do cluster analysis for various number of clusters in a range. Using some measure of the quality of the clusters, the actual number of clusters can be found from that range.
K Means Clustering
Clustering is an well researched and well established area. There are many clustering techniques with their pros and cons. Any good text book will have wealth of information. This book is a good start. Here is a good clustering survey paper.
We will be using K Means clustering technique, which is a popular and well known algorithm. Here is the core K Means algorithm.
- For K = N to M
- Select K initial cluster centroids, wich could be random K data points
- Assign each data point to nearest cluster
- Calculate new cluster centroids
- Repeat from step 3, until centroids don’t change significantly
- Repeat for step2 to generate multiple clustering result for different random initial clusters
- Repeat from step 1 for different number of clusters
The critical question to ask is what is the correct number of clusters K. Actually we don’t know and we try different values of K between N and M. The right value of K is identified based on certain error values.
We have done clustering for number of clusters in the range of 2,3,4 and 5. Here is the complete output for all 4 cases along with various cluster quality measures. The data is scaled i.e. for each attribute, mean is subtracted and then divided by standard deviation.
Since clustering involves finding similarity, which requires distance calculation, attributes with large values will dominate the calculated distance. Scaling is a way to mitigate this problem, so that all the attributes contribute equally to the distance calculation.
starting with num of clusters 2 ... starting kmeans clustering... [[-1.03627278 0.92199485 0.4243233 -0.96961262 -0.3488038 ] [ 0.55310265 -0.49210768 -0.22647931 0.5175233 0.18617135]] cohesion: 3.310 [ 2.72894043] min inter cluster distance: 2.729 partition measure (3.3102508769423524, 0.73288517982895462) starting with num of clusters 3 ... starting kmeans clustering... [[-1.08143727 1.08889441 0.6752688 -1.00844926 -0.18799157] [ 1.17321314 -0.48495096 0.56952466 1.26140784 0.95217464] [ 0.04158759 -0.44223792 -0.77460269 -0.05745153 -0.42296058]] cohesion: 2.279 [ 2.58210873 2.59219329] min inter cluster distance: 2.582 partition measure (2.2790902859408284, 1.1618410816877616) starting with num of clusters 4 ... starting kmeans clustering... [[ 0.27300247 1.48042407 -3.08280661 0.3435035 -0.86061828] [ 0.0436233 -0.59060533 -0.47675267 -0.05318986 -0.34397085] [-1.09396736 1.07373027 0.71470365 -1.02615107 -0.16474849] [ 1.25662996 -0.49227607 0.61029852 1.34656622 1.04793469]] cohesion: 1.859 [ 3.39964482 2.54213843 3.88445258] min inter cluster distance: 2.542 partition measure (1.8585527924149532, 1.573478436187997) starting with num of clusters 5 ... starting kmeans clustering... [[ 1.26566129 -0.5036571 0.61029852 1.35440855 1.01642539] [-1.28945283 1.19441055 0.67842944 -1.22407402 -0.95453381] [-0.86360688 0.90521963 0.75639191 -0.78863294 0.76758807] [ 0.27300247 1.48042407 -3.08280661 0.3435035 -0.86061828] [ 0.04343482 -0.58587737 -0.48516047 -0.05526562 -0.34287934]] cohesion: 1.620 [ 2.55646606 1.85104484 2.52160916 3.39074311] min inter cluster distance: 1.851 partition measure (1.6204303414516343, 2.7011771367271082) validity index num cluster 2 validity 1.000 num cluster 3 validity 0.608 num cluster 4 validity 0.568 num cluster 5 validity 1.000
The output requires lot of explanation. For any case with some number of clusters, the output consists of
- Cluster centroids
- Cluster cohesion, which is average distance of data points to their respective cluster centroid
- Minimum distance between clusters
- Validity index
Dissecting a Cluster in Plain English
Let’s try interpreting one of the clusters in plain english. Assuming we have accepted the 3 cluster solution, let’s take a close look at the first cluster as below
-1.08143727 1.08889441 0.6752688 -1.00844926 -0.18799157
Because of scaling the attributes values look very different from input data. The cluster could be described in plain english as follows in relative terms compared to the other 2 clusters.
- The number of visits is low
- Time spent during visit is high
- They visit later during the day
- Number of transactions low
- Amount spent is medium
Some general observations can be made about the customers in this cluster. Although they are infrequent visitor, they tend to spend more time per visit. Perhaps they are buying diverse range of products or high price ticket products which require more online browsing and research.
The amount spent per transaction is high, indicating that they are probably purchasing higher price ticket items, which perhaps explains why they spend more time browsing and evaluating products.
Validity index helps us determine the right number of clusters. According to validity index, it seems to be toss up between 3 and 4 for our data. Incidentally, we artificially introduced 3 clusters in the input data.
Validity index depends on the following quantities. Here are the details of the algorithm.
- Under partition measure which is the mean of distance between data points and cluster centroid across all clusters. This has large value when the number of clusters is less than the actual number of clusters.
- Over partition measure, which is number of clusters divided by the minimum inter cluster distance. This has large value when the number of clusters is more than the actual number of clusters
The two vectors are added to get the validity index vector. The number of clusters that correspond to the smallest validity index is the desired number of clusters.
Squared Sum Error Elbow
A popular technique for finding the number of clusters is based on the sum of squared error (SSE). For small number of clusters SSE is high. It falls rapidly as the number of clusters increases.
For specific number of clusters, the rate of drop SSE decreases significantly becoming almost flat, creating an elbow where the second derivative of SSE is maximum. The number of clusters corresponding to the elbow is the actual number of clusters.
The steps are as follows. Cluster analysis is done for a range of number of clusters where the ideal number clusters lies somewhere in that range
- Start with the smallest number of clusters.
- Find the clusters and calculate squared sum error which is sum of the square of the distance between a data point and cluster centroid across all clusters.
- Increase number of clusters by 1 and repeat step 2, until the maximum number of clusters has been tried.
- Identify the elbow in SSE where the second derivative is the highest. This best done through a plot of SSE against number of clusters.
Silhouette coefficient is another way to find the right number of clusters, after clustering analysis has been performed for a range of cluster numbers. The calculation steps are as follows
- For a data point calculate average distance (a) to all other data points in the cluster
- For the same data point calculate average distance to all data points belonging to another cluster.
- Repeat 2 for all other clusters and take the minimum all the average distances (b) to other clusters.
- Silhouette coefficient for the data point is defined as (b -a) / max(a,b)
- Repeat from 1 for other data points
- Take the average of the silhouette coefficient for all data points
- Repeat from 1 for next number of clusters
When plotted against the number of clusters, the number of clusters corresponding to the maximum values of the Silhouette coefficient is the actual number of clusters.
Outliers adversely affect the quality of clusters, particularly for K Means clustering. One option is to remove outliers before clustering.
A better option is to identify and track outliers in the context clusters as we find clusters and remove them. Here are the steps
- In any K Means iteration, after the data elements have been assigned to respective clusters, calculate outlier score of all data points.
- Define outlier score as the ratio of the distance of a data point to the centroid to the mean or median distance of all cluster members to the centroid of the cluster
- Either remove the data element with highest outlier score or remove all the data elements with outlier scores above some threshold.
Unfortunately K Mean clustering in scikit-learn, does not offer the option of removing outliers as described.
I have skipped lot of technical details on clustering, because there is enough material on clustering in books and on line content. This blog is not meant to be a survey of clustering techniques.
My goal is this post has been to take a specific problem and solve it with a specific clustering algorithm and going through all the steps necessary for a complete solution of the problem.
The core clustering problem was solved with scikit-learn. Rest were implemented from scratch. Here is a tutorial document with the steps to to run this use case.