Bounce rate for a page in a web site, is the proportion of sessions with only that page in the session. This post will show how to calculate bounce rate in real time with Storm using web log data. We will use sampling and window based algorithm called Biased Reservoir Sampling which is from my OSS project hoidla. It is a collection of stream processing algorithms. The storm implementation is part of my web analytic OSS project visitante.
We will see later how this analysis can feed into a reinforcement learning based approach for web site optimization. To be more specific, consider multiple competing home pages for an web site. We expect the reinforcement learning engine to tell us the winning design. You may want to refer to this earlier post of mine for more on this topic.
Before we dive into the main topic, let’s take a digression into windowing techniques for processing real time data. Any data stream with sequential context is called a sequence. When the sequential context is time, it’s called time series.
With most sequence data and particularly with time series data, we are generally interested in recent data and that’s where windowing comes into picture.
There are generally two broad categories of windows, as described in the next two sections. For each category, there are different window types based how the window sliding works. All the window types described are implemented in hoidla.
Size Bound Window
The window is defined by a fixed size. For time series data, it implies that the window will hold data of varying time span based on the real time stream velocity. There are different flavors of size bound window, depending on how the incoming data and expired data is handled.
- Continuous sliding: As a new data point is added, the oldest one is discarded, when the window is full. In other words the sliding step size is 1
- Discrete sliding: When a new data point is added to a full window, n oldest data points are discarded, where n is the sliding step
- Full window sliding: When a new data point is added to a full window, all data points in the window are discarded. The sliding step size is same as the window size. this kind of sliding is also know as tumbling
Time Span Bound Window
The window size is defined by a fixed time span. For a stream with varying velocity, the number of data points in the window will vary with time. The different flavors are as follows
- Continuous sliding: As a new data point is added, the oldest one is discarded, when the window is full.
- Discrete sliding: When a new data point is added to a full window, oldest data points spanning over a time span t are discarded. We can think of t as the sliding step in terms of time units
- Full window sliding: When a new data point is added to a full window, all data points in the window are discarded. The sliding step size is same as the window time span.
Biased Reservoir Sampling Window
Pure sliding window could be extreme in many cases, since the past data is completely lost. In reservoir sampling, the focus is on retaining recent while not completely losing track of past data. The transition from old data to new data in the window happens more gradually.
In biased reservoir sampled window, we maintain an window of specified size. As new data arrives it’s included, while discarding of old data is done in a probabilistic way. We will be using this kind of window for our analysis instead of the other ones listed above. The algorithm works as follows.
- For partially filled window, as new data arrives, we flip a coin with success probability of f(t), where f(t) is the fraction of the window that is filled at time t. If the coin flip is a success, we choose an existing data point randomly in the window and replace it with the new data point. Otherwise, we add it to the window, resulting in the window size increasing by 1
- For a full window, we choose an existing data point randomly in the window and replace it with the new data point, with the window size remaining the same
In the beginning the window will fill up fast. Then it starts slowing down as the window size approaches it’s maximum size, since it becomes more likely for an existing data point to be replaced with the newly arriving data.
With this algorithm, the probability of the r-th data point remaining in the window of maximum size n at time t is as follows.
Here are some key observations. The older the data point, less likely it is to remain in the window. With a bigger window, an old data point is more likely to remain in the window.
As I mentioned earlier, our use case involves analyzing web log files to identify sessions with with only 1 page visited to ultimately calculate the bounce rate. Our topology consists of a spout that reads log lines from a Redis queue and two layers of bolts. The log format is simplified version of W3C log format. Here is some sample input.
2014-10-04 05:34:47 18.104.22.168 POST /placeOrder?sessionid=c78c022e-4bc2-11e4-bb7a-c42c030f8af1 200 2014-10-04 05:34:47 22.214.171.124 POST /confirmShipping?sessionid=c064dba6-4bc2-11e4-94e9-c42c030f8af1 200 2014-10-04 05:34:47 126.96.36.199 GET /onSale?sessionid=ceb34542-4bc2-11e4-80ea-c42c030f8af1 200 2014-10-04 05:34:47 188.8.131.52 POST /addToCart?sessionid=b3478ef8-4bc2-11e4-8b73-c42c030f8af1 200 2014-10-04 05:34:48 184.108.40.206 GET /home2?sessionid=d410c430-4bc2-11e4-934d-c42c030f8af1 200 2014-10-04 05:34:51 220.127.116.11 POST /confirmShipping?sessionid=c064dba6-4bc2-11e4-94e9-c42c030f8af1 200 2014-10-04 05:34:51 18.104.22.168 POST /placeOrder?sessionid=b3478ef8-4bc2-11e4-8b73-c42c030f8af1 200
The first layer consists multiple VisitSessionBolt bolt instances for load balancing and uses field grouping on sessionID, so that all data for a given session will land on the same bolt instance.
Whenever a session end is detected, whether through an explicit logout page or session expiry, a message tuple is emitted to the next layer which consists of the tuple (pageID, session depth). The pageID is an unique identifier for a given home page. It’s extracted from the URL with a regex filter. Expired sessions are detected by a tick tuple that arrives at regular interval.
The second layer consists of one VisitDepthBolt bolt instance. It maintains one BiasedReservoirWindow instance for each pageID.When a normal tuple arrives it’s added to the corresponding window. When a tick tuple arrives, we count the number of sessions with only one page and calculate the bounce rate. This process is repeated for all pageIDs. The bounce rates are written to a Redis queue. This layer contains only one bolt and is hooked up to the previous bolt layer with shuffle grouping. Here is some output
home1:10 home2:16 home1:15 home2:16 home1:15 home2:11 home1:15 home2:11 home1:15 home2:10 home1:15 home2:10
Instead of shuffle grouping, we could have done a field grouping based in pageID with multiple bolt instances. However on closer look, it seems unnecessary, because the data volume shrinks significantly between the first and the second layer of the bolts. Whereas, the first layer collectively receives a message for each page visited in each session, the second layer receives a message for each session
Session Depth Distribution
Instead of bounce rate, the output could also be a distribution of session counts by session depth. The type of output desired can be chosen by appropriately setting the configuration parameter result.type. The choices are bounceRate , depthDistr and avDepth.
The output consists of list of pairs of session depth and corresponding counts. Session depth distribution could be used to track the stickiness of an web site. Increasing skewness to the right indicates higher stickiness.
Site Optimization with Reinforcement Learning
As mentioned earlier, the bounce rate analysis can play a key role when an web site is optimized with reinforcement learning engine. The goal of the optimization could be to select the most optimum page among n candidate home pages. A reinforcement learning engine runs iteratively and it has two essential components as below
- The learning system which in a given iteration chooses 1 among n options (which is our case is the n different home page designs), based on it’s algorithm and reward data available.
- A reward (which in our case is the inverse of the bounce rate) gathering system which given the decision by the learning engine, finds the reward for that decision and feeds the reward data back into the learning engine
The analysis in this post corresponds to the second piece of the puzzle. For a given home page, the lower the bounce rate, higher the reward.
For the first part, any of the algorithms implemented in my OSS project avenir could be used. They are in the package org.avenir.reinforce.
Instead of bounce rate, we could have used average session depth for each home page design. For average session depth, the parameter result.type needs to be set to avDepth.
We have gone through a storm based implementation for real time bounce rate analysis. Although the solution has been discussed in the context of a reinforcement learning based web site optimization solution, there could be simpler use cases.
You may want to track bounce rate after launching a new home page. If you find that the new design is having an adverse effect based on real time tracking, you may want to revert to the existing design.
A tutorial document on how the run the example with step by step instructions is also available.