When I go to a web site for for downloading white paper or product data sheet, I often hit the back button if presented with a form asking for lots of personal data. Any user that bounces out, is a potential loss of a lead. Perhaps a redesigned page, asking for fewer personal details, would have circumvented the problem.
In this post we will work on a solution using reinforcement learning algorithms. We will have multiple candidate page designs and have a reinforcement learning algorithm find the optimum page with highest click through rate. The algorithm will run real time deployed on Storm and process page requests from users. This implementation is part of my project avenir.
These algorithms makes decision in an uncertain environment and based on the feedback i.e., the reward obtained from the decision made, it eventually converges on the optimal decision. These algorithms have two different phase, exploration and exploitation. During exploration, the focus is on learning from the environment and it will make decisions which may be sub optimal.
During exploitation phase, it exploits knowledge learned from the environment to make optimal decision. The algorithms start with exploration phase and then transitions over to the exploitation phase. However, the transition may not always be crisp, depending on the algorithm being used. My earlier post contains a survey of different Reinforcement Learning algorithms.
High Level Architecture
We will be using Storm and Redis for the for the implementation. Page request consists of session ID and a counter which increments for every new request. The request is fed through a Redis queue. The learning engine decides which page to display. The response consists of session ID and page ID. The response is written to another Redis queue.
After enough impression data for a given page has accumulated, a python script simulates the click through rate(CTR), which is essentially the reward for the decisions made by the learning engine. I have assumed Gaussian distribution for click through rate with different mean and variance.
The click through rate data is fed into the learning engine through another Redis queue. the storm bolt reads this queue and builds in memory reward distribution (a histogram) for each page. The distribution data is used by the learning algorithm we are using.
The particular reinforcement algorithms we are using is called Interval Estimate and it uses a second order statistics of the reward distribution to choose the action. Here is the implementation. For a given confidence interval, the confidence upper bound is found for reward distribution of each action. The action with the highest upper confidence bound in it’s reward distribution is chosen.
Initially, the confidence interval is high (e.g., .95 or .90). A high confidence interval promotes more exploration. With a high confidence interval, a distribution with high mean or variance is likely to have a high upper confidence bound. The confidence interval is gradually reduced as the algorithms makes progress and it transitions into the exploitation phase from the exploration phase.
With a lower confidence interval, a distribution with high mean will have high upper confidence bound. So, as the algorithm goes well into the exploitation phase it will choose the action with the highest mean reward e.g. choosing the page with the highest click through rate for our example.
When the learning engine starts, there is not enough data in the reward distribution. Until certain minimum number of data points are available for each reward distribution, the algorithm chooses an action randomly. This is a pure exploration phase.
The storm topology consists of a Redis spout followed by multiple bolts. The bolts are linked to the spout through shuffle grouping, which balances load across the bolt instances. Here is the bolt implementation. All Storm configuration parameters (e.g., number of bolts etc.) and algorithm related parameters are provided through a properties file when deploying the storm topology.
Different reinforcement algorithms implementations are decoupled from the bolt implementation. All reinforcement learning algorithms implement an interface. An algorithm ID is provided to Storm. Based on the ID, a reinforcement learner factory class creates the learner instance. A big advantage of the decoupling is that they can be reused in other cluster computing framework e.g., Hadoop.
Since each bolt needs to have CTR distribution for each page, it reads the queue containing the CTR data for different pages and builds an in memory cache of the CTR distribution. Since all bolt instances read the same Redis queue containing CTR data, the queue is read in an immutable way, the item is not removed from the queue after it is read.
In our example, we have 3 candidate pages. The first page has low mean CTR. The second page has medium mean CTR , but high variance. The third page has high mean CTR. We expect to see the second and the third page to be selected during the early stage. However, as the algorithm goes more into the exploitation phase, the selection will converge on the third page.
Here is some output from the early phase of the algorithm. We see both page2 and page3 in the output.
cea24d0a-799c-11e3-8789-1c659d492701,page2 cea26498-799c-11e3-8789-1c659d492701,page2 cea272bc-799c-11e3-8789-1c659d492701,page2 cea28054-799c-11e3-8789-1c659d492701,page2 cea29292-799c-11e3-8789-1c659d492701,page2 cea29ff8-799c-11e3-8789-1c659d492701,page2 cea2ba24-799c-11e3-8789-1c659d492701,page2 cea2ca3c-799c-11e3-8789-1c659d492701,page3 cea2d7d4-799c-11e3-8789-1c659d492701,page2 cea2e5d0-799c-11e3-8789-1c659d492701,page3 cea2f39a-799c-11e3-8789-1c659d492701,page2 cea30128-799c-11e3-8789-1c659d492701,page3
Here is some sample output, as the algorithm goes well into the exploitation phase. We find the solution converging on page3
cea3858a-799c-11e3-8789-1c659d492701,page3 cea390ac-799c-11e3-8789-1c659d492701,page3 cea3a60a-799c-11e3-8789-1c659d492701,page3 cea39be2-799c-11e3-8789-1c659d492701,page3 cea3c0ae-799c-11e3-8789-1c659d492701,page3 cea3aff6-799c-11e3-8789-1c659d492701,page3 cea3ceaa-799c-11e3-8789-1c659d492701,page3 cea3db8e-799c-11e3-8789-1c659d492701,page3
The solution can be used for various other problems e.g., ad serving, content selection in news web site etc. Here is a tutorial on how to get everything running for our example.