Boost Lead Generation with Online Reinforcement Learning

When I go to a web site for for downloading white paper or product data sheet, I often hit the back button if presented with a form asking for lots of personal data. Any user that bounces out, is a potential loss of a lead. Perhaps a redesigned page, asking for fewer personal details, would have circumvented the problem.

In this post we will work on a solution using reinforcement learning algorithms. We will have multiple candidate page designs and have a reinforcement learning algorithm find the optimum page with highest click through rate. The algorithm will run real time deployed on Storm and process page requests from users. This implementation is part of my project avenir.

Reinforcement Learning

These algorithms makes decision in an uncertain environment and based on the feedback i.e., the reward obtained from the decision made, it eventually converges on the optimal decision. These algorithms have two different phase, exploration and exploitation. During exploration, the focus is on learning from the environment and it will make decisions which may be sub optimal.

During exploitation phase, it exploits knowledge learned from the environment to make optimal decision. The algorithms start with exploration phase and then transitions over to the exploitation phase. However, the transition may not always be crisp, depending on the algorithm being used. My earlier post contains a survey of different Reinforcement Learning algorithms.

High Level Architecture

We will be using Storm and Redis for the for the implementation. Page request consists of session ID and a counter which increments for every new request. The request is fed through a Redis queue. The learning engine decides which page to display. The response consists of session ID and page ID. The response is written to another Redis queue.

After enough impression data for a given page has accumulated, a python script simulates the click through rate(CTR), which is essentially the reward for the decisions made by the learning engine. I have assumed Gaussian distribution for click through rate with different mean and variance.

The click through rate data is fed into the learning engine through another Redis queue. the storm bolt reads this queue and builds in memory reward distribution (a histogram) for each page. The distribution data is used by the learning algorithm we are using.

Interval Estimate

The particular reinforcement algorithms we are using is called Interval Estimate and it uses a second order statistics of the reward distribution to choose the action. Here is the implementation. For a given confidence interval, the confidence upper bound is found for reward distribution of each action. The action with the highest upper confidence bound in it’s reward distribution is chosen.

Initially, the confidence interval is high (e.g., .95 or .90). A high confidence interval promotes more exploration. With a high confidence interval, a distribution with high mean or variance is likely to have a high upper confidence bound. The confidence interval is gradually reduced as the algorithms makes progress and it transitions into the exploitation phase from the exploration phase.

With a lower confidence interval, a distribution with high mean will have high upper confidence bound. So, as the algorithm goes well into the exploitation phase it will choose the action with the highest mean reward e.g. choosing the page with the highest click through rate for our example.

When the learning engine starts, there is not enough data in the reward distribution. Until certain minimum number of data points are available for each reward distribution, the algorithm chooses an action randomly. This is a pure exploration phase.

Storm Topology

The storm topology consists of a Redis spout followed by multiple bolts. The bolts are linked to the spout through shuffle grouping, which balances load across the bolt instances. Here is the bolt implementation. All Storm configuration parameters (e.g., number of bolts etc.) and algorithm related parameters are provided through a properties file when deploying the storm topology.

Different reinforcement algorithms implementations are decoupled from the bolt implementation. All reinforcement learning algorithms implement an interface. An algorithm ID is provided to Storm. Based on the ID, a reinforcement learner factory class creates the learner instance. A big advantage of the decoupling is that they can be reused in other cluster computing framework e.g., Hadoop.

Since each bolt needs to have CTR distribution for each page, it reads the queue containing the CTR data for different pages and builds an in memory cache of the CTR distribution. Since all bolt instances read the same Redis queue containing CTR data, the queue is read in an immutable way, the item is not removed from the queue after it is read.

Example

In our example, we have 3 candidate pages. The first page has low mean CTR. The second page has medium mean CTR , but high variance. The third page has high mean CTR. We expect to see the second and the third page to be selected during the early stage. However, as the algorithm goes more into the exploitation phase, the selection will converge on the third page.

Here is some output from the early phase of the algorithm. We see both page2 and page3 in the output.

cea24d0a-799c-11e3-8789-1c659d492701,page2
cea26498-799c-11e3-8789-1c659d492701,page2
cea272bc-799c-11e3-8789-1c659d492701,page2
cea28054-799c-11e3-8789-1c659d492701,page2
cea29292-799c-11e3-8789-1c659d492701,page2
cea29ff8-799c-11e3-8789-1c659d492701,page2
cea2ba24-799c-11e3-8789-1c659d492701,page2
cea2ca3c-799c-11e3-8789-1c659d492701,page3
cea2d7d4-799c-11e3-8789-1c659d492701,page2
cea2e5d0-799c-11e3-8789-1c659d492701,page3
cea2f39a-799c-11e3-8789-1c659d492701,page2
cea30128-799c-11e3-8789-1c659d492701,page3

Here is some sample output, as the algorithm goes well into the exploitation phase. We find the solution converging on page3

cea3858a-799c-11e3-8789-1c659d492701,page3
cea390ac-799c-11e3-8789-1c659d492701,page3
cea3a60a-799c-11e3-8789-1c659d492701,page3
cea39be2-799c-11e3-8789-1c659d492701,page3
cea3c0ae-799c-11e3-8789-1c659d492701,page3
cea3aff6-799c-11e3-8789-1c659d492701,page3
cea3ceaa-799c-11e3-8789-1c659d492701,page3
cea3db8e-799c-11e3-8789-1c659d492701,page3

Summing Up

The solution can be used for various other problems e.g., ad serving, content selection in news web site etc. Here is a tutorial on how to get everything running for our example.

For commercial support for this solution or other solutions in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing,

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.

View all posts by Pranab →