Online Advertisement Placement with Contextual Bandit.


There is a class of algorithms called Multi Arm Bandit (MAB) applicable for Reinforcement Leaning problems without state. Sometimes there is side information available for each action. MAB algorithms that take this side information aka context into account are called Contextual Bandit. The decision made by a Contextual Bandit at any given time is a function of the context in addition to past rewards. In this post we will use a Contextual Bandit algorithms called Linear Payoff Upper Confidence Bound (LinUCB) for making online ad placement decisions.

The implementation is available in my Reinforcement learning Python package called qinsia. Here is the GitHub repo.

Contextual Bandit

In reinforcement Learning (RL) an agent takes an action in a given state of the environment and the agent gets a reward from the environment at a later time and the state changes. The goal of RL is to select actions in such a way to maximize the long term cumulative reward. Multi Arm Bandit(MAB) is applicable to a class of RL problems without state. In Contextual Multi Arm bandit there is a context based on the agent and the environment. The context is not exactly same as state in RL.

The contextual bandit is a continuous decision making process, under uncertain conditions. It’s goal as in any Reinforcement Learning system is to maximize the reward over a time horizon. It runs in a loop as follows making decisions when asked to and getting reward for past decisions taken.

  • Pass the context for each action and get the action selected by the model. The action is taken.
  • For some past action taken, when the reward is available from the environment, pass the action and reward to the model.The model updates itself based on the reward feedback

The context is dynamic. The context could be same for all actions or different. The context could change with time. The context is generally represented as vector.

There are many contextual bandit algorithms. Here are some examples

  • Linear Upper Confidence Bound
  • Thompson Sampling with Linear Payoff
  • Epoch Greedy

In LinUCB, expected reward for an action is assumed to be linear function of the feature vector for the action. Based on the actual reward received, a linear regression problem is solved. It uses Ridge regularizer aka L2 regularizer. LinUCB also calculates an upper confidence bound for the reward. The LinUCB algorithm selects the action with the highest estimated upper confidence bound of the reward.

Advertisement Placement

In online advertisement, there is a platform through which advertisers and content publishers collaborate. The crux of the problem is matching content with the advertisement to maximize click through. There is a set of advertisements and a set of pages in various websites where the advertisements are place based on contract between advertisers and content publishers. When a page is about to come up in response to user request, a decision needs to be made for the particular advertisement to be placed on the page. The advertisement are the actions for our problem.

The context for advertisements is calculated based on similarity between the advertisement and the page calculated along 4 dimensions. The four similarity parameters for matching are as follows. So the feature vector size is 4 for our example problem. Each similarity value is between 0 and 1, with 1 being maximum similarity. The context vector for this problem is different for each action and it also varies with time.

  • Content
  • Time of day
  • Device type
  • Geo location

Advertisement placement optimization is simulated as below in a loop. Please take a look at adpl.py for details. Bandit models start with a clean slate and learns incrementally as it receives reward feedback. Learning is online.

  • Choose a site randomly
  • Take the features vectors for the combination of that site and all advertisements. A feature vector for an advertise and site combination consists of 4 similarity values
  • Change the the geo location similarity randomly. You could also change device type similarity. Time of the day similarity could also be change randomly every few hours
  • Pass the feature vector of all advertisements to the contextual bandit model. It returns the advertisement to run. That’s the decision made by the LinUCB model.
  • Simulate reward for that advertisement, site combination. Pass the reward to the contextual bandit model. The model updates itself based on the reward feedback. This is where learning takes place
  • At the end of a session you have the option to checkpoint the model and then restore the checkpoints model

Please take a look at the tutorial document for details on how to run it. The implementation of LinUCB is in GitHub in case you are curious.

Here is some console output. Each row shows the advertisement selected by the LinUCB model for a given site along with the actual reward obtained.

site site_1   adv adv_5  features 0.981 0.700 0.475 0.464  reward 0.560
site site_6   adv adv_2  features 0.594 0.914 0.461 0.779  reward 0.620
site site_3   adv adv_4  features 0.976 0.813 0.460 0.857  reward 0.802
site site_8   adv adv_1  features 0.973 0.843 0.667 0.645  reward 0.609

You can get more detailed output in the log. Here is some sample. It shows the expected reward upper bound, based which the model selects the action.

2023-07-26 12:12:04,965 - qinisa.cmab - INFO - play count 174  action adv_5  reward upper bound 2.034
2023-07-26 12:12:04,965 - qinisa.cmab - INFO - action adv_5  feature 0.894 0.691 0.460 0.707  actual reward 0.697
2023-07-26 12:12:04,966 - qinisa.cmab - INFO - play count 175  action adv_4  reward upper bound 0.463
2023-07-26 12:12:04,966 - qinisa.cmab - INFO - action adv_4  feature 0.750 0.599 0.678 0.634  actual reward 0.694
2023-07-26 12:12:04,966 - qinisa.cmab - INFO - play count 176  action adv_3  reward upper bound 2.722
2023-07-26 12:12:04,966 - qinisa.cmab - INFO - action adv_3  feature 0.892 0.912 0.425 0.421  actual reward 0.579
2023-07-26 12:12:04,967 - qinisa.cmab - INFO - play count 177  action adv_2  reward upper bound 1.538
2023-07-26 12:12:04,967 - qinisa.cmab - INFO - action adv_2  feature 0.594 0.914 0.461 0.806  actual reward 0.670
2023-07-26 12:12:04,967 - qinisa.cmab - INFO - play count 178  action adv_5  reward upper bound 2.159
2023-07-26 12:12:04,967 - qinisa.cmab - INFO - action adv_5  feature 0.894 0.691 0.460 0.474  actual reward 0.552
2023-07-26 12:12:04,968 - qinisa.cmab - INFO - play count 179  action adv_5  reward upper bound 2.365
2023-07-26 12:12:04,968 - qinisa.cmab - INFO - action adv_5  feature 0.981 0.700 0.475 0.835  actual reward 0.784

Concluding Thoughts

Contextual Bandit in a powerful RL technique suitable for many applications such as recommendation, search, advertisement.

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in AI, Data Science, Machine Learning, Python, Reinforcement Learning and tagged , . Bookmark the permalink.

Leave a comment