## A Learning but Greedy Gambler

In multi-armed bandit (MAB)  problem, a gambler must decide which arm of K slot machines to pull in sequence of N rounds of pulls to maximize the overall return. Many real life optimization and decision making problems can be modelled this way. Some example are deciding which advertisement content  to serve among several alternatives,  which web page to serve among many design alternatives, what sub set of products to sell among many and the optimum price for a product.

The algorithms to solve multi-armed bandit problem come under the umbrella of Reinforcement Learning (RL). In Reinforcement Learning an agent learns by trial and error experiments with a dynamic environment. The agent makes a decision, gets reward from the environment. The reward obtained is factored into future decision making.

In this post I will do an overview of various MAB algorithms. It will be followed by several post on actual  Hadoop map reduce implementations of them. The implementations will be part of my open source project avenir. I will  be using merchandising and pricing as the use cases.

## Multi Armed Bandit

Like any RL problem, MAB algorithms are characterized by exploration and exploitation. The dilemma is to find the balance between exploring actions that have not been taken before or have been tried very few times to further knowledge and exploiting knowledge already acquired from actions that have been already taken multiple times.

• exploitation: Maximization of expected reward, although it can not be done with certainty, because we only have estimates of rewards for actions already taken
• exploration: The environment is explored, in order to make estimates of rewards for action taken more accurate

If you explore too much, you might be losing out for not exploiting an action that is already proved to be highly rewarding. With too much exploitation,   you  are likely to choose a sub optimal action because of the lack of exploration and resulting poor  knowledge of reward from actions.

The various algorithms are characterized by  the way exploration and exploitation phases are combined and how actions are chosen in each round. In some cases, there is a distinct exploration phase when the agent acquires knowledge based on the rewards obtained from decisions taken. It ‘s followed by an exploitation phase when the acquired knowledge is exploited to make optimum decision.

In other cases, exploration and exploitation are blended  in. To start with they are  predominantly exploratory  and as the number of rounds go up they become more exploitative.

The ultimate goal of these algorithms is to minimize regret. Regret in is context is defined as the cumulative reward from  a sequence of optimum actions and the cumulative rewards from the sequence of actions as dictated by the algorithms.

There are lots of reinforcement learning and multi armed bandit algorithms in the literature. I will be covering only the important and popular ones in this post. These algorithms come under the following  broad categories as below.

## Pricing Engine Use Case

I will be citing the example of optimum pricing engine while exploring through the algorithms. For a pricing engine, a product has a set of prices.  We may try all the prices until we find the one that gives us the highest return.  We may let the algorithm select a price and we use that for a sales cycle of a week or two weeks. The return we get for the price goes back to the engine and the engine uses that information for the selection of the next price for the next sales cycle.

Optimum price may have  a temporal aspect to it. The price that is optimum today may not be so 6 months down. When it is suspected that the price we have been using may not be the optimum any more, it may be necessary to reactivate the engine and find a new optimum price.

## Greedy Strategy

An action is chosen with a small probability (e). Otherwise the action that has so far returned the highest average reward is chosen. This is a very simple algorithm called epsilon-greedy. The probability is chosen by the user. A higher probability indicates more exploration. In case of a pricing engine, the algorithms will choose either a price at random with small probability or it will choose a price that has given us the highest average return so far.

A variant of this algorithm is to decrease the probability as the number of rounds (t) goes up. It’s called epsilon-reducing-greedy algorithm. Intuitively, it makes lot of sense. As the number of rounds go up, the probability of choosing an action randomly goes down i.e., there is less exploration and more exploitation.  One way to reduce the probability is as below

e(t) = min (1, e*a /t)
where
e(t) = probability at round t
e = initial probability at round 1
a = user chosen constant
t = current round number

Another way to reduce the probability which has more statistical grounding is shown below. It is a function of the difference in rewards between the two highest rewarding actions, total number of actions and the current round number.

e(t) = min(1,c*K/d*d*t)
where
c = user chosen constant
K = total number of actions
d = difference in rewards between the two highest rewarding actions
t = current round number

For small d, the probability is high. The implication is that when there is no clear winner between the actions, more exploratory trials are necessary.

## Upper Confidence Bound

These algorithms keep track number of times each items has been tried. They implement optimism in the face of uncertainty.

The first algorithm  called UCB1, combines exploration and exploitation in a way described below. It chooses the next action based on rewards obtained in past and the number of times action has been chosen relative to the total number of rounds.

s(j) = m(j) + sqrt(2 * ln(t) / t(j))
where
m(j) = average reward of action j
t = total number of rounds
t(j) = number of rounds for action j

The action with highest s(j) is chosen. Generally the action with highest average reward will be chosen. However as an action gets  chosen less frequently t(j) drops relative to t and the second term becomes more dominant and causes action to be chosen over action with higher reward.

The next algorithms called UCB1-Tuned, takes into account the variance of reward. The choice of the action for the next round is based on the highest value of s(j).

s(j) = m(j) + sqrt( (ln(t) / t(j)) * min(1/4, V(t(j)))
where
V(t(j)) = var(j) + sqrt(2 * ln(t) / t(j))
m(j) = mean reward for action j
var(j) = variance of reward for action j
t = total number of rounds
t(j) = number of rounds for action j

## Exploration First Greedy Strategy

Another variation of the greedy strategy, has a distinct exploration phase in the beginning followed by exploitation in the later rounds. The question is how may times to try each action  during the exploration phase, that will give us enough reward data that can be leveraged during the exploitation phase.

The simplest approach is to try each action some user decided number of times. This is commonly know as A/B testing, which is popular technique in optimizing web page design and marketing.  The dilemma here is deciding how many times to explore. If there is clear winner among the actions, a limited number of exploration trial will suffice. But, we don’t have any prior knowledge of that.

A more advanced approach consists of estimation of this count  based on Probably Approximate Correct (PAC) framework. To set the context we need to define a d-optimal action. A d-optimal action is an action whose reward is at most d away from the reward of an optimal action with probability (1-p). To find such action the number of trials needed for action during the exploration phase is as follows

t = 4/(d*d)  * ln (2*K /p)
where
t = the number of trials
d = difference in reward as mentioned earlier
K = number of actions
p = probability as mentioned earlier

## SoftMax Strategy

In this strategy an action is chosen randomly  according to Gibbs (or Boltzmann) distribution. Gibbs distribution is used to express the distribution of states in a system. it’s expressed as below in the context of our problem. It’s root goes back to statistical mechanics.

p(i) = exp(r(i)/T) / sum(exp(r(j)/T))
where
p(i) = probability of action i
r(i) = average reward for action i
T = parameter called temperature

As you can tell, higher the average reward obtained for an action, more likely is it to be chosen in the next round. But it does not rule out choice of other non optimal action. The parameter T is chosen by the user. A large value for T indicates more exploration, as it tends to flatten the probability distribution.

As in the greedy strategy, the parameter T could be decrease as the number of rounds go up, to give more weight to exploitation. It could be decreased linearly with the number of rounds as below. It’s called decreasing SoftMax.

T(t) = T * c / t
where
T(t) = value of T after t rounds
c = user defined constant
t = current round number

Another SoftMax strategy that is more complex is called exp3. According this algorithm the probability of choosing an action i in round t is as follows

p(i,t) =  (1 – g) * w(i,t) / sum(w(j,t))  + g /K
where
p(i,t) = probability of choosing action i at round t
g = user defined constant
K = number of actions

w(j,t+1) = w(j,t) * exp(g * r(j,t) / p(j,t) * K) if action j was chosen at round t other wise w(j,t+1) = w(j,t)
where
r(j,t) = reward from action j at round t
p(j,t) = probability of choosing action j at round t

The distribution is a mixture of uniform distribution and  and a distribution based on the exponential of the reward for an action. The uniform part of the distribution ensures that all actions get a fair chance. The parameter g controls the relative weight between uniform distribution and the exponential based distribution.

## Interval Estimation Strategy

This class algorithms are based on second order information about the mean and variance of rewards of actions. An action is chosen based on the upper bound of a 100 * (1-a) % confidence interval of the reward probability of each action.

The action with the highest upper bound is chosen. Smaller values of the parameter a encourages more exploration, as it tends to include actions with higher reward variance.

If the reward distribution for an action is normal, then there is closed form solution for the upper bound.  For example, here is the reward upper bound  value for an action when the true mean and variance are not known.

u = m(n) + s(n) * t(a,n) / sqrt(n)
where
u = upper bound at (100-a)%  confidence interval
m(n) = estimated mean reward from n trials
s(n) =  estimated std dev of reward from n trials
t(a,n) = student’s t function at confidence level a/2 with n-1 degree of freedom
n = number of trials for the action

If the normal distribution assumption for reward is invalid, then one has to resort to non parametric estimation based on available reward data for an action.

To implement the algorithms, the Hadoop MR job needs to run iteratively. For many problems there is a user defined fixed time gap between successive iteration, which we could call trial period. For the Pricing Engine we could run the the algorithms to select price then try that price, say for 2 weeks. Then the trial period is 2 weeks. In each iteration we increment the trial or round counter and use the reward data from the past trials as an input.

Sometimes the the gap between the successive iterations is random and based on events in the external world and Hadoop batch processing may not be appropriate and serious consideration should be given for Storm.  Consider the use case where an advertiser is running. The advertiser has multiple content alternatives and  these algorithms are being used to find the optimal one. Whenever the target web page gets loaded, one among the many alternative content needs to be chosen by the algorithm. The reward is defined in terms of click through rate. For this use case, real time response is needed and Storm may be more appropriate.

## References

My post draws on material from several technical papers. Here are some of the important ones.

1. Multi-Armed Bandit Algorithms and Empirical Evaluations   by Joannes Vermorel and Mehryar Mohri
2. Finite-time Analysis of the Mutiarmed Bandit Problem  by Peter Auer, Nicolo Cesa-Bianchi and Paul Fischer
3. Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems by Eyal Even-Dar, Shie Mannor and Yishay Mansour
4. Exploration of Multi-State Environments: Local Measures and Back-Propagation of Uncertainty by Nicolas Meuleau and Paul Bourgine

For commercial support for any solution in my github repositories, please talk to ThirdEye Data Science Services. Support is available for Hadoop or Spark deployment on cloud including installation, configuration and testing,