In an earlier post, I did a survey of a class of reinforcement learning algorithms, known as *Multi Arm Bandit(MAB)* . Essentially, these algorithms make decisions and learn from rewards received from the environment. You can also think of them as experimental learning or trial and error learning.

Now it’s time put them into practice. So, in this post, my focus is on using some of these algorithms for a real life use case. Our use case is price optimization. When selling a product, how do you choose the optimum price, to maximize profit. It’s an ubiquitous problem faced by any one selling anything.The map reduce implementation is available in my open source project* avenir,* hosted on github.

If a price is too low, net profit could be low, because of the low per unit margin. On the other hand, if the price is too high, the net profit could be low, because of low sell volume. Some where in between, lies the optimum price. In this post, we will find out how to find the optimum price using Hadoop based implementation of some of the MAB algorithms, discussed in my earlier post. We will try two different algorithms.

## Pricing Optimization

Let’s consider a eCommerce company, selling hundreds of products. It’s difficult for the company to decide on the optimum price. Moreover, optimum price is very dynamic and changes with time. The company being very much data and analytic driven, decides to use the MAB algorithms as a guide to decide the optimum price. Optimum price determination is a continuous process that runs periodically e.g. every 2 weeks

This is how the solution works

*Each product has a set of candidate prices, out of which the algorithm selects one, in every round.**Pricing optimizer has a cycle of e.g., 2 weeks i.e., once every 2 weeks it determines the optimum price for all products to be effective for the next 2 weeks, based on among other things, the profit data available so far.*

The profit data for the last 2 weeks is fed into the pricing optimizer, which will take it into consideration when recommending price for a product for the next round. As explained in my earlier post, these algorithms judiciously balances exploration (e.g., try a price that has not been tried before or not tried enough) and exploitation (e.g, choosing a price that has yielded maximum profit from recent trials).

After enough trials, the algorithms converge on an optimum price for a product, and the ratio of exploitation to exploration goes up during the process. However, even after such convergence, it may be premature to think that our job is done.

As time goes, the current optimum price may not be valid any more because of changes in market condition, economy or competing products. In such cases, it is necessary to run the algorithms with appropriate parameters to induce more exploration to handle changes and find a new optimum price.

In our case, changing conditions will be manifested in terms of lower profit data. Many algorithms will automatically lean more toward exploration and find a new optimum price. For some algorithms, we may need to change some parameters to nudge the algorithm more towards exploration. For example, for epsilon-greedy, we may increase the random selection probability threshold.

## Price for Learning

You may wonder about the potential losses for not selecting the optimum prices as we go through the rounds of trials. *Well, that is the price to pay to find the optimum price.*

Fortunately, most of the MAB algorithms guarantee an upper bound on regret. Regret is defined as the aggregate loss for not choosing the optimum price in the different round of trials. A good MAB algorithm will always try to minimize the regret.

## Input

The input for all the MAB map reduce jobs have the following format. The source for data in field 3, 4 and 5 are sells and profit data. I have simulated the profit data with a python script.

*Group ID (e.g. product ID)**Item (e.g., candidate price)**Number of trials (e.g., number of times a price has been tried)**Total reward (e.g., total profit)**Average reward (e.g. average profit)*

In real life, the source of the data for columns 3, 4 and 5 are sales data for a given price of a product.

## Random Greedy

The first algorithm we will use is called *epsilon-greedy*. The optimizer selects a price randomly with a probability below some specified level. The price selected may or may not have been tried before. Otherwise, it selects the best price so far based on profit data for a product and price. With each round, the threshold probability is reduced linearly, causing less exploration and more exploitation. Details of the algorithm can be found in this MAB survey paper.

Here are the steps to select the price for a product for the next sales cycle. It involves two map reduce jobs.

*First, the MR class GreedyRandomBandit takes the input as outlined earlier.It generates price for each product, based on the algorithm selected.*

*Next, we need the profit data for the price selected for each product. I am using a python script to simulate the profit data, based on some pre defined distribution of profit for different prices.**Next, we need to incorporate the profit data into the current input to generate new set of average profit data for different prices for all products. The MR class RunningAggregator is used for this purpose.**Finally, the output of RunningAggregator goes as input to GreedyRandomBandit for the next round of price selection and the cycle continues.*

For every new cycle, the configuration parameter *current.round.num* is incremented. In the linear probability reduction algorithms we are using, the probability of choosing a price for a product randomly, decreases linearly with this parameter. As the round number goes up, the probability decreases and we transition out of the exploration into the exploitation phase.

Here is some sample input data for round 3 of the pricing engine. The rows with non zero data in column 3 onward, are prices that have been tried.

2438359,60,0,0,0 2438359,63,0,0,0 2438359,66,1,29628,29628 2438359,69,0,0,0 2438359,72,1,32219,32219 2438359,75,0,0,0 2438359,78,0,0,0 2438359,81,0,0,0 2438359,84,0,0,0 2438359,87,0,0,0

Here are sample prices selected for some products. For the product ID 2438359, the price selected (87) was not tried before as we can tell from the data above. The exploratory aspect of the algorithm triggered for this product in this round to select 87.

2187255,74 2365792,78 2438359,87 2449414,20 2586794,95 2672704,68

The MR class GreedyRandomBandiy implements following 3 algorithms. We are using the first one. The first two are different flavors of the epsilon-greedy algorithm.

*Random greedy with linear probability reduction**Random greedy with log linear probability reduction.**Auer random greedy*

## Auer Deterministic

The next algorithm we will use is deterministic in nature. It will generally choose the price that yielded the highest profit. However as we go through the rounds items not tried in recent past get more priority.

The price selected is a function of return from that price and how many times that price has been tried compared to total number of trials. The details are in my earlier post. The implementation is in the *AuerDeterministic* MR class.. All prices for a product are tried at least once, before the algorithm triggers. Additional theoretical details of the algorithm are available in the original paper by Auer.

After several rounds, here is the profit data for a particular product. After all the prices have been tried once before, the algorithm picks the the price that had the highest return in the current round.

4028189,79,1,23576,23576 4028189,82,1,23217,23217 4028189,85,2,48784,24392 4028189,88,1,23046,23046 4028189,91,1,22692,22692

in the next round, the algorithm may pick the price 79, because it’s the next best price and also because it’s been tried less than the price 85. That’s exactly what we find as in the result for the next round, as shown below.

4028189,79,2,45533,22766 4028189,82,1,23217,23217 4028189,85,2,48784,24392 4028189,88,1,23046,23046 4028189,91,1,22692,22692

It seems like, in the next round 85 will be chosen again. Eventually. After enough rounds the algorithm will converge with an optimum price.

## Wrapping Up

There are many MAB algorithms is literature. I have implemented some of the popular ones. As I implement more, I will be blogging about them in future. Here is a tutorial that contains details of running the map reduce jobs for price optimization.

For commercial support for this solution or other solutions in my github repositories, please talk to ThirdEye Data Science Services. Support is available for *Hadoop* or Spark deployment on cloud including installation, configuration and testing,

Hi Pranab, Thanks for nice cleaner post. This is very useful for understanding and Implementation purpose. I have one question – If I don’t have candidate prices for product instead of that I am having only single price value of product with candidate sales for various months(daily data). Then how can I utilize MAB (Multi Arm Bendint ) algorithm over my data. ?

Vignesh

I am not sure how your problem relates to this algorithm. You can think of these algorithms as experimental learning. The algorithms try different values of a parameter. The rewards from the choices are used by the algorithm to learn and make better decision to optimize reward.

Hi Pranab,

I’m able to run AuerDeterministic, GreedyRandomBandit, RandomFirstGreedyBandit and SoftMax Bandit . I also see that you have in-depth implementation of other aspects of Multi-Arm Bandit (Sampson sampling, Reward comparisons,etc ), a rough document to use these capabilities would be of great help.

Regards

Riaz

I don’t have documentation. I will be publishing a blog on Sampson Sampling as implemented in avenir along with an use case soon