Bandits Know the Best Product Price

In an earlier post, I did a survey of a class of reinforcement learning  algorithms, known as Multi Arm Bandit(MAB) . Essentially, these algorithms make decisions and learn from rewards received from the environment. You can also think of them as experimental learning or trial and error learning.

Now it’s time put them into practice. So, in this post, my focus is on using some of these algorithms for a real life use case. Our use case is  price optimization.  When selling a product, how do you choose the optimum price, to maximize profit. It’s an ubiquitous problem faced by any one selling anything.The map reduce implementation is available in my open source project avenir, hosted on github.

If a price is too low, net profit could be low, because of the low per unit margin. On the other hand, if the price is too high, the net profit could be low, because of low sell volume. Some where in between, lies the optimum price.  In this post, we will find out how to find the optimum price using Hadoop based implementation of some of the MAB algorithms, discussed in my  earlier post. We will try two different algorithms.

Pricing Optimization

Let’s consider a eCommerce company, selling hundreds of products. It’s difficult for the company to decide on the optimum price. Moreover, optimum price is very dynamic and changes with time. The company being very much data and analytic driven, decides to use the MAB algorithms as a guide to decide the optimum price. Optimum price determination is a continuous process that runs periodically e.g. every 2 weeks

This is how the solution works

  1. Each product has a set of candidate prices, out of which the algorithm selects one, in every round. 
  2. Pricing optimizer has a cycle of  e.g., 2 weeks i.e., once every 2 weeks it determines the optimum price for all products to be effective for the next 2 weeks, based on among other things, the profit data available so far.

The profit data for the last 2 weeks is fed into the pricing optimizer, which will take it into consideration when recommending price for a product for the next round. As explained in my earlier post, these algorithms judiciously balances exploration (e.g., try a price that has not been tried before or not tried enough) and exploitation (e.g, choosing a price that has  yielded maximum profit from recent trials).  

After enough trials, the algorithms converge on an optimum price for a product, and the ratio of exploitation to exploration goes up during the process. However, even after such convergence, it may be premature to think that our job is done.

As time goes, the current optimum price may not be valid any more because of changes in market condition, economy or competing products. In such cases, it is necessary to run the algorithms with appropriate parameters to induce more exploration to handle changes and find a new optimum price.

In our case, changing conditions will be manifested in terms of lower profit data. Many algorithms will automatically lean more toward exploration and find a new optimum price. For some algorithms, we may need to change some parameters to nudge the algorithm more towards exploration. For example, for epsilon-greedy, we may increase the random selection probability threshold.

Price for Learning

You may wonder about the potential losses for not selecting the optimum prices as we go through the rounds of  trials. Well, that is the price to pay to find the optimum price.

Fortunately, most of the MAB algorithms guarantee an upper bound on regret. Regret is defined as the aggregate loss for not choosing the optimum price in the different round of trials. A good MAB algorithm will always try to minimize the regret.


The input for all the MAB map reduce jobs have the following format. The source for data in field 3, 4 and 5 are sells and profit data. I have simulated the profit data with a python script.

  1. Group ID (e.g. product ID)
  2. Item   (e.g., candidate price)
  3. Number of trials (e.g., number of times a price has been tried)
  4. Total reward (e.g., total profit)
  5. Average reward (e.g. average profit)

In real life, the source of the data for columns 3, 4 and 5 are sales data for a given price of a product.

Random Greedy

The first algorithm we will use is called epsilon-greedy. The optimizer selects a price randomly with a probability below some specified level.  The price selected may or may not have been tried before. Otherwise, it selects the best price so far based on profit data for a product and price. With each round, the  threshold probability is reduced linearly, causing less exploration and more exploitation. Details of the algorithm can be found in this MAB survey paper.

Here are the steps to select the price for a product for the next sales cycle. It involves two map reduce jobs.

  • First, the MR class GreedyRandomBandit takes the input as outlined earlier.It generates price for each product, based on the algorithm selected.
  • Next, we need the profit data for the price selected for each product. I am using a python script to simulate  the profit data, based on some pre defined distribution of profit for different prices.
  • Next, we need to incorporate the profit data into the current input to generate new set of average profit data for different prices for all products. The MR class RunningAggregator is used for this purpose.
  • Finally, the output of RunningAggregator goes as input to GreedyRandomBandit for the next round of price selection and the cycle continues.

For every new cycle, the configuration parameter current.round.num is incremented. In the linear probability reduction algorithms we are using, the probability of choosing a price for a product randomly, decreases linearly with this parameter. As the round number goes up, the probability decreases and we transition out of the exploration into the exploitation phase.

Here is some sample input data for round 3 of the pricing engine. The rows with non zero data in column 3 onward, are prices that have been tried.


Here are sample prices selected for some products. For the product ID 2438359, the price selected (87) was not tried before as we can tell from the data above. The exploratory aspect of the algorithm triggered for this product in this round to select 87.


The MR class GreedyRandomBandiy implements following 3 algorithms. We are using the first one. The first two are different flavors of the epsilon-greedy algorithm.

  1. Random greedy with linear probability reduction
  2. Random greedy with log linear probability reduction.
  3. Auer random greedy

Auer Deterministic

The next algorithm we will use is deterministic in nature. It will generally choose the price that yielded the highest profit. However as we go through the rounds items not tried in recent past get more priority.

The price  selected is a function of return from that price and how many times that price has been tried compared to total number of trials. The details are in my earlier post. The implementation is in the AuerDeterministic MR class.. All prices for a product are tried at least once, before the algorithm triggers. Additional theoretical details  of the algorithm are available in  the original paper by Auer.

After several rounds, here is the profit data for a particular product. After all the prices have been tried once before, the algorithm picks the the price that had the highest return in the current round.


in the next round, the algorithm may pick the price 79, because it’s the next best  price  and also because it’s been tried less than the price 85. That’s exactly what we find as in the result for the next round, as shown below.


It seems like, in the next round 85 will be chosen again. Eventually. After enough rounds the algorithm will  converge with an optimum price.

Wrapping Up

There are many MAB algorithms is literature. I have implemented some of the popular ones. As I implement more, I will be blogging about them in future. Here is a tutorial that contains details of running the map reduce jobs for price optimization.


About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.
This entry was posted in Big Data, Data Science, Hadoop and Map Reduce, Marketing Analytic, Optimizatiom, Optimization, Reinforcement Learning and tagged , , , . Bookmark the permalink.

2 Responses to Bandits Know the Best Product Price

  1. Vignesh Prajapati says:

    Hi Pranab, Thanks for nice cleaner post. This is very useful for understanding and Implementation purpose. I have one question – If I don’t have candidate prices for product instead of that I am having only single price value of product with candidate sales for various months(daily data). Then how can I utilize MAB (Multi Arm Bendint ) algorithm over my data. ?

  2. Pranab says:

    I am not sure how your problem relates to this algorithm. You can think of these algorithms as experimental learning. The algorithms try different values of a parameter. The rewards from the choices are used by the algorithm to learn and make better decision to optimize reward.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s