Route Planning with DynaQ Reinforcement Learning

The purpose of a simple route planning is to find the optimum route from one location to another that minimizes travel distance or time. Generally various heuristic optimization techniques are used to find optimum route. In this post we will use a Reinforcement Learning (RL) algorithm called DynaQ to solve the routing problem. DynaQ belongs to the Temporal Difference (TD) learning family of RL algorithms. The DynaQ implementation is available in my RL Python package qinisa. The code is in my GitHub repo whakapai.

DynaQ Learning Algorithm

The purpose of RL algorithms is to learn which action to take from a given state of the environment to maximize cumulative reward. RL algorithms could be characterized as follows . DynaQ is value based, off policy and model based.

Value based or policy based
On policy or off policy
Model free or model based
With or without goal

Off policy RL algorithms while learning the value functions uses the same value function for decision making. While the Q value function is being learnt, the model will use the same Q value function to decide what action to take from a given state.

RL algorithm runs continuously until there is convergence or certain number of iterations have been performed. When training is completed, a policy is extracted. The extracted policy can be used for decision making, just a like a trained predictive model is used to make inferences.

A policy defines the optimum action to take from a given state. The policy can be deterministic or probabilstic. The policy can be used in a decision making process. For problem without a specific goal the policy defines the action to take from a given state for all states. For problems with a goal state, the policy includes actions for only the states in the trajectory from initial state to goal state.

DynaQ ia variation of well known Q Learning algorithm, that includes a model for the environment and a planning or indirect reinforcement learning loop. In Q learning the Q values for state, action pairs is learnt. The Q value represents the value of taking certain action from certain state.

The basic Q Learning is as follow, the extra step of learning a model is necessary for DynaQ. The model gets used in an inner loop for planning or indirect RL

repeat
  1)for current state s choose action a such that Q(s,a) is maximum or choose action 
    a to promote exploration
  2)get reward r and next state ns from the environment
  3)update Q(s,a) based on r and Q value for next state ns and next action na 
    that corresponds to maximum of Q(ns, na)
  4)perform incremental model learning with input(s,a) and output (r,ns)

In step 1, to promote exploration some exploration inducing techniques should be used. Currently the following techniques are supported in my implementation

Epsilon Greedy
Softmax Distribution
Upper Confidence Bound

For our use case I have used Epsilon Greedy exploration technique with log linear reduction of the epsilon parameter. Most of the time action will be chosen based on max(Q(s,a)). rest of the time the action will be chosen randomly.

Step 4 for model learning facilitates planning or indirect RL. In our case the state and action space is discrete and finite. So the model is a simple lookup table. When the state and action space is continuous, a proper ML predictive model should be used, which could be DL or something else.

DynaQ has an inner loop that uses the model m to simulate the environment and update Q values. This step is called planning or indirect RL. Use of the term planning could be confusing. I prefer indirect RL. Here is the inner loop

repeat
  1)choose a state s randomly from previously visited states 
  2)choose an action a randomly from actions selected previously from the state s
  3)get reward r and next state ns from the model m with (s,a) as input
  4)update Q(s,a) based on r and Q value for next state ns and next action na 
  that corresponds to maximum of Q(ns, na)

Model based planning or indirect RL has the following advantages

In normal Q Learning, state and action space is navigated following a trajectory. Model based indirect RL allows wider navigation of the state, action space in any direction
Enables faster Q Learning and convergence
Sometimes interaction with the real environment may be limited due to various reasons

Route Planning

The problem is to find the optimum route from some final location to another location. The final location is the goal. Goal driven RL has the problem of reward sparsity i.e reward is received when the goal state is reached. These problems take longer to convergence because reward is only received when goal state is reached has to be back propagated through time in the state, action space.

To alleviate the reward sparsity problem the simulated environment returns immediate reward using heuristic based on the straight line distance from the start to the target location and the straight line distance between the next location and the target location. Here is a map showing various locations. Start location is A and goal location is G. The optimum path found by DynaQ is A – C -D – G, which is the true optimal route and shown in dotted line. Sometimes it found A-B-D-G as the optimal route.

Here is the policy output from DynaQ. The first field is state (location), the second is action (go to location X).

policy
A C
C D
D G

Next plot shows the convergence of the model during training. The y value is Q value updates as training progresses, which approaches 0 as the model converges. The Q values shown are across all state, action pairs.

To run the tuse case, please follow the instructions in the tutorial. The DynaQ implementation is in the Python module reinfl.

Non Stationary Environment

Environments can be deterministic or stochastic. Stochastic environments can be stationary or non stationary. For our use case environment is deterministic, since the rewards are based on static distances. If we used travel time instead, the environment would have been stochastic, possibly non stationary too.

One way to handle non stationarity would be track the rewards obtained from the environment and detect any drift. There are various ways to detect drift such 2 sample statistics. Once drift is detected, we could induce more exploration of the state, action space.

For Epsilon Greedy exploitation strategy, we could reset epsilon to its original value. For Softmax Distribution based exploitation strategy, we could reset the temperature parameter to the original high value, making the distribution flatter. I am planning to enhance the implementation for non stationary stochastic environment.

Wrapping Up

We have used a model based value function learning RL algorithm called DynaQ to find optimum route. Because the state and action is finite and discrete, it’s easy to learn the model and the value function. Value function boils down to learning the values of the Q table. The model trivially becomes a simple look up keyed by the state, action pair. If the state and action was continuous and large, you have to use Deep Learning for learning both Q function and the model.

About Pranab

I am Pranab Ghosh, a software professional in the San Francisco Bay area. I manipulate bits and bytes for the good of living beings and the planet. I have worked with myriad of technologies and platforms in various business domains for early stage startups, large corporations and anything in between. I am an active blogger and open source project owner. I am passionate about technology and green and sustainable living. My technical interest areas are Big Data, Distributed Processing, NOSQL databases, Machine Learning and Programming languages. I am fascinated by problems that don't have neat closed form solution.

View all posts by Pranab →