Topics: Policy Improvement Method for MDPs - Markov Decision Process


The policy improvement method can be used to find the optimal policy for a given MDP.

When modelling certain phenomena under certain circumstances, it may prove useful to take a discount factor into consideration when determining the optimal policy. Say, for instance, we want to take into consideration the devaluation of a currency.

For these cases, we can follow a procedure that is practically identical to the one for the policy improvement method, with just a few slight changes in the equations and expressions that are used. The resulting method is called the discounted policy improvement method.

Compared to the standard one, this method only uses the expected total costs starting from a state (), though this time they are actually a variation thereof, since they are discounted.

Algorithm

This method is an iterative algorithm. We will use as the iteration number.

Step 0

Before we can formally begin, we need to arbitrarily choose a viable policy as our starting policy (, first iteration).

Of course, we also have to define a discount factor . We can alternatively define an interest rate , having:

Step 1

The first step is to solve the following linear equation system:

…where is the cost of making the decision in the state (as defined by the policy) and the transition probability from to when making the decision (as defined by the policy).

We’ll thus obtain every , the expected total cost starting from a state given the policy .

Step 2

The second step consists in finding an alternative policy such that, in every state , is the optimal decision to make. To do so, we will use the values previously we previously obtained.

That is, for every state , we will plug these values into the expressions:

…having one for every decision that can be made in . We will then pick the decision that yields the optimal result (the smallest if we deal with true costs, the largest if we deal with earnings). This will result in a new alternative policy .

Step 3

The third and last step is to determine whether we’ve obtained the optimal policy or not, continuing with another iteration if it is not the case.

If , then we have the optimal policy. If not, then we shall make another iteration ( and back to step 1).