Topics: Policy Improvement Method for MDPs - Markov Decision Process

The policy improvement method can be used to find the optimal policy for a given MDP.

When modelling certain phenomena under certain circumstances, it may prove useful to take a discount factor into consideration when determining the optimal policy. Say, for instance, we want to take into consideration the devaluation of a currency.

For these cases, we can follow a procedure that is practically identical to the one for the policy improvement method, with just a few slight changes in the equations and expressions that are used. The resulting method is called the discounted policy improvement method.

Compared to the standard one, this method only uses the expected total costs starting from a state ( $V_{i} {r}$ ), though this time they are actually a variation thereof, since they are discounted.

Changes Compared to the Standard Method

The changes in the equations and expressions used for this discounted method compared to the standard one can be summarised as follows:

We don’t have a $g {r_{n}}$ (expected long term cost of the policy)

We solve for every $V_{j} {r_{n}}$ instead, not setting the last one to $0$

We multiply the weighted sum of all the $V_{j} {r_{n}}$ by $α$

The expressions on the right hand side don’t subtract $V_{i} {r_{n}}$

For completeness’s sake, and so that this note can stand on its own, I have rewritten all of the steps.

Algorithm

This method is an iterative algorithm. We will use $n$ as the iteration number.

Step 0

Before we can formally begin, we need to arbitrarily choose a viable policy $r_{1}$ as our starting policy ( $n = 1$ , first iteration).

Of course, we also have to define a discount factor $α$ . We can alternatively define an interest rate $i$ , having:

α = \frac{1}{1 + i} = (1 + i)^{- 1}

Step 1

The first step is to solve the following linear equation system:

{V_{i} {r_{n}} = C_{ik} + α \sum_{j = 0}^{m} p_{ij} {k} V_{j} {r_{n}} for i = 0, 1, ..., m

…where $C_{ik}$ is the cost of making the decision $k$ in the state $i$ (as defined by the policy) and $p_{ij} {k}$ the transition probability from $i$ to $j$ when making the decision $k$ (as defined by the policy).

We’ll thus obtain every $V_{i} {r_{n}}$ , the expected total cost starting from a state $i$ given the policy $r_{n}$ .

Step 2

The second step consists in finding an alternative policy $r_{n + 1}$ such that, in every state $i$ , $d_{i} {r_{n + 1}} = k$ is the optimal decision to make. To do so, we will use the values previously we previously obtained.

That is, for every state $i$ , we will plug these values into the expressions:

C_{ik} + α j = 0 \sum m p_{ij} {k} V_{j} {r_{n}} for k = 1, 2, ..., K

…having one for every decision that can be made in $i$ . We will then pick the decision that yields the optimal result (the smallest if we deal with true costs, the largest if we deal with earnings). This will result in a new alternative policy $r_{n + 1}$ .

Step 3

The third and last step is to determine whether we’ve obtained the optimal policy or not, continuing with another iteration if it is not the case.

If $r_{n + 1} = r$ , then we have the optimal policy. If not, then we shall make another iteration ( $n \to n + 1$ and back to step 1).

Notkesto

Navigation

Recently Created

A Camera is a Simple Device

Good Feedback is Constructive and Actionable

Photography Gear is the Medium

Photography is More About Art than Technology

Discounted Policy Improvement Method for MDPs

Discounted Policy Improvement Method for MDPs

Algorithm

Step 0

Step 1

Step 2

Step 3

Graph View

Table of Contents

Backlinks

Notkesto

Navigation

Recently Created

A Camera is a Simple Device

Good Feedback is Constructive and Actionable

Photography Gear is the Medium

Photography is More About Art than Technology

Discounted Policy Improvement Method for MDPs

Discounted Policy Improvement Method for MDPs

Algorithm §

Step 0 §

Step 1 §

Step 2 §

Step 3 §

Graph View

Table of Contents

Backlinks

Algorithm

Step 0

Step 1

Step 2

Step 3