Topics: Markov Decision Process - Optimisation


Given an MDP, we can find the optimal policy with the exhaustive enumeration method. This method is one of the simplest for this purpose.

Algorithm

The exhaustive enumeration method consists of the following steps.

Step 1

Enumerating exhaustively all of the viable policies for the MDP.

Step 2

Calculating the expected long term cost for every policy :

…where denotes the cost incurred for making the decision (defined according to the policy) in the state , while denotes the steady state probability of the state given the policy.

Don’t forget to take the policy into consideration in this step. It defines the decision we’ll be taking in each state , as well as the transition matrix that is used to obtain the steady state probabilities. We can obtain the latter by (1) selecting the column that corresponds to the state in the transition matrix for the decision , and then (2) splicing them together in a new matrix.

Step 3

Select the optimal policy based on its cost.

If we’re dealing with classical costs, then we’ll select the policy that has the smallest expected cost. If our “costs” are actually earnings, then we’ll select the policy that has the largest expected “cost” (earning).