Topics: Markov Decision Process - Optimisation
Given an MDP, we can find the optimal policy with the exhaustive enumeration method. This method is one of the simplest for this purpose.
Algorithm
The exhaustive enumeration method consists of the following steps.
Step 1
Enumerating exhaustively all of the viable policies for the MDP.
- When every decision (with total) can be made in every state (with total), then the total amount of viable policies is (permutations with repetitions).
Step 2
Calculating the expected long term cost for every policy :
…where denotes the cost incurred for making the decision (defined according to the policy) in the state , while denotes the steady state probability of the state given the policy.
Don’t forget to take the policy into consideration in this step. It defines the decision we’ll be taking in each state , as well as the transition matrix that is used to obtain the steady state probabilities. We can obtain the latter by (1) selecting the column that corresponds to the state in the transition matrix for the decision , and then (2) splicing them together in a new matrix.
Step 3
Select the optimal policy based on its cost.
If we’re dealing with classical costs, then we’ll select the policy that has the smallest expected cost. If our “costs” are actually earnings, then we’ll select the policy that has the largest expected “cost” (earning).