Topics: Markov Decision Process
Given an MDP, we can obtain an approximation to the optimal policy by using the successive approximations method.
The basic idea behind this method is to find an optimal policy for the process when only periods remain, starting from (one period away from finishing) and continuing onwards. As grows, the optimal policies that are found converge to the optimal policy for the MDP (i.e. the optimal policy when “infinite” periods remain). Thus, the optimal policies that are found for each are successive approximations to the optimal policy.
This method mainly uses the expected total costs starting from a state , though we’ll be using the ones that correspond to the optimal policy among all the possible ones, instead of just the ones for a given policy.
Algorithm
Evidently, the method is iterative. We will iterate over , the amount of remaining periods.
Since the method finds approximations and, depending on the amount of iterations we make, we’re not always guaranteed to find the optimal policy, we will define , the maximum amount of iterations, as well as , the tolerated error.
For the first iteration , we will determine, for every state :
…having an element to compare for every decision that is viable in .
We will then obtain the optimal policy (when having remaining periods) by setting to the that corresponds to the optimal cost.
We continue by setting .
For the second iteration and onwards, we will determine, for every state :
…having an element to compare for every decision that is viable in .
We will then obtain the optimal policy by setting to the that corresponds to the optimal cost.
Stopping Condition
If , then we stop iterating. If , then, for every state , we test the inequality:
If the inequality is satisfied for every , we stop iterating. Otherwise, we set and continue iterating.