Topics: Policy Improvement Method for MDPs - Markov Decision Process
(definition)
In the context of an MDP, let be the expected total cost of a system that starts in state and evolves over periods, given a specific policy . This will be:
…where is the cost of making the decision in the state (as defined by the policy ) and is the discount factor. If there is no discount, then , as is the case when using the standard policy improvement method (cf. the discounted version).
Explanation
Notice that this sum is basically the sum of:
- , the cost of the first period, when the system goes from to another state .
- , the weighted sum of the total costs that could be incurred by continuing from another state (and still following the same policy ).
Do not forget that these calculations take into consideration a given policy , which defines which decision is made in which state .