Loading content...
A Markov Decision Process (MDP) is a mathematical framework for modeling sequential decision-making under uncertainty. It provides a formal way to describe an agent interacting with an environment where outcomes are partly random and partly controlled by the agent's choices.
An MDP is defined by:
The Action Value Function (Q-Function):
One of the most important concepts in reinforcement learning is the action value function, denoted Q(s, a), which estimates the expected cumulative discounted reward when starting from state s, taking action a, and thereafter following an optimal policy.
When given an existing estimate of the state value function V(s), the action value for a particular state-action pair can be computed using the Bellman expectation equation:
$$Q(s, a) = \sum_{s' \in S} P(s' | s, a) \cdot \left[ R(s, a, s') + \gamma \cdot V(s') \right]$$
This equation expresses the intuition that the value of taking action a in state s equals the expected sum of:
The expectation is computed by weighing over all possible successor states according to the transition probabilities.
Your Task:
Implement a function that computes the expected action value Q(s, a) for a given state-action pair in an MDP. The function should:
state = 0
action = "a"
P = {
"0": {"a": {"0": 0.5, "1": 0.5}, "b": {"0": 1.0}},
"1": {"a": {"1": 1.0}, "b": {"0": 0.7, "1": 0.3}}
}
R = {
"0": {"a": {"0": 5, "1": 10}, "b": {"0": 2}},
"1": {"a": {"1": 0}, "b": {"0": -1, "1": 3}}
}
V = [1.0, 2.0]
gamma = 0.98.85For state 0 and action "a", we need to consider all possible successor states with their probabilities:
Successor State 0:
Successor State 1:
Total Q(0, a) = 2.95 + 5.90 = 8.85
state = 0
action = "a"
P = {
"0": {"a": {"1": 1.0}},
"1": {"a": {"0": 1.0}}
}
R = {
"0": {"a": {"1": 10}},
"1": {"a": {"0": 5}}
}
V = [0.0, 0.0]
gamma = 0.910.0This is a deterministic transition scenario where action "a" from state 0 always leads to state 1.
Successor State 1:
Since V(1) = 0, the expected value equals just the immediate reward.
Total Q(0, a) = 10.0
state = 1
action = "a"
P = {
"0": {"a": {"0": 0.5, "1": 0.5}, "b": {"1": 0.5, "2": 0.5}},
"1": {"a": {"0": 0.5, "2": 0.5}, "b": {"0": 1.0}},
"2": {"a": {"2": 1.0}, "b": {"0": 0.33, "1": 0.33, "2": 0.34}}
}
R = {
"0": {"a": {"0": 1, "1": 2}, "b": {"1": 3, "2": 4}},
"1": {"a": {"0": 5, "2": 6}, "b": {"0": 7}},
"2": {"a": {"2": 0}, "b": {"0": 1, "1": 2, "2": 3}}
}
V = [1.0, 2.0, 3.0]
gamma = 0.957.4For state 1 and action "a", we analyze transitions to two possible successor states:
Successor State 0:
Successor State 2:
Total Q(1, a) = 2.975 + 4.425 = 7.4
Constraints