0/318

00:00:00

Description

Editorial

Action Value Estimation in Markov Decision Processes

MEDIUM20 pts

A Markov Decision Process (MDP) is a mathematical framework for modeling sequential decision-making under uncertainty. It provides a formal way to describe an agent interacting with an environment where outcomes are partly random and partly controlled by the agent's choices.

An MDP is defined by:

A finite set of states S
A finite set of actions A available to the agent
A transition probability function P(s' | s, a) representing the probability of moving to state s' when taking action a in state s
A reward function R(s, a, s') representing the immediate reward received during the transition
A discount factor γ ∈ [0, 1] that weights the importance of future versus immediate rewards

The Action Value Function (Q-Function):

One of the most important concepts in reinforcement learning is the action value function, denoted Q(s, a), which estimates the expected cumulative discounted reward when starting from state s, taking action a, and thereafter following an optimal policy.

When given an existing estimate of the state value function V(s), the action value for a particular state-action pair can be computed using the Bellman expectation equation:

$$Q(s, a) = \sum_{s' \in S} P(s' | s, a) \cdot \left[ R(s, a, s') + \gamma \cdot V(s') \right]$$

This equation expresses the intuition that the value of taking action a in state s equals the expected sum of:

The immediate reward R(s, a, s') for transitioning to state s'
The discounted future value γ · V(s') of that successor state

The expectation is computed by weighing over all possible successor states according to the transition probabilities.

Your Task:

Implement a function that computes the expected action value Q(s, a) for a given state-action pair in an MDP. The function should:

Use the provided transition probabilities to enumerate all possible successor states
Weight the immediate reward plus discounted future value by the transition probability for each successor
Sum these contributions to compute the total expected value
Return the result rounded to 2 decimal places

Example

Input

state = 0
action = "a"
P = {
    "0": {"a": {"0": 0.5, "1": 0.5}, "b": {"0": 1.0}},
    "1": {"a": {"1": 1.0}, "b": {"0": 0.7, "1": 0.3}}
}
R = {
    "0": {"a": {"0": 5, "1": 10}, "b": {"0": 2}},
    "1": {"a": {"1": 0}, "b": {"0": -1, "1": 3}}
}
V = [1.0, 2.0]
gamma = 0.9

Output

8.85

Explanation

For state 0 and action "a", we need to consider all possible successor states with their probabilities:

Successor State 0:

Transition probability: P(0 → 0 | a) = 0.5
Immediate reward: R(0, a, 0) = 5
Future value: γ × V(0) = 0.9 × 1.0 = 0.9
Contribution: 0.5 × (5 + 0.9) = 0.5 × 5.9 = 2.95

Successor State 1:

Transition probability: P(0 → 1 | a) = 0.5
Immediate reward: R(0, a, 1) = 10
Future value: γ × V(1) = 0.9 × 2.0 = 1.8
Contribution: 0.5 × (10 + 1.8) = 0.5 × 11.8 = 5.90

Total Q(0, a) = 2.95 + 5.90 = 8.85

Example

Input

state = 0
action = "a"
P = {
    "0": {"a": {"1": 1.0}},
    "1": {"a": {"0": 1.0}}
}
R = {
    "0": {"a": {"1": 10}},
    "1": {"a": {"0": 5}}
}
V = [0.0, 0.0]
gamma = 0.9

Output

10.0

Explanation

This is a deterministic transition scenario where action "a" from state 0 always leads to state 1.

Successor State 1:

Transition probability: P(0 → 1 | a) = 1.0 (certain transition)
Immediate reward: R(0, a, 1) = 10
Future value: γ × V(1) = 0.9 × 0.0 = 0.0
Contribution: 1.0 × (10 + 0.0) = 10.0

Since V(1) = 0, the expected value equals just the immediate reward.

Total Q(0, a) = 10.0

Example

Input

state = 1
action = "a"
P = {
    "0": {"a": {"0": 0.5, "1": 0.5}, "b": {"1": 0.5, "2": 0.5}},
    "1": {"a": {"0": 0.5, "2": 0.5}, "b": {"0": 1.0}},
    "2": {"a": {"2": 1.0}, "b": {"0": 0.33, "1": 0.33, "2": 0.34}}
}
R = {
    "0": {"a": {"0": 1, "1": 2}, "b": {"1": 3, "2": 4}},
    "1": {"a": {"0": 5, "2": 6}, "b": {"0": 7}},
    "2": {"a": {"2": 0}, "b": {"0": 1, "1": 2, "2": 3}}
}
V = [1.0, 2.0, 3.0]
gamma = 0.95

Output

7.4

Explanation

For state 1 and action "a", we analyze transitions to two possible successor states:

Successor State 0:

Transition probability: P(1 → 0 | a) = 0.5
Immediate reward: R(1, a, 0) = 5
Future value: γ × V(0) = 0.95 × 1.0 = 0.95
Contribution: 0.5 × (5 + 0.95) = 0.5 × 5.95 = 2.975

Successor State 2:

Transition probability: P(1 → 2 | a) = 0.5
Immediate reward: R(1, a, 2) = 6
Future value: γ × V(2) = 0.95 × 3.0 = 2.85
Contribution: 0.5 × (6 + 2.85) = 0.5 × 8.85 = 4.425

Total Q(1, a) = 2.975 + 4.425 = 7.4

Accepted0/0·0% Acceptance

Constraints

The number of states |S| is between 1 and 1000
The number of actions |A| is between 1 and 100
Transition probabilities are valid: ∑_{s'} P(s'|s,a) = 1.0 for all s, a
-1000 ≤ R(s, a, s') ≤ 1000 for all rewards
0 ≤ γ ≤ 1.0 (discount factor)
0 ≤ V(s) ≤ 10000 for all state values
All probability values are floating-point numbers with precision up to 0.01
The specified state and action pair is always valid within the MDP

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

P =

{"0":{"a":{"0":0.5,"1":0.5},"b":{"0":1}},"1":{"a":{"1":1},"b":{"0":0.7,"1":0.3}}}

R =

{"0":{"a":{"0":5,"1":10},"b":{"0":2}},"1":{"a":{"1":0},"b":{"0":-1,"1":3}}}

V =

[1,2]

gamma =

0.9

state =

action =

"a"

Action Value Estimation in Markov Decision Processes

Hints

Action Value Estimation in Markov Decision Processes

Hints