State-Value Function Bellman Update (Medium) — Practice with Code Visualizer

In Reinforcement Learning, an agent learns to make optimal decisions by interacting with an environment modeled as a Markov Decision Process (MDP). A fundamental component of solving MDPs is computing the state-value function V(s), which quantifies the expected cumulative reward an agent can obtain starting from state s and following an optimal policy.

The Bellman Optimality Equation provides a recursive relationship that defines the optimal value function. The key insight is that the optimal value of any state equals the best action's immediate reward plus the discounted value of the resulting state. Mathematically:

$$V(s) = \max_{a} \sum_{s', r} P(s', r | s, a) \cdot [r + \gamma \cdot V(s') \cdot (1 - \text{terminal})]$$

Where:

V(s) is the value of state s
a represents an action available in state s
P(s', r | s, a) is the probability of transitioning to state s' and receiving reward r when taking action a in state s
γ (gamma) is the discount factor (0 < γ ≤ 1) that weights the importance of future rewards
terminal is a flag indicating whether the transition leads to a terminal state (episode ends)

Value Iteration Algorithm: Value iteration is an iterative algorithm that computes optimal state values by repeatedly applying the Bellman update to all states until convergence. One iteration (or "sweep") updates every state's value based on the current value estimates of all states.

MDP Transition Structure: The environment is specified as a list of transition dictionaries. For each state s, transitions[s] is a dictionary where:

Keys are action indices (integers)
Values are lists of tuples: (probability, next_state, reward, is_terminal)

Each tuple represents a possible outcome when taking that action, with the probability of that outcome occurring, the resulting state, the immediate reward received, and whether this ends the episode.

Your Task: Implement a function that performs one complete iteration of value iteration using the Bellman equation. Given the current state-value estimates V, transition dynamics, and discount factor γ, compute and return the updated value function.

Important Notes:

When a transition leads to a terminal state (is_terminal = True), the future discounted value should be 0 (the episode ends, so no future rewards can be collected)
For each state, compute the maximum expected value across all available actions
Use NumPy for efficient numerical computations

State 0 Analysis:

Action 0: Transitions to state 0 with reward 0.0 → Value = 0.0 + 0.9 × V[0] = 0.0 + 0.9 × 0 = 0.0
Action 1: Transitions to state 1 with reward 1.0 → Value = 1.0 + 0.9 × V[1] = 1.0 + 0.9 × 0 = 1.0
Best action: 1 → V[0] = 1.0

State 1 Analysis:

Action 0: Transitions to state 0 with reward 0.0 → Value = 0.0 + 0.9 × V[0] = 0.0
Action 1: Transitions to state 1 with reward 1.0 (terminal) → Value = 1.0 + 0 = 1.0 (no future value since terminal)
Best action: 1 → V[1] = 1.0

The updated value function is [1.0, 1.0].

State 0 Analysis:

Action 0: Reward 1.0, go to state 1 → Value = 1.0 + 0.9 × 0 = 1.0
Action 1: Reward 2.0, go to state 2 → Value = 2.0 + 0.9 × 0 = 2.0
Best action: 1 → V[0] = 2.0

State 1 Analysis:

Action 0: Reward 1.0, go to state 2 → Value = 1.0 + 0.9 × 0 = 1.0
Action 1: Reward 0.0, go to state 0 → Value = 0.0 + 0.9 × 0 = 0.0
Best action: 0 → V[1] = 1.0

State 2 Analysis:

Action 0: Reward 5.0, terminal → Value = 5.0 (no future reward)
Action 1: Reward 0.0, go to state 0 → Value = 0.0 + 0.9 × 0 = 0.0
Best action: 0 → V[2] = 5.0

The updated value function is [2.0, 1.0, 5.0].

State 0 Analysis (Stochastic Transition):

Action 0 has probabilistic outcomes:
- 50% chance: Stay in state 0, receive 1.0 → 0.5 × (1.0 + 0.9 × 0) = 0.5
- 50% chance: Go to state 1, receive 2.0 → 0.5 × (2.0 + 0.9 × 0) = 1.0
- Total expected value = 0.5 + 1.0 = 1.5
Only one action available → V[0] = 1.5

State 1 Analysis:

Action 0: Reward 10.0, terminal → Value = 10.0
Only one action available → V[1] = 10.0

This example demonstrates stochastic transitions where an action can lead to multiple possible outcomes with different probabilities. The expected value is the probability-weighted sum of all possible outcomes.

$$V(s) = \max_{a} \sum_{s', r} P(s', r | s, a) \cdot [r + \gamma \cdot V(s') \cdot (1 - \text{terminal})]$$

Where:

V(s) is the value of state s
a represents an action available in state s
P(s', r | s, a) is the probability of transitioning to state s' and receiving reward r when taking action a in state s
γ (gamma) is the discount factor (0 < γ ≤ 1) that weights the importance of future rewards
terminal is a flag indicating whether the transition leads to a terminal state (episode ends)

MDP Transition Structure: The environment is specified as a list of transition dictionaries. For each state s, transitions[s] is a dictionary where:

Keys are action indices (integers)
Values are lists of tuples: (probability, next_state, reward, is_terminal)

Important Notes:

When a transition leads to a terminal state (is_terminal = True), the future discounted value should be 0 (the episode ends, so no future rewards can be collected)
For each state, compute the maximum expected value across all available actions
Use NumPy for efficient numerical computations

State 0 Analysis:

Action 0: Transitions to state 0 with reward 0.0 → Value = 0.0 + 0.9 × V[0] = 0.0 + 0.9 × 0 = 0.0
Action 1: Transitions to state 1 with reward 1.0 → Value = 1.0 + 0.9 × V[1] = 1.0 + 0.9 × 0 = 1.0
Best action: 1 → V[0] = 1.0

State 1 Analysis:

Action 0: Transitions to state 0 with reward 0.0 → Value = 0.0 + 0.9 × V[0] = 0.0
Action 1: Transitions to state 1 with reward 1.0 (terminal) → Value = 1.0 + 0 = 1.0 (no future value since terminal)
Best action: 1 → V[1] = 1.0

The updated value function is [1.0, 1.0].

State 0 Analysis:

Action 0: Reward 1.0, go to state 1 → Value = 1.0 + 0.9 × 0 = 1.0
Action 1: Reward 2.0, go to state 2 → Value = 2.0 + 0.9 × 0 = 2.0
Best action: 1 → V[0] = 2.0

State 1 Analysis:

Action 0: Reward 1.0, go to state 2 → Value = 1.0 + 0.9 × 0 = 1.0
Action 1: Reward 0.0, go to state 0 → Value = 0.0 + 0.9 × 0 = 0.0
Best action: 0 → V[1] = 1.0

State 2 Analysis:

Action 0: Reward 5.0, terminal → Value = 5.0 (no future reward)
Action 1: Reward 0.0, go to state 0 → Value = 0.0 + 0.9 × 0 = 0.0
Best action: 0 → V[2] = 5.0

The updated value function is [2.0, 1.0, 5.0].

State 0 Analysis (Stochastic Transition):

Action 0 has probabilistic outcomes:
- 50% chance: Stay in state 0, receive 1.0 → 0.5 × (1.0 + 0.9 × 0) = 0.5
- 50% chance: Go to state 1, receive 2.0 → 0.5 × (2.0 + 0.9 × 0) = 1.0
- Total expected value = 0.5 + 1.0 = 1.5
Only one action available → V[0] = 1.5

State 1 Analysis:

Action 0: Reward 10.0, terminal → Value = 10.0
Only one action available → V[1] = 10.0

State-Value Function Bellman Update

Hints

State-Value Function Bellman Update

Hints