Loading problem...
In Reinforcement Learning, an agent learns to make optimal decisions by interacting with an environment modeled as a Markov Decision Process (MDP). A fundamental component of solving MDPs is computing the state-value function V(s), which quantifies the expected cumulative reward an agent can obtain starting from state s and following an optimal policy.
The Bellman Optimality Equation provides a recursive relationship that defines the optimal value function. The key insight is that the optimal value of any state equals the best action's immediate reward plus the discounted value of the resulting state. Mathematically:
$$V(s) = \max_{a} \sum_{s', r} P(s', r | s, a) \cdot [r + \gamma \cdot V(s') \cdot (1 - \text{terminal})]$$
Where:
Value Iteration Algorithm: Value iteration is an iterative algorithm that computes optimal state values by repeatedly applying the Bellman update to all states until convergence. One iteration (or "sweep") updates every state's value based on the current value estimates of all states.
MDP Transition Structure:
The environment is specified as a list of transition dictionaries. For each state s, transitions[s] is a dictionary where:
(probability, next_state, reward, is_terminal)Each tuple represents a possible outcome when taking that action, with the probability of that outcome occurring, the resulting state, the immediate reward received, and whether this ends the episode.
Your Task: Implement a function that performs one complete iteration of value iteration using the Bellman equation. Given the current state-value estimates V, transition dynamics, and discount factor γ, compute and return the updated value function.
Important Notes:
is_terminal = True), the future discounted value should be 0 (the episode ends, so no future rewards can be collected)V = np.array([0.0, 0.0])
transitions = [
{0: [(1.0, 0, 0.0, False)], 1: [(1.0, 1, 1.0, False)]},
{0: [(1.0, 0, 0.0, False)], 1: [(1.0, 1, 1.0, True)]}
]
gamma = 0.9[1.0, 1.0]State 0 Analysis:
State 1 Analysis:
The updated value function is [1.0, 1.0].
V = np.array([0.0, 0.0, 0.0])
transitions = [
{0: [(1.0, 1, 1.0, False)], 1: [(1.0, 2, 2.0, False)]},
{0: [(1.0, 2, 1.0, False)], 1: [(1.0, 0, 0.0, False)]},
{0: [(1.0, 2, 5.0, True)], 1: [(1.0, 0, 0.0, False)]}
]
gamma = 0.9[2.0, 1.0, 5.0]State 0 Analysis:
State 1 Analysis:
State 2 Analysis:
The updated value function is [2.0, 1.0, 5.0].
V = np.array([0.0, 0.0])
transitions = [
{0: [(0.5, 0, 1.0, False), (0.5, 1, 2.0, False)]}
{0: [(1.0, 1, 10.0, True)]}
]
gamma = 0.9[1.5, 10.0]State 0 Analysis (Stochastic Transition):
State 1 Analysis:
This example demonstrates stochastic transitions where an action can lead to multiple possible outcomes with different probabilities. The expected value is the probability-weighted sum of all possible outcomes.
Constraints