Loading content...
In reinforcement learning, one of the most fundamental challenges is determining how valuable it is to be in a particular state when following a given behavioral strategy. This problem introduces the concept of iterative value estimation, a cornerstone technique used to evaluate policies in sequential decision-making environments.
Consider a 5×5 navigation grid where an autonomous agent must traverse from various starting positions. The environment has the following characteristics:
A policy defines the agent's behavioral strategy by mapping each state to a probability distribution over actions. For each non-terminal state, the policy specifies the probability of taking each of the four possible actions. These probabilities must sum to 1.0 for each state.
The state-value function $V^\pi(s)$ represents the expected cumulative discounted reward when starting from state $s$ and following policy $\pi$ thereafter. For a given policy, the value of each state satisfies the Bellman expectation equation:
$$V^\pi(s) = \sum_{a} \pi(a|s) \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma V^\pi(s') \right]$$
Where:
Starting with initial values of 0 for all states, repeatedly update each non-terminal state's value using the Bellman equation until the largest absolute change across all states in a single iteration falls below the given convergence threshold.
Implement a function that computes the converged state-value function for the given policy. Your implementation should:
policy = {
"(0,0)": {"up": 0.25, "down": 0.25, "left": 0.25, "right": 0.25},
"(0,1)": {"up": 0.25, "down": 0.25, "left": 0.25, "right": 0.25},
... # All 25 states with uniform random policy
}
gamma = 0.9
threshold = 0.001[
[0.0, -4.8645, -6.079, -4.8645, 0.0],
[-4.8645, -6.2339, -6.7677, -6.2339, -4.8645],
[-6.079, -6.7677, -7.0902, -6.7677, -6.079],
[-4.8645, -6.2339, -6.7677, -6.2339, -4.8645],
[0.0, -4.8645, -6.079, -4.8645, 0.0]
]With a uniform random policy (equal 25% probability for each action), the agent has no directional preference and moves chaotically. The center cell (2,2) has the worst value of approximately -7.09 because it's farthest from any terminal corner and requires the most expected steps to escape. The symmetry of values reflects the symmetry of both the grid and the uniform policy. Corner states remain at 0 as they are terminal. States adjacent to corners have better (less negative) values since they can quickly reach terminal states.
policy = {
"(i,j)": {"up": 0.0, "down": 1.0, "left": 0.0, "right": 0.0}
... # All 25 states with deterministic "always go down" policy
}
gamma = 0.9
threshold = 0.001[
[0.0, -9.9914, -9.9914, -9.9914, 0.0],
[-2.71, -9.9914, -9.9914, -9.9914, -2.71],
[-1.9, -9.9914, -9.9914, -9.9914, -1.9],
[-1.0, -9.9914, -9.9914, -9.9914, -1.0],
[0.0, -9.9914, -9.9914, -9.9914, 0.0]
]With a deterministic "always move down" policy, cells in the leftmost and rightmost columns can eventually reach the bottom terminal corners by hitting boundaries and accumulating penalties. For example, (3,0) has value -1 because one step down reaches terminal (4,0). Cells (1,0) and (1,4) have value -2.71 ≈ -1 + 0.9×(-1.9). However, cells in the middle columns (columns 1, 2, 3) are trapped—they repeatedly hit the bottom boundary without reaching a corner, making their values approach the negative limit: lim(n→∞) Σ(-0.9^k) = -1/(1-0.9) ≈ -10.
policy = {
"(i,j)": {"up": 0.25, "down": 0.25, "left": 0.25, "right": 0.25}
... # All 25 states with uniform random policy
}
gamma = 0.5
threshold = 0.01[
[0.0, -1.6855, -1.902, -1.6855, 0.0],
[-1.6855, -1.9091, -1.9597, -1.9091, -1.6855],
[-1.902, -1.9597, -1.9772, -1.9597, -1.902],
[-1.6855, -1.9091, -1.9597, -1.9091, -1.6855],
[0.0, -1.6855, -1.902, -1.6855, 0.0]
]With the same uniform policy but a lower discount factor (γ=0.5 vs γ=0.9), the state values are significantly less negative. This is because future penalties are discounted more heavily—rewards/penalties received k steps in the future are worth 0.5^k times their face value. With γ=0.5, a penalty 5 steps away is worth only 0.03125 of its original magnitude. This makes the long-term cost of being trapped or far from terminals much less impactful. The relaxed threshold (0.01) also allows for slightly faster convergence with marginally less precision.
Constraints