Loading content...

0/318

00:00:00

Description

Editorial

Grid Navigation Value Estimation

MEDIUM20 pts

In reinforcement learning, one of the most fundamental challenges is determining how valuable it is to be in a particular state when following a given behavioral strategy. This problem introduces the concept of iterative value estimation, a cornerstone technique used to evaluate policies in sequential decision-making environments.

The Environment

Consider a 5×5 navigation grid where an autonomous agent must traverse from various starting positions. The environment has the following characteristics:

State Space: 25 cells arranged in a 5×5 grid, indexed by (row, column) coordinates from (0,0) to (4,4)
Action Space: Four possible movements—up (decreases row), down (increases row), left (decreases column), and right (increases column)
Movement Penalty: Every action incurs a constant reward of -1, representing time, energy, or resource consumption
Boundary Behavior: If an action would move the agent outside the grid boundaries, the agent remains in its current position but still incurs the -1 penalty
Terminal States: The four corner cells—(0,0), (0,4), (4,0), and (4,4)—are terminal states with a fixed value of 0. Once the agent reaches a terminal state, the episode ends

The Policy

A policy defines the agent's behavioral strategy by mapping each state to a probability distribution over actions. For each non-terminal state, the policy specifies the probability of taking each of the four possible actions. These probabilities must sum to 1.0 for each state.

The Bellman Expectation Equation

The state-value function $V^\pi(s)$ represents the expected cumulative discounted reward when starting from state $s$ and following policy $\pi$ thereafter. For a given policy, the value of each state satisfies the Bellman expectation equation:

$$V^\pi(s) = \sum_{a} \pi(a|s) \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma V^\pi(s') \right]$$

Where:

$\pi(a|s)$ is the probability of taking action $a$ in state $s$ under policy $\pi$
$P(s'|s,a)$ is the probability of transitioning to state $s'$ when taking action $a$ in state $s$ (deterministic in this environment)
$R(s,a,s') = -1$ is the immediate reward for the transition
$\gamma$ (gamma) is the discount factor that determines the present value of future rewards

Iterative Value Computation

Starting with initial values of 0 for all states, repeatedly update each non-terminal state's value using the Bellman equation until the largest absolute change across all states in a single iteration falls below the given convergence threshold.

Your Task

Implement a function that computes the converged state-value function for the given policy. Your implementation should:

Use only Python built-in libraries (no external RL frameworks)
Apply the Bellman expectation equation iteratively
Terminate when the maximum value change is below the threshold
Return values rounded to 4 decimal places

Example

Input

policy = {
    "(0,0)": {"up": 0.25, "down": 0.25, "left": 0.25, "right": 0.25},
    "(0,1)": {"up": 0.25, "down": 0.25, "left": 0.25, "right": 0.25},
    ...  # All 25 states with uniform random policy
}
gamma = 0.9
threshold = 0.001

Output

[
  [0.0, -4.8645, -6.079, -4.8645, 0.0],
  [-4.8645, -6.2339, -6.7677, -6.2339, -4.8645],
  [-6.079, -6.7677, -7.0902, -6.7677, -6.079],
  [-4.8645, -6.2339, -6.7677, -6.2339, -4.8645],
  [0.0, -4.8645, -6.079, -4.8645, 0.0]
]

Explanation

With a uniform random policy (equal 25% probability for each action), the agent has no directional preference and moves chaotically. The center cell (2,2) has the worst value of approximately -7.09 because it's farthest from any terminal corner and requires the most expected steps to escape. The symmetry of values reflects the symmetry of both the grid and the uniform policy. Corner states remain at 0 as they are terminal. States adjacent to corners have better (less negative) values since they can quickly reach terminal states.

Example

Input

policy = {
    "(i,j)": {"up": 0.0, "down": 1.0, "left": 0.0, "right": 0.0}
    ...  # All 25 states with deterministic "always go down" policy
}
gamma = 0.9
threshold = 0.001

Output

[
  [0.0, -9.9914, -9.9914, -9.9914, 0.0],
  [-2.71, -9.9914, -9.9914, -9.9914, -2.71],
  [-1.9, -9.9914, -9.9914, -9.9914, -1.9],
  [-1.0, -9.9914, -9.9914, -9.9914, -1.0],
  [0.0, -9.9914, -9.9914, -9.9914, 0.0]
]

Explanation

With a deterministic "always move down" policy, cells in the leftmost and rightmost columns can eventually reach the bottom terminal corners by hitting boundaries and accumulating penalties. For example, (3,0) has value -1 because one step down reaches terminal (4,0). Cells (1,0) and (1,4) have value -2.71 ≈ -1 + 0.9×(-1.9). However, cells in the middle columns (columns 1, 2, 3) are trapped—they repeatedly hit the bottom boundary without reaching a corner, making their values approach the negative limit: lim(n→∞) Σ(-0.9^k) = -1/(1-0.9) ≈ -10.

Example

Input

policy = {
    "(i,j)": {"up": 0.25, "down": 0.25, "left": 0.25, "right": 0.25}
    ...  # All 25 states with uniform random policy
}
gamma = 0.5
threshold = 0.01

Output

[
  [0.0, -1.6855, -1.902, -1.6855, 0.0],
  [-1.6855, -1.9091, -1.9597, -1.9091, -1.6855],
  [-1.902, -1.9597, -1.9772, -1.9597, -1.902],
  [-1.6855, -1.9091, -1.9597, -1.9091, -1.6855],
  [0.0, -1.6855, -1.902, -1.6855, 0.0]
]

Explanation

With the same uniform policy but a lower discount factor (γ=0.5 vs γ=0.9), the state values are significantly less negative. This is because future penalties are discounted more heavily—rewards/penalties received k steps in the future are worth 0.5^k times their face value. With γ=0.5, a penalty 5 steps away is worth only 0.03125 of its original magnitude. This makes the long-term cost of being trapped or far from terminals much less impactful. The relaxed threshold (0.01) also allows for slightly faster convergence with marginally less precision.

Accepted0/0·0% Acceptance

Constraints

The grid is always 5×5 with states indexed (0,0) through (4,4)
Terminal states are the four corners: (0,0), (0,4), (4,0), (4,4)
0 < gamma ≤ 1 (discount factor)
0 < threshold ≤ 0.1 (convergence threshold)
Policy probabilities for each state sum to exactly 1.0
All policy probabilities are non-negative
Movement reward is always -1
Use only Python built-in libraries

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

gamma =

0.9

policy =

{"(0,0)":{"up":0.25,"down":0.25,"left":0.25,"right":0.25},"(0,1)":{"up":0.25,"down":0.25,"left":0.25,"right":0.25},"(0,2)":{"up":0.25,"down":0.25,"left":0.25,"right":0.25},"(0,3)":{"up":0.25,"down":0.25,"left":0.25,"right":0.25},"(0,4)":{"up":0.25,"down":0.25,"left":0.25,"right":0.25},"(1,0)":{"up":0.25,"down":0.25,"left":0.25,"right":0.25},"(1,1)":{"up":0.25,"down":0.25,"left":0.25,"right":0.25},"(1,2)":{"up":0.25,"down":0.25,"left":0.25,"right":0.25},"(1,3)":{"up":0.25,"down":0.25,"left":0.25,"right":0.25},"(1,4)":{"up":0.25,"down":0.25,"left":0.25,"right":0.25},"(2,0)":{"up":0.25,"down":0.25,"left":0.25,"right":0.25},"(2,1)":{"up":0.25,"down":0.25,"left":0.25,"right":0.25},"(2,2)":{"up":0.25,"down":0.25,"left":0.25,"right":0.25},"(2,3)":{"up":0.25,"down":0.25,"left":0.25,"right":0.25},"(2,4)":{"up":0.25,"down":0.25,"left":0.25,"right":0.25},"(3,0)":{"up":0.25,"down":0.25,"left":0.25,"right":0.25},"(3,1)":{"up":0.25,"down":0.25,"left":0.25,"right":0.25},"(3,2)":{"up":0.25,"down":0.25,"left":0.25,"right":0.25},"(3,3)":{"up":0.25,"down":0.25,"left":0.25,"right":0.25},"(3,4)":{"up":0.25,"down":0.25,"left":0.25,"right":0.25},"(4,0)":{"up":0.25,"down":0.25,"left":0.25,"right":0.25},"(4,1)":{"up":0.25,"down":0.25,"left":0.25,"right":0.25},"(4,2)":{"up":0.25,"down":0.25,"left":0.25,"right":0.25},"(4,3)":{"up":0.25,"down":0.25,"left":0.25,"right":0.25},"(4,4)":{"up":0.25,"down":0.25,"left":0.25,"right":0.25}}

threshold =

0.001

00:00:00

Description

Editorial

Grid Navigation Value Estimation

MEDIUM20 pts

The Environment

Consider a 5×5 navigation grid where an autonomous agent must traverse from various starting positions. The environment has the following characteristics:

State Space: 25 cells arranged in a 5×5 grid, indexed by (row, column) coordinates from (0,0) to (4,4)
Action Space: Four possible movements—up (decreases row), down (increases row), left (decreases column), and right (increases column)
Movement Penalty: Every action incurs a constant reward of -1, representing time, energy, or resource consumption
Boundary Behavior: If an action would move the agent outside the grid boundaries, the agent remains in its current position but still incurs the -1 penalty
Terminal States: The four corner cells—(0,0), (0,4), (4,0), and (4,4)—are terminal states with a fixed value of 0. Once the agent reaches a terminal state, the episode ends

The Policy

The Bellman Expectation Equation

$$V^\pi(s) = \sum_{a} \pi(a|s) \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma V^\pi(s') \right]$$

Where:

$\pi(a|s)$ is the probability of taking action $a$ in state $s$ under policy $\pi$
$P(s'|s,a)$ is the probability of transitioning to state $s'$ when taking action $a$ in state $s$ (deterministic in this environment)
$R(s,a,s') = -1$ is the immediate reward for the transition
$\gamma$ (gamma) is the discount factor that determines the present value of future rewards

Iterative Value Computation

Your Task

Implement a function that computes the converged state-value function for the given policy. Your implementation should:

Use only Python built-in libraries (no external RL frameworks)
Apply the Bellman expectation equation iteratively
Terminate when the maximum value change is below the threshold
Return values rounded to 4 decimal places

Example

Input

policy = {
    "(0,0)": {"up": 0.25, "down": 0.25, "left": 0.25, "right": 0.25},
    "(0,1)": {"up": 0.25, "down": 0.25, "left": 0.25, "right": 0.25},
    ...  # All 25 states with uniform random policy
}
gamma = 0.9
threshold = 0.001

Output

[
  [0.0, -4.8645, -6.079, -4.8645, 0.0],
  [-4.8645, -6.2339, -6.7677, -6.2339, -4.8645],
  [-6.079, -6.7677, -7.0902, -6.7677, -6.079],
  [-4.8645, -6.2339, -6.7677, -6.2339, -4.8645],
  [0.0, -4.8645, -6.079, -4.8645, 0.0]
]

Explanation

Example

Input

policy = {
    "(i,j)": {"up": 0.0, "down": 1.0, "left": 0.0, "right": 0.0}
    ...  # All 25 states with deterministic "always go down" policy
}
gamma = 0.9
threshold = 0.001

Output

[
  [0.0, -9.9914, -9.9914, -9.9914, 0.0],
  [-2.71, -9.9914, -9.9914, -9.9914, -2.71],
  [-1.9, -9.9914, -9.9914, -9.9914, -1.9],
  [-1.0, -9.9914, -9.9914, -9.9914, -1.0],
  [0.0, -9.9914, -9.9914, -9.9914, 0.0]
]

Explanation

Example

Input

policy = {
    "(i,j)": {"up": 0.25, "down": 0.25, "left": 0.25, "right": 0.25}
    ...  # All 25 states with uniform random policy
}
gamma = 0.5
threshold = 0.01

Output

[
  [0.0, -1.6855, -1.902, -1.6855, 0.0],
  [-1.6855, -1.9091, -1.9597, -1.9091, -1.6855],
  [-1.902, -1.9597, -1.9772, -1.9597, -1.902],
  [-1.6855, -1.9091, -1.9597, -1.9091, -1.6855],
  [0.0, -1.6855, -1.902, -1.6855, 0.0]
]

Explanation

Accepted0/0·0% Acceptance

Constraints

The grid is always 5×5 with states indexed (0,0) through (4,4)
Terminal states are the four corners: (0,0), (0,4), (4,0), (4,4)
0 < gamma ≤ 1 (discount factor)
0 < threshold ≤ 0.1 (convergence threshold)
Policy probabilities for each state sum to exactly 1.0
All policy probabilities are non-negative
Movement reward is always -1
Use only Python built-in libraries

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

gamma =

0.9

policy =

threshold =

0.001

Grid Navigation Value Estimation

The Environment

The Policy

The Bellman Expectation Equation

Iterative Value Computation

Your Task

Hints

Grid Navigation Value Estimation

The Environment

The Policy

The Bellman Expectation Equation

Iterative Value Computation

Your Task

Hints