0/318

00:00:00

Description

Editorial

Bellman Prediction Error for State Value Updates

EASY10 pts

In reinforcement learning (RL), agents learn optimal behavior by iteratively refining their estimates of how valuable different states are. A cornerstone of this learning process is the Temporal Difference (TD) error, which quantifies the discrepancy between the agent's current value estimate and a more informed estimate derived from actual experience.

The TD error serves as a critical learning signal that drives the update of value function estimates. When an agent transitions from state s to state s' and receives reward r, it can compute a better estimate of the value of state s by combining the immediate reward with the discounted value of the next state.

Mathematical Formulation:

The TD error (δ) is defined as:

$$\delta = r + \gamma \cdot V(s') - V(s)$$

Where:

r is the immediate reward received after the transition
γ (gamma) is the discount factor, determining how much future rewards are valued compared to immediate ones (0 ≤ γ ≤ 1)
V(s') is the current estimate of the value for the next state
V(s) is the current estimate of the value for the current state

The term r + γ · V(s') is called the TD target — it represents what the value of state s "should be" based on the experience of transitioning to s'.

Handling Terminal States:

When the episode terminates (i.e., the agent reaches a terminal state), there is no future to bootstrap from. In this case, the next state has no value beyond the immediate reward, so the TD error simplifies to:

$$\delta = r - V(s)$$

Your Task: Implement a function that computes the TD error for a single state transition. The function should correctly handle both continuing and terminal episodes.

Example

Input

v_s = 5.0, reward = 1.0, v_s_prime = 10.0, gamma = 0.9, done = False

Output

5.0

Explanation

Since the episode has not terminated (done = False), we compute the full TD error:

• TD Target = reward + gamma × V(s') = 1.0 + 0.9 × 10.0 = 1.0 + 9.0 = 10.0 • TD Error = TD Target - V(s) = 10.0 - 5.0 = 5.0

The positive TD error of 5.0 indicates that the current value estimate V(s) = 5.0 was too pessimistic. The actual experienced transition suggests state s is more valuable than previously estimated.

Example

Input

v_s = 8.0, reward = 10.0, v_s_prime = 100.0, gamma = 0.9, done = True

Output

2.0

Explanation

Since this is a terminal transition (done = True), the next state s' has no future value — the episode ends here. We ignore V(s') entirely:

• TD Target = reward (no discounted future since episode ends) = 10.0 • TD Error = TD Target - V(s) = 10.0 - 8.0 = 2.0

Note that even though v_s_prime = 100.0 is provided, it is ignored because at terminal states there is no future to consider. The agent only received the final reward of 10.0.

Example

Input

v_s = 10.0, reward = 2.0, v_s_prime = 5.0, gamma = 0.95, done = False

Output

-3.25

Explanation

Computing the TD error for a continuing episode:

• TD Target = reward + gamma × V(s') = 2.0 + 0.95 × 5.0 = 2.0 + 4.75 = 6.75 • TD Error = TD Target - V(s) = 6.75 - 10.0 = -3.25

The negative TD error of -3.25 indicates that the current value estimate V(s) = 10.0 was too optimistic. Based on the actual transition, state s appears to be less valuable than the agent thought.

Accepted0/0·0% Acceptance

Constraints

-10⁶ ≤ v_s ≤ 10⁶ (value estimate for current state)
-10⁶ ≤ reward ≤ 10⁶ (immediate reward)
-10⁶ ≤ v_s_prime ≤ 10⁶ (value estimate for next state)
0 ≤ gamma ≤ 1 (discount factor)
done is a boolean (True for terminal states, False otherwise)

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

v_s =

done =

false

gamma =

0.9

reward =

v_s_prime =

Bellman Prediction Error for State Value Updates

Hints

Bellman Prediction Error for State Value Updates

Hints