Loading content...
In reinforcement learning (RL), agents learn optimal behavior by iteratively refining their estimates of how valuable different states are. A cornerstone of this learning process is the Temporal Difference (TD) error, which quantifies the discrepancy between the agent's current value estimate and a more informed estimate derived from actual experience.
The TD error serves as a critical learning signal that drives the update of value function estimates. When an agent transitions from state s to state s' and receives reward r, it can compute a better estimate of the value of state s by combining the immediate reward with the discounted value of the next state.
Mathematical Formulation:
The TD error (δ) is defined as:
$$\delta = r + \gamma \cdot V(s') - V(s)$$
Where:
The term r + γ · V(s') is called the TD target — it represents what the value of state s "should be" based on the experience of transitioning to s'.
Handling Terminal States:
When the episode terminates (i.e., the agent reaches a terminal state), there is no future to bootstrap from. In this case, the next state has no value beyond the immediate reward, so the TD error simplifies to:
$$\delta = r - V(s)$$
Your Task: Implement a function that computes the TD error for a single state transition. The function should correctly handle both continuing and terminal episodes.
v_s = 5.0, reward = 1.0, v_s_prime = 10.0, gamma = 0.9, done = False5.0Since the episode has not terminated (done = False), we compute the full TD error:
• TD Target = reward + gamma × V(s') = 1.0 + 0.9 × 10.0 = 1.0 + 9.0 = 10.0 • TD Error = TD Target - V(s) = 10.0 - 5.0 = 5.0
The positive TD error of 5.0 indicates that the current value estimate V(s) = 5.0 was too pessimistic. The actual experienced transition suggests state s is more valuable than previously estimated.
v_s = 8.0, reward = 10.0, v_s_prime = 100.0, gamma = 0.9, done = True2.0Since this is a terminal transition (done = True), the next state s' has no future value — the episode ends here. We ignore V(s') entirely:
• TD Target = reward (no discounted future since episode ends) = 10.0 • TD Error = TD Target - V(s) = 10.0 - 8.0 = 2.0
Note that even though v_s_prime = 100.0 is provided, it is ignored because at terminal states there is no future to consider. The agent only received the final reward of 10.0.
v_s = 10.0, reward = 2.0, v_s_prime = 5.0, gamma = 0.95, done = False-3.25Computing the TD error for a continuing episode:
• TD Target = reward + gamma × V(s') = 2.0 + 0.95 × 5.0 = 2.0 + 4.75 = 6.75 • TD Error = TD Target - V(s) = 6.75 - 10.0 = -3.25
The negative TD error of -3.25 indicates that the current value estimate V(s) = 10.0 was too optimistic. Based on the actual transition, state s appears to be less valuable than the agent thought.
Constraints