Loading content...
In reinforcement learning, an agent learns to make optimal decisions by interacting with an environment and receiving feedback in the form of rewards. One of the most influential algorithms in this domain is On-Policy Temporal Difference (TD) Learning, which updates value estimates based on the actions the agent actually takes while following its current policy.
The algorithm you will implement is a foundational on-policy control method that learns action-value functions (Q-values). Unlike Monte Carlo methods that wait until the end of an episode to update estimates, temporal difference methods update values after each step, making them more efficient for long or continuous episodes.
The core insight of TD learning is the bootstrapping principle: we update our estimate of one state-action pair's value based on the estimate of another. The update rule follows this pattern:
$$Q(s,a) \leftarrow Q(s,a) + \alpha \cdot [r + \gamma \cdot Q(s',a') - Q(s,a)]$$
Where:
Implement the compute_action_values function that runs the on-policy TD learning algorithm across all episodes and returns the final Q-value estimates for each state-action pair.
transitions = {
"A,left": {"reward": 5.0, "next_state": "B"},
"A,right": {"reward": 1.0, "next_state": "C"},
"B,left": {"reward": 2.0, "next_state": "A"},
"B,right": {"reward": 0.0, "next_state": "C"},
"C,down": {"reward": 1.0, "next_state": "terminal"}
}
initial_states = ["A", "B"]
alpha = 0.1
gamma = 0.9
max_steps = 10{"A,left": 4.2181, "A,right": 0.0, "B,left": 2.7901, "B,right": 0.0, "C,down": 0.0}Episode 1 (starting from A): With all Q-values at 0, both actions from A are equal. Alphabetically, "left" comes before "right", so we choose "left".
Step 1: (A, left) → reward=5.0, next_state=B From B, we again choose "left" (alphabetical tie-breaker) Q(A,left) ← 0 + 0.1 × [5.0 + 0.9 × Q(B,left) - 0] = 0.1 × 5.0 = 0.5
Step 2: (B, left) → reward=2.0, next_state=A Q(B,left) ← 0 + 0.1 × [2.0 + 0.9 × Q(A,left) - 0] = 0.1 × [2.0 + 0.9 × 0.5] = 0.245
The agent continues alternating between A and B, building up Q-values through the cycle.
Episode 2 (starting from B): Now Q(A,left) and Q(B,left) have positive values from Episode 1. The greedy policy leverages these estimates, and updates continue to refine the Q-values.
After both episodes complete (or hit max_steps), Q(A,left) ≈ 4.2181 and Q(B,left) ≈ 2.7901 reflect the cumulative learning. Q(A,right) and Q(B,right) remain 0.0 because they were never selected or led quickly to the low-value terminal path.
transitions = {
"S,go": {"reward": 10.0, "next_state": "terminal"}
}
initial_states = ["S"]
alpha = 0.5
gamma = 0.9
max_steps = 5{"S,go": 5.0}This is a single-state, single-action environment. Starting from state S, the only available action is "go", which yields a reward of 10.0 and transitions to the terminal state.
Episode (starting from S): Step 1: (S, go) → reward=10.0, next_state=terminal Since terminal has no actions, Q(s',a') = 0 Q(S,go) ← 0 + 0.5 × [10.0 + 0.9 × 0 - 0] = 0.5 × 10.0 = 5.0
The episode ends immediately. With only one episode, the final Q-value is 5.0.
transitions = {
"A,right": {"reward": 1.0, "next_state": "B"},
"B,right": {"reward": 2.0, "next_state": "C"},
"C,right": {"reward": 3.0, "next_state": "terminal"}
}
initial_states = ["A", "A", "B"]
alpha = 0.2
gamma = 0.95
max_steps = 10{"A,right": 0.436, "B,right": 1.2724, "C,right": 1.464}This is a linear chain environment: A → B → C → terminal. Each state has only one action ("right"), so the greedy policy always selects it.
Episode 1 (starting from A): Step 1: (A, right) → reward=1.0, next_state=B Q(A,right) ← 0 + 0.2 × [1.0 + 0.95 × Q(B,right) - 0] = 0.2 × 1.0 = 0.2
Step 2: (B, right) → reward=2.0, next_state=C Q(B,right) ← 0 + 0.2 × [2.0 + 0.95 × Q(C,right) - 0] = 0.2 × 2.0 = 0.4
Step 3: (C, right) → reward=3.0, next_state=terminal Q(C,right) ← 0 + 0.2 × [3.0 + 0.95 × 0 - 0] = 0.2 × 3.0 = 0.6
Episode 2 (starting from A again): Values continue to propagate backward. Each visit updates Q-values toward the true discounted return.
Step 1: Q(A,right) ← 0.2 + 0.2 × [1.0 + 0.95 × 0.4 - 0.2] = 0.2 + 0.2 × 1.18 = 0.436 Step 2: Q(B,right) ← 0.4 + 0.2 × [2.0 + 0.95 × 0.6 - 0.4] = 0.4 + 0.2 × 2.17 = 0.834 Step 3: Q(C,right) ← 0.6 + 0.2 × [3.0 + 0 - 0.6] = 0.6 + 0.2 × 2.4 = 1.08
Episode 3 (starting from B): Starts partway through the chain, updating only Q(B,right) and Q(C,right).
After all episodes: Q(A,right)=0.436, Q(B,right)=1.2724, Q(C,right)=1.464.
Constraints