Loading content...

0/318

00:00:00

Description

Editorial

On-Policy Temporal Difference Learning for Action-Value Estimation

MEDIUM25 pts

In reinforcement learning, an agent learns to make optimal decisions by interacting with an environment and receiving feedback in the form of rewards. One of the most influential algorithms in this domain is On-Policy Temporal Difference (TD) Learning, which updates value estimates based on the actions the agent actually takes while following its current policy.

The algorithm you will implement is a foundational on-policy control method that learns action-value functions (Q-values). Unlike Monte Carlo methods that wait until the end of an episode to update estimates, temporal difference methods update values after each step, making them more efficient for long or continuous episodes.

The Learning Process

The core insight of TD learning is the bootstrapping principle: we update our estimate of one state-action pair's value based on the estimate of another. The update rule follows this pattern:

$$Q(s,a) \leftarrow Q(s,a) + \alpha \cdot [r + \gamma \cdot Q(s',a') - Q(s,a)]$$

Where:

Q(s,a) is the current estimate of taking action a in state s
α (alpha) is the learning rate controlling how much new information overrides old estimates
r is the immediate reward received after taking action a in state s
γ (gamma) is the discount factor that determines the importance of future rewards
s' is the next state after the transition
a' is the next action chosen according to the current policy in state s'

Algorithm Overview

Initialize all Q-values to zero
For each episode starting from the given initial state:
- Select an action using a greedy policy (choose the action with highest Q-value; break ties deterministically by preferring the first action in dictionary order)
- Execute the action, observe the reward and next state
- If the next state is not terminal, select the next action using the greedy policy
- Update the Q-value using the TD update rule
- Repeat until reaching the terminal state or exceeding max_steps
Q-values persist across episodes, allowing learning to accumulate

Important Details

The terminal state signals the end of an episode; no actions are available from it
All Q-values start at zero and are shared across episodes
The agent uses a pure greedy policy: always select the action with the highest estimated Q-value
When multiple actions have equal Q-values, select the one that appears first alphabetically (based on action name in "state,action" key format)
Each element in initial_states triggers a new episode from that state
Round final Q-values to 4 decimal places

Your Task

Implement the compute_action_values function that runs the on-policy TD learning algorithm across all episodes and returns the final Q-value estimates for each state-action pair.

Example

Input

transitions = {
    "A,left": {"reward": 5.0, "next_state": "B"},
    "A,right": {"reward": 1.0, "next_state": "C"},
    "B,left": {"reward": 2.0, "next_state": "A"},
    "B,right": {"reward": 0.0, "next_state": "C"},
    "C,down": {"reward": 1.0, "next_state": "terminal"}
}
initial_states = ["A", "B"]
alpha = 0.1
gamma = 0.9
max_steps = 10

Output

{"A,left": 4.2181, "A,right": 0.0, "B,left": 2.7901, "B,right": 0.0, "C,down": 0.0}

Explanation

Episode 1 (starting from A): With all Q-values at 0, both actions from A are equal. Alphabetically, "left" comes before "right", so we choose "left".

Step 1: (A, left) → reward=5.0, next_state=B From B, we again choose "left" (alphabetical tie-breaker) Q(A,left) ← 0 + 0.1 × [5.0 + 0.9 × Q(B,left) - 0] = 0.1 × 5.0 = 0.5

Step 2: (B, left) → reward=2.0, next_state=A Q(B,left) ← 0 + 0.1 × [2.0 + 0.9 × Q(A,left) - 0] = 0.1 × [2.0 + 0.9 × 0.5] = 0.245

The agent continues alternating between A and B, building up Q-values through the cycle.

Episode 2 (starting from B): Now Q(A,left) and Q(B,left) have positive values from Episode 1. The greedy policy leverages these estimates, and updates continue to refine the Q-values.

After both episodes complete (or hit max_steps), Q(A,left) ≈ 4.2181 and Q(B,left) ≈ 2.7901 reflect the cumulative learning. Q(A,right) and Q(B,right) remain 0.0 because they were never selected or led quickly to the low-value terminal path.

Example

Input

transitions = {
    "S,go": {"reward": 10.0, "next_state": "terminal"}
}
initial_states = ["S"]
alpha = 0.5
gamma = 0.9
max_steps = 5

Output

{"S,go": 5.0}

Explanation

This is a single-state, single-action environment. Starting from state S, the only available action is "go", which yields a reward of 10.0 and transitions to the terminal state.

Episode (starting from S): Step 1: (S, go) → reward=10.0, next_state=terminal Since terminal has no actions, Q(s',a') = 0 Q(S,go) ← 0 + 0.5 × [10.0 + 0.9 × 0 - 0] = 0.5 × 10.0 = 5.0

The episode ends immediately. With only one episode, the final Q-value is 5.0.

Example

Input

transitions = {
    "A,right": {"reward": 1.0, "next_state": "B"},
    "B,right": {"reward": 2.0, "next_state": "C"},
    "C,right": {"reward": 3.0, "next_state": "terminal"}
}
initial_states = ["A", "A", "B"]
alpha = 0.2
gamma = 0.95
max_steps = 10

Output

{"A,right": 0.436, "B,right": 1.2724, "C,right": 1.464}

Explanation

This is a linear chain environment: A → B → C → terminal. Each state has only one action ("right"), so the greedy policy always selects it.

Episode 1 (starting from A): Step 1: (A, right) → reward=1.0, next_state=B Q(A,right) ← 0 + 0.2 × [1.0 + 0.95 × Q(B,right) - 0] = 0.2 × 1.0 = 0.2

Step 2: (B, right) → reward=2.0, next_state=C Q(B,right) ← 0 + 0.2 × [2.0 + 0.95 × Q(C,right) - 0] = 0.2 × 2.0 = 0.4

Step 3: (C, right) → reward=3.0, next_state=terminal Q(C,right) ← 0 + 0.2 × [3.0 + 0.95 × 0 - 0] = 0.2 × 3.0 = 0.6

Episode 2 (starting from A again): Values continue to propagate backward. Each visit updates Q-values toward the true discounted return.

Step 1: Q(A,right) ← 0.2 + 0.2 × [1.0 + 0.95 × 0.4 - 0.2] = 0.2 + 0.2 × 1.18 = 0.436 Step 2: Q(B,right) ← 0.4 + 0.2 × [2.0 + 0.95 × 0.6 - 0.4] = 0.4 + 0.2 × 2.17 = 0.834 Step 3: Q(C,right) ← 0.6 + 0.2 × [3.0 + 0 - 0.6] = 0.6 + 0.2 × 2.4 = 1.08

Episode 3 (starting from B): Starts partway through the chain, updating only Q(B,right) and Q(C,right).

After all episodes: Q(A,right)=0.436, Q(B,right)=1.2724, Q(C,right)=1.464.

Accepted0/0·0% Acceptance

Constraints

1 ≤ number of state-action pairs ≤ 100
1 ≤ number of episodes (length of initial_states) ≤ 50
0 < α (alpha) ≤ 1.0 (learning rate)
0 ≤ γ (gamma) ≤ 1.0 (discount factor)
1 ≤ max_steps ≤ 1000
-100 ≤ reward ≤ 100 for any transition
State names are non-empty strings containing only alphanumeric characters
Action names are non-empty strings containing only alphanumeric characters
The special state 'terminal' marks episode completion
All transitions are deterministic (each state-action pair leads to exactly one next state)

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

alpha =

0.1

gamma =

0.9

max_steps =

transitions =

{"A,left":{"reward":5,"next_state":"B"},"A,right":{"reward":1,"next_state":"C"},"B,left":{"reward":2,"next_state":"A"},"B,right":{"reward":0,"next_state":"C"},"C,down":{"reward":1,"next_state":"terminal"}}

initial_states =

["A","B"]

00:00:00

Description

Editorial

On-Policy Temporal Difference Learning for Action-Value Estimation

MEDIUM25 pts

The Learning Process

The core insight of TD learning is the bootstrapping principle: we update our estimate of one state-action pair's value based on the estimate of another. The update rule follows this pattern:

$$Q(s,a) \leftarrow Q(s,a) + \alpha \cdot [r + \gamma \cdot Q(s',a') - Q(s,a)]$$

Where:

Q(s,a) is the current estimate of taking action a in state s
α (alpha) is the learning rate controlling how much new information overrides old estimates
r is the immediate reward received after taking action a in state s
γ (gamma) is the discount factor that determines the importance of future rewards
s' is the next state after the transition
a' is the next action chosen according to the current policy in state s'

Algorithm Overview

Initialize all Q-values to zero
For each episode starting from the given initial state:
- Select an action using a greedy policy (choose the action with highest Q-value; break ties deterministically by preferring the first action in dictionary order)
- Execute the action, observe the reward and next state
- If the next state is not terminal, select the next action using the greedy policy
- Update the Q-value using the TD update rule
- Repeat until reaching the terminal state or exceeding max_steps
Q-values persist across episodes, allowing learning to accumulate

Important Details

The terminal state signals the end of an episode; no actions are available from it
All Q-values start at zero and are shared across episodes
The agent uses a pure greedy policy: always select the action with the highest estimated Q-value
When multiple actions have equal Q-values, select the one that appears first alphabetically (based on action name in "state,action" key format)
Each element in initial_states triggers a new episode from that state
Round final Q-values to 4 decimal places

Your Task

Implement the compute_action_values function that runs the on-policy TD learning algorithm across all episodes and returns the final Q-value estimates for each state-action pair.

Example

Input

transitions = {
    "A,left": {"reward": 5.0, "next_state": "B"},
    "A,right": {"reward": 1.0, "next_state": "C"},
    "B,left": {"reward": 2.0, "next_state": "A"},
    "B,right": {"reward": 0.0, "next_state": "C"},
    "C,down": {"reward": 1.0, "next_state": "terminal"}
}
initial_states = ["A", "B"]
alpha = 0.1
gamma = 0.9
max_steps = 10

Output

{"A,left": 4.2181, "A,right": 0.0, "B,left": 2.7901, "B,right": 0.0, "C,down": 0.0}

Explanation

Episode 1 (starting from A): With all Q-values at 0, both actions from A are equal. Alphabetically, "left" comes before "right", so we choose "left".

Step 1: (A, left) → reward=5.0, next_state=B From B, we again choose "left" (alphabetical tie-breaker) Q(A,left) ← 0 + 0.1 × [5.0 + 0.9 × Q(B,left) - 0] = 0.1 × 5.0 = 0.5

Step 2: (B, left) → reward=2.0, next_state=A Q(B,left) ← 0 + 0.1 × [2.0 + 0.9 × Q(A,left) - 0] = 0.1 × [2.0 + 0.9 × 0.5] = 0.245

The agent continues alternating between A and B, building up Q-values through the cycle.

Episode 2 (starting from B): Now Q(A,left) and Q(B,left) have positive values from Episode 1. The greedy policy leverages these estimates, and updates continue to refine the Q-values.

Example

Input

transitions = {
    "S,go": {"reward": 10.0, "next_state": "terminal"}
}
initial_states = ["S"]
alpha = 0.5
gamma = 0.9
max_steps = 5

Output

{"S,go": 5.0}

Explanation

This is a single-state, single-action environment. Starting from state S, the only available action is "go", which yields a reward of 10.0 and transitions to the terminal state.

Episode (starting from S): Step 1: (S, go) → reward=10.0, next_state=terminal Since terminal has no actions, Q(s',a') = 0 Q(S,go) ← 0 + 0.5 × [10.0 + 0.9 × 0 - 0] = 0.5 × 10.0 = 5.0

The episode ends immediately. With only one episode, the final Q-value is 5.0.

Example

Input

transitions = {
    "A,right": {"reward": 1.0, "next_state": "B"},
    "B,right": {"reward": 2.0, "next_state": "C"},
    "C,right": {"reward": 3.0, "next_state": "terminal"}
}
initial_states = ["A", "A", "B"]
alpha = 0.2
gamma = 0.95
max_steps = 10

Output

{"A,right": 0.436, "B,right": 1.2724, "C,right": 1.464}

Explanation

This is a linear chain environment: A → B → C → terminal. Each state has only one action ("right"), so the greedy policy always selects it.

Episode 1 (starting from A): Step 1: (A, right) → reward=1.0, next_state=B Q(A,right) ← 0 + 0.2 × [1.0 + 0.95 × Q(B,right) - 0] = 0.2 × 1.0 = 0.2

Step 2: (B, right) → reward=2.0, next_state=C Q(B,right) ← 0 + 0.2 × [2.0 + 0.95 × Q(C,right) - 0] = 0.2 × 2.0 = 0.4

Step 3: (C, right) → reward=3.0, next_state=terminal Q(C,right) ← 0 + 0.2 × [3.0 + 0.95 × 0 - 0] = 0.2 × 3.0 = 0.6

Episode 2 (starting from A again): Values continue to propagate backward. Each visit updates Q-values toward the true discounted return.

Episode 3 (starting from B): Starts partway through the chain, updating only Q(B,right) and Q(C,right).

After all episodes: Q(A,right)=0.436, Q(B,right)=1.2724, Q(C,right)=1.464.

Accepted0/0·0% Acceptance

Constraints

1 ≤ number of state-action pairs ≤ 100
1 ≤ number of episodes (length of initial_states) ≤ 50
0 < α (alpha) ≤ 1.0 (learning rate)
0 ≤ γ (gamma) ≤ 1.0 (discount factor)
1 ≤ max_steps ≤ 1000
-100 ≤ reward ≤ 100 for any transition
State names are non-empty strings containing only alphanumeric characters
Action names are non-empty strings containing only alphanumeric characters
The special state 'terminal' marks episode completion
All transitions are deterministic (each state-action pair leads to exactly one next state)

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

alpha =

0.1

gamma =

0.9

max_steps =

transitions =

initial_states =

["A","B"]

On-Policy Temporal Difference Learning for Action-Value Estimation

The Learning Process

Algorithm Overview

Important Details

Your Task

Hints

On-Policy Temporal Difference Learning for Action-Value Estimation

The Learning Process

Algorithm Overview

Important Details

Your Task

Hints