Temporal Difference Value Estimator for Sequential Decision Problems (Medium) — Practice with Code Visualizer

Temporal Difference (TD) Learning is a foundational approach in reinforcement learning that enables agents to learn optimal decision-making strategies through direct interaction with an environment. Unlike supervised learning where correct outputs are provided, TD methods learn from experience by updating value estimates based on the difference between predicted and observed outcomes.

Background: Markov Decision Processes

A Markov Decision Process (MDP) provides a mathematical framework for modeling sequential decision-making. It consists of:

States (S): A finite set of situations the agent can be in
Actions (A): A finite set of choices available to the agent
Transition Probabilities P(s'|s, a): The probability of moving to state s' when taking action a from state s
Rewards R(s, a): The immediate reward received when taking action a in state s
Terminal States: States where the episode ends (no further actions taken)

The State-Action Value Function

The Q-function (or action-value function) represents the expected cumulative discounted reward of taking action a in state s and then following an optimal policy:

$$Q(s, a) = \mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid s_0 = s, a_0 = a\right]$$

where γ (gamma) is the discount factor that determines how much future rewards are valued compared to immediate rewards.

Temporal Difference Update Rule

The core of TD learning is the update rule that refines value estimates after each experience:

$$Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right]$$

Where:

α (alpha) is the learning rate controlling the magnitude of updates
r is the immediate reward received
s' is the next state observed after taking action a
max Q(s', a') represents the estimated optimal future value

The term in brackets is called the Temporal Difference Error — the discrepancy between the current estimate and the observed outcome.

Epsilon-Greedy Exploration

To balance exploration (trying new actions) with exploitation (choosing the best known action), the agent uses an ε-greedy policy:

With probability (1 - ε): Select the action with the highest Q-value (greedy)
With probability ε: Select a random action (exploration)

Your Task

Implement a function that trains a Q-table using temporal difference learning over multiple episodes. The algorithm should:

Initialize all Q-values to zero
For each episode:
- Start from a randomly selected non-terminal state
- Until a terminal state is reached:
  - Select an action using ε-greedy policy
  - Sample the next state based on transition probabilities
  - Receive the corresponding reward
  - Update Q(s, a) using the TD update rule
  - Move to the next state
Return the learned Q-table rounded to 8 decimal places

Important Implementation Detail: Use np.random.seed(42) at the beginning of your function to ensure reproducible results. Use np.random.choice() for sampling states and actions.

This is a simple 2-state MDP where state 1 is terminal:

• State 0 (non-terminal):

Action 0 gives reward 1 and always transitions to state 1
Action 1 gives reward 0 and always transitions to state 0 (self-loop)

• State 1 (terminal): Q-values remain 0 as no actions are taken from terminal states

Over 10 episodes, the agent learns that:

Q(0, 0) ≈ 0.65: Taking action 0 from state 0 leads to reward 1 and terminal state
Q(0, 1) ≈ 0.05: Taking action 1 occasionally leads to suboptimal paths

The Q-values for state 1 remain [0, 0] since it's a terminal state.

A 3-state environment with stochastic transitions and state 2 as terminal:

• State 0: Action 0 gives reward 1.0 with equal chance to stay or move to state 1. Action 1 can reach terminal state 2. • State 1: Both actions give reward 0.5 with varying transition probabilities toward the terminal state. • State 2 (terminal): No further actions, Q-values stay at 0.

After 20 episodes with learning rate 0.2 and higher exploration (ε = 0.2):

Q(0, 0) ≈ 3.98: Best strategy from state 0, accumulating rewards through state 1
Q(1, 0) ≈ 2.88: Higher value as transitions more often lead to rewarding paths
The discount factor γ = 0.95 allows substantial value propagation from future rewards.

A more complex 3-state, 3-action environment with varied rewards:

• State 0: Actions yield rewards 1, 2, and 3 respectively with stochastic transitions • State 1: Mixed rewards with complex transition dynamics • State 2 (terminal): Episode ends here

With only 15 episodes and conservative learning rate (α = 0.1):

Q(0, 0) ≈ 2.20: Highest learned value despite action 2 having higher immediate reward, showing the agent's exploration pattern
Some Q-values remain at 0.0 indicating those state-action pairs weren't visited or updated during training
The sparse updates demonstrate the sample-based nature of TD learning—not all state-action pairs are equally explored.

Background: Markov Decision Processes

A Markov Decision Process (MDP) provides a mathematical framework for modeling sequential decision-making. It consists of:

States (S): A finite set of situations the agent can be in
Actions (A): A finite set of choices available to the agent
Transition Probabilities P(s'|s, a): The probability of moving to state s' when taking action a from state s
Rewards R(s, a): The immediate reward received when taking action a in state s
Terminal States: States where the episode ends (no further actions taken)

The State-Action Value Function

The Q-function (or action-value function) represents the expected cumulative discounted reward of taking action a in state s and then following an optimal policy:

$$Q(s, a) = \mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid s_0 = s, a_0 = a\right]$$

where γ (gamma) is the discount factor that determines how much future rewards are valued compared to immediate rewards.

Temporal Difference Update Rule

The core of TD learning is the update rule that refines value estimates after each experience:

$$Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right]$$

Where:

α (alpha) is the learning rate controlling the magnitude of updates
r is the immediate reward received
s' is the next state observed after taking action a
max Q(s', a') represents the estimated optimal future value

The term in brackets is called the Temporal Difference Error — the discrepancy between the current estimate and the observed outcome.

Epsilon-Greedy Exploration

To balance exploration (trying new actions) with exploitation (choosing the best known action), the agent uses an ε-greedy policy:

With probability (1 - ε): Select the action with the highest Q-value (greedy)
With probability ε: Select a random action (exploration)

Your Task

Implement a function that trains a Q-table using temporal difference learning over multiple episodes. The algorithm should:

Initialize all Q-values to zero
For each episode:
- Start from a randomly selected non-terminal state
- Until a terminal state is reached:
  - Select an action using ε-greedy policy
  - Sample the next state based on transition probabilities
  - Receive the corresponding reward
  - Update Q(s, a) using the TD update rule
  - Move to the next state
Return the learned Q-table rounded to 8 decimal places

Important Implementation Detail: Use np.random.seed(42) at the beginning of your function to ensure reproducible results. Use np.random.choice() for sampling states and actions.

This is a simple 2-state MDP where state 1 is terminal:

• State 0 (non-terminal):

Action 0 gives reward 1 and always transitions to state 1
Action 1 gives reward 0 and always transitions to state 0 (self-loop)

• State 1 (terminal): Q-values remain 0 as no actions are taken from terminal states

Over 10 episodes, the agent learns that:

Q(0, 0) ≈ 0.65: Taking action 0 from state 0 leads to reward 1 and terminal state
Q(0, 1) ≈ 0.05: Taking action 1 occasionally leads to suboptimal paths

The Q-values for state 1 remain [0, 0] since it's a terminal state.

A 3-state environment with stochastic transitions and state 2 as terminal:

After 20 episodes with learning rate 0.2 and higher exploration (ε = 0.2):

Q(0, 0) ≈ 3.98: Best strategy from state 0, accumulating rewards through state 1
Q(1, 0) ≈ 2.88: Higher value as transitions more often lead to rewarding paths
The discount factor γ = 0.95 allows substantial value propagation from future rewards.

A more complex 3-state, 3-action environment with varied rewards:

With only 15 episodes and conservative learning rate (α = 0.1):

Q(0, 0) ≈ 2.20: Highest learned value despite action 2 having higher immediate reward, showing the agent's exploration pattern
Some Q-values remain at 0.0 indicating those state-action pairs weren't visited or updated during training
The sparse updates demonstrate the sample-based nature of TD learning—not all state-action pairs are equally explored.

Temporal Difference Value Estimator for Sequential Decision Problems

Background: Markov Decision Processes

The State-Action Value Function

Temporal Difference Update Rule

Epsilon-Greedy Exploration

Your Task

Hints

Temporal Difference Value Estimator for Sequential Decision Problems

Background: Markov Decision Processes

The State-Action Value Function

Temporal Difference Update Rule

Epsilon-Greedy Exploration

Your Task

Hints