Loading problem...
Temporal Difference (TD) Learning is a foundational approach in reinforcement learning that enables agents to learn optimal decision-making strategies through direct interaction with an environment. Unlike supervised learning where correct outputs are provided, TD methods learn from experience by updating value estimates based on the difference between predicted and observed outcomes.
A Markov Decision Process (MDP) provides a mathematical framework for modeling sequential decision-making. It consists of:
The Q-function (or action-value function) represents the expected cumulative discounted reward of taking action a in state s and then following an optimal policy:
$$Q(s, a) = \mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid s_0 = s, a_0 = a\right]$$
where γ (gamma) is the discount factor that determines how much future rewards are valued compared to immediate rewards.
The core of TD learning is the update rule that refines value estimates after each experience:
$$Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right]$$
Where:
The term in brackets is called the Temporal Difference Error — the discrepancy between the current estimate and the observed outcome.
To balance exploration (trying new actions) with exploitation (choosing the best known action), the agent uses an ε-greedy policy:
Implement a function that trains a Q-table using temporal difference learning over multiple episodes. The algorithm should:
Important Implementation Detail: Use np.random.seed(42) at the beginning of your function to ensure reproducible results. Use np.random.choice() for sampling states and actions.
num_states = 2
num_actions = 2
P = [[[0, 1], [1, 0]], [[1, 0], [1, 0]]]
R = [[1, 0], [0, 0]]
terminal_states = [1]
alpha = 0.1
gamma = 0.9
epsilon = 0.1
num_episodes = 10[[0.65132156, 0.052902], [0.0, 0.0]]This is a simple 2-state MDP where state 1 is terminal:
• State 0 (non-terminal):
• State 1 (terminal): Q-values remain 0 as no actions are taken from terminal states
Over 10 episodes, the agent learns that:
The Q-values for state 1 remain [0, 0] since it's a terminal state.
num_states = 3
num_actions = 2
P = [[[0.5, 0.5, 0.0], [0.0, 0.5, 0.5]], [[0.3, 0.4, 0.3], [0.2, 0.3, 0.5]], [[1.0, 0.0, 0.0], [1.0, 0.0, 0.0]]]
R = [[1.0, 0.0], [0.5, 0.5], [0.0, 0.0]]
terminal_states = [2]
alpha = 0.2
gamma = 0.95
epsilon = 0.2
num_episodes = 20[[3.98498385, 0.18733587], [2.8834386, 1.64292912], [0.0, 0.0]]A 3-state environment with stochastic transitions and state 2 as terminal:
• State 0: Action 0 gives reward 1.0 with equal chance to stay or move to state 1. Action 1 can reach terminal state 2. • State 1: Both actions give reward 0.5 with varying transition probabilities toward the terminal state. • State 2 (terminal): No further actions, Q-values stay at 0.
After 20 episodes with learning rate 0.2 and higher exploration (ε = 0.2):
num_states = 3
num_actions = 3
P = [[[0.496, 0.062, 0.442], [0.271, 0.270, 0.458], [0.300, 0.312, 0.388]], [[0.334, 0.452, 0.214], [0.563, 0.149, 0.287], [0.344, 0.339, 0.317]], [[0.432, 0.143, 0.425], [0.095, 0.312, 0.594], [0.145, 0.372, 0.483]]]
R = [[1, 2, 3], [0, 1, 0], [0, 0, 0]]
terminal_states = [2]
alpha = 0.1
gamma = 0.9
epsilon = 0.15
num_episodes = 15[[2.19963675, 0.0, 0.67575589], [0.68363551, 0.0, 0.0], [0.0, 0.0, 0.0]]A more complex 3-state, 3-action environment with varied rewards:
• State 0: Actions yield rewards 1, 2, and 3 respectively with stochastic transitions • State 1: Mixed rewards with complex transition dynamics • State 2 (terminal): Episode ends here
With only 15 episodes and conservative learning rate (α = 0.1):
Constraints