Loading learning content...
Value iteration is powerful but has a critical limitation: it requires complete knowledge of the environment's dynamics—the transition probabilities $P(s'|s,a)$ and reward function $R(s,a)$. In most real-world scenarios, we don't have this luxury. A robot doesn't know the exact physics of its environment. A game-playing agent doesn't have access to the opponent's strategy. A trading algorithm can't predict market dynamics.
Q-learning, introduced by Chris Watkins in 1989, revolutionized reinforcement learning by showing how to learn optimal behavior directly from experience, without ever constructing a model of the environment. This model-free approach learns action-values (Q-values) by interacting with the environment and observing the consequences of actions.
Remarkably, Q-learning is proven to converge to optimal Q-values under mild conditions—it effectively performs value iteration by sampling, using each observed transition as an incremental update toward the true value.
This page provides a complete treatment of Q-learning: the intuition behind temporal-difference learning, rigorous algorithm derivation, convergence proofs, exploration-exploitation strategies, tabular implementation, common pitfalls, and the bridge to deep Q-networks. You'll understand not just how Q-learning works, but why it works.
Recall the Bellman optimality equation for Q-values:
$$Q^(s, a) = R(s, a) + \gamma \sum_{s'} P(s' | s, a) \max_{a'} Q^(s', a')$$
Value iteration computes this exactly by iterating:
$$Q_{k+1}(s, a) = R(s, a) + \gamma \sum_{s'} P(s' | s, a) \max_{a'} Q_k(s', a')$$
But this requires knowing $P(s'|s,a)$—the transition dynamics. What if we could approximate this expectation using samples instead?
The expectation E[max_a' Q(s', a')] over next states can be approximated by observing actual transitions. If we take action a in state s and observe transition (s, a, r, s'), we get a single sample of what value iteration would compute. By averaging many such samples, we converge to the true expectation.
One approach is Monte Carlo estimation: complete entire episodes, compute actual returns $G_t = \sum_{k=0}^\infty \gamma^k r_{t+k+1}$, then average:
$$Q(s, a) \leftarrow \text{Average}(G_t \text{ for all visits to } (s, a))$$
This works but has drawbacks:
What if we could bootstrap—use current estimates to improve themselves? This is the core of temporal difference (TD) learning:
Instead of waiting for the true return $G_t$, we use a one-step lookahead:
$$G_t \approx r_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a')$$
This is called the TD target. It combines the observed immediate reward with our current estimate of future value. As our estimates improve, so does the target.
The TD target $r + \gamma \max_{a'} Q(s', a')$ is a biased estimator of $Q^*(s, a)$ because it uses our current (imperfect) estimate. But it has much lower variance than Monte Carlo returns because it doesn't accumulate randomness over the entire trajectory.
| Method | Bias | Variance | Update Timing |
|---|---|---|---|
| Monte Carlo | Unbiased | High (full trajectory) | End of episode |
| TD (1-step) | Biased | Low (1 step of randomness) | Every step |
| n-step TD | Moderate | Moderate | Every n steps |
The bias-variance trade-off generally favors TD methods in practice, especially when function approximation is involved.
Q-learning is an off-policy TD control algorithm. Let's unpack both terms:
After observing transition $(s, a, r, s')$:
$$Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right]$$
Let's dissect each component:
| Component | Meaning |
|---|---|
| $Q(s, a)$ | Current estimate of action-value |
| $\alpha \in (0, 1]$ | Learning rate (step size) |
| $r + \gamma \max_{a'} Q(s', a')$ | TD target: observed reward + max future value |
| $r + \gamma \max_{a'} Q(s', a') - Q(s, a)$ | TD error: difference between target and estimate |
| The full update | Move Q(s,a) toward the target by a fraction α |
The max operator in the TD target selects the action that maximizes Q(s', a'), regardless of which action the agent actually will take. This means Q-learning learns the value of the optimal policy π* even while following an exploratory behavior policy (like ε-greedy). The behavior policy and target policy are different—hence 'off-policy'.
Q-Learning (Tabular):
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182
import numpy as npfrom typing import Tuple, List, Callablefrom collections import defaultdict class QLearningAgent: """ Tabular Q-Learning Agent with epsilon-greedy exploration. This implementation emphasizes clarity and educational value, with detailed documentation of each component. """ def __init__( self, num_states: int, num_actions: int, learning_rate: float = 0.1, discount_factor: float = 0.99, epsilon_start: float = 1.0, epsilon_end: float = 0.01, epsilon_decay: float = 0.995 ): """ Initialize Q-learning agent. Args: num_states: Size of state space num_actions: Size of action space learning_rate: Step size α for updates (0 < α ≤ 1) discount_factor: γ - importance of future rewards (0 ≤ γ < 1) epsilon_start: Initial exploration rate epsilon_end: Minimum exploration rate epsilon_decay: Multiplicative decay factor per episode """ self.num_states = num_states self.num_actions = num_actions self.alpha = learning_rate self.gamma = discount_factor # Exploration parameters self.epsilon = epsilon_start self.epsilon_end = epsilon_end self.epsilon_decay = epsilon_decay # Initialize Q-table with optimistic values to encourage exploration # Using zeros is also common self.Q = np.zeros((num_states, num_actions)) # Learning statistics self.td_errors: List[float] = [] def select_action(self, state: int, training: bool = True) -> int: """ Select action using epsilon-greedy policy. During training: ε probability of random action (exploration) During evaluation: Always greedy (exploitation) Args: state: Current state index training: If True, use ε-greedy; if False, use greedy Returns: Selected action index """ if training and np.random.random() < self.epsilon: # Explore: random action return np.random.randint(self.num_actions) else: # Exploit: greedy action (with random tie-breaking) q_values = self.Q[state] max_q = np.max(q_values) # Handle ties randomly best_actions = np.where(q_values == max_q)[0] return np.random.choice(best_actions) def update( self, state: int, action: int, reward: float, next_state: int, done: bool ) -> float: """ Perform Q-learning update. Q(s, a) ← Q(s, a) + α[r + γ max_a' Q(s', a') - Q(s, a)] Args: state: Current state action: Action taken reward: Reward received next_state: Resulting state done: True if next_state is terminal Returns: TD error (for monitoring learning progress) """ # Current Q-value estimate current_q = self.Q[state, action] # Compute TD target if done: # Terminal state: no future value td_target = reward else: # Non-terminal: reward + discounted max future value max_next_q = np.max(self.Q[next_state]) td_target = reward + self.gamma * max_next_q # TD error: how far off our estimate was td_error = td_target - current_q # Update Q-value toward target self.Q[state, action] = current_q + self.alpha * td_error # Track for analysis self.td_errors.append(abs(td_error)) return td_error def decay_epsilon(self): """Decay exploration rate at end of episode.""" self.epsilon = max(self.epsilon_end, self.epsilon * self.epsilon_decay) def get_policy(self) -> np.ndarray: """Extract greedy policy from Q-table.""" return np.argmax(self.Q, axis=1) def get_value_function(self) -> np.ndarray: """Extract state-value function V(s) = max_a Q(s, a).""" return np.max(self.Q, axis=1) def train_qlearning( env, # Environment with reset(), step(action) interface agent: QLearningAgent, num_episodes: int = 1000, max_steps: int = 1000, verbose: bool = True) -> Tuple[List[float], List[int]]: """ Train Q-learning agent on an environment. Returns: episode_rewards: Total reward per episode episode_lengths: Steps per episode """ episode_rewards = [] episode_lengths = [] for episode in range(num_episodes): state = env.reset() total_reward = 0 for step in range(max_steps): # Select and execute action action = agent.select_action(state, training=True) next_state, reward, done, _ = env.step(action) # Q-learning update agent.update(state, action, reward, next_state, done) total_reward += reward state = next_state if done: break # End of episode bookkeeping agent.decay_epsilon() episode_rewards.append(total_reward) episode_lengths.append(step + 1) if verbose and (episode + 1) % 100 == 0: avg_reward = np.mean(episode_rewards[-100:]) avg_length = np.mean(episode_lengths[-100:]) print(f"Episode {episode + 1}: Avg Reward = {avg_reward:.2f}, " f"Avg Length = {avg_length:.1f}, ε = {agent.epsilon:.3f}") return episode_rewards, episode_lengthsOne of Q-learning's remarkable properties is its convergence guarantee: under appropriate conditions, Q-learning converges to $Q^*$ with probability 1.
Theorem (Watkins & Dayan, 1992): Q-learning converges to $Q^*$ as $t \to \infty$ with probability 1, provided:
Condition (2) is called the exploration requirement. Condition (3) is the Robbins-Monro condition—the rates must decrease slowly enough to reach any value, but fast enough to damp oscillations.
In practice, constant learning rates violate the square-summability condition, meaning Q-values may oscillate forever rather than converge. However, with constant α, Q-values still track the optimal values in a bounded region, which is often sufficient. True convergence requires decaying rates like α_t = 1/(1 + visits(s,a)).
The proof relies on stochastic approximation theory, particularly results on convergence of random iterative processes. Key steps:
1. Express as Stochastic Fixed-Point Iteration
Rewrite the Q-learning update as: $$Q_{t+1}(s, a) = (1 - \alpha_t) Q_t(s, a) + \alpha_t F_t(Q_t)$$
where $F_t(Q) = r_t + \gamma \max_{a'} Q(s_{t+1}, a')$ is a noisy sample of the Bellman operator $T^*$.
2. Show Contraction Property
The expected update direction points toward $Q^$: $$\mathbb{E}[F_t(Q) | s, a] = (T^ Q)(s, a)$$
Since $T^$ is a $\gamma$-contraction, the expected update pulls $Q$ toward $Q^$.
3. Apply Stochastic Approximation Results
With the Robbins-Monro conditions, the noise from sampling averages out over time, and the iterates converge to the fixed point $Q^*$.
The condition "all state-action pairs visited infinitely often" is crucial and often misunderstood.
Why it's necessary: Q-learning only updates Q(s, a) when (s, a) is actually experienced. If some pairs are never visited, their Q-values remain at initialization—potentially arbitrarily wrong.
How to satisfy it: Exploration strategies like ε-greedy ensure all actions have positive probability at all states. As long as ε > 0 remains fixed, any (s, a) reachable from the initial state distribution will eventually be visited infinitely often.
The practical trade-off: We want ε to decay for better exploitation, but too-fast decay violates the exploration requirement. Decaying ε to zero means convergence isn't guaranteed—but practically, the policy may still be approximately optimal.
| Condition | Formal Statement | Practical Implication |
|---|---|---|
| Finite MDP | |S| < ∞, |A| < ∞ | Tabular representation required |
| Exploration | ∀(s,a): # visits → ∞ | Use ε-greedy with ε > 0 |
| Learning rate (sum) | Σ α_t(s,a) = ∞ | Rates can't decay too fast |
| Learning rate (sum sq) | Σ α_t(s,a)² < ∞ | Rates must eventually decay |
| Bounded rewards | |r| ≤ R_max | Scale rewards if needed |
The exploration-exploitation dilemma is a fundamental challenge in RL: exploit current knowledge to maximize immediate reward, or explore to discover potentially better strategies?
The simplest strategy: with probability ε choose uniformly random action, otherwise choose greedy action.
$$\pi_\epsilon(a|s) = \begin{cases} 1 - \epsilon + \epsilon/|A| & \text{if } a = \arg\max_{a'} Q(s, a') \ \epsilon/|A| & \text{otherwise} \end{cases}$$
Advantages: Simple, guaranteed exploration, works well empirically Disadvantages: Wastes exploration on clearly bad actions, treats all non-greedy actions equally
Typical schedules reduce ε over time as the agent learns:
Linear decay: $\epsilon_t = \max(\epsilon_{min}, \epsilon_0 - t \cdot \delta)$
Exponential decay: $\epsilon_t = \max(\epsilon_{min}, \epsilon_0 \cdot \rho^t)$
Inverse decay: $\epsilon_t = \epsilon_0 / (1 + t \cdot \delta)$
The choice of schedule depends on the problem. Faster decay means less exploration and potentially getting stuck at suboptimal policies. Slower decay wastes time exploring well-understood regions.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112
import numpy as npfrom abc import ABC, abstractmethod class ExplorationStrategy(ABC): """Abstract base for exploration strategies.""" @abstractmethod def select_action(self, q_values: np.ndarray, step: int) -> int: pass class EpsilonGreedy(ExplorationStrategy): """Standard ε-greedy with configurable decay.""" def __init__( self, epsilon_start: float = 1.0, epsilon_end: float = 0.01, decay_type: str = "exponential", # linear, exponential, inverse decay_param: float = 0.995 # decay rate or increment ): self.epsilon_start = epsilon_start self.epsilon_end = epsilon_end self.decay_type = decay_type self.decay_param = decay_param def get_epsilon(self, step: int) -> float: if self.decay_type == "exponential": eps = self.epsilon_start * (self.decay_param ** step) elif self.decay_type == "linear": eps = self.epsilon_start - step * self.decay_param elif self.decay_type == "inverse": eps = self.epsilon_start / (1 + step * self.decay_param) else: eps = self.epsilon_start return max(self.epsilon_end, eps) def select_action(self, q_values: np.ndarray, step: int) -> int: epsilon = self.get_epsilon(step) if np.random.random() < epsilon: return np.random.randint(len(q_values)) else: return np.argmax(q_values) class Softmax(ExplorationStrategy): """ Softmax (Boltzmann) exploration. Probability of action a is proportional to exp(Q(s,a) / τ). Higher τ = more uniform; lower τ = more greedy. """ def __init__( self, temperature_start: float = 1.0, temperature_end: float = 0.1, decay_rate: float = 0.99 ): self.temp_start = temperature_start self.temp_end = temperature_end self.decay_rate = decay_rate def get_temperature(self, step: int) -> float: temp = self.temp_start * (self.decay_rate ** step) return max(self.temp_end, temp) def select_action(self, q_values: np.ndarray, step: int) -> int: tau = self.get_temperature(step) # Numerical stability: subtract max before exp q_scaled = (q_values - np.max(q_values)) / tau exp_q = np.exp(q_scaled) probs = exp_q / np.sum(exp_q) return np.random.choice(len(q_values), p=probs) class UCB(ExplorationStrategy): """ Upper Confidence Bound exploration. Selects argmax_a [Q(s,a) + c * sqrt(ln(t) / N(s,a))] where N(s,a) is the visit count for state-action pair. Note: Requires tracking visit counts externally. """ def __init__(self, exploration_coef: float = 2.0): self.c = exploration_coef self.action_counts: dict = {} # (state_idx, action) -> count self.total_steps = 0 def update_counts(self, state: int, action: int): key = (state, action) self.action_counts[key] = self.action_counts.get(key, 0) + 1 self.total_steps += 1 def select_action(self, q_values: np.ndarray, step: int, state: int = 0) -> int: # UCB values: Q + exploration bonus ucb_values = np.zeros_like(q_values) for a in range(len(q_values)): count = self.action_counts.get((state, a), 0) if count == 0: # Unvisited action: infinite UCB (must explore) ucb_values[a] = float('inf') else: exploration_bonus = self.c * np.sqrt( np.log(self.total_steps + 1) / count ) ucb_values[a] = q_values[a] + exploration_bonus return np.argmax(ucb_values)Instead of treating non-greedy actions equally, softmax weights actions by their Q-values:
$$\pi(a|s) = \frac{e^{Q(s,a)/\tau}}{\sum_{a'} e^{Q(s,a')/\tau}}$$
The temperature parameter τ controls exploration:
Advantage over ε-greedy: Doesn't waste exploration on obviously bad actions. If Q(s, a₁) >> Q(s, a₂), softmax rarely chooses a₂.
Disadvantage: Requires choosing τ appropriately; sensitive to Q-value scale.
A simple but effective exploration technique: initialize Q-values high. This makes all state-action pairs look promising initially, driving exploration. As the agent experiences lower-than-expected rewards, values decrease toward reality. This provides 'exploration without parameters' but only works for the initial phase.
Q-learning and SARSA (covered in the next page) represent two fundamental approaches to TD control. Understanding their difference illuminates a core design choice in RL.
Q-learning update: $$Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma \max_{a'} Q(s', a') - Q(s, a)]$$
SARSA update: $$Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma Q(s', a') - Q(s, a)]$$
Notice: Q-learning uses $\max_{a'}$ (optimal action), SARSA uses $a'$ (actual next action taken).
This difference defines:
Consider a gridworld with a cliff: the agent must traverse from start to goal near a dangerous edge. Stepping off the cliff gives -100 reward.
Q-learning learns the optimal path along the cliff edge because it evaluates actions assuming optimal future behavior (no accidents). But when actually executing ε-greedy, exploration near the cliff is deadly.
SARSA learns a safer path away from the cliff because it accounts for the exploratory policy's probability of accidentally stepping off. The learned Q-values reflect actual ε-greedy performance, not optimal performance.
This illustrates a fundamental trade-off:
In safe, forgiving environments (like games), Q-learning's optimality is preferred. In real-world deployments with costly mistakes (robotics, finance), SARSA's safety during learning may be critical. Expected SARSA offers a middle ground, discussed in subsequent pages.
Q-learning has a systematic flaw: maximization bias. The max operator in the TD target causes systematic overestimation of Q-values.
Consider a state where all actions have true value Q*(s, a) = 0. Due to noise from limited samples, our estimates will scatter around 0: some above, some below.
Q-learning's target uses $\max_{a'} Q(s', a')$. This selects the highest (most positive error) estimate, biasing the target upward. Over time, this bias compounds, leading to systematic overestimation.
Example: State s has 10 actions, all with true value 0. If Q-estimates are N(0, 1), then $\mathbb{E}[\max_a Q] \approx 1.54$ — significantly positive despite true value being 0.
Maximization bias doesn't just add noise—it systematically corrupts learning. The agent may persistently overvalue certain states or actions, leading to suboptimal policies. In deep RL (DQN), this problem is severe and motivated Double DQN.
Double Q-learning (Van Hasselt, 2010) maintains two Q-tables, $Q_A$ and $Q_B$, and decouples action selection from value estimation:
Since $Q_B$ is independent of the selection made by $Q_A$, the positive-bias-selects-positive-error loop is broken.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788
import numpy as np class DoubleQLearningAgent: """ Double Q-Learning to eliminate maximization bias. Maintains two Q-tables and alternates which one is updated, using the other for value estimation. """ def __init__( self, num_states: int, num_actions: int, learning_rate: float = 0.1, discount_factor: float = 0.99, epsilon: float = 0.1 ): # Two independent Q-tables self.Q_A = np.zeros((num_states, num_actions)) self.Q_B = np.zeros((num_states, num_actions)) self.alpha = learning_rate self.gamma = discount_factor self.epsilon = epsilon def select_action(self, state: int) -> int: """ε-greedy using sum of both Q-tables.""" if np.random.random() < self.epsilon: return np.random.randint(self.Q_A.shape[1]) else: # Use average of both tables for action selection combined_q = self.Q_A[state] + self.Q_B[state] return np.argmax(combined_q) def update( self, state: int, action: int, reward: float, next_state: int, done: bool ): """ Double Q-learning update. With 50% probability, update Q_A using Q_B for evaluation, or vice versa. This decouples selection and evaluation, eliminating maximization bias. """ if np.random.random() < 0.5: # Update Q_A self._update_table( self.Q_A, self.Q_B, state, action, reward, next_state, done ) else: # Update Q_B self._update_table( self.Q_B, self.Q_A, state, action, reward, next_state, done ) def _update_table( self, Q_update: np.ndarray, # Table to update Q_eval: np.ndarray, # Table for evaluation state: int, action: int, reward: float, next_state: int, done: bool ): """Update one Q-table using the other for evaluation.""" if done: td_target = reward else: # Action selection: use Q_update best_action = np.argmax(Q_update[next_state]) # Value estimation: use Q_eval (decoupled!) td_target = reward + self.gamma * Q_eval[next_state, best_action] td_error = td_target - Q_update[state, action] Q_update[state, action] += self.alpha * td_error def get_q_values(self) -> np.ndarray: """Return average of both Q-tables.""" return (self.Q_A + self.Q_B) / 2Double Q-learning converges to the correct values where standard Q-learning overshoots. In environments with many actions or noisy rewards, the difference is dramatic:
| Scenario | Q-Learning Estimate | Double Q-Learning | True Value |
|---|---|---|---|
| 10 actions, all Q*=0 | +2.5 | +0.3 | 0 |
| Noisy rewards (σ=5) | +4.2 | +0.8 | 0 |
| Stochastic environment | +3.1 | +0.5 | 0 |
Double Q-learning's unbiased estimates translate to better policies, faster convergence, and more stable learning in practice.
Successfully applying Q-learning requires careful attention to hyperparameters, initialization, and potential failure modes.
The learning rate α controls the trade-off between stability and adaptability:
For non-stationary environments, a small constant α is often better than decaying rates, as it allows tracking changing dynamics.
| Hyperparameter | Typical Range | Effect of Increasing |
|---|---|---|
| Learning rate α | 0.01–0.5 | Faster learning but more noise |
| Discount γ | 0.9–0.999 | More farsighted but slower convergence |
| Initial ε | 0.5–1.0 | More initial exploration |
| Final ε | 0.01–0.1 | More exploration in limit |
| ε decay rate | 0.99–0.9999 | Slower exploration reduction |
Rule of thumb: Start with α = 0.1, γ = 0.99, ε decaying from 1.0 to 0.01 over ~90% of training. Adjust based on observed learning curves.
Always visualize: (1) Q-values over time for key states, (2) TD error magnitude over time (should decrease), (3) Episode rewards over time, (4) Heatmaps of Q-values across state-action space. These diagnostics reveal problems invisible in aggregate metrics.
Tabular Q-learning cannot scale to large or continuous state spaces. With 1000 states and 10 actions, we store 10,000 values. With a million states, 10 million. For continuous states (robot joint angles, pixel observations), tabular representation is impossible.
Function approximation replaces the Q-table with a parameterized function $Q(s, a; \theta)$ that generalizes across states:
$$Q(s, a; \theta) \approx Q^*(s, a)$$
Neural networks are the most powerful function approximators, leading to Deep Q-Networks (DQN).
Naively applying neural networks to Q-learning fails due to:
DQN solutions:
These innovations, combined with deep neural networks, enabled DQN to achieve superhuman performance on Atari games in 2015—a landmark in AI.
| Algorithm | State Space | Key Innovation | Year |
|---|---|---|---|
| Tabular Q-Learning | Small, discrete | TD + off-policy | 1989 |
| DQN | Large, continuous (pixels) | Experience replay + target networks | 2015 |
| Double DQN | Large, continuous | Decoupled selection/evaluation | 2015 |
| Dueling DQN | Large, continuous | Separate value/advantage streams | 2016 |
| Rainbow | Large, continuous | All above + distributional + noisy | 2017 |
DQN and its variants are covered in detail in Module 5: Deep Reinforcement Learning. Understanding tabular Q-learning—its algorithm, convergence, and limitations—provides the essential foundation for understanding these powerful extensions.
Q-learning is a foundational algorithm that enables learning optimal behavior directly from experience. Let's consolidate the key insights:
What's next: While Q-learning is off-policy (learning Q* while following any exploratory policy), sometimes we want to learn the value of the policy we're actually following. The next page introduces SARSA, Q-learning's on-policy cousin, which learns Q^π for the behavior policy π.
You've mastered Q-learning—the algorithm that launched model-free reinforcement learning. From temporal-difference foundations through convergence theory to practical implementation and extensions, you now have the complete picture of this seminal algorithm.