Machine LearningReinforcement Learning

Value-Based Methods

LevelAdvanced

Duration180 mins

TopicReinforcement Learning

2 / 5

Q-Learning

Learning Optimal Behavior Without a Model

Value iteration is powerful but has a critical limitation: it requires complete knowledge of the environment's dynamics—the transition probabilities $P(s'|s,a)$ and reward function $R(s,a)$. In most real-world scenarios, we don't have this luxury. A robot doesn't know the exact physics of its environment. A game-playing agent doesn't have access to the opponent's strategy. A trading algorithm can't predict market dynamics.

Q-learning, introduced by Chris Watkins in 1989, revolutionized reinforcement learning by showing how to learn optimal behavior directly from experience, without ever constructing a model of the environment. This model-free approach learns action-values (Q-values) by interacting with the environment and observing the consequences of actions.

Remarkably, Q-learning is proven to converge to optimal Q-values under mild conditions—it effectively performs value iteration by sampling, using each observed transition as an incremental update toward the true value.

What You Will Learn

This page provides a complete treatment of Q-learning: the intuition behind temporal-difference learning, rigorous algorithm derivation, convergence proofs, exploration-exploitation strategies, tabular implementation, common pitfalls, and the bridge to deep Q-networks. You'll understand not just how Q-learning works, but why it works.

From Value Iteration to Learning from Experience

Recall the Bellman optimality equation for Q-values:

$$Q^(s, a) = R(s, a) + \gamma \sum_{s'} P(s' | s, a) \max_{a'} Q^(s', a')$$

Value iteration computes this exactly by iterating:

$$Q_{k+1}(s, a) = R(s, a) + \gamma \sum_{s'} P(s' | s, a) \max_{a'} Q_k(s', a')$$

But this requires knowing $P(s'|s,a)$—the transition dynamics. What if we could approximate this expectation using samples instead?

The Key Insight: Sample-Based Updates

The expectation E[max_a' Q(s', a')] over next states can be approximated by observing actual transitions. If we take action a in state s and observe transition (s, a, r, s'), we get a single sample of what value iteration would compute. By averaging many such samples, we converge to the true expectation.

The Monte Carlo Baseline

One approach is Monte Carlo estimation: complete entire episodes, compute actual returns $G_t = \sum_{k=0}^\infty \gamma^k r_{t+k+1}$, then average:

$$Q(s, a) \leftarrow \text{Average}(G_t \text{ for all visits to } (s, a))$$

This works but has drawbacks:

Must wait for episode completion (no learning mid-episode)
High variance in return estimates
Doesn't work for continuing (non-episodic) tasks

The Temporal Difference Idea

What if we could bootstrap—use current estimates to improve themselves? This is the core of temporal difference (TD) learning:

Instead of waiting for the true return $G_t$, we use a one-step lookahead:

$$G_t \approx r_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a')$$

This is called the TD target. It combines the observed immediate reward with our current estimate of future value. As our estimates improve, so does the target.

Why Bootstrapping Works

The TD target $r + \gamma \max_{a'} Q(s', a')$ is a biased estimator of $Q^*(s, a)$ because it uses our current (imperfect) estimate. But it has much lower variance than Monte Carlo returns because it doesn't accumulate randomness over the entire trajectory.

Method	Bias	Variance	Update Timing
Monte Carlo	Unbiased	High (full trajectory)	End of episode
TD (1-step)	Biased	Low (1 step of randomness)	Every step
n-step TD	Moderate	Moderate	Every n steps

The bias-variance trade-off generally favors TD methods in practice, especially when function approximation is involved.

The Q-Learning Algorithm

Q-learning is an off-policy TD control algorithm. Let's unpack both terms:

TD: Uses bootstrapped targets rather than full returns
Off-policy: The policy being improved (target policy) differs from the policy generating behavior (behavior policy)

The Update Rule

After observing transition $(s, a, r, s')$:

$$Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right]$$

Let's dissect each component:

Component	Meaning
$Q(s, a)$	Current estimate of action-value
$\alpha \in (0, 1]$	Learning rate (step size)
$r + \gamma \max_{a'} Q(s', a')$	TD target: observed reward + max future value
$r + \gamma \max_{a'} Q(s', a') - Q(s, a)$	TD error: difference between target and estimate
The full update	Move Q(s,a) toward the target by a fraction α

Why 'Off-Policy'?

The max operator in the TD target selects the action that maximizes Q(s', a'), regardless of which action the agent actually will take. This means Q-learning learns the value of the optimal policy π* even while following an exploratory behavior policy (like ε-greedy). The behavior policy and target policy are different—hence 'off-policy'.

The Complete Algorithm

Q-Learning (Tabular):

Initialize $Q(s, a)$ arbitrarily for all $s \in S$, $a \in A$
Initialize $Q(s_{terminal}, \cdot) = 0$ for all terminal states
For each episode:
- Initialize state $s$
- Repeat for each step:
  - Choose action $a$ from $s$ using policy derived from $Q$ (e.g., $\epsilon$-greedy)
  - Take action $a$, observe reward $r$ and next state $s'$
  - Update: $Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma \max_{a'} Q(s', a') - Q(s, a)]$
  - $s \leftarrow s'$
- Until $s$ is terminal

q_learning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
import numpy as np
from typing import Tuple, List, Callable
from collections import defaultdict
 
class QLearningAgent:
    """
    Tabular Q-Learning Agent with epsilon-greedy exploration.
    
    This implementation emphasizes clarity and educational value,
    with detailed documentation of each component.
    """
    
    def __init__(
        self,
        num_states: int,
        num_actions: int,
        learning_rate: float = 0.1,
        discount_factor: float = 0.99,
        epsilon_start: float = 1.0,
        epsilon_end: float = 0.01,
        epsilon_decay: float = 0.995
    ):
        """
        Initialize Q-learning agent.
        
        Args:
            num_states: Size of state space
            num_actions: Size of action space
            learning_rate: Step size α for updates (0 < α ≤ 1)
            discount_factor: γ - importance of future rewards (0 ≤ γ < 1)
            epsilon_start: Initial exploration rate
            epsilon_end: Minimum exploration rate
            epsilon_decay: Multiplicative decay factor per episode
        """
        self.num_states = num_states
        self.num_actions = num_actions
        self.alpha = learning_rate
        self.gamma = discount_factor
        
        # Exploration parameters
        self.epsilon = epsilon_start
        self.epsilon_end = epsilon_end
        self.epsilon_decay = epsilon_decay
        
        # Initialize Q-table with optimistic values to encourage exploration
        # Using zeros is also common
        self.Q = np.zeros((num_states, num_actions))
        
        # Learning statistics
        self.td_errors: List[float] = []
        
    def select_action(self, state: int, training: bool = True) -> int:
        """
        Select action using epsilon-greedy policy.
        
        During training: ε probability of random action (exploration)
        During evaluation: Always greedy (exploitation)
        
        Args:
            state: Current state index
            training: If True, use ε-greedy; if False, use greedy
            
        Returns:
            Selected action index
        """
        if training and np.random.random() < self.epsilon:
            # Explore: random action
            return np.random.randint(self.num_actions)
        else:
            # Exploit: greedy action (with random tie-breaking)
            q_values = self.Q[state]
            max_q = np.max(q_values)
            # Handle ties randomly
            best_actions = np.where(q_values == max_q)[0]
            return np.random.choice(best_actions)
    
    def update(
        self,
        state: int,
        action: int,
        reward: float,
        next_state: int,
        done: bool
    ) -> float:
        """
        Perform Q-learning update.
        
        Q(s, a) ← Q(s, a) + α[r + γ max_a' Q(s', a') - Q(s, a)]
        
        Args:
            state: Current state
            action: Action taken
            reward: Reward received
            next_state: Resulting state
            done: True if next_state is terminal
            
        Returns:
            TD error (for monitoring learning progress)
        """
        # Current Q-value estimate
        current_q = self.Q[state, action]
        
        # Compute TD target
        if done:
            # Terminal state: no future value
            td_target = reward
        else:
            # Non-terminal: reward + discounted max future value
            max_next_q = np.max(self.Q[next_state])
            td_target = reward + self.gamma * max_next_q
        
        # TD error: how far off our estimate was
        td_error = td_target - current_q
        
        # Update Q-value toward target
        self.Q[state, action] = current_q + self.alpha * td_error
        
        # Track for analysis
        self.td_errors.append(abs(td_error))
        
        return td_error
    
    def decay_epsilon(self):
        """Decay exploration rate at end of episode."""
        self.epsilon = max(self.epsilon_end, self.epsilon * self.epsilon_decay)
    
    def get_policy(self) -> np.ndarray:
        """Extract greedy policy from Q-table."""
        return np.argmax(self.Q, axis=1)
    
    def get_value_function(self) -> np.ndarray:
        """Extract state-value function V(s) = max_a Q(s, a)."""
        return np.max(self.Q, axis=1)
 
 
def train_qlearning(
    env,  # Environment with reset(), step(action) interface
    agent: QLearningAgent,
    num_episodes: int = 1000,
    max_steps: int = 1000,
    verbose: bool = True
) -> Tuple[List[float], List[int]]:
    """
    Train Q-learning agent on an environment.
    
    Returns:
        episode_rewards: Total reward per episode
        episode_lengths: Steps per episode
    """
    episode_rewards = []
    episode_lengths = []
    
    for episode in range(num_episodes):
        state = env.reset()
        total_reward = 0
        
        for step in range(max_steps):
            # Select and execute action
            action = agent.select_action(state, training=True)
            next_state, reward, done, _ = env.step(action)
            
            # Q-learning update
            agent.update(state, action, reward, next_state, done)
            
            total_reward += reward
            state = next_state
            
            if done:
                break
        
        # End of episode bookkeeping
        agent.decay_epsilon()
        episode_rewards.append(total_reward)
        episode_lengths.append(step + 1)
        
        if verbose and (episode + 1) % 100 == 0:
            avg_reward = np.mean(episode_rewards[-100:])
            avg_length = np.mean(episode_lengths[-100:])
            print(f"Episode {episode + 1}: Avg Reward = {avg_reward:.2f}, "
                  f"Avg Length = {avg_length:.1f}, ε = {agent.epsilon:.3f}")
    
    return episode_rewards, episode_lengths

Convergence Theory

One of Q-learning's remarkable properties is its convergence guarantee: under appropriate conditions, Q-learning converges to $Q^*$ with probability 1.

Main Convergence Theorem

Theorem (Watkins & Dayan, 1992): Q-learning converges to $Q^*$ as $t \to \infty$ with probability 1, provided:

The state and action spaces are finite
All state-action pairs are visited infinitely often
The learning rates $\alpha_t(s, a)$ satisfy:
- $\sum_{t=0}^\infty \alpha_t(s, a) = \infty$ (rates sum to infinity)
- $\sum_{t=0}^\infty \alpha_t(s, a)^2 < \infty$ (rates are square-summable)
The rewards are bounded

Condition (2) is called the exploration requirement. Condition (3) is the Robbins-Monro condition—the rates must decrease slowly enough to reach any value, but fast enough to damp oscillations.

The Learning Rate Schedule Matters

In practice, constant learning rates violate the square-summability condition, meaning Q-values may oscillate forever rather than converge. However, with constant α, Q-values still track the optimal values in a bounded region, which is often sufficient. True convergence requires decaying rates like α_t = 1/(1 + visits(s,a)).

Proof Sketch

The proof relies on stochastic approximation theory, particularly results on convergence of random iterative processes. Key steps:

1. Express as Stochastic Fixed-Point Iteration

Rewrite the Q-learning update as: $$Q_{t+1}(s, a) = (1 - \alpha_t) Q_t(s, a) + \alpha_t F_t(Q_t)$$

where $F_t(Q) = r_t + \gamma \max_{a'} Q(s_{t+1}, a')$ is a noisy sample of the Bellman operator $T^*$.

2. Show Contraction Property

The expected update direction points toward $Q^$: $$\mathbb{E}[F_t(Q) | s, a] = (T^ Q)(s, a)$$

Since $T^$ is a $\gamma$-contraction, the expected update pulls $Q$ toward $Q^$.

3. Apply Stochastic Approximation Results

With the Robbins-Monro conditions, the noise from sampling averages out over time, and the iterates converge to the fixed point $Q^*$.

Understanding the Exploration Requirement

The condition "all state-action pairs visited infinitely often" is crucial and often misunderstood.

Why it's necessary: Q-learning only updates Q(s, a) when (s, a) is actually experienced. If some pairs are never visited, their Q-values remain at initialization—potentially arbitrarily wrong.

How to satisfy it: Exploration strategies like ε-greedy ensure all actions have positive probability at all states. As long as ε > 0 remains fixed, any (s, a) reachable from the initial state distribution will eventually be visited infinitely often.

The practical trade-off: We want ε to decay for better exploitation, but too-fast decay violates the exploration requirement. Decaying ε to zero means convergence isn't guaranteed—but practically, the policy may still be approximately optimal.

Conditions for Q-Learning Convergence
Condition	Formal Statement	Practical Implication
Finite MDP	\|S\| < ∞, \|A\| < ∞	Tabular representation required
Exploration	∀(s,a): # visits → ∞	Use ε-greedy with ε > 0
Learning rate (sum)	Σ α_t(s,a) = ∞	Rates can't decay too fast
Learning rate (sum sq)	Σ α_t(s,a)² < ∞	Rates must eventually decay
Bounded rewards	\|r\| ≤ R_max	Scale rewards if needed

Exploration Strategies

The exploration-exploitation dilemma is a fundamental challenge in RL: exploit current knowledge to maximize immediate reward, or explore to discover potentially better strategies?

ε-Greedy Exploration

The simplest strategy: with probability ε choose uniformly random action, otherwise choose greedy action.

$$\pi_\epsilon(a|s) = \begin{cases} 1 - \epsilon + \epsilon/|A| & \text{if } a = \arg\max_{a'} Q(s, a') \ \epsilon/|A| & \text{otherwise} \end{cases}$$

Advantages: Simple, guaranteed exploration, works well empirically Disadvantages: Wastes exploration on clearly bad actions, treats all non-greedy actions equally

ε Decay Schedules

Typical schedules reduce ε over time as the agent learns:

Linear decay: $\epsilon_t = \max(\epsilon_{min}, \epsilon_0 - t \cdot \delta)$

Exponential decay: $\epsilon_t = \max(\epsilon_{min}, \epsilon_0 \cdot \rho^t)$

Inverse decay: $\epsilon_t = \epsilon_0 / (1 + t \cdot \delta)$

The choice of schedule depends on the problem. Faster decay means less exploration and potentially getting stuck at suboptimal policies. Slower decay wastes time exploring well-understood regions.

exploration_strategies.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
import numpy as np
from abc import ABC, abstractmethod
 
class ExplorationStrategy(ABC):
    """Abstract base for exploration strategies."""
    
    @abstractmethod
    def select_action(self, q_values: np.ndarray, step: int) -> int:
        pass
 
 
class EpsilonGreedy(ExplorationStrategy):
    """Standard ε-greedy with configurable decay."""
    
    def __init__(
        self,
        epsilon_start: float = 1.0,
        epsilon_end: float = 0.01,
        decay_type: str = "exponential",  # linear, exponential, inverse
        decay_param: float = 0.995  # decay rate or increment
    ):
        self.epsilon_start = epsilon_start
        self.epsilon_end = epsilon_end
        self.decay_type = decay_type
        self.decay_param = decay_param
    
    def get_epsilon(self, step: int) -> float:
        if self.decay_type == "exponential":
            eps = self.epsilon_start * (self.decay_param ** step)
        elif self.decay_type == "linear":
            eps = self.epsilon_start - step * self.decay_param
        elif self.decay_type == "inverse":
            eps = self.epsilon_start / (1 + step * self.decay_param)
        else:
            eps = self.epsilon_start
        return max(self.epsilon_end, eps)
    
    def select_action(self, q_values: np.ndarray, step: int) -> int:
        epsilon = self.get_epsilon(step)
        if np.random.random() < epsilon:
            return np.random.randint(len(q_values))
        else:
            return np.argmax(q_values)
 
 
class Softmax(ExplorationStrategy):
    """
    Softmax (Boltzmann) exploration.
    
    Probability of action a is proportional to exp(Q(s,a) / τ).
    Higher τ = more uniform; lower τ = more greedy.
    """
    
    def __init__(
        self,
        temperature_start: float = 1.0,
        temperature_end: float = 0.1,
        decay_rate: float = 0.99
    ):
        self.temp_start = temperature_start
        self.temp_end = temperature_end
        self.decay_rate = decay_rate
    
    def get_temperature(self, step: int) -> float:
        temp = self.temp_start * (self.decay_rate ** step)
        return max(self.temp_end, temp)
    
    def select_action(self, q_values: np.ndarray, step: int) -> int:
        tau = self.get_temperature(step)
        # Numerical stability: subtract max before exp
        q_scaled = (q_values - np.max(q_values)) / tau
        exp_q = np.exp(q_scaled)
        probs = exp_q / np.sum(exp_q)
        return np.random.choice(len(q_values), p=probs)
 
 
class UCB(ExplorationStrategy):
    """
    Upper Confidence Bound exploration.
    
    Selects argmax_a [Q(s,a) + c * sqrt(ln(t) / N(s,a))]
    where N(s,a) is the visit count for state-action pair.
    
    Note: Requires tracking visit counts externally.
    """
    
    def __init__(self, exploration_coef: float = 2.0):
        self.c = exploration_coef
        self.action_counts: dict = {}  # (state_idx, action) -> count
        self.total_steps = 0
    
    def update_counts(self, state: int, action: int):
        key = (state, action)
        self.action_counts[key] = self.action_counts.get(key, 0) + 1
        self.total_steps += 1
    
    def select_action(self, q_values: np.ndarray, step: int, state: int = 0) -> int:
        # UCB values: Q + exploration bonus
        ucb_values = np.zeros_like(q_values)
        
        for a in range(len(q_values)):
            count = self.action_counts.get((state, a), 0)
            if count == 0:
                # Unvisited action: infinite UCB (must explore)
                ucb_values[a] = float('inf')
            else:
                exploration_bonus = self.c * np.sqrt(
                    np.log(self.total_steps + 1) / count
                )
                ucb_values[a] = q_values[a] + exploration_bonus
        
        return np.argmax(ucb_values)

Softmax (Boltzmann) Exploration

Instead of treating non-greedy actions equally, softmax weights actions by their Q-values:

$$\pi(a|s) = \frac{e^{Q(s,a)/\tau}}{\sum_{a'} e^{Q(s,a')/\tau}}$$

The temperature parameter τ controls exploration:

High τ → uniform distribution (exploration)
Low τ → peaked at max Q (exploitation)
τ → 0 → greedy policy

Advantage over ε-greedy: Doesn't waste exploration on obviously bad actions. If Q(s, a₁) >> Q(s, a₂), softmax rarely chooses a₂.

Disadvantage: Requires choosing τ appropriately; sensitive to Q-value scale.

Optimistic Initialization

A simple but effective exploration technique: initialize Q-values high. This makes all state-action pairs look promising initially, driving exploration. As the agent experiences lower-than-expected rewards, values decrease toward reality. This provides 'exploration without parameters' but only works for the initial phase.

Q-Learning vs SARSA: Off-Policy vs On-Policy

Q-learning and SARSA (covered in the next page) represent two fundamental approaches to TD control. Understanding their difference illuminates a core design choice in RL.

The Key Difference

Q-learning update: $$Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma \max_{a'} Q(s', a') - Q(s, a)]$$

SARSA update: $$Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma Q(s', a') - Q(s, a)]$$

Notice: Q-learning uses $\max_{a'}$ (optimal action), SARSA uses $a'$ (actual next action taken).

This difference defines:

Q-learning: Off-policy—learns optimal Q* regardless of behavior policy
SARSA: On-policy—learns Q for the policy being followed

Q-Learning (Off-Policy)

•Learns optimal policy Q*
•Behavior policy can differ from target
•Can use experience replay (stored transitions)
•May overestimate values (maximization bias)
•Can be more sample efficient
•May learn unsafe behaviors

SARSA (On-Policy)

•Learns Q for current policy
•Behavior = target policy
•Cannot directly use replay
•No maximization bias
•May need more samples
•Naturally conservative/safe

The Cliff Walking Example

Consider a gridworld with a cliff: the agent must traverse from start to goal near a dangerous edge. Stepping off the cliff gives -100 reward.

Q-learning learns the optimal path along the cliff edge because it evaluates actions assuming optimal future behavior (no accidents). But when actually executing ε-greedy, exploration near the cliff is deadly.

SARSA learns a safer path away from the cliff because it accounts for the exploratory policy's probability of accidentally stepping off. The learned Q-values reflect actual ε-greedy performance, not optimal performance.

This illustrates a fundamental trade-off:

Q-learning: Optimal in the limit, but risky during learning
SARSA: Suboptimal in the limit, but safer during learning

When Does It Matter?

In safe, forgiving environments (like games), Q-learning's optimality is preferred. In real-world deployments with costly mistakes (robotics, finance), SARSA's safety during learning may be critical. Expected SARSA offers a middle ground, discussed in subsequent pages.

Maximization Bias and Double Q-Learning

Q-learning has a systematic flaw: maximization bias. The max operator in the TD target causes systematic overestimation of Q-values.

The Problem

Consider a state where all actions have true value Q*(s, a) = 0. Due to noise from limited samples, our estimates will scatter around 0: some above, some below.

Q-learning's target uses $\max_{a'} Q(s', a')$. This selects the highest (most positive error) estimate, biasing the target upward. Over time, this bias compounds, leading to systematic overestimation.

Example: State s has 10 actions, all with true value 0. If Q-estimates are N(0, 1), then $\mathbb{E}[\max_a Q] \approx 1.54$ — significantly positive despite true value being 0.

Why This Matters

Maximization bias doesn't just add noise—it systematically corrupts learning. The agent may persistently overvalue certain states or actions, leading to suboptimal policies. In deep RL (DQN), this problem is severe and motivated Double DQN.

Double Q-Learning

Double Q-learning (Van Hasselt, 2010) maintains two Q-tables, $Q_A$ and $Q_B$, and decouples action selection from value estimation:

Use $Q_A$ to select the best action: $a^* = \arg\max_a Q_A(s', a)$
Use $Q_B$ to evaluate it: $Q_B(s', a^*)$
Update: $Q_A(s, a) \leftarrow Q_A(s, a) + \alpha[r + \gamma Q_B(s', a^*) - Q_A(s, a)]$
With 50% probability, swap roles of $Q_A$ and $Q_B$

Since $Q_B$ is independent of the selection made by $Q_A$, the positive-bias-selects-positive-error loop is broken.

double_q_learning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
import numpy as np
 
class DoubleQLearningAgent:
    """
    Double Q-Learning to eliminate maximization bias.
    
    Maintains two Q-tables and alternates which one is updated,
    using the other for value estimation.
    """
    
    def __init__(
        self,
        num_states: int,
        num_actions: int,
        learning_rate: float = 0.1,
        discount_factor: float = 0.99,
        epsilon: float = 0.1
    ):
        # Two independent Q-tables
        self.Q_A = np.zeros((num_states, num_actions))
        self.Q_B = np.zeros((num_states, num_actions))
        
        self.alpha = learning_rate
        self.gamma = discount_factor
        self.epsilon = epsilon
    
    def select_action(self, state: int) -> int:
        """ε-greedy using sum of both Q-tables."""
        if np.random.random() < self.epsilon:
            return np.random.randint(self.Q_A.shape[1])
        else:
            # Use average of both tables for action selection
            combined_q = self.Q_A[state] + self.Q_B[state]
            return np.argmax(combined_q)
    
    def update(
        self,
        state: int,
        action: int,
        reward: float,
        next_state: int,
        done: bool
    ):
        """
        Double Q-learning update.
        
        With 50% probability, update Q_A using Q_B for evaluation, or vice versa.
        This decouples selection and evaluation, eliminating maximization bias.
        """
        if np.random.random() < 0.5:
            # Update Q_A
            self._update_table(
                self.Q_A, self.Q_B,
                state, action, reward, next_state, done
            )
        else:
            # Update Q_B
            self._update_table(
                self.Q_B, self.Q_A,
                state, action, reward, next_state, done
            )
    
    def _update_table(
        self,
        Q_update: np.ndarray,  # Table to update
        Q_eval: np.ndarray,    # Table for evaluation
        state: int,
        action: int,
        reward: float,
        next_state: int,
        done: bool
    ):
        """Update one Q-table using the other for evaluation."""
        
        if done:
            td_target = reward
        else:
            # Action selection: use Q_update
            best_action = np.argmax(Q_update[next_state])
            # Value estimation: use Q_eval (decoupled!)
            td_target = reward + self.gamma * Q_eval[next_state, best_action]
        
        td_error = td_target - Q_update[state, action]
        Q_update[state, action] += self.alpha * td_error
    
    def get_q_values(self) -> np.ndarray:
        """Return average of both Q-tables."""
        return (self.Q_A + self.Q_B) / 2

Empirical Results

Double Q-learning converges to the correct values where standard Q-learning overshoots. In environments with many actions or noisy rewards, the difference is dramatic:

Scenario	Q-Learning Estimate	Double Q-Learning
10 actions, all Q*=0	+2.5	+0.3
Noisy rewards (σ=5)	+4.2	+0.8
Stochastic environment	+3.1	+0.5

Double Q-learning's unbiased estimates translate to better policies, faster convergence, and more stable learning in practice.

Practical Considerations

Successfully applying Q-learning requires careful attention to hyperparameters, initialization, and potential failure modes.

Learning Rate Selection

The learning rate α controls the trade-off between stability and adaptability:

Too high (α → 1): Q-values oscillate wildly, never converging
Too low (α → 0): Learning is extremely slow, may not adapt to non-stationarity
Sweet spot (α ≈ 0.1–0.5): Balances responsiveness with stability

For non-stationary environments, a small constant α is often better than decaying rates, as it allows tracking changing dynamics.

Common Pitfalls and Solutions

•Q-values exploding: Usually reward/discount mismatch. If γ ≈ 1 and rewards are large, Q-values grow unboundedly. Normalize rewards or use smaller γ.
•Q-values collapsing to same value: Likely a bug in terminal state handling. Terminal states should have Q = 0 (or final reward) with no bootstrap.
•Policy never improves: Insufficient exploration. Check that ε > 0 and all state-action pairs are reachable.
•Erratic learning curves: High variance in returns. Use reward smoothing, longer evaluation windows, or reduce learning rate.
•Overestimation persists: Maximization bias. Consider Double Q-learning.
•Slow convergence in sparse reward: Add reward shaping or use eligibility traces (upcoming page).

Hyperparameter Sensitivity

Hyperparameter	Typical Range	Effect of Increasing
Learning rate α	0.01–0.5	Faster learning but more noise
Discount γ	0.9–0.999	More farsighted but slower convergence
Initial ε	0.5–1.0	More initial exploration
Final ε	0.01–0.1	More exploration in limit
ε decay rate	0.99–0.9999	Slower exploration reduction

Rule of thumb: Start with α = 0.1, γ = 0.99, ε decaying from 1.0 to 0.01 over ~90% of training. Adjust based on observed learning curves.

Debugging Q-Learning

Always visualize: (1) Q-values over time for key states, (2) TD error magnitude over time (should decrease), (3) Episode rewards over time, (4) Heatmaps of Q-values across state-action space. These diagnostics reveal problems invisible in aggregate metrics.

Bridge to Deep Q-Networks

Tabular Q-learning cannot scale to large or continuous state spaces. With 1000 states and 10 actions, we store 10,000 values. With a million states, 10 million. For continuous states (robot joint angles, pixel observations), tabular representation is impossible.

Function approximation replaces the Q-table with a parameterized function $Q(s, a; \theta)$ that generalizes across states:

$$Q(s, a; \theta) \approx Q^*(s, a)$$

Neural networks are the most powerful function approximators, leading to Deep Q-Networks (DQN).

The DQN Innovation

Naively applying neural networks to Q-learning fails due to:

Correlated samples: Sequential experience is highly correlated, violating i.i.d. assumptions
Non-stationary targets: The TD target changes as Q changes, destabilizing learning

DQN solutions:

Experience replay: Store transitions in a buffer, sample mini-batches randomly for training (breaks correlation)
Target networks: Use a separate, slowly-updated network for TD targets (stabilizes targets)

These innovations, combined with deep neural networks, enabled DQN to achieve superhuman performance on Atari games in 2015—a landmark in AI.

Q-Learning Evolution
Algorithm	State Space	Key Innovation	Year
Tabular Q-Learning	Small, discrete	TD + off-policy	1989
DQN	Large, continuous (pixels)	Experience replay + target networks	2015
Double DQN	Large, continuous	Decoupled selection/evaluation	2015
Dueling DQN	Large, continuous	Separate value/advantage streams	2016
Rainbow	Large, continuous	All above + distributional + noisy	2017

Deep RL Module Preview

DQN and its variants are covered in detail in Module 5: Deep Reinforcement Learning. Understanding tabular Q-learning—its algorithm, convergence, and limitations—provides the essential foundation for understanding these powerful extensions.

Summary: Q-Learning

Q-learning is a foundational algorithm that enables learning optimal behavior directly from experience. Let's consolidate the key insights:

Key Takeaways

•Model-free learning: Q-learning estimates Q* without requiring transition dynamics—learns purely from (s, a, r, s') samples.
•Off-policy: The max operator targets the optimal policy, regardless of the behavior policy generating data.
•TD learning: Bootstrapping from current estimates enables online, incremental updates without waiting for episode completion.
•Convergence: Provably converges to Q* with sufficient exploration and appropriate learning rate decay.
•Exploration-exploitation: ε-greedy, softmax, and UCB provide different trade-offs; exploration is essential for convergence.
•Maximization bias: Standard Q-learning overestimates values; Double Q-learning provides unbiased estimates.
•Scalability: Tabular Q-learning is limited to small state spaces; function approximation (DQN) extends to large spaces.

What's next: While Q-learning is off-policy (learning Q* while following any exploratory policy), sometimes we want to learn the value of the policy we're actually following. The next page introduces SARSA, Q-learning's on-policy cousin, which learns Q^π for the behavior policy π.

Page Complete

You've mastered Q-learning—the algorithm that launched model-free reinforcement learning. From temporal-difference foundations through convergence theory to practical implementation and extensions, you now have the complete picture of this seminal algorithm.

2 / 5

Loading learning content...

Machine LearningReinforcement Learning

Value-Based Methods

LevelAdvanced

Duration180 mins

TopicReinforcement Learning

2 / 5

Q-Learning

Learning Optimal Behavior Without a Model

What You Will Learn

From Value Iteration to Learning from Experience

Recall the Bellman optimality equation for Q-values:

$$Q^(s, a) = R(s, a) + \gamma \sum_{s'} P(s' | s, a) \max_{a'} Q^(s', a')$$

Value iteration computes this exactly by iterating:

$$Q_{k+1}(s, a) = R(s, a) + \gamma \sum_{s'} P(s' | s, a) \max_{a'} Q_k(s', a')$$

But this requires knowing $P(s'|s,a)$—the transition dynamics. What if we could approximate this expectation using samples instead?

The Key Insight: Sample-Based Updates

The Monte Carlo Baseline

One approach is Monte Carlo estimation: complete entire episodes, compute actual returns $G_t = \sum_{k=0}^\infty \gamma^k r_{t+k+1}$, then average:

$$Q(s, a) \leftarrow \text{Average}(G_t \text{ for all visits to } (s, a))$$

This works but has drawbacks:

Must wait for episode completion (no learning mid-episode)
High variance in return estimates
Doesn't work for continuing (non-episodic) tasks

The Temporal Difference Idea

What if we could bootstrap—use current estimates to improve themselves? This is the core of temporal difference (TD) learning:

Instead of waiting for the true return $G_t$, we use a one-step lookahead:

$$G_t \approx r_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a')$$

This is called the TD target. It combines the observed immediate reward with our current estimate of future value. As our estimates improve, so does the target.

Why Bootstrapping Works

Method	Bias	Variance	Update Timing
Monte Carlo	Unbiased	High (full trajectory)	End of episode
TD (1-step)	Biased	Low (1 step of randomness)	Every step
n-step TD	Moderate	Moderate	Every n steps

The bias-variance trade-off generally favors TD methods in practice, especially when function approximation is involved.

The Q-Learning Algorithm

Q-learning is an off-policy TD control algorithm. Let's unpack both terms:

TD: Uses bootstrapped targets rather than full returns
Off-policy: The policy being improved (target policy) differs from the policy generating behavior (behavior policy)

The Update Rule

After observing transition $(s, a, r, s')$:

$$Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right]$$

Let's dissect each component:

Component	Meaning
$Q(s, a)$	Current estimate of action-value
$\alpha \in (0, 1]$	Learning rate (step size)
$r + \gamma \max_{a'} Q(s', a')$	TD target: observed reward + max future value
$r + \gamma \max_{a'} Q(s', a') - Q(s, a)$	TD error: difference between target and estimate
The full update	Move Q(s,a) toward the target by a fraction α

Why 'Off-Policy'?

The Complete Algorithm

Q-Learning (Tabular):

Initialize $Q(s, a)$ arbitrarily for all $s \in S$, $a \in A$
Initialize $Q(s_{terminal}, \cdot) = 0$ for all terminal states
For each episode:
- Initialize state $s$
- Repeat for each step:
  - Choose action $a$ from $s$ using policy derived from $Q$ (e.g., $\epsilon$-greedy)
  - Take action $a$, observe reward $r$ and next state $s'$
  - Update: $Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma \max_{a'} Q(s', a') - Q(s, a)]$
  - $s \leftarrow s'$
- Until $s$ is terminal

q_learning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
import numpy as np
from typing import Tuple, List, Callable
from collections import defaultdict
 
class QLearningAgent:
    """
    Tabular Q-Learning Agent with epsilon-greedy exploration.
    
    This implementation emphasizes clarity and educational value,
    with detailed documentation of each component.
    """
    
    def __init__(
        self,
        num_states: int,
        num_actions: int,
        learning_rate: float = 0.1,
        discount_factor: float = 0.99,
        epsilon_start: float = 1.0,
        epsilon_end: float = 0.01,
        epsilon_decay: float = 0.995
    ):
        """
        Initialize Q-learning agent.
        
        Args:
            num_states: Size of state space
            num_actions: Size of action space
            learning_rate: Step size α for updates (0 < α ≤ 1)
            discount_factor: γ - importance of future rewards (0 ≤ γ < 1)
            epsilon_start: Initial exploration rate
            epsilon_end: Minimum exploration rate
            epsilon_decay: Multiplicative decay factor per episode
        """
        self.num_states = num_states
        self.num_actions = num_actions
        self.alpha = learning_rate
        self.gamma = discount_factor
        
        # Exploration parameters
        self.epsilon = epsilon_start
        self.epsilon_end = epsilon_end
        self.epsilon_decay = epsilon_decay
        
        # Initialize Q-table with optimistic values to encourage exploration
        # Using zeros is also common
        self.Q = np.zeros((num_states, num_actions))
        
        # Learning statistics
        self.td_errors: List[float] = []
        
    def select_action(self, state: int, training: bool = True) -> int:
        """
        Select action using epsilon-greedy policy.
        
        During training: ε probability of random action (exploration)
        During evaluation: Always greedy (exploitation)
        
        Args:
            state: Current state index
            training: If True, use ε-greedy; if False, use greedy
            
        Returns:
            Selected action index
        """
        if training and np.random.random() < self.epsilon:
            # Explore: random action
            return np.random.randint(self.num_actions)
        else:
            # Exploit: greedy action (with random tie-breaking)
            q_values = self.Q[state]
            max_q = np.max(q_values)
            # Handle ties randomly
            best_actions = np.where(q_values == max_q)[0]
            return np.random.choice(best_actions)
    
    def update(
        self,
        state: int,
        action: int,
        reward: float,
        next_state: int,
        done: bool
    ) -> float:
        """
        Perform Q-learning update.
        
        Q(s, a) ← Q(s, a) + α[r + γ max_a' Q(s', a') - Q(s, a)]
        
        Args:
            state: Current state
            action: Action taken
            reward: Reward received
            next_state: Resulting state
            done: True if next_state is terminal
            
        Returns:
            TD error (for monitoring learning progress)
        """
        # Current Q-value estimate
        current_q = self.Q[state, action]
        
        # Compute TD target
        if done:
            # Terminal state: no future value
            td_target = reward
        else:
            # Non-terminal: reward + discounted max future value
            max_next_q = np.max(self.Q[next_state])
            td_target = reward + self.gamma * max_next_q
        
        # TD error: how far off our estimate was
        td_error = td_target - current_q
        
        # Update Q-value toward target
        self.Q[state, action] = current_q + self.alpha * td_error
        
        # Track for analysis
        self.td_errors.append(abs(td_error))
        
        return td_error
    
    def decay_epsilon(self):
        """Decay exploration rate at end of episode."""
        self.epsilon = max(self.epsilon_end, self.epsilon * self.epsilon_decay)
    
    def get_policy(self) -> np.ndarray:
        """Extract greedy policy from Q-table."""
        return np.argmax(self.Q, axis=1)
    
    def get_value_function(self) -> np.ndarray:
        """Extract state-value function V(s) = max_a Q(s, a)."""
        return np.max(self.Q, axis=1)
 
 
def train_qlearning(
    env,  # Environment with reset(), step(action) interface
    agent: QLearningAgent,
    num_episodes: int = 1000,
    max_steps: int = 1000,
    verbose: bool = True
) -> Tuple[List[float], List[int]]:
    """
    Train Q-learning agent on an environment.
    
    Returns:
        episode_rewards: Total reward per episode
        episode_lengths: Steps per episode
    """
    episode_rewards = []
    episode_lengths = []
    
    for episode in range(num_episodes):
        state = env.reset()
        total_reward = 0
        
        for step in range(max_steps):
            # Select and execute action
            action = agent.select_action(state, training=True)
            next_state, reward, done, _ = env.step(action)
            
            # Q-learning update
            agent.update(state, action, reward, next_state, done)
            
            total_reward += reward
            state = next_state
            
            if done:
                break
        
        # End of episode bookkeeping
        agent.decay_epsilon()
        episode_rewards.append(total_reward)
        episode_lengths.append(step + 1)
        
        if verbose and (episode + 1) % 100 == 0:
            avg_reward = np.mean(episode_rewards[-100:])
            avg_length = np.mean(episode_lengths[-100:])
            print(f"Episode {episode + 1}: Avg Reward = {avg_reward:.2f}, "
                  f"Avg Length = {avg_length:.1f}, ε = {agent.epsilon:.3f}")
    
    return episode_rewards, episode_lengths

Convergence Theory

One of Q-learning's remarkable properties is its convergence guarantee: under appropriate conditions, Q-learning converges to $Q^*$ with probability 1.

Main Convergence Theorem

Theorem (Watkins & Dayan, 1992): Q-learning converges to $Q^*$ as $t \to \infty$ with probability 1, provided:

The state and action spaces are finite
All state-action pairs are visited infinitely often
The learning rates $\alpha_t(s, a)$ satisfy:
- $\sum_{t=0}^\infty \alpha_t(s, a) = \infty$ (rates sum to infinity)
- $\sum_{t=0}^\infty \alpha_t(s, a)^2 < \infty$ (rates are square-summable)
The rewards are bounded

The Learning Rate Schedule Matters

Proof Sketch

The proof relies on stochastic approximation theory, particularly results on convergence of random iterative processes. Key steps:

1. Express as Stochastic Fixed-Point Iteration

Rewrite the Q-learning update as: $$Q_{t+1}(s, a) = (1 - \alpha_t) Q_t(s, a) + \alpha_t F_t(Q_t)$$

where $F_t(Q) = r_t + \gamma \max_{a'} Q(s_{t+1}, a')$ is a noisy sample of the Bellman operator $T^*$.

2. Show Contraction Property

The expected update direction points toward $Q^$: $$\mathbb{E}[F_t(Q) | s, a] = (T^ Q)(s, a)$$

Since $T^$ is a $\gamma$-contraction, the expected update pulls $Q$ toward $Q^$.

3. Apply Stochastic Approximation Results

With the Robbins-Monro conditions, the noise from sampling averages out over time, and the iterates converge to the fixed point $Q^*$.

Understanding the Exploration Requirement

The condition "all state-action pairs visited infinitely often" is crucial and often misunderstood.

Why it's necessary: Q-learning only updates Q(s, a) when (s, a) is actually experienced. If some pairs are never visited, their Q-values remain at initialization—potentially arbitrarily wrong.

Conditions for Q-Learning Convergence
Condition	Formal Statement	Practical Implication
Finite MDP	\|S\| < ∞, \|A\| < ∞	Tabular representation required
Exploration	∀(s,a): # visits → ∞	Use ε-greedy with ε > 0
Learning rate (sum)	Σ α_t(s,a) = ∞	Rates can't decay too fast
Learning rate (sum sq)	Σ α_t(s,a)² < ∞	Rates must eventually decay
Bounded rewards	\|r\| ≤ R_max	Scale rewards if needed

Exploration Strategies

The exploration-exploitation dilemma is a fundamental challenge in RL: exploit current knowledge to maximize immediate reward, or explore to discover potentially better strategies?

ε-Greedy Exploration

The simplest strategy: with probability ε choose uniformly random action, otherwise choose greedy action.

$$\pi_\epsilon(a|s) = \begin{cases} 1 - \epsilon + \epsilon/|A| & \text{if } a = \arg\max_{a'} Q(s, a') \ \epsilon/|A| & \text{otherwise} \end{cases}$$

Advantages: Simple, guaranteed exploration, works well empirically Disadvantages: Wastes exploration on clearly bad actions, treats all non-greedy actions equally

ε Decay Schedules

Typical schedules reduce ε over time as the agent learns:

Linear decay: $\epsilon_t = \max(\epsilon_{min}, \epsilon_0 - t \cdot \delta)$

Exponential decay: $\epsilon_t = \max(\epsilon_{min}, \epsilon_0 \cdot \rho^t)$

Inverse decay: $\epsilon_t = \epsilon_0 / (1 + t \cdot \delta)$

The choice of schedule depends on the problem. Faster decay means less exploration and potentially getting stuck at suboptimal policies. Slower decay wastes time exploring well-understood regions.

exploration_strategies.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
import numpy as np
from abc import ABC, abstractmethod
 
class ExplorationStrategy(ABC):
    """Abstract base for exploration strategies."""
    
    @abstractmethod
    def select_action(self, q_values: np.ndarray, step: int) -> int:
        pass
 
 
class EpsilonGreedy(ExplorationStrategy):
    """Standard ε-greedy with configurable decay."""
    
    def __init__(
        self,
        epsilon_start: float = 1.0,
        epsilon_end: float = 0.01,
        decay_type: str = "exponential",  # linear, exponential, inverse
        decay_param: float = 0.995  # decay rate or increment
    ):
        self.epsilon_start = epsilon_start
        self.epsilon_end = epsilon_end
        self.decay_type = decay_type
        self.decay_param = decay_param
    
    def get_epsilon(self, step: int) -> float:
        if self.decay_type == "exponential":
            eps = self.epsilon_start * (self.decay_param ** step)
        elif self.decay_type == "linear":
            eps = self.epsilon_start - step * self.decay_param
        elif self.decay_type == "inverse":
            eps = self.epsilon_start / (1 + step * self.decay_param)
        else:
            eps = self.epsilon_start
        return max(self.epsilon_end, eps)
    
    def select_action(self, q_values: np.ndarray, step: int) -> int:
        epsilon = self.get_epsilon(step)
        if np.random.random() < epsilon:
            return np.random.randint(len(q_values))
        else:
            return np.argmax(q_values)
 
 
class Softmax(ExplorationStrategy):
    """
    Softmax (Boltzmann) exploration.
    
    Probability of action a is proportional to exp(Q(s,a) / τ).
    Higher τ = more uniform; lower τ = more greedy.
    """
    
    def __init__(
        self,
        temperature_start: float = 1.0,
        temperature_end: float = 0.1,
        decay_rate: float = 0.99
    ):
        self.temp_start = temperature_start
        self.temp_end = temperature_end
        self.decay_rate = decay_rate
    
    def get_temperature(self, step: int) -> float:
        temp = self.temp_start * (self.decay_rate ** step)
        return max(self.temp_end, temp)
    
    def select_action(self, q_values: np.ndarray, step: int) -> int:
        tau = self.get_temperature(step)
        # Numerical stability: subtract max before exp
        q_scaled = (q_values - np.max(q_values)) / tau
        exp_q = np.exp(q_scaled)
        probs = exp_q / np.sum(exp_q)
        return np.random.choice(len(q_values), p=probs)
 
 
class UCB(ExplorationStrategy):
    """
    Upper Confidence Bound exploration.
    
    Selects argmax_a [Q(s,a) + c * sqrt(ln(t) / N(s,a))]
    where N(s,a) is the visit count for state-action pair.
    
    Note: Requires tracking visit counts externally.
    """
    
    def __init__(self, exploration_coef: float = 2.0):
        self.c = exploration_coef
        self.action_counts: dict = {}  # (state_idx, action) -> count
        self.total_steps = 0
    
    def update_counts(self, state: int, action: int):
        key = (state, action)
        self.action_counts[key] = self.action_counts.get(key, 0) + 1
        self.total_steps += 1
    
    def select_action(self, q_values: np.ndarray, step: int, state: int = 0) -> int:
        # UCB values: Q + exploration bonus
        ucb_values = np.zeros_like(q_values)
        
        for a in range(len(q_values)):
            count = self.action_counts.get((state, a), 0)
            if count == 0:
                # Unvisited action: infinite UCB (must explore)
                ucb_values[a] = float('inf')
            else:
                exploration_bonus = self.c * np.sqrt(
                    np.log(self.total_steps + 1) / count
                )
                ucb_values[a] = q_values[a] + exploration_bonus
        
        return np.argmax(ucb_values)

Softmax (Boltzmann) Exploration

Instead of treating non-greedy actions equally, softmax weights actions by their Q-values:

$$\pi(a|s) = \frac{e^{Q(s,a)/\tau}}{\sum_{a'} e^{Q(s,a')/\tau}}$$

The temperature parameter τ controls exploration:

High τ → uniform distribution (exploration)
Low τ → peaked at max Q (exploitation)
τ → 0 → greedy policy

Advantage over ε-greedy: Doesn't waste exploration on obviously bad actions. If Q(s, a₁) >> Q(s, a₂), softmax rarely chooses a₂.

Disadvantage: Requires choosing τ appropriately; sensitive to Q-value scale.

Optimistic Initialization

Q-Learning vs SARSA: Off-Policy vs On-Policy

Q-learning and SARSA (covered in the next page) represent two fundamental approaches to TD control. Understanding their difference illuminates a core design choice in RL.

The Key Difference

Q-learning update: $$Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma \max_{a'} Q(s', a') - Q(s, a)]$$

SARSA update: $$Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma Q(s', a') - Q(s, a)]$$

Notice: Q-learning uses $\max_{a'}$ (optimal action), SARSA uses $a'$ (actual next action taken).

This difference defines:

Q-learning: Off-policy—learns optimal Q* regardless of behavior policy
SARSA: On-policy—learns Q for the policy being followed

Q-Learning (Off-Policy)

•Learns optimal policy Q*
•Behavior policy can differ from target
•Can use experience replay (stored transitions)
•May overestimate values (maximization bias)
•Can be more sample efficient
•May learn unsafe behaviors

SARSA (On-Policy)

•Learns Q for current policy
•Behavior = target policy
•Cannot directly use replay
•No maximization bias
•May need more samples
•Naturally conservative/safe

The Cliff Walking Example

Consider a gridworld with a cliff: the agent must traverse from start to goal near a dangerous edge. Stepping off the cliff gives -100 reward.

This illustrates a fundamental trade-off:

Q-learning: Optimal in the limit, but risky during learning
SARSA: Suboptimal in the limit, but safer during learning

When Does It Matter?

Maximization Bias and Double Q-Learning

Q-learning has a systematic flaw: maximization bias. The max operator in the TD target causes systematic overestimation of Q-values.

The Problem

Consider a state where all actions have true value Q*(s, a) = 0. Due to noise from limited samples, our estimates will scatter around 0: some above, some below.

Example: State s has 10 actions, all with true value 0. If Q-estimates are N(0, 1), then $\mathbb{E}[\max_a Q] \approx 1.54$ — significantly positive despite true value being 0.

Why This Matters

Double Q-Learning

Double Q-learning (Van Hasselt, 2010) maintains two Q-tables, $Q_A$ and $Q_B$, and decouples action selection from value estimation:

Use $Q_A$ to select the best action: $a^* = \arg\max_a Q_A(s', a)$
Use $Q_B$ to evaluate it: $Q_B(s', a^*)$
Update: $Q_A(s, a) \leftarrow Q_A(s, a) + \alpha[r + \gamma Q_B(s', a^*) - Q_A(s, a)]$
With 50% probability, swap roles of $Q_A$ and $Q_B$

Since $Q_B$ is independent of the selection made by $Q_A$, the positive-bias-selects-positive-error loop is broken.

double_q_learning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
import numpy as np
 
class DoubleQLearningAgent:
    """
    Double Q-Learning to eliminate maximization bias.
    
    Maintains two Q-tables and alternates which one is updated,
    using the other for value estimation.
    """
    
    def __init__(
        self,
        num_states: int,
        num_actions: int,
        learning_rate: float = 0.1,
        discount_factor: float = 0.99,
        epsilon: float = 0.1
    ):
        # Two independent Q-tables
        self.Q_A = np.zeros((num_states, num_actions))
        self.Q_B = np.zeros((num_states, num_actions))
        
        self.alpha = learning_rate
        self.gamma = discount_factor
        self.epsilon = epsilon
    
    def select_action(self, state: int) -> int:
        """ε-greedy using sum of both Q-tables."""
        if np.random.random() < self.epsilon:
            return np.random.randint(self.Q_A.shape[1])
        else:
            # Use average of both tables for action selection
            combined_q = self.Q_A[state] + self.Q_B[state]
            return np.argmax(combined_q)
    
    def update(
        self,
        state: int,
        action: int,
        reward: float,
        next_state: int,
        done: bool
    ):
        """
        Double Q-learning update.
        
        With 50% probability, update Q_A using Q_B for evaluation, or vice versa.
        This decouples selection and evaluation, eliminating maximization bias.
        """
        if np.random.random() < 0.5:
            # Update Q_A
            self._update_table(
                self.Q_A, self.Q_B,
                state, action, reward, next_state, done
            )
        else:
            # Update Q_B
            self._update_table(
                self.Q_B, self.Q_A,
                state, action, reward, next_state, done
            )
    
    def _update_table(
        self,
        Q_update: np.ndarray,  # Table to update
        Q_eval: np.ndarray,    # Table for evaluation
        state: int,
        action: int,
        reward: float,
        next_state: int,
        done: bool
    ):
        """Update one Q-table using the other for evaluation."""
        
        if done:
            td_target = reward
        else:
            # Action selection: use Q_update
            best_action = np.argmax(Q_update[next_state])
            # Value estimation: use Q_eval (decoupled!)
            td_target = reward + self.gamma * Q_eval[next_state, best_action]
        
        td_error = td_target - Q_update[state, action]
        Q_update[state, action] += self.alpha * td_error
    
    def get_q_values(self) -> np.ndarray:
        """Return average of both Q-tables."""
        return (self.Q_A + self.Q_B) / 2

Empirical Results

Double Q-learning converges to the correct values where standard Q-learning overshoots. In environments with many actions or noisy rewards, the difference is dramatic:

Scenario	Q-Learning Estimate	Double Q-Learning
10 actions, all Q*=0	+2.5	+0.3
Noisy rewards (σ=5)	+4.2	+0.8
Stochastic environment	+3.1	+0.5

Double Q-learning's unbiased estimates translate to better policies, faster convergence, and more stable learning in practice.

Practical Considerations

Successfully applying Q-learning requires careful attention to hyperparameters, initialization, and potential failure modes.

Learning Rate Selection

The learning rate α controls the trade-off between stability and adaptability:

Too high (α → 1): Q-values oscillate wildly, never converging
Too low (α → 0): Learning is extremely slow, may not adapt to non-stationarity
Sweet spot (α ≈ 0.1–0.5): Balances responsiveness with stability

For non-stationary environments, a small constant α is often better than decaying rates, as it allows tracking changing dynamics.

Common Pitfalls and Solutions

•Q-values exploding: Usually reward/discount mismatch. If γ ≈ 1 and rewards are large, Q-values grow unboundedly. Normalize rewards or use smaller γ.
•Q-values collapsing to same value: Likely a bug in terminal state handling. Terminal states should have Q = 0 (or final reward) with no bootstrap.
•Policy never improves: Insufficient exploration. Check that ε > 0 and all state-action pairs are reachable.
•Erratic learning curves: High variance in returns. Use reward smoothing, longer evaluation windows, or reduce learning rate.
•Overestimation persists: Maximization bias. Consider Double Q-learning.
•Slow convergence in sparse reward: Add reward shaping or use eligibility traces (upcoming page).

Hyperparameter Sensitivity

Hyperparameter	Typical Range	Effect of Increasing
Learning rate α	0.01–0.5	Faster learning but more noise
Discount γ	0.9–0.999	More farsighted but slower convergence
Initial ε	0.5–1.0	More initial exploration
Final ε	0.01–0.1	More exploration in limit
ε decay rate	0.99–0.9999	Slower exploration reduction

Rule of thumb: Start with α = 0.1, γ = 0.99, ε decaying from 1.0 to 0.01 over ~90% of training. Adjust based on observed learning curves.

Debugging Q-Learning

Bridge to Deep Q-Networks

Function approximation replaces the Q-table with a parameterized function $Q(s, a; \theta)$ that generalizes across states:

$$Q(s, a; \theta) \approx Q^*(s, a)$$

Neural networks are the most powerful function approximators, leading to Deep Q-Networks (DQN).

The DQN Innovation

Naively applying neural networks to Q-learning fails due to:

Correlated samples: Sequential experience is highly correlated, violating i.i.d. assumptions
Non-stationary targets: The TD target changes as Q changes, destabilizing learning

DQN solutions:

Experience replay: Store transitions in a buffer, sample mini-batches randomly for training (breaks correlation)
Target networks: Use a separate, slowly-updated network for TD targets (stabilizes targets)

These innovations, combined with deep neural networks, enabled DQN to achieve superhuman performance on Atari games in 2015—a landmark in AI.

Q-Learning Evolution
Algorithm	State Space	Key Innovation	Year
Tabular Q-Learning	Small, discrete	TD + off-policy	1989
DQN	Large, continuous (pixels)	Experience replay + target networks	2015
Double DQN	Large, continuous	Decoupled selection/evaluation	2015
Dueling DQN	Large, continuous	Separate value/advantage streams	2016
Rainbow	Large, continuous	All above + distributional + noisy	2017

Deep RL Module Preview

Summary: Q-Learning

Q-learning is a foundational algorithm that enables learning optimal behavior directly from experience. Let's consolidate the key insights:

Key Takeaways

•Model-free learning: Q-learning estimates Q* without requiring transition dynamics—learns purely from (s, a, r, s') samples.
•Off-policy: The max operator targets the optimal policy, regardless of the behavior policy generating data.
•TD learning: Bootstrapping from current estimates enables online, incremental updates without waiting for episode completion.
•Convergence: Provably converges to Q* with sufficient exploration and appropriate learning rate decay.
•Exploration-exploitation: ε-greedy, softmax, and UCB provide different trade-offs; exploration is essential for convergence.
•Maximization bias: Standard Q-learning overestimates values; Double Q-learning provides unbiased estimates.
•Scalability: Tabular Q-learning is limited to small state spaces; function approximation (DQN) extends to large spaces.

Page Complete

2 / 5