Value Based Methods - Learning Module

Loading content...

0/245

SARSA

Learning What Your Policy Actually Achieves

Q-learning asks: What's the best possible future from here? But there's a fundamentally different question an agent might ask: What future will I actually experience if I keep behaving as I am?

SARSA (State-Action-Reward-State-Action) answers this second question. Unlike Q-learning, which optimistically assumes optimal future behavior, SARSA learns the value of states and actions under the current policy. If your policy explores randomly 10% of the time, SARSA accounts for that uncertainty; Q-learning pretends you'll always act optimally.

This distinction—on-policy vs off-policy learning—has profound implications. SARSA is more conservative during training, respects the actual behavior being executed, and naturally incorporates the costs of exploration. Understanding when each approach is appropriate is essential for any RL practitioner.

What You Will Learn

This page provides comprehensive coverage of SARSA: the on-policy TD control algorithm, its mathematical formulation, convergence properties, the crucial differences from Q-learning, Expected SARSA as a unified approach, practical implementation, and scenarios where on-policy learning is the right choice.

On-Policy vs Off-Policy Learning

Before diving into SARSA's mechanics, let's establish the conceptual distinction that defines it.

Two Policies in RL

In any RL scenario, there are conceptually two policies:

Behavior policy $b(a|s)$: The policy actually generating actions and collecting experience
Target policy $\pi(a|s)$: The policy whose value we're trying to estimate/improve

On-policy methods: Behavior = Target ($b = \pi$). We learn about the policy we're following.

Off-policy methods: Behavior ≠ Target. We can learn about one policy while following another.

Why Does This Matter?

The distinction affects what value function we learn:

	On-Policy (SARSA)	Off-Policy (Q-Learning)
Learns	$Q^\pi$ (current policy's value)	$Q^*$ (optimal value)
Target	$r + \gamma Q(s', a')$ where $a' \sim \pi$	$r + \gamma \max_{a'} Q(s', a')$
Assumes	Continues following $\pi$	Acts optimally from $s'$
During exploration	Accounts for exploration costs	Ignores them

The Fundamental Trade-off

Off-policy (Q-learning) is more ambitious: it learns about the optimal policy while exploring. On-policy (SARSA) is more honest: it learns about what will actually happen given how the agent actually behaves. In safe environments, off-policy is more sample-efficient. In dangerous environments, on-policy is safer during training.

The Importance of Behavior-Awareness

Consider an ε-greedy policy with ε = 0.1. When the agent is near a cliff:

Q-learning thinks: "The optimal action at s' is STAY_AWAY, giving value +10."

SARSA thinks: "With ε = 0.1, there's a 10% chance of random action. Some random actions lead off the cliff (-100). Expected value is actually 0.1 × (-100) + 0.9 × (+10) = -1."

SARSA's Q-values reflect the actual expected return under the ε-greedy policy, including the risk from exploration. This makes SARSA naturally risk-averse to states where exploration is dangerous.

The SARSA Algorithm

The name SARSA comes from the quintuple of values used in each update: $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$.

The Update Rule

After observing transition $(s, a, r, s')$ and selecting next action $a'$ from $s'$:

$$Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma Q(s', a') - Q(s, a) \right]$$

Compare to Q-learning:

Q-learning: $r + \gamma \max_{a'} Q(s', a')$
SARSA: $r + \gamma Q(s', a')$ where $a' \sim \pi(\cdot|s')$

The critical difference: SARSA uses the actual next action $a'$ drawn from the current policy, not the best action.

Complete Algorithm

SARSA (Tabular, On-Policy TD Control):

Initialize $Q(s, a)$ arbitrarily for all $s \in S$, $a \in A$
For each episode:
- Initialize state $s$
- Choose action $a$ from $s$ using policy derived from $Q$ (e.g., ε-greedy)
- Repeat for each step:
  - Take action $a$, observe reward $r$ and next state $s'$
  - Choose action $a'$ from $s'$ using policy derived from $Q$
  - Update: $Q(s, a) \leftarrow Q(s, a) + \alpha[r + \gamma Q(s', a') - Q(s, a)]$
  - $s \leftarrow s'$; $a \leftarrow a'$
- Until $s$ is terminal

Notice $a'$ must be chosen before the update (it's part of the TD target) and kept as the action for the next step (on-policy consistency).

sarsa.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
import numpy as np
from typing import Tuple, List
 
class SARSAAgent:
    """
    SARSA (State-Action-Reward-State-Action) Agent.
    
    On-policy TD control: learns Q^π for the behavior policy π.
    The key difference from Q-learning is using the actual next action
    in the TD target, not the max.
    """
    
    def __init__(
        self,
        num_states: int,
        num_actions: int,
        learning_rate: float = 0.1,
        discount_factor: float = 0.99,
        epsilon: float = 0.1
    ):
        self.num_states = num_states
        self.num_actions = num_actions
        self.alpha = learning_rate
        self.gamma = discount_factor
        self.epsilon = epsilon
        
        # Initialize Q-table
        self.Q = np.zeros((num_states, num_actions))
        
    def select_action(self, state: int) -> int:
        """ε-greedy action selection."""
        if np.random.random() < self.epsilon:
            return np.random.randint(self.num_actions)
        else:
            # Greedy with random tie-breaking
            q_values = self.Q[state]
            max_q = np.max(q_values)
            best_actions = np.where(q_values == max_q)[0]
            return np.random.choice(best_actions)
    
    def update(
        self,
        state: int,
        action: int,
        reward: float,
        next_state: int,
        next_action: int,  # Key difference: we need the actual next action!
        done: bool
    ) -> float:
        """
        SARSA update: Q(s,a) ← Q(s,a) + α[r + γQ(s',a') - Q(s,a)]
        
        Note: next_action is the action that WILL be taken, selected
        according to the current policy. This is what makes SARSA on-policy.
        """
        if done:
            td_target = reward  # No future value at terminal
        else:
            # Use actual next action, not max
            td_target = reward + self.gamma * self.Q[next_state, next_action]
        
        td_error = td_target - self.Q[state, action]
        self.Q[state, action] += self.alpha * td_error
        
        return td_error
 
 
def train_sarsa(
    env,
    agent: SARSAAgent,
    num_episodes: int = 1000,
    max_steps: int = 1000
) -> Tuple[List[float], List[int]]:
    """
    Train SARSA agent. Note the subtle difference in the training loop:
    we must select the next action BEFORE the update, then use it
    as the action for the next step.
    """
    episode_rewards = []
    episode_lengths = []
    
    for episode in range(num_episodes):
        state = env.reset()
        # Select first action BEFORE entering the loop
        action = agent.select_action(state)
        
        total_reward = 0
        
        for step in range(max_steps):
            # Execute action
            next_state, reward, done, _ = env.step(action)
            
            if done:
                # Terminal: update with no next action
                agent.update(state, action, reward, next_state, 0, done=True)
                total_reward += reward
                break
            else:
                # Select next action (this is the a' in SARSA)
                next_action = agent.select_action(next_state)
                
                # Update Q-value using the actual next action
                agent.update(state, action, reward, next_state, next_action, done=False)
                
                total_reward += reward
                
                # Transition: use the SAME next_action we used in the update
                # This maintains on-policy consistency
                state = next_state
                action = next_action
        
        episode_rewards.append(total_reward)
        episode_lengths.append(step + 1)
    
    return episode_rewards, episode_lengths

Common Implementation Mistake

A common bug is to select a' for the update but then select a different action when actually stepping. This breaks on-policy consistency—you're updating toward one policy but following another. The action used in Q(s', a') MUST be the action executed in the next step.

SARSA vs Q-Learning: Deep Dive

Let's rigorously examine how SARSA and Q-learning differ in behavior, learned values, and appropriate use cases.

Mathematical Difference

Both algorithms update Q(s, a) toward a target:

Q-learning: Target = $r + \gamma \max_{a'} Q(s', a')$ = $r + \gamma V^*(s')$

SARSA: Target = $r + \gamma Q(s', a')$ where $a' \sim \pi$

In expectation over policy $\pi$: $$\mathbb{E}{a' \sim \pi}[r + \gamma Q(s', a')] = r + \gamma \sum{a'} \pi(a'|s') Q(s', a') = r + \gamma V^\pi(s')$$

So:

Q-learning: backs up against $V^*$ (optimal value)
SARSA: backs up against $V^\pi$ (policy value)

This is why Q-learning learns $Q^*$ and SARSA learns $Q^\pi$.

The Cliff Walking Experiment

The Cliff Walking environment is the canonical illustration of the SARSA/Q-learning difference.

Environment:

4×12 grid, start at bottom-left, goal at bottom-right
Bottom row (except start/goal) is a "cliff"—stepping there gives -100 reward and resets to start
All other steps give -1 reward
Optimal path: along the cliff edge (shortest)
Safe path: one row up from the cliff (longer but no risk)

Results with ε-greedy (ε = 0.1):

Algorithm	Path Learned	Average Reward	Behavior
Q-learning	Along cliff edge	~-40	Sometimes falls off during training
SARSA	One row above cliff	~-20	Avoids cliff entirely

Q-learning learns the optimal path but can't safely execute it (exploration causes falls). SARSA learns a suboptimal path but executes it safely.

Q-Learning's Cliff Strategy

Q-learning values the cliff-edge path highly because max Q(s', a') assumes you'll take the optimal (STAY) action. But ε-greedy has a 10% chance of random action → potential cliff fall. The policy is optimal but the execution isn't safe with exploration.

SARSA's Cliff Strategy

SARSA values near-cliff states lower because a' is drawn from ε-greedy, so the expected value includes the 10% chance of a bad random action. This naturally avoids the cliff, leading to safer training-time behavior.

When Each Algorithm is Preferred

Prefer Q-learning when:

Environment is forgiving (no catastrophic failures)
You'll switch to a greedy policy after training (no ε during deployment)
You want maximum sample efficiency for finding Q*
You want to use experience replay (requires off-policy)

Prefer SARSA when:

Training-time behavior matters (real-world deployment during learning)
Catastrophic failures exist and must be avoided
Your deployed policy will include exploration
Stability during learning is more important than final optimality

Expected SARSA: The Best of Both Worlds

There's a middle ground between SARSA's sample-based on-policy update and Q-learning's off-policy max: Expected SARSA, which takes the expectation over policy actions instead of sampling.

The Update Rule

$$Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \sum_{a'} \pi(a'|s') Q(s', a') - Q(s, a) \right]$$

Instead of sampling $a' \sim \pi$ (SARSA) or taking max (Q-learning), Expected SARSA computes the exact expected value under $\pi$.

Variance Reduction

SARSA has variance from sampling the next action: $$\text{Var}[Q(s', a')] > 0 \text{ when } a' \sim \pi$$

Expected SARSA eliminates this source of variance: $$\sum_{a'} \pi(a'|s') Q(s', a') = \mathbb{E}_{a' \sim \pi}[Q(s', a')]$$

Lower variance means more stable updates and often faster convergence.

expected_sarsa.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import numpy as np
 
class ExpectedSARSAAgent:
    """
    Expected SARSA: uses expected value under policy instead of sampling.
    
    Lower variance than SARSA, can be on-policy or off-policy depending
    on the policy used in the expectation.
    """
    
    def __init__(
        self,
        num_states: int,
        num_actions: int,
        learning_rate: float = 0.1,
        discount_factor: float = 0.99,
        epsilon: float = 0.1
    ):
        self.num_states = num_states
        self.num_actions = num_actions
        self.alpha = learning_rate
        self.gamma = discount_factor
        self.epsilon = epsilon
        
        self.Q = np.zeros((num_states, num_actions))
    
    def get_policy_probs(self, state: int) -> np.ndarray:
        """
        Return action probabilities for ε-greedy policy at given state.
        This is π(a|s) needed for expected value computation.
        """
        q_values = self.Q[state]
        probs = np.ones(self.num_actions) * (self.epsilon / self.num_actions)
        
        # Handle ties in max Q
        max_q = np.max(q_values)
        best_actions = np.where(q_values == max_q)[0]
        
        # Distribute the (1 - ε) greedy probability among best actions
        greedy_boost = (1 - self.epsilon) / len(best_actions)
        probs[best_actions] += greedy_boost
        
        return probs
    
    def select_action(self, state: int) -> int:
        """Sample action from ε-greedy policy."""
        probs = self.get_policy_probs(state)
        return np.random.choice(self.num_actions, p=probs)
    
    def update(
        self,
        state: int,
        action: int,
        reward: float,
        next_state: int,
        done: bool
    ) -> float:
        """
        Expected SARSA update.
        
        Uses E_{a'~π}[Q(s',a')] instead of sampled Q(s',a') or max Q(s',a').
        """
        if done:
            td_target = reward
        else:
            # Compute expected value under current policy
            policy_probs = self.get_policy_probs(next_state)
            expected_q = np.dot(policy_probs, self.Q[next_state])
            td_target = reward + self.gamma * expected_q
        
        td_error = td_target - self.Q[state, action]
        self.Q[state, action] += self.alpha * td_error
        
        return td_error
    
    def expected_value(self, state: int) -> float:
        """Compute V^π(s) = Σ_a π(a|s) Q(s,a)."""
        probs = self.get_policy_probs(state)
        return np.dot(probs, self.Q[state])

Relationship to Q-learning and SARSA

Expected SARSA generalizes both algorithms:

SARSA: Expected SARSA where we sample $a'$ instead of computing expectation
Q-learning: Expected SARSA with a greedy target policy (ε = 0 in expectation)

To see that Q-learning is Expected SARSA with greedy target: $$\sum_{a'} \pi_{greedy}(a'|s') Q(s', a') = 1 \cdot Q(s', \arg\max Q) = \max_{a'} Q(s', a')$$

This unification reveals that the core difference between algorithms is the target policy used in the expectation.

TD Control Algorithm Comparison
Algorithm	Target Policy	Next Action Handling	Variance
SARSA	Behavior π	Sample a' ~ π	Higher (sampling)
Expected SARSA	Any π	E_{a'~π}[Q(s',a')]	Lower (exact)
Q-learning	Greedy π*	max Q(s',a')	N/A (deterministic)

Expected SARSA in Practice

Expected SARSA is often the best default choice for tabular problems: it has the lower variance of Q-learning while remaining on-policy when desired. The computational overhead (summing over actions) is negligible for small action spaces. For large action spaces, sampling (SARSA) may be necessary.

Convergence Analysis

SARSA's convergence theory differs from Q-learning because SARSA learns about a changing target: as the policy improves, $Q^\pi$ changes.

SARSA Convergence Theorem

Theorem (Singh et al., 2000): SARSA converges to $Q^*$ with probability 1 under the following conditions:

All state-action pairs continue to be visited
The policy is GLIE (Greedy in the Limit with Infinite Exploration):
- All state-action pairs visited infinitely often
- Policy converges to greedy: $\lim_{t \to \infty} \pi_t(a|s) = 1$ if $a = \arg\max Q(s, \cdot)$
Learning rates satisfy Robbins-Monro conditions

GLIE Explained: The policy must explore forever (ensuring all Q-values are updated) but eventually become greedy (ensuring convergence to optimal, not just any consistent policy).

Example GLIE Policy: ε-greedy with $\epsilon_t = 1/t$ (decays to zero but sum is infinite).

The Difference from Q-Learning

Q-learning converges to Q* regardless of the behavior policy (as long as all state-actions are visited). SARSA converges to Q* only if the policy itself converges to optimal. With constant ε > 0, SARSA converges to Q^{π_ε}, not Q*.

What SARSA Converges To

With constant ε (not decaying to 0), SARSA converges to $Q^{\pi_\epsilon}$—the value function of the ε-greedy policy.

This is NOT the same as $Q^$! For any ε > 0: $$Q^{\pi_\epsilon}(s, a) \leq Q^(s, a)$$

The gap reflects the "cost of exploration"—the expected loss from random actions.

Practical implication: If you plan to deploy with ε = 0, Q-learning gives you the right values. If you plan to deploy with ε > 0 (for continued learning or safety), SARSA gives you the right values.

Convergence Rate Comparison

Comparing convergence rates is complex and environment-dependent:

Factor	SARSA	Q-learning
Variance per update	Higher (samples a')	Lower (deterministic max)
Bias	Unbiased for Q^π	Biased by maximization
Moving target	Policy changes during learning	Fixed target Q*
Stability	Generally more stable	Can oscillate near cliff-like states

Expected SARSA often has the best empirical convergence because it combines:

Low variance (like Q-learning)
Natural on-policy learning (like SARSA)
Can be either on-policy or off-policy depending on target policy

n-Step SARSA

One-step SARSA bootstraps from the immediate next state. But we could wait longer—accumulate more real rewards before bootstrapping. This gives n-step SARSA, which provides a spectrum between TD(0) and Monte Carlo.

The n-step Return

The n-step return from time $t$: $$G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \ldots + \gamma^{n-1} R_{t+n} + \gamma^n Q(S_{t+n}, A_{t+n})$$

This uses $n$ actual rewards plus a bootstrap from $Q$ at step $t+n$.

Special cases:

$n = 1$: Standard SARSA (one reward + bootstrap)
$n = \infty$ (or until termination): Monte Carlo (all rewards, no bootstrap)

The n-step SARSA Update

$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ G_{t:t+n} - Q(S_t, A_t) \right]$$

This update can only occur at time $t + n$ when $G_{t:t+n}$ is known.

n_step_sarsa.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
import numpy as np
from collections import deque
 
class NStepSARSAAgent:
    """
    n-step SARSA: bridges one-step SARSA and Monte Carlo.
    
    Uses n actual rewards plus a bootstrap from Q(S_{t+n}, A_{t+n}).
    Larger n = lower bias, higher variance.
    """
    
    def __init__(
        self,
        num_states: int,
        num_actions: int,
        n_steps: int = 4,
        learning_rate: float = 0.1,
        discount_factor: float = 0.99,
        epsilon: float = 0.1
    ):
        self.num_states = num_states
        self.num_actions = num_actions
        self.n = n_steps
        self.alpha = learning_rate
        self.gamma = discount_factor
        self.epsilon = epsilon
        
        self.Q = np.zeros((num_states, num_actions))
        
        # Buffer to store the last n+1 transitions
        self.state_buffer = deque(maxlen=n_steps + 1)
        self.action_buffer = deque(maxlen=n_steps + 1)
        self.reward_buffer = deque(maxlen=n_steps)
        
    def select_action(self, state: int) -> int:
        if np.random.random() < self.epsilon:
            return np.random.randint(self.num_actions)
        return np.argmax(self.Q[state])
    
    def start_episode(self, state: int, action: int):
        """Initialize buffers at start of episode."""
        self.state_buffer.clear()
        self.action_buffer.clear()
        self.reward_buffer.clear()
        
        self.state_buffer.append(state)
        self.action_buffer.append(action)
    
    def step(
        self,
        next_state: int,
        reward: float,
        next_action: int,
        done: bool,
        t: int  # Current timestep
    ):
        """
        Process one step, potentially making an n-step update.
        
        Returns True if an update was made.
        """
        # Store transition
        self.state_buffer.append(next_state)
        self.action_buffer.append(next_action)
        self.reward_buffer.append(reward)
        
        # Can only update once we have n steps of data
        # Update is for time tau = t - n + 1
        tau = t - self.n + 1
        
        if tau >= 0:
            # Compute n-step return G_{tau:tau+n}
            G = 0.0
            for i, r in enumerate(self.reward_buffer):
                G += (self.gamma ** i) * r
            
            # Bootstrap from Q(S_{tau+n}, A_{tau+n}) if not terminal
            if not done:
                bootstrap_state = self.state_buffer[-1]
                bootstrap_action = self.action_buffer[-1]
                G += (self.gamma ** self.n) * self.Q[bootstrap_state, bootstrap_action]
            
            # Update Q(S_tau, A_tau)
            update_state = self.state_buffer[0]
            update_action = self.action_buffer[0]
            
            td_error = G - self.Q[update_state, update_action]
            self.Q[update_state, update_action] += self.alpha * td_error
            
            return True
        
        return False
    
    def end_episode(self, T: int):
        """
        Make remaining updates at end of episode.
        T is the terminal timestep.
        """
        # Need to update states from T-n+1 to T-1
        for remaining in range(len(self.reward_buffer)):
            # Compute return for remaining transitions
            # These don't bootstrap (episode ended)
            G = 0.0
            for i, r in enumerate(list(self.reward_buffer)[remaining:]):
                G += (self.gamma ** i) * r
            
            update_state = self.state_buffer[remaining]
            update_action = self.action_buffer[remaining]
            
            td_error = G - self.Q[update_state, update_action]
            self.Q[update_state, update_action] += self.alpha * td_error

The Bias-Variance Trade-off

As $n$ increases:

Aspect	Effect	Why
Bias	Decreases	More real rewards, less reliance on Q estimates
Variance	Increases	More random rewards accumulated, more stochasticity
Delayed updates	Increases	Must wait n steps to update
Credit assignment	More direct	Rewards closer to responsible actions

Optimal n: Problem-dependent. Short horizons with high reward variance favor small $n$. Long horizons with sparse rewards favor larger $n$. Empirically, $n \in [4, 8]$ often works well.

Eligibility Traces: The Elegant Alternative

Rather than choosing a fixed n, eligibility traces (TD(λ)) provide a weighted average over ALL n-step returns. This is covered in the next page and is often more elegant than explicitly implementing n-step methods.

Implementation Best Practices

Successfully implementing SARSA requires attention to subtle details that can make the difference between correct learning and puzzling failures.

The Action Flow

The most common SARSA bug is incorrect action handling. The flow must be:

state, action = env.reset(), select_action(state)
while not done:
    next_state, reward, done = env.step(action)
    next_action = select_action(next_state)
    update(state, action, reward, next_state, next_action)
    state, action = next_state, next_action  # SAME action!

The next_action used in the update MUST be the action taken in the next step. Any deviation breaks on-policy consistency.

SARSA Debugging Checklist

•Verify action persistence: Print (state, action) before and after updates to confirm action used in update matches action executed.
•Check terminal handling: Terminal states should update with target = reward only (no bootstrap). Q(terminal, ·) should remain 0 or be excluded from updates.
•Validate policy consistency: The policy used for selection must match what's implicitly assumed in the update (both ε-greedy with same ε).
•Monitor Q-value magnitudes: Q-values should converge to reasonable values (bounded by R_max / (1-γ) typically). Divergence indicates a bug.
•Compare with Q-learning: On simple environments, SARSA should learn similar (but not identical) Q-values. Large discrepancies suggest implementation errors.

Hyperparameter Guidelines

Hyperparameter	SARSA Guidance	Notes
Learning rate α	0.1–0.5	SARSA often tolerates higher α than Q-learning
Discount γ	0.9–0.99	Standard; higher for long-horizon tasks
ε (exploration)	0.1–0.3 typical	Remember: SARSA learns Q^{π_ε}, not Q*
ε decay	Slower than Q-learning	On-policy needs exploration to evaluate all actions
n (for n-step)	2–8	Larger for sparse rewards

Key insight: Because SARSA is on-policy, you need continued exploration to update Q-values for non-greedy actions. Decay ε too fast and those Q-values become stale.

Testing On-Policy Consistency

A good sanity check: track the empirical average reward and compare it to V^π(start_state). With enough training, these should converge to similar values. If V^π significantly exceeds actual performance, you may have an off-policy leak in your implementation.

Real-World Applications of On-Policy Learning

While Q-learning often gets more attention, on-policy methods like SARSA are preferred in several important domains.

Robotics and Physical Systems

Physical systems can't reset like simulators—mistakes have real consequences.

Example: A robotic arm learning manipulation:

Q-learning might learn an optimal but risky trajectory near joint limits
During training with exploration, the robot could damage itself
SARSA learns trajectories that are safe given the exploratory behavior

Practice: Use SARSA during initial training when exploration is high, then optionally switch to Q-learning for fine-tuning after establishing safe policies.

Financial Trading

Trading systems operate in real markets where training-time behavior has real costs.

Why SARSA fits:

The system trades during learning—losses are real
Exploration (trying new strategies) has direct cost
SARSA's Q-values account for exploration overhead
Conservative strategies during uncertain periods are rewarded

Contrast with Q-learning: Q-learning might value a high-risk trade highly (optimal if executed perfectly), but the strategy is only safe when exploration is disabled.

Dialog Systems and User Interaction

Conversational agents learn while interacting with real users.

The challenge: Exploratory (random) responses annoy users. SARSA naturally penalizes states where random exploration leads to poor outcomes (user frustration, conversation termination).

SARSA's advantage:

Learns the value of states given actual behavior (including occasional poor responses)
Naturally finds robust conversation strategies
Q-values reflect user patience with imperfect responses

When to Choose On-Policy vs Off-Policy
Scenario	Recommendation	Rationale
Simulator training, perfect deployment	Q-learning	Learn Q, deploy greedy π
Real-world training	SARSA	Safe training-time behavior
Continued learning after deployment	SARSA/Expected SARSA	Account for ongoing exploration
High exploration (ε > 0.2)	SARSA	Q-learning values unrealistic
Experience replay desired	Q-learning	Off-policy requirement
Low exploration, stable Q-values	Expected SARSA	Lower variance, any policy

The Practical Guideline

If you're training in simulation and deploying a greedy policy: Q-learning. If you're training in the real world, or your deployed policy will explore: SARSA or Expected SARSA. When in doubt, Expected SARSA offers stability with flexibility.

Summary: SARSA

SARSA represents the on-policy approach to TD control—learning the value of the policy we're actually following, not an idealized optimal policy. Let's consolidate the key insights:

Key Takeaways

•On-policy learning: SARSA learns Q^π for the behavior policy π, not Q* for the optimal policy. The TD target uses the actual next action a' ~ π.
•The SARSA quintuple: Each update uses (S, A, R, S', A')—hence the name. The next action A' must be the action actually executed.
•Safety during training: SARSA naturally avoids states where exploration is dangerous, because it accounts for the probability of random actions.
•Convergence to Q*: Requires GLIE (policy must converge to greedy). With constant ε, converges to Q^{π_ε}, not Q*.
•Expected SARSA: Uses E[Q(s', a')] instead of sampling—lower variance, can be on-policy or off-policy, often the best default.
•n-step SARSA: Bridges one-step SARSA and Monte Carlo, trading bias for variance as n increases.
•When to use: Prefer SARSA when training-time behavior matters, when catastrophic failures exist, or when deployed policy will include exploration.

What's next: Both Q-learning and SARSA focus on immediate (one-step) TD updates. But what if rewards are sparse and distant from the actions that caused them? The next page introduces Temporal Difference Learning in depth, covering the theory of TD errors and how they propagate value information through the state space.

Page Complete

You've mastered SARSA—the on-policy counterpart to Q-learning. You understand when each approach is appropriate, how Expected SARSA unifies them, and how n-step methods extend the basic algorithm. This completes your toolkit for tabular TD control methods.

SARSA

Learning What Your Policy Actually Achieves

Q-learning asks: What's the best possible future from here? But there's a fundamentally different question an agent might ask: What future will I actually experience if I keep behaving as I am?

What You Will Learn

On-Policy vs Off-Policy Learning

Before diving into SARSA's mechanics, let's establish the conceptual distinction that defines it.

Two Policies in RL

In any RL scenario, there are conceptually two policies:

Behavior policy $b(a|s)$: The policy actually generating actions and collecting experience
Target policy $\pi(a|s)$: The policy whose value we're trying to estimate/improve

On-policy methods: Behavior = Target ($b = \pi$). We learn about the policy we're following.

Off-policy methods: Behavior ≠ Target. We can learn about one policy while following another.

Why Does This Matter?

The distinction affects what value function we learn:

	On-Policy (SARSA)	Off-Policy (Q-Learning)
Learns	$Q^\pi$ (current policy's value)	$Q^*$ (optimal value)
Target	$r + \gamma Q(s', a')$ where $a' \sim \pi$	$r + \gamma \max_{a'} Q(s', a')$
Assumes	Continues following $\pi$	Acts optimally from $s'$
During exploration	Accounts for exploration costs	Ignores them

The Fundamental Trade-off

The Importance of Behavior-Awareness

Consider an ε-greedy policy with ε = 0.1. When the agent is near a cliff:

Q-learning thinks: "The optimal action at s' is STAY_AWAY, giving value +10."

SARSA thinks: "With ε = 0.1, there's a 10% chance of random action. Some random actions lead off the cliff (-100). Expected value is actually 0.1 × (-100) + 0.9 × (+10) = -1."

SARSA's Q-values reflect the actual expected return under the ε-greedy policy, including the risk from exploration. This makes SARSA naturally risk-averse to states where exploration is dangerous.

The SARSA Algorithm

The name SARSA comes from the quintuple of values used in each update: $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$.

The Update Rule

After observing transition $(s, a, r, s')$ and selecting next action $a'$ from $s'$:

$$Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma Q(s', a') - Q(s, a) \right]$$

Compare to Q-learning:

Q-learning: $r + \gamma \max_{a'} Q(s', a')$
SARSA: $r + \gamma Q(s', a')$ where $a' \sim \pi(\cdot|s')$

The critical difference: SARSA uses the actual next action $a'$ drawn from the current policy, not the best action.

Complete Algorithm

SARSA (Tabular, On-Policy TD Control):

Initialize $Q(s, a)$ arbitrarily for all $s \in S$, $a \in A$
For each episode:
- Initialize state $s$
- Choose action $a$ from $s$ using policy derived from $Q$ (e.g., ε-greedy)
- Repeat for each step:
  - Take action $a$, observe reward $r$ and next state $s'$
  - Choose action $a'$ from $s'$ using policy derived from $Q$
  - Update: $Q(s, a) \leftarrow Q(s, a) + \alpha[r + \gamma Q(s', a') - Q(s, a)]$
  - $s \leftarrow s'$; $a \leftarrow a'$
- Until $s$ is terminal

Notice $a'$ must be chosen before the update (it's part of the TD target) and kept as the action for the next step (on-policy consistency).

sarsa.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
import numpy as np
from typing import Tuple, List
 
class SARSAAgent:
    """
    SARSA (State-Action-Reward-State-Action) Agent.
    
    On-policy TD control: learns Q^π for the behavior policy π.
    The key difference from Q-learning is using the actual next action
    in the TD target, not the max.
    """
    
    def __init__(
        self,
        num_states: int,
        num_actions: int,
        learning_rate: float = 0.1,
        discount_factor: float = 0.99,
        epsilon: float = 0.1
    ):
        self.num_states = num_states
        self.num_actions = num_actions
        self.alpha = learning_rate
        self.gamma = discount_factor
        self.epsilon = epsilon
        
        # Initialize Q-table
        self.Q = np.zeros((num_states, num_actions))
        
    def select_action(self, state: int) -> int:
        """ε-greedy action selection."""
        if np.random.random() < self.epsilon:
            return np.random.randint(self.num_actions)
        else:
            # Greedy with random tie-breaking
            q_values = self.Q[state]
            max_q = np.max(q_values)
            best_actions = np.where(q_values == max_q)[0]
            return np.random.choice(best_actions)
    
    def update(
        self,
        state: int,
        action: int,
        reward: float,
        next_state: int,
        next_action: int,  # Key difference: we need the actual next action!
        done: bool
    ) -> float:
        """
        SARSA update: Q(s,a) ← Q(s,a) + α[r + γQ(s',a') - Q(s,a)]
        
        Note: next_action is the action that WILL be taken, selected
        according to the current policy. This is what makes SARSA on-policy.
        """
        if done:
            td_target = reward  # No future value at terminal
        else:
            # Use actual next action, not max
            td_target = reward + self.gamma * self.Q[next_state, next_action]
        
        td_error = td_target - self.Q[state, action]
        self.Q[state, action] += self.alpha * td_error
        
        return td_error
 
 
def train_sarsa(
    env,
    agent: SARSAAgent,
    num_episodes: int = 1000,
    max_steps: int = 1000
) -> Tuple[List[float], List[int]]:
    """
    Train SARSA agent. Note the subtle difference in the training loop:
    we must select the next action BEFORE the update, then use it
    as the action for the next step.
    """
    episode_rewards = []
    episode_lengths = []
    
    for episode in range(num_episodes):
        state = env.reset()
        # Select first action BEFORE entering the loop
        action = agent.select_action(state)
        
        total_reward = 0
        
        for step in range(max_steps):
            # Execute action
            next_state, reward, done, _ = env.step(action)
            
            if done:
                # Terminal: update with no next action
                agent.update(state, action, reward, next_state, 0, done=True)
                total_reward += reward
                break
            else:
                # Select next action (this is the a' in SARSA)
                next_action = agent.select_action(next_state)
                
                # Update Q-value using the actual next action
                agent.update(state, action, reward, next_state, next_action, done=False)
                
                total_reward += reward
                
                # Transition: use the SAME next_action we used in the update
                # This maintains on-policy consistency
                state = next_state
                action = next_action
        
        episode_rewards.append(total_reward)
        episode_lengths.append(step + 1)
    
    return episode_rewards, episode_lengths

Common Implementation Mistake

SARSA vs Q-Learning: Deep Dive

Let's rigorously examine how SARSA and Q-learning differ in behavior, learned values, and appropriate use cases.

Mathematical Difference

Both algorithms update Q(s, a) toward a target:

Q-learning: Target = $r + \gamma \max_{a'} Q(s', a')$ = $r + \gamma V^*(s')$

SARSA: Target = $r + \gamma Q(s', a')$ where $a' \sim \pi$

In expectation over policy $\pi$: $$\mathbb{E}{a' \sim \pi}[r + \gamma Q(s', a')] = r + \gamma \sum{a'} \pi(a'|s') Q(s', a') = r + \gamma V^\pi(s')$$

So:

Q-learning: backs up against $V^*$ (optimal value)
SARSA: backs up against $V^\pi$ (policy value)

This is why Q-learning learns $Q^*$ and SARSA learns $Q^\pi$.

The Cliff Walking Experiment

The Cliff Walking environment is the canonical illustration of the SARSA/Q-learning difference.

Environment:

4×12 grid, start at bottom-left, goal at bottom-right
Bottom row (except start/goal) is a "cliff"—stepping there gives -100 reward and resets to start
All other steps give -1 reward
Optimal path: along the cliff edge (shortest)
Safe path: one row up from the cliff (longer but no risk)

Results with ε-greedy (ε = 0.1):

Algorithm	Path Learned	Average Reward	Behavior
Q-learning	Along cliff edge	~-40	Sometimes falls off during training
SARSA	One row above cliff	~-20	Avoids cliff entirely

Q-learning learns the optimal path but can't safely execute it (exploration causes falls). SARSA learns a suboptimal path but executes it safely.

Q-Learning's Cliff Strategy

SARSA's Cliff Strategy

When Each Algorithm is Preferred

Prefer Q-learning when:

Environment is forgiving (no catastrophic failures)
You'll switch to a greedy policy after training (no ε during deployment)
You want maximum sample efficiency for finding Q*
You want to use experience replay (requires off-policy)

Prefer SARSA when:

Training-time behavior matters (real-world deployment during learning)
Catastrophic failures exist and must be avoided
Your deployed policy will include exploration
Stability during learning is more important than final optimality

Expected SARSA: The Best of Both Worlds

There's a middle ground between SARSA's sample-based on-policy update and Q-learning's off-policy max: Expected SARSA, which takes the expectation over policy actions instead of sampling.

The Update Rule

$$Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \sum_{a'} \pi(a'|s') Q(s', a') - Q(s, a) \right]$$

Instead of sampling $a' \sim \pi$ (SARSA) or taking max (Q-learning), Expected SARSA computes the exact expected value under $\pi$.

Variance Reduction

SARSA has variance from sampling the next action: $$\text{Var}[Q(s', a')] > 0 \text{ when } a' \sim \pi$$

Expected SARSA eliminates this source of variance: $$\sum_{a'} \pi(a'|s') Q(s', a') = \mathbb{E}_{a' \sim \pi}[Q(s', a')]$$

Lower variance means more stable updates and often faster convergence.

expected_sarsa.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import numpy as np
 
class ExpectedSARSAAgent:
    """
    Expected SARSA: uses expected value under policy instead of sampling.
    
    Lower variance than SARSA, can be on-policy or off-policy depending
    on the policy used in the expectation.
    """
    
    def __init__(
        self,
        num_states: int,
        num_actions: int,
        learning_rate: float = 0.1,
        discount_factor: float = 0.99,
        epsilon: float = 0.1
    ):
        self.num_states = num_states
        self.num_actions = num_actions
        self.alpha = learning_rate
        self.gamma = discount_factor
        self.epsilon = epsilon
        
        self.Q = np.zeros((num_states, num_actions))
    
    def get_policy_probs(self, state: int) -> np.ndarray:
        """
        Return action probabilities for ε-greedy policy at given state.
        This is π(a|s) needed for expected value computation.
        """
        q_values = self.Q[state]
        probs = np.ones(self.num_actions) * (self.epsilon / self.num_actions)
        
        # Handle ties in max Q
        max_q = np.max(q_values)
        best_actions = np.where(q_values == max_q)[0]
        
        # Distribute the (1 - ε) greedy probability among best actions
        greedy_boost = (1 - self.epsilon) / len(best_actions)
        probs[best_actions] += greedy_boost
        
        return probs
    
    def select_action(self, state: int) -> int:
        """Sample action from ε-greedy policy."""
        probs = self.get_policy_probs(state)
        return np.random.choice(self.num_actions, p=probs)
    
    def update(
        self,
        state: int,
        action: int,
        reward: float,
        next_state: int,
        done: bool
    ) -> float:
        """
        Expected SARSA update.
        
        Uses E_{a'~π}[Q(s',a')] instead of sampled Q(s',a') or max Q(s',a').
        """
        if done:
            td_target = reward
        else:
            # Compute expected value under current policy
            policy_probs = self.get_policy_probs(next_state)
            expected_q = np.dot(policy_probs, self.Q[next_state])
            td_target = reward + self.gamma * expected_q
        
        td_error = td_target - self.Q[state, action]
        self.Q[state, action] += self.alpha * td_error
        
        return td_error
    
    def expected_value(self, state: int) -> float:
        """Compute V^π(s) = Σ_a π(a|s) Q(s,a)."""
        probs = self.get_policy_probs(state)
        return np.dot(probs, self.Q[state])

Relationship to Q-learning and SARSA

Expected SARSA generalizes both algorithms:

SARSA: Expected SARSA where we sample $a'$ instead of computing expectation
Q-learning: Expected SARSA with a greedy target policy (ε = 0 in expectation)

To see that Q-learning is Expected SARSA with greedy target: $$\sum_{a'} \pi_{greedy}(a'|s') Q(s', a') = 1 \cdot Q(s', \arg\max Q) = \max_{a'} Q(s', a')$$

This unification reveals that the core difference between algorithms is the target policy used in the expectation.

TD Control Algorithm Comparison
Algorithm	Target Policy	Next Action Handling	Variance
SARSA	Behavior π	Sample a' ~ π	Higher (sampling)
Expected SARSA	Any π	E_{a'~π}[Q(s',a')]	Lower (exact)
Q-learning	Greedy π*	max Q(s',a')	N/A (deterministic)

Expected SARSA in Practice

Convergence Analysis

SARSA's convergence theory differs from Q-learning because SARSA learns about a changing target: as the policy improves, $Q^\pi$ changes.

SARSA Convergence Theorem

Theorem (Singh et al., 2000): SARSA converges to $Q^*$ with probability 1 under the following conditions:

All state-action pairs continue to be visited
The policy is GLIE (Greedy in the Limit with Infinite Exploration):
- All state-action pairs visited infinitely often
- Policy converges to greedy: $\lim_{t \to \infty} \pi_t(a|s) = 1$ if $a = \arg\max Q(s, \cdot)$
Learning rates satisfy Robbins-Monro conditions

GLIE Explained: The policy must explore forever (ensuring all Q-values are updated) but eventually become greedy (ensuring convergence to optimal, not just any consistent policy).

Example GLIE Policy: ε-greedy with $\epsilon_t = 1/t$ (decays to zero but sum is infinite).

The Difference from Q-Learning

What SARSA Converges To

With constant ε (not decaying to 0), SARSA converges to $Q^{\pi_\epsilon}$—the value function of the ε-greedy policy.

This is NOT the same as $Q^$! For any ε > 0: $$Q^{\pi_\epsilon}(s, a) \leq Q^(s, a)$$

The gap reflects the "cost of exploration"—the expected loss from random actions.

Convergence Rate Comparison

Comparing convergence rates is complex and environment-dependent:

Factor	SARSA	Q-learning
Variance per update	Higher (samples a')	Lower (deterministic max)
Bias	Unbiased for Q^π	Biased by maximization
Moving target	Policy changes during learning	Fixed target Q*
Stability	Generally more stable	Can oscillate near cliff-like states

Expected SARSA often has the best empirical convergence because it combines:

Low variance (like Q-learning)
Natural on-policy learning (like SARSA)
Can be either on-policy or off-policy depending on target policy

n-Step SARSA

The n-step Return

The n-step return from time $t$: $$G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \ldots + \gamma^{n-1} R_{t+n} + \gamma^n Q(S_{t+n}, A_{t+n})$$

This uses $n$ actual rewards plus a bootstrap from $Q$ at step $t+n$.

Special cases:

$n = 1$: Standard SARSA (one reward + bootstrap)
$n = \infty$ (or until termination): Monte Carlo (all rewards, no bootstrap)

The n-step SARSA Update

$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ G_{t:t+n} - Q(S_t, A_t) \right]$$

This update can only occur at time $t + n$ when $G_{t:t+n}$ is known.

n_step_sarsa.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
import numpy as np
from collections import deque
 
class NStepSARSAAgent:
    """
    n-step SARSA: bridges one-step SARSA and Monte Carlo.
    
    Uses n actual rewards plus a bootstrap from Q(S_{t+n}, A_{t+n}).
    Larger n = lower bias, higher variance.
    """
    
    def __init__(
        self,
        num_states: int,
        num_actions: int,
        n_steps: int = 4,
        learning_rate: float = 0.1,
        discount_factor: float = 0.99,
        epsilon: float = 0.1
    ):
        self.num_states = num_states
        self.num_actions = num_actions
        self.n = n_steps
        self.alpha = learning_rate
        self.gamma = discount_factor
        self.epsilon = epsilon
        
        self.Q = np.zeros((num_states, num_actions))
        
        # Buffer to store the last n+1 transitions
        self.state_buffer = deque(maxlen=n_steps + 1)
        self.action_buffer = deque(maxlen=n_steps + 1)
        self.reward_buffer = deque(maxlen=n_steps)
        
    def select_action(self, state: int) -> int:
        if np.random.random() < self.epsilon:
            return np.random.randint(self.num_actions)
        return np.argmax(self.Q[state])
    
    def start_episode(self, state: int, action: int):
        """Initialize buffers at start of episode."""
        self.state_buffer.clear()
        self.action_buffer.clear()
        self.reward_buffer.clear()
        
        self.state_buffer.append(state)
        self.action_buffer.append(action)
    
    def step(
        self,
        next_state: int,
        reward: float,
        next_action: int,
        done: bool,
        t: int  # Current timestep
    ):
        """
        Process one step, potentially making an n-step update.
        
        Returns True if an update was made.
        """
        # Store transition
        self.state_buffer.append(next_state)
        self.action_buffer.append(next_action)
        self.reward_buffer.append(reward)
        
        # Can only update once we have n steps of data
        # Update is for time tau = t - n + 1
        tau = t - self.n + 1
        
        if tau >= 0:
            # Compute n-step return G_{tau:tau+n}
            G = 0.0
            for i, r in enumerate(self.reward_buffer):
                G += (self.gamma ** i) * r
            
            # Bootstrap from Q(S_{tau+n}, A_{tau+n}) if not terminal
            if not done:
                bootstrap_state = self.state_buffer[-1]
                bootstrap_action = self.action_buffer[-1]
                G += (self.gamma ** self.n) * self.Q[bootstrap_state, bootstrap_action]
            
            # Update Q(S_tau, A_tau)
            update_state = self.state_buffer[0]
            update_action = self.action_buffer[0]
            
            td_error = G - self.Q[update_state, update_action]
            self.Q[update_state, update_action] += self.alpha * td_error
            
            return True
        
        return False
    
    def end_episode(self, T: int):
        """
        Make remaining updates at end of episode.
        T is the terminal timestep.
        """
        # Need to update states from T-n+1 to T-1
        for remaining in range(len(self.reward_buffer)):
            # Compute return for remaining transitions
            # These don't bootstrap (episode ended)
            G = 0.0
            for i, r in enumerate(list(self.reward_buffer)[remaining:]):
                G += (self.gamma ** i) * r
            
            update_state = self.state_buffer[remaining]
            update_action = self.action_buffer[remaining]
            
            td_error = G - self.Q[update_state, update_action]
            self.Q[update_state, update_action] += self.alpha * td_error

The Bias-Variance Trade-off

As $n$ increases:

Aspect	Effect	Why
Bias	Decreases	More real rewards, less reliance on Q estimates
Variance	Increases	More random rewards accumulated, more stochasticity
Delayed updates	Increases	Must wait n steps to update
Credit assignment	More direct	Rewards closer to responsible actions

Optimal n: Problem-dependent. Short horizons with high reward variance favor small $n$. Long horizons with sparse rewards favor larger $n$. Empirically, $n \in [4, 8]$ often works well.

Eligibility Traces: The Elegant Alternative

Implementation Best Practices

Successfully implementing SARSA requires attention to subtle details that can make the difference between correct learning and puzzling failures.

The Action Flow

The most common SARSA bug is incorrect action handling. The flow must be:

state, action = env.reset(), select_action(state)
while not done:
    next_state, reward, done = env.step(action)
    next_action = select_action(next_state)
    update(state, action, reward, next_state, next_action)
    state, action = next_state, next_action  # SAME action!

The next_action used in the update MUST be the action taken in the next step. Any deviation breaks on-policy consistency.

SARSA Debugging Checklist

•Verify action persistence: Print (state, action) before and after updates to confirm action used in update matches action executed.
•Check terminal handling: Terminal states should update with target = reward only (no bootstrap). Q(terminal, ·) should remain 0 or be excluded from updates.
•Validate policy consistency: The policy used for selection must match what's implicitly assumed in the update (both ε-greedy with same ε).
•Monitor Q-value magnitudes: Q-values should converge to reasonable values (bounded by R_max / (1-γ) typically). Divergence indicates a bug.
•Compare with Q-learning: On simple environments, SARSA should learn similar (but not identical) Q-values. Large discrepancies suggest implementation errors.

Hyperparameter Guidelines

Hyperparameter	SARSA Guidance	Notes
Learning rate α	0.1–0.5	SARSA often tolerates higher α than Q-learning
Discount γ	0.9–0.99	Standard; higher for long-horizon tasks
ε (exploration)	0.1–0.3 typical	Remember: SARSA learns Q^{π_ε}, not Q*
ε decay	Slower than Q-learning	On-policy needs exploration to evaluate all actions
n (for n-step)	2–8	Larger for sparse rewards

Key insight: Because SARSA is on-policy, you need continued exploration to update Q-values for non-greedy actions. Decay ε too fast and those Q-values become stale.

Testing On-Policy Consistency

Real-World Applications of On-Policy Learning

While Q-learning often gets more attention, on-policy methods like SARSA are preferred in several important domains.

Robotics and Physical Systems

Physical systems can't reset like simulators—mistakes have real consequences.

Example: A robotic arm learning manipulation:

Q-learning might learn an optimal but risky trajectory near joint limits
During training with exploration, the robot could damage itself
SARSA learns trajectories that are safe given the exploratory behavior

Practice: Use SARSA during initial training when exploration is high, then optionally switch to Q-learning for fine-tuning after establishing safe policies.

Financial Trading

Trading systems operate in real markets where training-time behavior has real costs.

Why SARSA fits:

The system trades during learning—losses are real
Exploration (trying new strategies) has direct cost
SARSA's Q-values account for exploration overhead
Conservative strategies during uncertain periods are rewarded

Contrast with Q-learning: Q-learning might value a high-risk trade highly (optimal if executed perfectly), but the strategy is only safe when exploration is disabled.

Dialog Systems and User Interaction

Conversational agents learn while interacting with real users.

The challenge: Exploratory (random) responses annoy users. SARSA naturally penalizes states where random exploration leads to poor outcomes (user frustration, conversation termination).

SARSA's advantage:

Learns the value of states given actual behavior (including occasional poor responses)
Naturally finds robust conversation strategies
Q-values reflect user patience with imperfect responses

When to Choose On-Policy vs Off-Policy
Scenario	Recommendation	Rationale
Simulator training, perfect deployment	Q-learning	Learn Q, deploy greedy π
Real-world training	SARSA	Safe training-time behavior
Continued learning after deployment	SARSA/Expected SARSA	Account for ongoing exploration
High exploration (ε > 0.2)	SARSA	Q-learning values unrealistic
Experience replay desired	Q-learning	Off-policy requirement
Low exploration, stable Q-values	Expected SARSA	Lower variance, any policy

The Practical Guideline

Summary: SARSA

SARSA represents the on-policy approach to TD control—learning the value of the policy we're actually following, not an idealized optimal policy. Let's consolidate the key insights:

Key Takeaways

•On-policy learning: SARSA learns Q^π for the behavior policy π, not Q* for the optimal policy. The TD target uses the actual next action a' ~ π.
•The SARSA quintuple: Each update uses (S, A, R, S', A')—hence the name. The next action A' must be the action actually executed.
•Safety during training: SARSA naturally avoids states where exploration is dangerous, because it accounts for the probability of random actions.
•Convergence to Q*: Requires GLIE (policy must converge to greedy). With constant ε, converges to Q^{π_ε}, not Q*.
•Expected SARSA: Uses E[Q(s', a')] instead of sampling—lower variance, can be on-policy or off-policy, often the best default.
•n-step SARSA: Bridges one-step SARSA and Monte Carlo, trading bias for variance as n increases.
•When to use: Prefer SARSA when training-time behavior matters, when catastrophic failures exist, or when deployed policy will include exploration.

Page Complete