Loading content...
Q-learning asks: What's the best possible future from here? But there's a fundamentally different question an agent might ask: What future will I actually experience if I keep behaving as I am?
SARSA (State-Action-Reward-State-Action) answers this second question. Unlike Q-learning, which optimistically assumes optimal future behavior, SARSA learns the value of states and actions under the current policy. If your policy explores randomly 10% of the time, SARSA accounts for that uncertainty; Q-learning pretends you'll always act optimally.
This distinction—on-policy vs off-policy learning—has profound implications. SARSA is more conservative during training, respects the actual behavior being executed, and naturally incorporates the costs of exploration. Understanding when each approach is appropriate is essential for any RL practitioner.
This page provides comprehensive coverage of SARSA: the on-policy TD control algorithm, its mathematical formulation, convergence properties, the crucial differences from Q-learning, Expected SARSA as a unified approach, practical implementation, and scenarios where on-policy learning is the right choice.
Before diving into SARSA's mechanics, let's establish the conceptual distinction that defines it.
In any RL scenario, there are conceptually two policies:
On-policy methods: Behavior = Target ($b = \pi$). We learn about the policy we're following.
Off-policy methods: Behavior ≠ Target. We can learn about one policy while following another.
The distinction affects what value function we learn:
| On-Policy (SARSA) | Off-Policy (Q-Learning) | |
|---|---|---|
| Learns | $Q^\pi$ (current policy's value) | $Q^*$ (optimal value) |
| Target | $r + \gamma Q(s', a')$ where $a' \sim \pi$ | $r + \gamma \max_{a'} Q(s', a')$ |
| Assumes | Continues following $\pi$ | Acts optimally from $s'$ |
| During exploration | Accounts for exploration costs | Ignores them |
Off-policy (Q-learning) is more ambitious: it learns about the optimal policy while exploring. On-policy (SARSA) is more honest: it learns about what will actually happen given how the agent actually behaves. In safe environments, off-policy is more sample-efficient. In dangerous environments, on-policy is safer during training.
Consider an ε-greedy policy with ε = 0.1. When the agent is near a cliff:
Q-learning thinks: "The optimal action at s' is STAY_AWAY, giving value +10."
SARSA thinks: "With ε = 0.1, there's a 10% chance of random action. Some random actions lead off the cliff (-100). Expected value is actually 0.1 × (-100) + 0.9 × (+10) = -1."
SARSA's Q-values reflect the actual expected return under the ε-greedy policy, including the risk from exploration. This makes SARSA naturally risk-averse to states where exploration is dangerous.
The name SARSA comes from the quintuple of values used in each update: $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$.
After observing transition $(s, a, r, s')$ and selecting next action $a'$ from $s'$:
$$Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma Q(s', a') - Q(s, a) \right]$$
Compare to Q-learning:
The critical difference: SARSA uses the actual next action $a'$ drawn from the current policy, not the best action.
SARSA (Tabular, On-Policy TD Control):
Notice $a'$ must be chosen before the update (it's part of the TD target) and kept as the action for the next step (on-policy consistency).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115
import numpy as npfrom typing import Tuple, List class SARSAAgent: """ SARSA (State-Action-Reward-State-Action) Agent. On-policy TD control: learns Q^π for the behavior policy π. The key difference from Q-learning is using the actual next action in the TD target, not the max. """ def __init__( self, num_states: int, num_actions: int, learning_rate: float = 0.1, discount_factor: float = 0.99, epsilon: float = 0.1 ): self.num_states = num_states self.num_actions = num_actions self.alpha = learning_rate self.gamma = discount_factor self.epsilon = epsilon # Initialize Q-table self.Q = np.zeros((num_states, num_actions)) def select_action(self, state: int) -> int: """ε-greedy action selection.""" if np.random.random() < self.epsilon: return np.random.randint(self.num_actions) else: # Greedy with random tie-breaking q_values = self.Q[state] max_q = np.max(q_values) best_actions = np.where(q_values == max_q)[0] return np.random.choice(best_actions) def update( self, state: int, action: int, reward: float, next_state: int, next_action: int, # Key difference: we need the actual next action! done: bool ) -> float: """ SARSA update: Q(s,a) ← Q(s,a) + α[r + γQ(s',a') - Q(s,a)] Note: next_action is the action that WILL be taken, selected according to the current policy. This is what makes SARSA on-policy. """ if done: td_target = reward # No future value at terminal else: # Use actual next action, not max td_target = reward + self.gamma * self.Q[next_state, next_action] td_error = td_target - self.Q[state, action] self.Q[state, action] += self.alpha * td_error return td_error def train_sarsa( env, agent: SARSAAgent, num_episodes: int = 1000, max_steps: int = 1000) -> Tuple[List[float], List[int]]: """ Train SARSA agent. Note the subtle difference in the training loop: we must select the next action BEFORE the update, then use it as the action for the next step. """ episode_rewards = [] episode_lengths = [] for episode in range(num_episodes): state = env.reset() # Select first action BEFORE entering the loop action = agent.select_action(state) total_reward = 0 for step in range(max_steps): # Execute action next_state, reward, done, _ = env.step(action) if done: # Terminal: update with no next action agent.update(state, action, reward, next_state, 0, done=True) total_reward += reward break else: # Select next action (this is the a' in SARSA) next_action = agent.select_action(next_state) # Update Q-value using the actual next action agent.update(state, action, reward, next_state, next_action, done=False) total_reward += reward # Transition: use the SAME next_action we used in the update # This maintains on-policy consistency state = next_state action = next_action episode_rewards.append(total_reward) episode_lengths.append(step + 1) return episode_rewards, episode_lengthsA common bug is to select a' for the update but then select a different action when actually stepping. This breaks on-policy consistency—you're updating toward one policy but following another. The action used in Q(s', a') MUST be the action executed in the next step.
Let's rigorously examine how SARSA and Q-learning differ in behavior, learned values, and appropriate use cases.
Both algorithms update Q(s, a) toward a target:
Q-learning: Target = $r + \gamma \max_{a'} Q(s', a')$ = $r + \gamma V^*(s')$
SARSA: Target = $r + \gamma Q(s', a')$ where $a' \sim \pi$
In expectation over policy $\pi$: $$\mathbb{E}{a' \sim \pi}[r + \gamma Q(s', a')] = r + \gamma \sum{a'} \pi(a'|s') Q(s', a') = r + \gamma V^\pi(s')$$
So:
This is why Q-learning learns $Q^*$ and SARSA learns $Q^\pi$.
The Cliff Walking environment is the canonical illustration of the SARSA/Q-learning difference.
Environment:
Results with ε-greedy (ε = 0.1):
| Algorithm | Path Learned | Average Reward | Behavior |
|---|---|---|---|
| Q-learning | Along cliff edge | ~-40 | Sometimes falls off during training |
| SARSA | One row above cliff | ~-20 | Avoids cliff entirely |
Q-learning learns the optimal path but can't safely execute it (exploration causes falls). SARSA learns a suboptimal path but executes it safely.
Q-learning values the cliff-edge path highly because max Q(s', a') assumes you'll take the optimal (STAY) action. But ε-greedy has a 10% chance of random action → potential cliff fall. The policy is optimal but the execution isn't safe with exploration.
SARSA values near-cliff states lower because a' is drawn from ε-greedy, so the expected value includes the 10% chance of a bad random action. This naturally avoids the cliff, leading to safer training-time behavior.
Prefer Q-learning when:
Prefer SARSA when:
There's a middle ground between SARSA's sample-based on-policy update and Q-learning's off-policy max: Expected SARSA, which takes the expectation over policy actions instead of sampling.
$$Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \sum_{a'} \pi(a'|s') Q(s', a') - Q(s, a) \right]$$
Instead of sampling $a' \sim \pi$ (SARSA) or taking max (Q-learning), Expected SARSA computes the exact expected value under $\pi$.
SARSA has variance from sampling the next action: $$\text{Var}[Q(s', a')] > 0 \text{ when } a' \sim \pi$$
Expected SARSA eliminates this source of variance: $$\sum_{a'} \pi(a'|s') Q(s', a') = \mathbb{E}_{a' \sim \pi}[Q(s', a')]$$
Lower variance means more stable updates and often faster convergence.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879
import numpy as np class ExpectedSARSAAgent: """ Expected SARSA: uses expected value under policy instead of sampling. Lower variance than SARSA, can be on-policy or off-policy depending on the policy used in the expectation. """ def __init__( self, num_states: int, num_actions: int, learning_rate: float = 0.1, discount_factor: float = 0.99, epsilon: float = 0.1 ): self.num_states = num_states self.num_actions = num_actions self.alpha = learning_rate self.gamma = discount_factor self.epsilon = epsilon self.Q = np.zeros((num_states, num_actions)) def get_policy_probs(self, state: int) -> np.ndarray: """ Return action probabilities for ε-greedy policy at given state. This is π(a|s) needed for expected value computation. """ q_values = self.Q[state] probs = np.ones(self.num_actions) * (self.epsilon / self.num_actions) # Handle ties in max Q max_q = np.max(q_values) best_actions = np.where(q_values == max_q)[0] # Distribute the (1 - ε) greedy probability among best actions greedy_boost = (1 - self.epsilon) / len(best_actions) probs[best_actions] += greedy_boost return probs def select_action(self, state: int) -> int: """Sample action from ε-greedy policy.""" probs = self.get_policy_probs(state) return np.random.choice(self.num_actions, p=probs) def update( self, state: int, action: int, reward: float, next_state: int, done: bool ) -> float: """ Expected SARSA update. Uses E_{a'~π}[Q(s',a')] instead of sampled Q(s',a') or max Q(s',a'). """ if done: td_target = reward else: # Compute expected value under current policy policy_probs = self.get_policy_probs(next_state) expected_q = np.dot(policy_probs, self.Q[next_state]) td_target = reward + self.gamma * expected_q td_error = td_target - self.Q[state, action] self.Q[state, action] += self.alpha * td_error return td_error def expected_value(self, state: int) -> float: """Compute V^π(s) = Σ_a π(a|s) Q(s,a).""" probs = self.get_policy_probs(state) return np.dot(probs, self.Q[state])Expected SARSA generalizes both algorithms:
To see that Q-learning is Expected SARSA with greedy target: $$\sum_{a'} \pi_{greedy}(a'|s') Q(s', a') = 1 \cdot Q(s', \arg\max Q) = \max_{a'} Q(s', a')$$
This unification reveals that the core difference between algorithms is the target policy used in the expectation.
| Algorithm | Target Policy | Next Action Handling | Variance |
|---|---|---|---|
| SARSA | Behavior π | Sample a' ~ π | Higher (sampling) |
| Expected SARSA | Any π | E_{a'~π}[Q(s',a')] | Lower (exact) |
| Q-learning | Greedy π* | max Q(s',a') | N/A (deterministic) |
Expected SARSA is often the best default choice for tabular problems: it has the lower variance of Q-learning while remaining on-policy when desired. The computational overhead (summing over actions) is negligible for small action spaces. For large action spaces, sampling (SARSA) may be necessary.
SARSA's convergence theory differs from Q-learning because SARSA learns about a changing target: as the policy improves, $Q^\pi$ changes.
Theorem (Singh et al., 2000): SARSA converges to $Q^*$ with probability 1 under the following conditions:
GLIE Explained: The policy must explore forever (ensuring all Q-values are updated) but eventually become greedy (ensuring convergence to optimal, not just any consistent policy).
Example GLIE Policy: ε-greedy with $\epsilon_t = 1/t$ (decays to zero but sum is infinite).
Q-learning converges to Q* regardless of the behavior policy (as long as all state-actions are visited). SARSA converges to Q* only if the policy itself converges to optimal. With constant ε > 0, SARSA converges to Q^{π_ε}, not Q*.
With constant ε (not decaying to 0), SARSA converges to $Q^{\pi_\epsilon}$—the value function of the ε-greedy policy.
This is NOT the same as $Q^$! For any ε > 0: $$Q^{\pi_\epsilon}(s, a) \leq Q^(s, a)$$
The gap reflects the "cost of exploration"—the expected loss from random actions.
Practical implication: If you plan to deploy with ε = 0, Q-learning gives you the right values. If you plan to deploy with ε > 0 (for continued learning or safety), SARSA gives you the right values.
Comparing convergence rates is complex and environment-dependent:
| Factor | SARSA | Q-learning |
|---|---|---|
| Variance per update | Higher (samples a') | Lower (deterministic max) |
| Bias | Unbiased for Q^π | Biased by maximization |
| Moving target | Policy changes during learning | Fixed target Q* |
| Stability | Generally more stable | Can oscillate near cliff-like states |
Expected SARSA often has the best empirical convergence because it combines:
One-step SARSA bootstraps from the immediate next state. But we could wait longer—accumulate more real rewards before bootstrapping. This gives n-step SARSA, which provides a spectrum between TD(0) and Monte Carlo.
The n-step return from time $t$: $$G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \ldots + \gamma^{n-1} R_{t+n} + \gamma^n Q(S_{t+n}, A_{t+n})$$
This uses $n$ actual rewards plus a bootstrap from $Q$ at step $t+n$.
Special cases:
$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ G_{t:t+n} - Q(S_t, A_t) \right]$$
This update can only occur at time $t + n$ when $G_{t:t+n}$ is known.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111
import numpy as npfrom collections import deque class NStepSARSAAgent: """ n-step SARSA: bridges one-step SARSA and Monte Carlo. Uses n actual rewards plus a bootstrap from Q(S_{t+n}, A_{t+n}). Larger n = lower bias, higher variance. """ def __init__( self, num_states: int, num_actions: int, n_steps: int = 4, learning_rate: float = 0.1, discount_factor: float = 0.99, epsilon: float = 0.1 ): self.num_states = num_states self.num_actions = num_actions self.n = n_steps self.alpha = learning_rate self.gamma = discount_factor self.epsilon = epsilon self.Q = np.zeros((num_states, num_actions)) # Buffer to store the last n+1 transitions self.state_buffer = deque(maxlen=n_steps + 1) self.action_buffer = deque(maxlen=n_steps + 1) self.reward_buffer = deque(maxlen=n_steps) def select_action(self, state: int) -> int: if np.random.random() < self.epsilon: return np.random.randint(self.num_actions) return np.argmax(self.Q[state]) def start_episode(self, state: int, action: int): """Initialize buffers at start of episode.""" self.state_buffer.clear() self.action_buffer.clear() self.reward_buffer.clear() self.state_buffer.append(state) self.action_buffer.append(action) def step( self, next_state: int, reward: float, next_action: int, done: bool, t: int # Current timestep ): """ Process one step, potentially making an n-step update. Returns True if an update was made. """ # Store transition self.state_buffer.append(next_state) self.action_buffer.append(next_action) self.reward_buffer.append(reward) # Can only update once we have n steps of data # Update is for time tau = t - n + 1 tau = t - self.n + 1 if tau >= 0: # Compute n-step return G_{tau:tau+n} G = 0.0 for i, r in enumerate(self.reward_buffer): G += (self.gamma ** i) * r # Bootstrap from Q(S_{tau+n}, A_{tau+n}) if not terminal if not done: bootstrap_state = self.state_buffer[-1] bootstrap_action = self.action_buffer[-1] G += (self.gamma ** self.n) * self.Q[bootstrap_state, bootstrap_action] # Update Q(S_tau, A_tau) update_state = self.state_buffer[0] update_action = self.action_buffer[0] td_error = G - self.Q[update_state, update_action] self.Q[update_state, update_action] += self.alpha * td_error return True return False def end_episode(self, T: int): """ Make remaining updates at end of episode. T is the terminal timestep. """ # Need to update states from T-n+1 to T-1 for remaining in range(len(self.reward_buffer)): # Compute return for remaining transitions # These don't bootstrap (episode ended) G = 0.0 for i, r in enumerate(list(self.reward_buffer)[remaining:]): G += (self.gamma ** i) * r update_state = self.state_buffer[remaining] update_action = self.action_buffer[remaining] td_error = G - self.Q[update_state, update_action] self.Q[update_state, update_action] += self.alpha * td_errorAs $n$ increases:
| Aspect | Effect | Why |
|---|---|---|
| Bias | Decreases | More real rewards, less reliance on Q estimates |
| Variance | Increases | More random rewards accumulated, more stochasticity |
| Delayed updates | Increases | Must wait n steps to update |
| Credit assignment | More direct | Rewards closer to responsible actions |
Optimal n: Problem-dependent. Short horizons with high reward variance favor small $n$. Long horizons with sparse rewards favor larger $n$. Empirically, $n \in [4, 8]$ often works well.
Rather than choosing a fixed n, eligibility traces (TD(λ)) provide a weighted average over ALL n-step returns. This is covered in the next page and is often more elegant than explicitly implementing n-step methods.
Successfully implementing SARSA requires attention to subtle details that can make the difference between correct learning and puzzling failures.
The most common SARSA bug is incorrect action handling. The flow must be:
state, action = env.reset(), select_action(state)
while not done:
next_state, reward, done = env.step(action)
next_action = select_action(next_state)
update(state, action, reward, next_state, next_action)
state, action = next_state, next_action # SAME action!
The next_action used in the update MUST be the action taken in the next step. Any deviation breaks on-policy consistency.
| Hyperparameter | SARSA Guidance | Notes |
|---|---|---|
| Learning rate α | 0.1–0.5 | SARSA often tolerates higher α than Q-learning |
| Discount γ | 0.9–0.99 | Standard; higher for long-horizon tasks |
| ε (exploration) | 0.1–0.3 typical | Remember: SARSA learns Q^{π_ε}, not Q* |
| ε decay | Slower than Q-learning | On-policy needs exploration to evaluate all actions |
| n (for n-step) | 2–8 | Larger for sparse rewards |
Key insight: Because SARSA is on-policy, you need continued exploration to update Q-values for non-greedy actions. Decay ε too fast and those Q-values become stale.
A good sanity check: track the empirical average reward and compare it to V^π(start_state). With enough training, these should converge to similar values. If V^π significantly exceeds actual performance, you may have an off-policy leak in your implementation.
While Q-learning often gets more attention, on-policy methods like SARSA are preferred in several important domains.
Physical systems can't reset like simulators—mistakes have real consequences.
Example: A robotic arm learning manipulation:
Practice: Use SARSA during initial training when exploration is high, then optionally switch to Q-learning for fine-tuning after establishing safe policies.
Trading systems operate in real markets where training-time behavior has real costs.
Why SARSA fits:
Contrast with Q-learning: Q-learning might value a high-risk trade highly (optimal if executed perfectly), but the strategy is only safe when exploration is disabled.
Conversational agents learn while interacting with real users.
The challenge: Exploratory (random) responses annoy users. SARSA naturally penalizes states where random exploration leads to poor outcomes (user frustration, conversation termination).
SARSA's advantage:
| Scenario | Recommendation | Rationale |
|---|---|---|
| Simulator training, perfect deployment | Q-learning | Learn Q*, deploy greedy π* |
| Real-world training | SARSA | Safe training-time behavior |
| Continued learning after deployment | SARSA/Expected SARSA | Account for ongoing exploration |
| High exploration (ε > 0.2) | SARSA | Q-learning values unrealistic |
| Experience replay desired | Q-learning | Off-policy requirement |
| Low exploration, stable Q-values | Expected SARSA | Lower variance, any policy |
If you're training in simulation and deploying a greedy policy: Q-learning. If you're training in the real world, or your deployed policy will explore: SARSA or Expected SARSA. When in doubt, Expected SARSA offers stability with flexibility.
SARSA represents the on-policy approach to TD control—learning the value of the policy we're actually following, not an idealized optimal policy. Let's consolidate the key insights:
What's next: Both Q-learning and SARSA focus on immediate (one-step) TD updates. But what if rewards are sparse and distant from the actions that caused them? The next page introduces Temporal Difference Learning in depth, covering the theory of TD errors and how they propagate value information through the state space.
You've mastered SARSA—the on-policy counterpart to Q-learning. You understand when each approach is appropriate, how Expected SARSA unifies them, and how n-step methods extend the basic algorithm. This completes your toolkit for tabular TD control methods.