Loading content...
Suppose you could receive $100 today or $100 one year from now. Most people prefer the money today—not just because of inflation or investment opportunity, but because the future is uncertain. A dollar today is worth more than a dollar tomorrow.
This intuition lies at the heart of the discount factor $\gamma \in [0, 1]$—a single parameter that fundamentally shapes how reinforcement learning agents value future rewards relative to immediate ones.
The discount factor appears simple: just a scalar between 0 and 1. But its implications are profound:
By the end of this page, you will understand the discount factor from multiple perspectives—mathematical necessity, economic interpretation, probabilistic meaning, computational implications, and practical selection guidelines. You'll be able to choose appropriate γ values and predict how they affect agent behavior.
Consider an infinite-horizon MDP with positive rewards. Without discounting, the return would be:
$$G = \sum_{t=0}^{\infty} R_t$$
If rewards are bounded below by some $R_{min} > 0$, this sum diverges to infinity. How can we compare two policies when both achieve infinite return?
The discount factor bounds the return:
$$G = \sum_{t=0}^{\infty} \gamma^t R_t$$
If rewards are bounded by $|R_t| \leq R_{max}$, then:
$$|G| \leq \sum_{t=0}^{\infty} \gamma^t R_{max} = \frac{R_{max}}{1 - \gamma}$$
For any $\gamma < 1$, the geometric series converges. The return is finite, and we can meaningfully compare policies.
Example:
Undiscounted returns (γ = 1) are valid in episodic tasks where episodes terminate in finite time. Since the sum is finite, no discounting is needed for convergence. However, even in episodic tasks, discounting is often used for numerical stability and to encourage efficiency (reaching goals faster).
The discounted return from time $t$ is defined as:
$$G_t = R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + \ldots = \sum_{k=0}^{\infty} \gamma^k R_{t+k}$$
This can be written recursively (a key property for dynamic programming):
$$G_t = R_t + \gamma G_{t+1}$$
The discount factor $\gamma$ controls how quickly future rewards decay:
Immediate reward: Weight $\gamma^0 = 1$ (full weight) 1 step ahead: Weight $\gamma^1 = \gamma$ 10 steps ahead: Weight $\gamma^{10}$ 100 steps ahead: Weight $\gamma^{100}$
For $\gamma = 0.99$:
For $\gamma = 0.9$:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129
import numpy as npimport matplotlib.pyplot as pltfrom typing import List, Tuple def discount_weight(gamma: float, steps: int) -> float: """Weight given to reward 'steps' in the future.""" return gamma ** steps def effective_horizon(gamma: float, threshold: float = 0.01) -> int: """ Number of steps until discount weight falls below threshold. This is the 'effective planning horizon'. γ^n < threshold n > log(threshold) / log(γ) """ if gamma >= 1.0: return float('inf') return int(np.ceil(np.log(threshold) / np.log(gamma))) def compare_gammas(): """Compare how different γ values affect planning horizon.""" gammas = [0.5, 0.9, 0.95, 0.99, 0.999] print("Effective horizons (1% threshold):") print("-" * 40) for g in gammas: horizon = effective_horizon(g, threshold=0.01) max_return = 1 / (1 - g) if g < 1 else float('inf') print(f"γ={g:.3f}: horizon={horizon:>6} steps, max_return={max_return:.1f}") print("" + "=" * 50) print("Discount weights at specific timesteps:") print("-" * 50) timesteps = [1, 10, 50, 100, 500] header = "γ\t" + "\t".join([f"t={t}" for t in timesteps]) print(header) for g in gammas: weights = [f"{discount_weight(g, t):.4f}" for t in timesteps] print(f"{g:.3f}\t" + "\t".join(weights)) compare_gammas() # ================================================# Visualization of Discount Curves# ================================================def plot_discount_curves(): """Visualize how reward weight decays over time.""" gammas = [0.5, 0.8, 0.9, 0.95, 0.99] steps = np.arange(100) plt.figure(figsize=(10, 6)) for g in gammas: weights = g ** steps plt.plot(steps, weights, label=f'γ = {g}', linewidth=2) plt.xlabel('Steps into future', fontsize=12) plt.ylabel('Discount weight (γ^t)', fontsize=12) plt.title('How Discount Factor Affects Future Reward Weights', fontsize=14) plt.legend() plt.grid(True, alpha=0.3) plt.ylim(0, 1.05) # Add horizontal line at 1% threshold plt.axhline(y=0.01, color='r', linestyle='--', alpha=0.5, label='1% threshold') plt.tight_layout() plt.savefig('discount_curves.png', dpi=150) print("\nSaved discount_curves.png") # plot_discount_curves() # Uncomment to generate plot # ================================================# Return Computation with Different γ# ================================================def compute_returns_with_gammas( rewards: List[float], gammas: List[float]) -> dict: """ Compute returns for same reward sequence with different γ. Shows how γ affects the total perceived value. """ results = {} for gamma in gammas: G = 0.0 for t, r in enumerate(rewards): G += (gamma ** t) * r results[gamma] = G return results def demo_gamma_effects(): """ Demonstrate how γ changes evaluation of reward sequences. """ # Scenario 1: Constant rewards (R=1 every step) constant_rewards = [1.0] * 100 # Scenario 2: Delayed big reward (R=0 for 50 steps, then R=100) delayed_rewards = [0.0] * 50 + [100.0] # Scenario 3: Early rewards that decay decaying_rewards = [10.0 / (t+1) for t in range(100)] gammas = [0.5, 0.9, 0.95, 0.99] print("\n=== Effect of γ on Different Reward Patterns ===") for name, rewards in [ ("Constant (R=1)", constant_rewards), ("Delayed (R=100 at t=50)", delayed_rewards), ("Decaying (R=10/(t+1))", decaying_rewards) ]: print(f"\n{name}:") returns = compute_returns_with_gammas(rewards, gammas) for g, G in returns.items(): print(f" γ={g}: G = {G:.2f}") demo_gamma_effects()The discount factor can be understood from multiple perspectives, each offering different intuitions:
Interpretation 1: Time Preference
Like economic discounting, $\gamma$ represents the agent's preference for immediate versus delayed reward. A low $\gamma$ means the agent is "impatient"—it strongly prefers rewards now over rewards later. This interpretation aligns with behavioral economics and explains phenomena like hyperbolic discounting in humans.
Interpretation 2: Probability of Continuation
Imagine that at each timestep, the episode terminates with probability $1 - \gamma$. Then:
$$\mathbb{E}[\text{total reward}] = \sum_{t=0}^{\infty} \gamma^t R_t$$
where $\gamma^t$ is the probability of surviving to timestep $t$. Under this view, $\gamma$ models uncertainty about whether the future will even exist—the agent should weight future rewards less because it might not be around to collect them.
Interpretation 3: Mathematical Regularization
Purely technically, $\gamma < 1$ is a regularizer that ensures the Bellman equations have a unique fixed point. Without discounting, value iteration may not converge. From this view, $\gamma$ is a mathematical expedient rather than a meaningful quantity.
| Interpretation | Meaning of Low γ | Meaning of High γ |
|---|---|---|
| Time Preference | Impatient, wants immediate reward | Patient, willing to wait for bigger payoffs |
| Termination Probability | High chance of episode ending | Confident interaction continues |
| Mathematical Regularization | Strong regularization, stable but short-sighted | Weak regularization, considers far future |
Use the interpretation that best matches your problem domain. For robotics with safety concerns, termination probability makes sense. For economic optimization, time preference is natural. For pure algorithm design, the mathematical view suffices. The formalism is the same regardless of interpretation.
The choice of $\gamma$ dramatically affects what policies agents learn:
Low γ (e.g., 0.5):
Medium γ (e.g., 0.9-0.95):
High γ (e.g., 0.99+):
Same gridworld, different γ valuesDifferent optimal pathsThis is why γ is not just a hyperparameter—it fundamentally defines what 'optimal' means in the MDP.
The discount factor affects not just learned behavior but also computational aspects of RL algorithms:
Contraction in Bellman Operators:
The Bellman operator $\mathcal{T}$ is a $\gamma$-contraction:
$$||\mathcal{T}V_1 - \mathcal{T}V_2||\infty \leq \gamma ||V_1 - V_2||\infty$$
This guarantees convergence of value iteration. Lower $\gamma$ means faster contraction (faster convergence) but potentially worse solutions.
Credit Assignment:
When a reward is received, the agent must determine which past actions caused it. With high $\gamma$, credit propagates far back in time:
$$\frac{\partial G_0}{\partial R_t} = \gamma^t$$
For $t = 100$ with $\gamma = 0.99$, credit decayed to $\gamma^{100} \approx 0.37$. With $\gamma = 0.9$, it would be $\gamma^{100} \appro 10^{-5}$—essentially no credit assignment.
Variance in Policy Gradients:
High $\gamma$ increases variance in return estimates because trajectories include more future randomness. This requires more samples or variance reduction techniques (baselines, GAE).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121
import numpy as npfrom typing import Tuple def value_iteration_convergence( P: np.ndarray, # Transition matrix [n_states, n_actions, n_states] R: np.ndarray, # Reward [n_states, n_actions] gamma: float, epsilon: float = 1e-6, max_iters: int = 10000) -> Tuple[np.ndarray, int]: """ Run value iteration and count iterations to convergence. Shows how γ affects convergence rate. """ n_states = P.shape[0] n_actions = P.shape[1] V = np.zeros(n_states) for iteration in range(max_iters): V_new = np.zeros(n_states) for s in range(n_states): # Bellman optimality backup q_values = [] for a in range(n_actions): q = R[s, a] + gamma * np.dot(P[s, a, :], V) q_values.append(q) V_new[s] = max(q_values) # Check convergence delta = np.max(np.abs(V_new - V)) V = V_new if delta < epsilon: return V, iteration + 1 return V, max_iters def demo_convergence_rates(): """ Show how γ affects value iteration convergence. """ # Simple 5-state chain MDP # State 0 <- State 1 <- ... <- State 4 (goal) n_states = 5 n_actions = 2 # Left, Right # Transition dynamics P = np.zeros((n_states, n_actions, n_states)) for s in range(n_states): # Action 0 (Left): move left with 0.9 prob left_target = max(0, s - 1) P[s, 0, left_target] = 0.9 P[s, 0, s] = 0.1 # Slip: stay # Action 1 (Right): move right with 0.9 prob right_target = min(n_states - 1, s + 1) P[s, 1, right_target] = 0.9 P[s, 1, s] = 0.1 # Slip: stay # Rewards: +10 at goal (state 4), 0 elsewhere R = np.zeros((n_states, n_actions)) R[3, 1] = 10 # Reward for reaching goal from state 3 # Test different gamma values gammas = [0.5, 0.8, 0.9, 0.95, 0.99, 0.999] print("Value Iteration Convergence:") print("-" * 50) print(f"{'γ':>8} | {'Iterations':>12} | {'V(s=0)':>10} | {'V(s=4)':>10}") print("-" * 50) for gamma in gammas: V, iters = value_iteration_convergence(P, R, gamma) print(f"{gamma:>8.3f} | {iters:>12} | {V[0]:>10.3f} | {V[4]:>10.3f}") print("-" * 50) print("Note: Higher γ → More iterations but higher values") print(" because agent plans further ahead") demo_convergence_rates() # ================================================# Credit Assignment Visualization# ================================================def credit_assignment_decay(gamma: float, max_steps: int = 100): """ Show how credit decays backward in time. """ steps = np.arange(max_steps) credit = gamma ** steps # Find effective horizon (1% credit remaining) effective_horizon = np.searchsorted(-credit, -0.01) return credit, effective_horizon def demo_credit_assignment(): """Compare credit assignment across γ values.""" gammas = [0.9, 0.95, 0.99, 0.999] print("\n=== Credit Assignment Analysis ===") print("How much credit does action at t=0 receive from reward at t=T?") print("-" * 60) timesteps = [10, 50, 100, 500] print(f"{'γ':>6} | " + " | ".join([f"t={t:>4}" for t in timesteps]) + " | horizon") print("-" * 60) for gamma in gammas: credit, horizon = credit_assignment_decay(gamma) credits = [f"{credit[min(t, len(credit)-1)]:.2e}" for t in timesteps] print(f"{gamma:>6.3f} | " + " | ".join([f"{c:>6}" for c in credits]) + f" | {horizon}") demo_credit_assignment()An advanced technique is to vary $\gamma$ during training—starting with a low value for fast initial learning, then increasing it for long-term optimality:
Curriculum over γ:
Why this works:
With low $\gamma$, the effective horizon is short. The agent learns to handle immediate challenges without getting overwhelmed by long-term complexity. As $\gamma$ increases, the horizon expands, and the agent builds on its foundation to plan further.
Caution:
Changing $\gamma$ changes the objective function. The optimal policy for $\gamma = 0.9$ differs from that for $\gamma = 0.99$. This is acceptable for training but means the "optimal policy" is target-dependent.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182
import numpy as npfrom typing import Callable class GammaScheduler: """ Schedule gamma to increase during training. Start myopic, gradually become far-sighted. """ @staticmethod def constant(gamma_final: float) -> Callable[[int], float]: """No scheduling, constant gamma.""" return lambda step: gamma_final @staticmethod def linear( gamma_start: float, gamma_final: float, warmup_steps: int ) -> Callable[[int], float]: """Linear increase from start to final over warmup_steps.""" def schedule(step: int) -> float: if step >= warmup_steps: return gamma_final alpha = step / warmup_steps return gamma_start + alpha * (gamma_final - gamma_start) return schedule @staticmethod def exponential( gamma_start: float, gamma_final: float, half_life: int ) -> Callable[[int], float]: """Exponential decay toward final gamma.""" def schedule(step: int) -> float: decay = 0.5 ** (step / half_life) return gamma_final - (gamma_final - gamma_start) * decay return schedule @staticmethod def step_schedule( milestones: list, # List of (step, gamma) pairs ) -> Callable[[int], float]: """Discrete increases at milestones.""" milestones = sorted(milestones, key=lambda x: x[0]) def schedule(step: int) -> float: gamma = milestones[0][1] for milestone_step, milestone_gamma in milestones: if step >= milestone_step: gamma = milestone_gamma return gamma return schedule def demo_scheduling(): """Visualize different gamma schedules.""" steps = np.arange(0, 100000, 100) schedules = { "Constant (0.99)": GammaScheduler.constant(0.99), "Linear (0.9→0.99, 50k steps)": GammaScheduler.linear(0.9, 0.99, 50000), "Exponential (0.9→0.99, τ=20k)": GammaScheduler.exponential(0.9, 0.99, 20000), "Step ([10k:0.95, 50k:0.99])": GammaScheduler.step_schedule([ (0, 0.9), (10000, 0.95), (50000, 0.99) ]), } print("Gamma Schedules at Key Steps:") print("-" * 70) key_steps = [0, 10000, 25000, 50000, 75000, 100000] header = f"{'Schedule':<35} | " + " | ".join([f"{s//1000:>4}k" for s in key_steps]) print(header) print("-" * 70) for name, schedule in schedules.items(): values = [f"{schedule(s):.3f}" for s in key_steps] print(f"{name:<35} | " + " | ".join(values)) demo_scheduling()Selecting the right $\gamma$ requires understanding your problem's characteristics:
Question 1: What is the natural planning horizon?
How many steps ahead does the agent need to consider to make optimal decisions?
Question 2: What is the episode length?
Rule of thumb: $\gamma$ should allow meaningful credit assignment over the episode.
Question 3: How delayed are the rewards?
Sparse, delayed rewards require high $\gamma$ to propagate value back to early decisions.
Question 4: What is your computational budget?
Higher $\gamma$ means slower convergence. If training time is limited, may need to accept lower $\gamma$.
| Domain | Typical γ Range | Reasoning |
|---|---|---|
| Arcade games (Atari) | 0.99 | ~1000 frame episodes, dense rewards |
| Robot control | 0.99-0.995 | Continuous tasks, moderate horizons |
| Board games (Chess/Go) | 0.999+ | Long games, sparse end-of-game reward |
| Dialogue systems | 0.9-0.95 | Relatively short conversations |
| Financial trading | 0.99-0.999 | Long-term portfolio growth matters |
| Recommendation | 0.9-0.95 | User engagement is relatively immediate |
Start with γ = 0.99 as a reasonable default for most problems. If learning is too slow, reduce it. If the agent is too myopic (ignores important future consequences), increase it. Treat γ as a hyperparameter to tune, not a fixed constant.
For continuing tasks where episodes never end, an alternative to discounting is the average reward formulation:
$$\rho^\pi = \lim_{T \rightarrow \infty} \frac{1}{T} \mathbb{E}\left[\sum_{t=1}^{T} R_t | \pi\right]$$
The objective is to maximize the long-run average reward per timestep, rather than discounted cumulative reward.
Advantages:
Disadvantages:
Differential Value Function:
Instead of $V^\pi(s)$, we use: $$V^\pi(s) = \mathbb{E}\left[\sum_{t=0}^{\infty} (R_t - \rho^\pi) | S_0 = s, \pi\right]$$
This represents the advantage of starting in state $s$ relative to the average reward.
As γ → 1, the discounted and average reward formulations become equivalent (under regularity conditions). The optimal policy that maximizes discounted return with very high γ also maximizes average reward. This justifies using high γ as an approximation to the average reward setting.
We have thoroughly examined the discount factor—a single parameter with far-reaching implications for what agents learn and how efficiently they learn it.
What's Next:
With all MDP components defined (states, actions, transitions, rewards, discount), we turn to the crown jewel of MDP theory: the Bellman equations. These recursive relationships form the mathematical backbone of value-based reinforcement learning algorithms.
You now have a deep understanding of the discount factor—from its mathematical role in ensuring convergence to its practical impact on agent behavior. This knowledge enables you to make informed choices about γ in your own RL projects and understand why existing systems use the values they do.