Markov Decision Processes - Learning Module

Loading content...

0/245

Discount Factor

The Weight of the Future

Suppose you could receive $100 today or $100 one year from now. Most people prefer the money today—not just because of inflation or investment opportunity, but because the future is uncertain. A dollar today is worth more than a dollar tomorrow.

This intuition lies at the heart of the discount factor $\gamma \in [0, 1]$—a single parameter that fundamentally shapes how reinforcement learning agents value future rewards relative to immediate ones.

The discount factor appears simple: just a scalar between 0 and 1. But its implications are profound:

It determines whether agents are myopic (focused on immediate rewards) or far-sighted (planning for long-term outcomes)
It is mathematically necessary for convergence in infinite-horizon settings
It can be interpreted as a termination probability, time preference, or uncertainty about the future
Different values lead to dramatically different learned behaviors

What You Will Learn

By the end of this page, you will understand the discount factor from multiple perspectives—mathematical necessity, economic interpretation, probabilistic meaning, computational implications, and practical selection guidelines. You'll be able to choose appropriate γ values and predict how they affect agent behavior.

Mathematical Necessity: Why We Need Discounting

Consider an infinite-horizon MDP with positive rewards. Without discounting, the return would be:

$$G = \sum_{t=0}^{\infty} R_t$$

If rewards are bounded below by some $R_{min} > 0$, this sum diverges to infinity. How can we compare two policies when both achieve infinite return?

The discount factor bounds the return:

$$G = \sum_{t=0}^{\infty} \gamma^t R_t$$

If rewards are bounded by $|R_t| \leq R_{max}$, then:

$$|G| \leq \sum_{t=0}^{\infty} \gamma^t R_{max} = \frac{R_{max}}{1 - \gamma}$$

For any $\gamma < 1$, the geometric series converges. The return is finite, and we can meaningfully compare policies.

Example:

$R_{max} = 1$, $\gamma = 0.99$ → $|G| \leq 100$
$R_{max} = 1$, $\gamma = 0.9$ → $|G| \leq 10$
$R_{max} = 1$, $\gamma = 0.5$ → $|G| \leq 2$

When γ = 1 Is Valid

Undiscounted returns (γ = 1) are valid in episodic tasks where episodes terminate in finite time. Since the sum is finite, no discounting is needed for convergence. However, even in episodic tasks, discounting is often used for numerical stability and to encourage efficiency (reaching goals faster).

The Discounted Return: Formal Definition

The discounted return from time $t$ is defined as:

$$G_t = R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + \ldots = \sum_{k=0}^{\infty} \gamma^k R_{t+k}$$

This can be written recursively (a key property for dynamic programming):

$$G_t = R_t + \gamma G_{t+1}$$

The discount factor $\gamma$ controls how quickly future rewards decay:

Immediate reward: Weight $\gamma^0 = 1$ (full weight) 1 step ahead: Weight $\gamma^1 = \gamma$ 10 steps ahead: Weight $\gamma^{10}$ 100 steps ahead: Weight $\gamma^{100}$

For $\gamma = 0.99$:

$\gamma^{10} = 0.904$ (still highly valued)
$\gamma^{100} = 0.366$ (meaningful but reduced)
$\gamma^{1000} = 0.00004$ (essentially ignored)

For $\gamma = 0.9$:

$\gamma^{10} = 0.349$
$\gamma^{100} \approx 0$ (negligible)
Agent effectively plans ~10-20 steps ahead

discount_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
import numpy as np
import matplotlib.pyplot as plt
from typing import List, Tuple
 
def discount_weight(gamma: float, steps: int) -> float:
    """Weight given to reward 'steps' in the future."""
    return gamma ** steps
 
def effective_horizon(gamma: float, threshold: float = 0.01) -> int:
    """
    Number of steps until discount weight falls below threshold.
    This is the 'effective planning horizon'.
    
    γ^n < threshold
    n > log(threshold) / log(γ)
    """
    if gamma >= 1.0:
        return float('inf')
    return int(np.ceil(np.log(threshold) / np.log(gamma)))
 
def compare_gammas():
    """Compare how different γ values affect planning horizon."""
    gammas = [0.5, 0.9, 0.95, 0.99, 0.999]
    
    print("Effective horizons (1% threshold):")
    print("-" * 40)
    for g in gammas:
        horizon = effective_horizon(g, threshold=0.01)
        max_return = 1 / (1 - g) if g < 1 else float('inf')
        print(f"γ={g:.3f}: horizon={horizon:>6} steps, max_return={max_return:.1f}")
    
    print("
" + "=" * 50)
    print("Discount weights at specific timesteps:")
    print("-" * 50)
    
    timesteps = [1, 10, 50, 100, 500]
    header = "γ\t" + "\t".join([f"t={t}" for t in timesteps])
    print(header)
    
    for g in gammas:
        weights = [f"{discount_weight(g, t):.4f}" for t in timesteps]
        print(f"{g:.3f}\t" + "\t".join(weights))
 
compare_gammas()
 
 
# ================================================
# Visualization of Discount Curves
# ================================================
def plot_discount_curves():
    """Visualize how reward weight decays over time."""
    gammas = [0.5, 0.8, 0.9, 0.95, 0.99]
    steps = np.arange(100)
    
    plt.figure(figsize=(10, 6))
    
    for g in gammas:
        weights = g ** steps
        plt.plot(steps, weights, label=f'γ = {g}', linewidth=2)
    
    plt.xlabel('Steps into future', fontsize=12)
    plt.ylabel('Discount weight (γ^t)', fontsize=12)
    plt.title('How Discount Factor Affects Future Reward Weights', fontsize=14)
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.ylim(0, 1.05)
    
    # Add horizontal line at 1% threshold
    plt.axhline(y=0.01, color='r', linestyle='--', alpha=0.5, 
                label='1% threshold')
    
    plt.tight_layout()
    plt.savefig('discount_curves.png', dpi=150)
    print("\nSaved discount_curves.png")
 
# plot_discount_curves()  # Uncomment to generate plot
 
 
# ================================================
# Return Computation with Different γ
# ================================================
def compute_returns_with_gammas(
    rewards: List[float],
    gammas: List[float]
) -> dict:
    """
    Compute returns for same reward sequence with different γ.
    Shows how γ affects the total perceived value.
    """
    results = {}
    
    for gamma in gammas:
        G = 0.0
        for t, r in enumerate(rewards):
            G += (gamma ** t) * r
        results[gamma] = G
    
    return results
 
 
def demo_gamma_effects():
    """
    Demonstrate how γ changes evaluation of reward sequences.
    """
    # Scenario 1: Constant rewards (R=1 every step)
    constant_rewards = [1.0] * 100
    
    # Scenario 2: Delayed big reward (R=0 for 50 steps, then R=100)
    delayed_rewards = [0.0] * 50 + [100.0]
    
    # Scenario 3: Early rewards that decay
    decaying_rewards = [10.0 / (t+1) for t in range(100)]
    
    gammas = [0.5, 0.9, 0.95, 0.99]
    
    print("\n=== Effect of γ on Different Reward Patterns ===")
    
    for name, rewards in [
        ("Constant (R=1)", constant_rewards),
        ("Delayed (R=100 at t=50)", delayed_rewards),
        ("Decaying (R=10/(t+1))", decaying_rewards)
    ]:
        print(f"\n{name}:")
        returns = compute_returns_with_gammas(rewards, gammas)
        for g, G in returns.items():
            print(f"  γ={g}: G = {G:.2f}")
 
demo_gamma_effects()

Three Interpretations of the Discount Factor

The discount factor can be understood from multiple perspectives, each offering different intuitions:

Interpretation 1: Time Preference

Like economic discounting, $\gamma$ represents the agent's preference for immediate versus delayed reward. A low $\gamma$ means the agent is "impatient"—it strongly prefers rewards now over rewards later. This interpretation aligns with behavioral economics and explains phenomena like hyperbolic discounting in humans.

Interpretation 2: Probability of Continuation

Imagine that at each timestep, the episode terminates with probability $1 - \gamma$. Then:

$$\mathbb{E}[\text{total reward}] = \sum_{t=0}^{\infty} \gamma^t R_t$$

where $\gamma^t$ is the probability of surviving to timestep $t$. Under this view, $\gamma$ models uncertainty about whether the future will even exist—the agent should weight future rewards less because it might not be around to collect them.

Interpretation 3: Mathematical Regularization

Purely technically, $\gamma < 1$ is a regularizer that ensures the Bellman equations have a unique fixed point. Without discounting, value iteration may not converge. From this view, $\gamma$ is a mathematical expedient rather than a meaningful quantity.

Three Perspectives on γ
Interpretation	Meaning of Low γ	Meaning of High γ
Time Preference	Impatient, wants immediate reward	Patient, willing to wait for bigger payoffs
Termination Probability	High chance of episode ending	Confident interaction continues
Mathematical Regularization	Strong regularization, stable but short-sighted	Weak regularization, considers far future

Which Interpretation to Use?

Use the interpretation that best matches your problem domain. For robotics with safety concerns, termination probability makes sense. For economic optimization, time preference is natural. For pure algorithm design, the mathematical view suffices. The formalism is the same regardless of interpretation.

Impact on Agent Behavior

The choice of $\gamma$ dramatically affects what policies agents learn:

Low γ (e.g., 0.5):

Agent is myopic, focused on immediate rewards
Ignores consequences more than a few steps ahead
Fast learning (small effective horizon)
May miss opportunities requiring long-term planning
Suitable for: Fast-paced games, reactive control

Medium γ (e.g., 0.9-0.95):

Balanced consideration of near and medium-term future
Plans ahead 10-50 steps meaningfully
Common choice for many practical problems
Suitable for: Most games, standard robotics

High γ (e.g., 0.99+):

Agent is far-sighted, considers long-term consequences
Values distant future nearly as much as present
Slower learning (must propagate values over long chains)
Better for problems requiring long-term planning
Suitable for: Strategic games, long-horizon optimization

Low γ Behavior

•Grabs immediate rewards quickly
•Ignores delayed consequences
•Converges faster in training
•May get stuck in local optima
•Good for reactive, fast environments

High γ Behavior

•Sacrifices immediate for future gains
•Long causal chains in reasoning
•Slower convergence in training
•Finds globally optimal strategies
•Good for strategic, delayed-reward tasks

γ Effect on Grid NavigationConsider a grid with the goal at position (5,5). There's also a small reward at (1,1) that serves as a "distraction." - With γ=0.5: Agent goes to (1,1) first—the small immediate reward matters more than the distant goal. - With γ=0.99: Agent goes directly to (5,5)—the large goal reward, though distant, outweighs the small nearby reward. The same environment, same policy architecture, but γ changes the learned behavior completely.

Input

Same gridworld, different γ values

Output

Different optimal paths

Explanation

This is why γ is not just a hyperparameter—it fundamentally defines what 'optimal' means in the MDP.

Computational Implications

The discount factor affects not just learned behavior but also computational aspects of RL algorithms:

Contraction in Bellman Operators:

The Bellman operator $\mathcal{T}$ is a $\gamma$-contraction:

$$||\mathcal{T}V_1 - \mathcal{T}V_2||\infty \leq \gamma ||V_1 - V_2||\infty$$

This guarantees convergence of value iteration. Lower $\gamma$ means faster contraction (faster convergence) but potentially worse solutions.

Credit Assignment:

When a reward is received, the agent must determine which past actions caused it. With high $\gamma$, credit propagates far back in time:

$$\frac{\partial G_0}{\partial R_t} = \gamma^t$$

For $t = 100$ with $\gamma = 0.99$, credit decayed to $\gamma^{100} \approx 0.37$. With $\gamma = 0.9$, it would be $\gamma^{100} \appro 10^{-5}$—essentially no credit assignment.

Variance in Policy Gradients:

High $\gamma$ increases variance in return estimates because trajectories include more future randomness. This requires more samples or variance reduction techniques (baselines, GAE).

gamma_convergence.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import numpy as np
from typing import Tuple
 
def value_iteration_convergence(
    P: np.ndarray,      # Transition matrix [n_states, n_actions, n_states]
    R: np.ndarray,      # Reward [n_states, n_actions]
    gamma: float,
    epsilon: float = 1e-6,
    max_iters: int = 10000
) -> Tuple[np.ndarray, int]:
    """
    Run value iteration and count iterations to convergence.
    
    Shows how γ affects convergence rate.
    """
    n_states = P.shape[0]
    n_actions = P.shape[1]
    
    V = np.zeros(n_states)
    
    for iteration in range(max_iters):
        V_new = np.zeros(n_states)
        
        for s in range(n_states):
            # Bellman optimality backup
            q_values = []
            for a in range(n_actions):
                q = R[s, a] + gamma * np.dot(P[s, a, :], V)
                q_values.append(q)
            V_new[s] = max(q_values)
        
        # Check convergence
        delta = np.max(np.abs(V_new - V))
        V = V_new
        
        if delta < epsilon:
            return V, iteration + 1
    
    return V, max_iters
 
 
def demo_convergence_rates():
    """
    Show how γ affects value iteration convergence.
    """
    # Simple 5-state chain MDP
    # State 0 <- State 1 <- ... <- State 4 (goal)
    n_states = 5
    n_actions = 2  # Left, Right
    
    # Transition dynamics
    P = np.zeros((n_states, n_actions, n_states))
    
    for s in range(n_states):
        # Action 0 (Left): move left with 0.9 prob
        left_target = max(0, s - 1)
        P[s, 0, left_target] = 0.9
        P[s, 0, s] = 0.1  # Slip: stay
        
        # Action 1 (Right): move right with 0.9 prob
        right_target = min(n_states - 1, s + 1)
        P[s, 1, right_target] = 0.9
        P[s, 1, s] = 0.1  # Slip: stay
    
    # Rewards: +10 at goal (state 4), 0 elsewhere
    R = np.zeros((n_states, n_actions))
    R[3, 1] = 10  # Reward for reaching goal from state 3
    
    # Test different gamma values
    gammas = [0.5, 0.8, 0.9, 0.95, 0.99, 0.999]
    
    print("Value Iteration Convergence:")
    print("-" * 50)
    print(f"{'γ':>8} | {'Iterations':>12} | {'V(s=0)':>10} | {'V(s=4)':>10}")
    print("-" * 50)
    
    for gamma in gammas:
        V, iters = value_iteration_convergence(P, R, gamma)
        print(f"{gamma:>8.3f} | {iters:>12} | {V[0]:>10.3f} | {V[4]:>10.3f}")
    
    print("-" * 50)
    print("Note: Higher γ → More iterations but higher values")
    print("      because agent plans further ahead")
 
demo_convergence_rates()
 
 
# ================================================
# Credit Assignment Visualization
# ================================================
def credit_assignment_decay(gamma: float, max_steps: int = 100):
    """
    Show how credit decays backward in time.
    """
    steps = np.arange(max_steps)
    credit = gamma ** steps
    
    # Find effective horizon (1% credit remaining)
    effective_horizon = np.searchsorted(-credit, -0.01)
    
    return credit, effective_horizon
 
 
def demo_credit_assignment():
    """Compare credit assignment across γ values."""
    gammas = [0.9, 0.95, 0.99, 0.999]
    
    print("\n=== Credit Assignment Analysis ===")
    print("How much credit does action at t=0 receive from reward at t=T?")
    print("-" * 60)
    
    timesteps = [10, 50, 100, 500]
    print(f"{'γ':>6} | " + " | ".join([f"t={t:>4}" for t in timesteps]) + " | horizon")
    print("-" * 60)
    
    for gamma in gammas:
        credit, horizon = credit_assignment_decay(gamma)
        credits = [f"{credit[min(t, len(credit)-1)]:.2e}" for t in timesteps]
        print(f"{gamma:>6.3f} | " + " | ".join([f"{c:>6}" for c in credits]) + f" | {horizon}")
 
demo_credit_assignment()

Gamma Scheduling and Curriculum Learning

An advanced technique is to vary $\gamma$ during training—starting with a low value for fast initial learning, then increasing it for long-term optimality:

Curriculum over γ:

Start low ($\gamma = 0.9$): Learn basic behaviors quickly
Gradually increase ($\gamma \rightarrow 0.99$): Refine to consider long-term consequences
Final high ($\gamma = 0.999$): Fine-tune for optimal long-horizon behavior

Why this works:

With low $\gamma$, the effective horizon is short. The agent learns to handle immediate challenges without getting overwhelmed by long-term complexity. As $\gamma$ increases, the horizon expands, and the agent builds on its foundation to plan further.

Caution:

Changing $\gamma$ changes the objective function. The optimal policy for $\gamma = 0.9$ differs from that for $\gamma = 0.99$. This is acceptable for training but means the "optimal policy" is target-dependent.

gamma_schedule.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
from typing import Callable
 
class GammaScheduler:
    """
    Schedule gamma to increase during training.
    Start myopic, gradually become far-sighted.
    """
    
    @staticmethod
    def constant(gamma_final: float) -> Callable[[int], float]:
        """No scheduling, constant gamma."""
        return lambda step: gamma_final
    
    @staticmethod
    def linear(
        gamma_start: float,
        gamma_final: float,
        warmup_steps: int
    ) -> Callable[[int], float]:
        """Linear increase from start to final over warmup_steps."""
        def schedule(step: int) -> float:
            if step >= warmup_steps:
                return gamma_final
            alpha = step / warmup_steps
            return gamma_start + alpha * (gamma_final - gamma_start)
        return schedule
    
    @staticmethod
    def exponential(
        gamma_start: float,
        gamma_final: float,
        half_life: int
    ) -> Callable[[int], float]:
        """Exponential decay toward final gamma."""
        def schedule(step: int) -> float:
            decay = 0.5 ** (step / half_life)
            return gamma_final - (gamma_final - gamma_start) * decay
        return schedule
    
    @staticmethod
    def step_schedule(
        milestones: list,  # List of (step, gamma) pairs
    ) -> Callable[[int], float]:
        """Discrete increases at milestones."""
        milestones = sorted(milestones, key=lambda x: x[0])
        
        def schedule(step: int) -> float:
            gamma = milestones[0][1]
            for milestone_step, milestone_gamma in milestones:
                if step >= milestone_step:
                    gamma = milestone_gamma
            return gamma
        return schedule
 
 
def demo_scheduling():
    """Visualize different gamma schedules."""
    steps = np.arange(0, 100000, 100)
    
    schedules = {
        "Constant (0.99)": GammaScheduler.constant(0.99),
        "Linear (0.9→0.99, 50k steps)": GammaScheduler.linear(0.9, 0.99, 50000),
        "Exponential (0.9→0.99, τ=20k)": GammaScheduler.exponential(0.9, 0.99, 20000),
        "Step ([10k:0.95, 50k:0.99])": GammaScheduler.step_schedule([
            (0, 0.9), (10000, 0.95), (50000, 0.99)
        ]),
    }
    
    print("Gamma Schedules at Key Steps:")
    print("-" * 70)
    
    key_steps = [0, 10000, 25000, 50000, 75000, 100000]
    header = f"{'Schedule':<35} | " + " | ".join([f"{s//1000:>4}k" for s in key_steps])
    print(header)
    print("-" * 70)
    
    for name, schedule in schedules.items():
        values = [f"{schedule(s):.3f}" for s in key_steps]
        print(f"{name:<35} | " + " | ".join(values))
 
demo_scheduling()

Practical Guidelines for Choosing γ

Selecting the right $\gamma$ requires understanding your problem's characteristics:

Question 1: What is the natural planning horizon?

How many steps ahead does the agent need to consider to make optimal decisions?

Immediate reactions (arcade games): $\gamma = 0.9 - 0.95$
Short-term planning (navigation): $\gamma = 0.95 - 0.99$
Long-term strategy (Go, chess): $\gamma = 0.99 - 0.999$

Question 2: What is the episode length?

Rule of thumb: $\gamma$ should allow meaningful credit assignment over the episode.

If episodes are ~100 steps, $\gamma = 0.99$ gives $\gamma^{100} \approx 0.37$
If episodes are ~1000 steps, may need $\gamma = 0.999$ for $\gamma^{1000} \approx 0.37$

Question 3: How delayed are the rewards?

Sparse, delayed rewards require high $\gamma$ to propagate value back to early decisions.

Question 4: What is your computational budget?

Higher $\gamma$ means slower convergence. If training time is limited, may need to accept lower $\gamma$.

γ Selection Guidelines
Domain	Typical γ Range	Reasoning
Arcade games (Atari)	0.99	~1000 frame episodes, dense rewards
Robot control	0.99-0.995	Continuous tasks, moderate horizons
Board games (Chess/Go)	0.999+	Long games, sparse end-of-game reward
Dialogue systems	0.9-0.95	Relatively short conversations
Financial trading	0.99-0.999	Long-term portfolio growth matters
Recommendation	0.9-0.95	User engagement is relatively immediate

When in Doubt

Start with γ = 0.99 as a reasonable default for most problems. If learning is too slow, reduce it. If the agent is too myopic (ignores important future consequences), increase it. Treat γ as a hyperparameter to tune, not a fixed constant.

Beyond Discounting: Average Reward Formulation

For continuing tasks where episodes never end, an alternative to discounting is the average reward formulation:

$$\rho^\pi = \lim_{T \rightarrow \infty} \frac{1}{T} \mathbb{E}\left[\sum_{t=1}^{T} R_t | \pi\right]$$

The objective is to maximize the long-run average reward per timestep, rather than discounted cumulative reward.

Advantages:

No arbitrary discount factor to choose
Natural for ongoing processes (factory control, server management)
Focuses on steady-state performance

Disadvantages:

More complex theory and algorithms
Transient behavior is ignored (only steady-state matters)
Less developed software ecosystem

Differential Value Function:

Instead of $V^\pi(s)$, we use: $$V^\pi(s) = \mathbb{E}\left[\sum_{t=0}^{\infty} (R_t - \rho^\pi) | S_0 = s, \pi\right]$$

This represents the advantage of starting in state $s$ relative to the average reward.

Connection to Discounting

As γ → 1, the discounted and average reward formulations become equivalent (under regularity conditions). The optimal policy that maximizes discounted return with very high γ also maximizes average reward. This justifies using high γ as an approximation to the average reward setting.

Summary: The Discount Factor

We have thoroughly examined the discount factor—a single parameter with far-reaching implications for what agents learn and how efficiently they learn it.

Key Takeaways

•Mathematical necessity — discounting bounds the return, enabling convergence in infinite-horizon MDPs
•Multiple interpretations — time preference, termination probability, or mathematical regularization
•Behavioral impact — low γ creates myopic agents; high γ creates far-sighted planners
•Computational trade-offs — higher γ means slower convergence but better long-term policies
•Effective horizon — γ determines how far ahead the agent meaningfully plans
•Can be scheduled — start low for fast learning, increase for long-term optimality

What's Next:

With all MDP components defined (states, actions, transitions, rewards, discount), we turn to the crown jewel of MDP theory: the Bellman equations. These recursive relationships form the mathematical backbone of value-based reinforcement learning algorithms.

Page Complete

You now have a deep understanding of the discount factor—from its mathematical role in ensuring convergence to its practical impact on agent behavior. This knowledge enables you to make informed choices about γ in your own RL projects and understand why existing systems use the values they do.