Loading content...
In reinforcement learning, an agent interacts with an environment by taking actions and receiving rewards over time. A fundamental concept for evaluating the quality of a decision-making policy is the cumulative discounted reward, often denoted as Gₜ (the return starting from time step t).
The cumulative discounted reward captures the total value of rewards collected along a trajectory, with future rewards being progressively discounted by a factor γ (gamma). This discounting mechanism reflects the intuition that immediate rewards are typically more valuable than distant future rewards—a principle rooted in both economic theory (time value of money) and practical considerations (uncertainty about the future).
Mathematical Definition:
For a sequence of rewards [R₁, R₂, R₃, ..., Rₙ] received at consecutive timesteps, the cumulative discounted reward is computed as:
$$G_t = \sum_{k=0}^{n-1} \gamma^k \cdot R_{t+k+1} = R_1 + \gamma R_2 + \gamma^2 R_3 + \gamma^3 R_4 + \ldots + \gamma^{n-1} R_n$$
Where:
Understanding the Discount Factor γ:
This quantity is central to defining the state-value function Vπ(s) and the action-value function Qπ(s, a) in reinforcement learning, which estimate the expected return from a given state or state-action pair under a policy π.
Your Task: Write a Python function that computes the cumulative discounted reward for a given sequence of rewards and a discount factor. Use NumPy for efficient computation.
rewards = [1, 2, 3, 4]
gamma = 0.98.146We compute the cumulative discounted reward as follows:
• Timestep 0: 1 × (0.9)⁰ = 1 × 1 = 1.000 • Timestep 1: 2 × (0.9)¹ = 2 × 0.9 = 1.800 • Timestep 2: 3 × (0.9)² = 3 × 0.81 = 2.430 • Timestep 3: 4 × (0.9)³ = 4 × 0.729 = 2.916
Total: Gₜ = 1.000 + 1.800 + 2.430 + 2.916 = 8.146
Notice how later rewards contribute less due to the discount factor. The final reward of 4 only contributes 2.916 to the total, despite being the largest individual reward.
rewards = [1, 2, 3]
gamma = 1.06.0With γ = 1.0 (no discounting), all rewards are weighted equally:
• Timestep 0: 1 × (1.0)⁰ = 1 • Timestep 1: 2 × (1.0)¹ = 2 • Timestep 2: 3 × (1.0)² = 3
Total: Gₜ = 1 + 2 + 3 = 6.0
This is simply the sum of all rewards, as there is no discounting applied to future rewards.
rewards = [5]
gamma = 0.95.0With only a single reward in the trajectory:
• Timestep 0: 5 × (0.9)⁰ = 5 × 1 = 5.0
Total: Gₜ = 5.0
For a single-step trajectory, the discount factor has no effect since only γ⁰ = 1 is applied.
Constraints