Loading content...
In reinforcement learning and sequential decision-making, agents collect rewards over time as they interact with an environment. However, immediate rewards are typically valued more than future rewards—a fundamental concept known as temporal discounting. This principle reflects the uncertainty of future outcomes and the preference for immediate gratification.
The cumulative reward with temporal decay (also known as the discounted return or discounted cumulative reward) quantifies the total value of a sequence of rewards, where each future reward is progressively diminished by a decay factor.
Mathematical Formulation:
Given a sequence of rewards ( R = [r_0, r_1, r_2, ..., r_{T-1}] ) and a decay factor ( \gamma ) (gamma), where ( 0 < \gamma \leq 1 ), the cumulative decayed reward ( G ) is computed as:
$$G = \sum_{t=0}^{T-1} \gamma^t \cdot r_t = r_0 + \gamma \cdot r_1 + \gamma^2 \cdot r_2 + ... + \gamma^{T-1} \cdot r_{T-1}$$
Understanding the Decay Factor (γ):
Your Task: Write a Python function that computes the cumulative reward with temporal decay given a list of rewards and a decay factor gamma. The function should use NumPy for efficient computation and return the scalar value representing the total decayed cumulative reward.
rewards = [1, 1, 1]
gamma = 0.51.75With gamma = 0.5, each subsequent reward is multiplied by an increasing power of 0.5:
• Time step 0: 1 × (0.5)⁰ = 1 × 1 = 1.0 • Time step 1: 1 × (0.5)¹ = 1 × 0.5 = 0.5 • Time step 2: 1 × (0.5)² = 1 × 0.25 = 0.25
Total cumulative decayed reward: 1.0 + 0.5 + 0.25 = 1.75
rewards = [1, 2, 3, 4, 5]
gamma = 1.015.0When gamma = 1.0, there is no temporal decay applied. Each reward contributes its full value:
• 1 × (1.0)⁰ + 2 × (1.0)¹ + 3 × (1.0)² + 4 × (1.0)³ + 5 × (1.0)⁴ • = 1 + 2 + 3 + 4 + 5 • = 15.0
This is equivalent to a simple sum of all rewards, representing an agent that values all future rewards equally.
rewards = [10, 5, 2]
gamma = 0.916.12With gamma = 0.9 (a typical value in RL), future rewards are discounted but still significant:
• Time step 0: 10 × (0.9)⁰ = 10 × 1 = 10.0 • Time step 1: 5 × (0.9)¹ = 5 × 0.9 = 4.5 • Time step 2: 2 × (0.9)² = 2 × 0.81 = 1.62
Total: 10.0 + 4.5 + 1.62 = 16.12
Note how the immediate large reward (10) contributes most significantly, while the smaller future rewards (5 and 2) are progressively diminished.
Constraints