0/318

00:00:00

Description

Editorial

Trajectory Cumulative Discounted Reward

EASY10 pts

In reinforcement learning, an agent interacts with an environment by taking actions and receiving rewards over time. A fundamental concept for evaluating the quality of a decision-making policy is the cumulative discounted reward, often denoted as Gₜ (the return starting from time step t).

The cumulative discounted reward captures the total value of rewards collected along a trajectory, with future rewards being progressively discounted by a factor γ (gamma). This discounting mechanism reflects the intuition that immediate rewards are typically more valuable than distant future rewards—a principle rooted in both economic theory (time value of money) and practical considerations (uncertainty about the future).

Mathematical Definition:

For a sequence of rewards [R₁, R₂, R₃, ..., Rₙ] received at consecutive timesteps, the cumulative discounted reward is computed as:

$$G_t = \sum_{k=0}^{n-1} \gamma^k \cdot R_{t+k+1} = R_1 + \gamma R_2 + \gamma^2 R_3 + \gamma^3 R_4 + \ldots + \gamma^{n-1} R_n$$

Where:

Rₖ is the reward received at timestep k
γ (gamma) is the discount factor, ranging from 0 to 1
n is the total number of rewards in the trajectory

Understanding the Discount Factor γ:

When γ = 0: Only the immediate reward matters (myopic agent)
When γ = 1: All rewards are weighted equally (no discounting)
When 0 < γ < 1: Future rewards are exponentially discounted, with rewards received further in the future contributing less to the total return

This quantity is central to defining the state-value function Vπ(s) and the action-value function Qπ(s, a) in reinforcement learning, which estimate the expected return from a given state or state-action pair under a policy π.

Your Task: Write a Python function that computes the cumulative discounted reward for a given sequence of rewards and a discount factor. Use NumPy for efficient computation.

Example

Input

rewards = [1, 2, 3, 4]
gamma = 0.9

Output

8.146

Explanation

We compute the cumulative discounted reward as follows:

• Timestep 0: 1 × (0.9)⁰ = 1 × 1 = 1.000 • Timestep 1: 2 × (0.9)¹ = 2 × 0.9 = 1.800 • Timestep 2: 3 × (0.9)² = 3 × 0.81 = 2.430 • Timestep 3: 4 × (0.9)³ = 4 × 0.729 = 2.916

Total: Gₜ = 1.000 + 1.800 + 2.430 + 2.916 = 8.146

Notice how later rewards contribute less due to the discount factor. The final reward of 4 only contributes 2.916 to the total, despite being the largest individual reward.

Example

Input

rewards = [1, 2, 3]
gamma = 1.0

Output

6.0

Explanation

With γ = 1.0 (no discounting), all rewards are weighted equally:

• Timestep 0: 1 × (1.0)⁰ = 1 • Timestep 1: 2 × (1.0)¹ = 2 • Timestep 2: 3 × (1.0)² = 3

Total: Gₜ = 1 + 2 + 3 = 6.0

This is simply the sum of all rewards, as there is no discounting applied to future rewards.

Example

Input

rewards = [5]
gamma = 0.9

Output

5.0

Explanation

With only a single reward in the trajectory:

• Timestep 0: 5 × (0.9)⁰ = 5 × 1 = 5.0

Total: Gₜ = 5.0

For a single-step trajectory, the discount factor has no effect since only γ⁰ = 1 is applied.

Accepted0/0·0% Acceptance

Constraints

1 ≤ length of rewards ≤ 10,000
-10⁶ ≤ rewards[i] ≤ 10⁶ for all i
0 ≤ γ ≤ 1
The rewards list will contain at least one element
All reward values can be integers or floating-point numbers
Round the final result to 3 decimal places

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

gamma =

0.9

rewards =

[1,2,3,4]

Trajectory Cumulative Discounted Reward

Hints

Trajectory Cumulative Discounted Reward

Hints