0/318

00:00:00

Description

Editorial

Initial Encounter Value Estimation from Episodic Experience

MEDIUM10 pts

In reinforcement learning, one fundamental challenge is estimating how valuable it is to be in a particular state—known as the state value function. The Initial Encounter Value Estimation method is a model-free technique that learns these values directly from complete episodes of experience, without requiring knowledge of the environment's dynamics.

The core principle is elegantly simple: to estimate the value of a state, we observe what happens after visiting that state and compute the cumulative discounted return—the sum of all future rewards, with distant rewards weighted less than immediate ones.

The Initial Encounter Rule: When a state appears multiple times within a single episode, we only consider the return from its first appearance. Subsequent visits to the same state within that episode are ignored for value estimation purposes. This approach provides unbiased estimates and is particularly useful when states can be revisited during an episode.

Mathematical Foundation:

For each episode, given a trajectory of state-reward pairs: (S₀, R₀), (S₁, R₁), ..., (Sₜ, Rₜ), the return from time step k to the end of the episode is:

$$G_k = R_k + \gamma R_{k+1} + \gamma^2 R_{k+2} + ... + \gamma^{T-k} R_T = \sum_{i=0}^{T-k} \gamma^i R_{k+i}$$

Where:

γ (gamma) is the discount factor (0 ≤ γ ≤ 1)
T is the final time step of the episode
Rₖ is the reward received at time step k

Value Estimation Process:

For each episode, identify the first occurrence of each unique state
Calculate the discounted return from that first visit to the episode's end
After processing all episodes, compute the average return for each state across all episodes where it was visited

The estimated value of state s is:

$$V(s) = \frac{1}{N(s)} \sum_{\text{first visits to } s} G_s$$

Where N(s) is the number of episodes containing a first visit to state s.

Your Task: Implement a function that processes a collection of episodes and computes the estimated value for each state encountered. The output should be formatted as a string showing each state's value rounded to 2 decimal places, sorted by state number.

Example

Input

episodes = [[[0, 1], [1, 1], [0, 1], [2, 0]]]
gamma = 1.0

Output

"V[0]=3.0, V[1]=2.0, V[2]=0.0"

Explanation

We process a single episode with four time steps. Let's trace through the calculation:

Episode trajectory: State 0 → State 1 → State 0 (revisit) → State 2

Identifying first visits: • State 0: First encountered at t=0 (second visit at t=2 is ignored) • State 1: First encountered at t=1 • State 2: First encountered at t=3

Computing returns (with γ = 1.0, so no discounting): • V[0]: Return from t=0 = R₀ + R₁ + R₂ + R₃ = 1 + 1 + 1 + 0 = 3.0 • V[1]: Return from t=1 = R₁ + R₂ + R₃ = 1 + 1 + 0 = 2.0 • V[2]: Return from t=3 = R₃ = 0.0

Key insight: Although state 0 appears again at t=2, we only use the return from its first visit at t=0. This is the defining characteristic of the initial encounter method.

Example

Input

episodes = [[[0, 1], [1, 2]], [[1, 3], [2, 4]]]
gamma = 1.0

Output

"V[0]=3.0, V[1]=4.5, V[2]=4.0"

Explanation

We now have two episodes to process and average:

Episode 1: (0, 1) → (1, 2) • State 0 first visited at t=0: Return = 1 + 2 = 3.0 • State 1 first visited at t=1: Return = 2

Episode 2: (1, 3) → (2, 4) • State 1 first visited at t=0: Return = 3 + 4 = 7.0 • State 2 first visited at t=1: Return = 4

Averaging across episodes: • V[0] = 3.0 (only appeared in Episode 1) • V[1] = (2 + 7) / 2 = 4.5 (appeared in both episodes) • V[2] = 4.0 (only appeared in Episode 2)

This demonstrates how the method aggregates experience across multiple episodes to form stable value estimates.

Example

Input

episodes = [[[0, 1], [1, 1], [2, 1]]]
gamma = 0.9

Output

"V[0]=2.71, V[1]=1.9, V[2]=1.0"

Explanation

This example demonstrates the effect of discounting with γ = 0.9:

Episode trajectory: State 0 → State 1 → State 2

Computing discounted returns: • V[0]: Return from t=0 = R₀ + γR₁ + γ²R₂ = 1 + (0.9)(1) + (0.81)(1) = 1 + 0.9 + 0.81 = 2.71

• V[1]: Return from t=1 = R₁ + γR₂ = 1 + (0.9)(1) = 1 + 0.9 = 1.9

• V[2]: Return from t=2 = R₂ = 1.0

The discount factor effect: Notice how the value of state 0 (2.71) is less than the undiscounted sum (3.0 with γ=1). The discount factor γ < 1 causes future rewards to contribute less to a state's value, reflecting the intuition that immediate rewards are often preferable to delayed ones. This is crucial in real-world scenarios where there's uncertainty about the future or a preference for immediate gratification.

Accepted0/0·0% Acceptance

Constraints

1 ≤ number of episodes ≤ 1000
1 ≤ episode length ≤ 500
0 ≤ state indices ≤ 10⁴
-10³ ≤ reward values ≤ 10³
0 ≤ gamma ≤ 1
Each episode entry is formatted as [state, reward]
States are represented as non-negative integers
Output values should be rounded to 2 decimal places
States in output should be sorted in ascending order by state number

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

gamma =

episodes =

[[[0,1],[1,1],[0,1],[2,0]]]

Initial Encounter Value Estimation from Episodic Experience

Hints

Initial Encounter Value Estimation from Episodic Experience

Hints