Loading content...
In reinforcement learning, one fundamental challenge is estimating how valuable it is to be in a particular state—known as the state value function. The Initial Encounter Value Estimation method is a model-free technique that learns these values directly from complete episodes of experience, without requiring knowledge of the environment's dynamics.
The core principle is elegantly simple: to estimate the value of a state, we observe what happens after visiting that state and compute the cumulative discounted return—the sum of all future rewards, with distant rewards weighted less than immediate ones.
The Initial Encounter Rule: When a state appears multiple times within a single episode, we only consider the return from its first appearance. Subsequent visits to the same state within that episode are ignored for value estimation purposes. This approach provides unbiased estimates and is particularly useful when states can be revisited during an episode.
Mathematical Foundation:
For each episode, given a trajectory of state-reward pairs: (S₀, R₀), (S₁, R₁), ..., (Sₜ, Rₜ), the return from time step k to the end of the episode is:
$$G_k = R_k + \gamma R_{k+1} + \gamma^2 R_{k+2} + ... + \gamma^{T-k} R_T = \sum_{i=0}^{T-k} \gamma^i R_{k+i}$$
Where:
Value Estimation Process:
The estimated value of state s is:
$$V(s) = \frac{1}{N(s)} \sum_{\text{first visits to } s} G_s$$
Where N(s) is the number of episodes containing a first visit to state s.
Your Task: Implement a function that processes a collection of episodes and computes the estimated value for each state encountered. The output should be formatted as a string showing each state's value rounded to 2 decimal places, sorted by state number.
episodes = [[[0, 1], [1, 1], [0, 1], [2, 0]]]
gamma = 1.0"V[0]=3.0, V[1]=2.0, V[2]=0.0"We process a single episode with four time steps. Let's trace through the calculation:
Episode trajectory: State 0 → State 1 → State 0 (revisit) → State 2
Identifying first visits: • State 0: First encountered at t=0 (second visit at t=2 is ignored) • State 1: First encountered at t=1 • State 2: First encountered at t=3
Computing returns (with γ = 1.0, so no discounting): • V[0]: Return from t=0 = R₀ + R₁ + R₂ + R₃ = 1 + 1 + 1 + 0 = 3.0 • V[1]: Return from t=1 = R₁ + R₂ + R₃ = 1 + 1 + 0 = 2.0 • V[2]: Return from t=3 = R₃ = 0.0
Key insight: Although state 0 appears again at t=2, we only use the return from its first visit at t=0. This is the defining characteristic of the initial encounter method.
episodes = [[[0, 1], [1, 2]], [[1, 3], [2, 4]]]
gamma = 1.0"V[0]=3.0, V[1]=4.5, V[2]=4.0"We now have two episodes to process and average:
Episode 1: (0, 1) → (1, 2) • State 0 first visited at t=0: Return = 1 + 2 = 3.0 • State 1 first visited at t=1: Return = 2
Episode 2: (1, 3) → (2, 4) • State 1 first visited at t=0: Return = 3 + 4 = 7.0 • State 2 first visited at t=1: Return = 4
Averaging across episodes: • V[0] = 3.0 (only appeared in Episode 1) • V[1] = (2 + 7) / 2 = 4.5 (appeared in both episodes) • V[2] = 4.0 (only appeared in Episode 2)
This demonstrates how the method aggregates experience across multiple episodes to form stable value estimates.
episodes = [[[0, 1], [1, 1], [2, 1]]]
gamma = 0.9"V[0]=2.71, V[1]=1.9, V[2]=1.0"This example demonstrates the effect of discounting with γ = 0.9:
Episode trajectory: State 0 → State 1 → State 2
Computing discounted returns: • V[0]: Return from t=0 = R₀ + γR₁ + γ²R₂ = 1 + (0.9)(1) + (0.81)(1) = 1 + 0.9 + 0.81 = 2.71
• V[1]: Return from t=1 = R₁ + γR₂ = 1 + (0.9)(1) = 1 + 0.9 = 1.9
• V[2]: Return from t=2 = R₂ = 1.0
The discount factor effect: Notice how the value of state 0 (2.71) is less than the undiscounted sum (3.0 with γ=1). The discount factor γ < 1 causes future rewards to contribute less to a state's value, reflecting the intuition that immediate rewards are often preferable to delayed ones. This is crucial in real-world scenarios where there's uncertainty about the future or a preference for immediate gratification.
Constraints