Machine LearningReinforcement Learning

RL Fundamentals

LevelAdvanced

Duration90 mins

TopicReinforcement Learning

1 / 5

Agent-Environment Interaction

A Fundamentally Different Way to Learn

Imagine trying to teach a child to ride a bicycle using only a book of instructions. You could describe the physics of balance, the mechanics of pedaling, the principles of steering—yet the child would still fall the first dozen times they actually tried. Some things can only be learned through experience.

This observation lies at the heart of Reinforcement Learning (RL)—a paradigm that diverges fundamentally from supervised and unsupervised learning. Where supervised learning requires a teacher providing correct answers, and unsupervised learning discovers patterns in static data, reinforcement learning learns through trial, error, and feedback.

The agent-environment interaction loop is the conceptual foundation upon which all of reinforcement learning is built. Understanding this loop deeply—not just superficially—is essential for grasping why RL algorithms work, when they succeed, and why they sometimes fail spectacularly.

What You Will Learn

By the end of this page, you will understand the agent-environment interaction paradigm at a level suitable for implementing RL systems and reasoning about their behavior. You'll grasp the mathematical formalism, the philosophical underpinnings, and the practical implications that distinguish RL from other machine learning approaches.

The Agent-Environment Dichotomy

Reinforcement Learning posits a deceptively simple world model: there exists an agent and an environment, and they interact through a continuous loop of perception and action.

The Agent is the learner and decision-maker. It perceives the world, chooses actions, and seeks to maximize some notion of cumulative reward. The agent embodies the algorithm we're designing—it's the entity that learns.

The Environment is everything external to the agent. It receives actions, transitions between states, and generates rewards. Crucially, the environment is treated as a black box: the agent may not know the environment's rules, dynamics, or internal structure.

This dichotomy may seem arbitrary—where exactly does the agent end and the environment begin? In practice, this boundary is a modeling choice that profoundly affects algorithm design.

The Boundary Problem

The agent-environment boundary isn't physical. In a robotic system, should the robot's motors be part of the agent or the environment? If the agent includes the motors, it must model motor dynamics; if the environment includes them, the agent's actions become higher-level commands. The 'right' boundary depends on what you want the agent to learn and what you can reliably model.

Why a Dichotomy?

This separation serves multiple purposes:

Abstraction: It allows us to reason about learning algorithms independently of specific domains. The same Q-learning algorithm can play Atari games, control robots, or optimize data center cooling.
Modularity: We can swap environments (simulation vs. real world) without changing the agent, enabling sim-to-real transfer and curriculum learning.
Formalism: The dichotomy enables precise mathematical treatment. We can define objectives, prove convergence theorems, and analyze sample complexity.
Generalization: By treating the environment as unknown, we design agents that generalize rather than memorize—essential for deployment in novel situations.

Agent vs Environment: Conceptual Boundaries
Aspect	Agent	Environment
Nature	Learner, decision-maker	External world, simulator
Knowledge	Learns from experience	Follows fixed (unknown) rules
Control	Chooses actions	Determines state transitions and rewards
Observability	May have partial information	Has complete internal state
Goal	Maximize cumulative reward	No goal (just dynamics)
Modifiability	We design/train this	Given (or simulated)

The Interaction Loop

At each discrete time step t, a precise sequence of events unfolds:

The agent observes the current state $s_t$ (or observation $o_t$ in partially observable settings)
The agent selects an action $a_t$ based on its policy $\pi(a_t|s_t)$
The environment transitions to a new state $s_{t+1}$ according to its dynamics $P(s_{t+1}|s_t, a_t)$
The environment emits a reward $r_{t+1} = R(s_t, a_t, s_{t+1})$
Time advances: $t \leftarrow t + 1$, and the loop repeats

This loop continues until a terminal state is reached (in episodic tasks) or indefinitely (in continuing tasks).

Converting Mermaid diagram...

The Critical Insight: Delayed Consequences

What makes this loop fundamentally different from supervised learning is the credit assignment problem. When the agent wins or loses a game, which of the hundreds of preceding actions was responsible? When a robot falls over, was it the action taken milliseconds before, or a strategic error made minutes ago?

Unlike supervised learning, where each input has an immediate correct output, RL must attribute success or failure to actions that may have occurred arbitrarily far in the past. This temporal credit assignment is one of the central challenges in RL.

The Exploration-Exploitation Dilemma

At each step, the agent faces a fundamental trade-off: should it exploit current knowledge (choose the best-known action) or explore uncertain alternatives (try something new that might be better)? This dilemma has no universal solution and is central to many RL algorithms. Too much exploitation leads to suboptimal local maxima; too much exploration wastes time on unpromising alternatives.

Mathematical Formalism

The agent-environment interaction is formalized mathematically as a trajectory (also called a rollout or episode):

$$\tau = (s_0, a_0, r_1, s_1, a_1, r_2, s_2, \ldots)$$

This sequence of states, actions, and rewards constitutes the agent's experience. From this raw data, the agent must learn a policy that maximizes expected cumulative reward.

Notation Convention

There are two common conventions for indexing rewards:

Convention 1: $r_t$ is the reward received at time $t$ for action $a_{t-1}$ in state $s_{t-1}$
Convention 2: $r_t$ is the reward received at time $t$ for action $a_t$ in state $s_t$

We will use Convention 1 (Sutton & Barto notation) where $r_{t+1}$ is the reward received after taking action $a_t$ in state $s_t$.

interaction_loop.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import numpy as np
from typing import Tuple, List, Any
 
class InteractionLoop:
    """
    Implements the fundamental agent-environment interaction loop.
    This is the core abstraction underlying all RL algorithms.
    """
    
    def __init__(self, env, agent, max_steps: int = 1000):
        self.env = env
        self.agent = agent
        self.max_steps = max_steps
    
    def run_episode(self) -> Tuple[float, List[Tuple]]:
        """
        Execute one complete episode of agent-environment interaction.
        
        Returns:
            total_reward: Sum of all rewards received in the episode
            trajectory: List of (state, action, reward, next_state, done) tuples
        """
        trajectory = []
        total_reward = 0.0
        
        # Step 1: Environment provides initial state
        state = self.env.reset()
        
        for step in range(self.max_steps):
            # Step 2: Agent selects action based on current state
            action = self.agent.select_action(state)
            
            # Step 3 & 4: Environment transitions and emits reward
            next_state, reward, done, info = self.env.step(action)
            
            # Store experience for learning
            trajectory.append((state, action, reward, next_state, done))
            total_reward += reward
            
            # Agent may learn from this transition (online learning)
            self.agent.learn(state, action, reward, next_state, done)
            
            # Step 5: Advance time
            state = next_state
            
            if done:
                break  # Terminal state reached
        
        return total_reward, trajectory
    
    def run_training(self, num_episodes: int) -> List[float]:
        """
        Run multiple episodes for training.
        
        Returns:
            episode_rewards: List of total rewards per episode
        """
        episode_rewards = []
        
        for episode in range(num_episodes):
            total_reward, trajectory = self.run_episode()
            episode_rewards.append(total_reward)
            
            # Agent may perform batch learning after episode
            self.agent.end_episode(trajectory)
            
            if episode % 100 == 0:
                avg_reward = np.mean(episode_rewards[-100:])
                print(f"Episode {episode}, Avg Reward (last 100): {avg_reward:.2f}")
        
        return episode_rewards

The Return: Quantifying Long-Term Success

The agent's objective is not to maximize immediate reward, but cumulative discounted reward (the return):

$$G_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \ldots = \sum_{k=0}^{\infty} \gamma^k r_{t+k+1}$$

where $\gamma \in [0, 1]$ is the discount factor that determines the present value of future rewards.

$\gamma = 0$: The agent is completely myopic, caring only about immediate reward
$\gamma = 1$: The agent values all future rewards equally (only valid for episodic tasks)
$\gamma \approx 0.99$: Common choice, balancing long-term planning with mathematical tractability

Why Discount?

Discounting serves multiple purposes: (1) It ensures the infinite sum converges for continuing tasks. (2) It handles uncertainty about the future—distant rewards are less certain. (3) It models economic time preference—a reward now is worth more than the same reward later. (4) It enables recursive Bellman equations that form the basis of dynamic programming solutions.

The Environment as a Black Box

A defining characteristic of RL is that the agent does not know the environment's dynamics. This is profoundly different from optimal control theory, where system dynamics are assumed known.

What the Agent Knows:

Its own policy (how it decides actions)
The history of past interactions (states, actions, rewards experienced)

What the Agent Does NOT Know:

The transition function $P(s'|s,a)$: How the environment responds to actions
The reward function $R(s,a,s')$: What reward each transition generates
The state space structure: What states are possible or reachable
The optimal policy: Which actions lead to maximum return

This ignorance is both a weakness and a strength. It's a weakness because the agent must learn everything from experience, which can be sample-inefficient. It's a strength because algorithms designed under this assumption can handle environments that are too complex to model analytically.

Model-Free RL

•Learns directly from experience
•No explicit environment model
•Examples: Q-learning, SARSA, Policy Gradient
•Pro: No modeling errors (what you see is what you get)
•Con: Sample inefficient—needs lots of interaction

Model-Based RL

•Learns an explicit model of the environment
•Uses model for planning and simulation
•Examples: Dyna-Q, MBPO, World Models
•Pro: More sample efficient—can plan internally
•Con: Model errors compound; may diverge from reality

The Simulation Hypothesis

A crucial implication of treating environments as black boxes is that simulation and reality become interchangeable at the interface level. Whether the environment is a video game, a physics simulator, or the real world, the agent only sees (state, action, reward, next_state) tuples.

This enables:

Sim-to-real transfer: Train in simulation, deploy in reality
Safe exploration: Make mistakes in simulation before real deployment
Rapid iteration: Run millions of episodes in minutes

However, it also creates the sim-to-real gap: simulators are imperfect models, and policies that work in simulation may fail catastrophically in the real world.

The Sim-to-Real Gap

A robot trained in simulation to walk may fail in the real world because the simulator didn't model friction, motor dynamics, or sensor noise accurately. Modern approaches like domain randomization (training with varied simulation parameters) and system identification (adjusting the simulator to match reality) help bridge this gap, but it remains a fundamental challenge.

Types of Environments

Environments vary along several important dimensions that affect which algorithms are applicable and how difficult learning becomes:

1. Episodic vs. Continuing

Episodic environments have natural endpoints—games end, tasks complete, robots reach goals or fail. Each episode is independent, allowing easy reset and parallel sampling.

Continuing environments have no natural termination—process control, stock trading, long-running systems. The agent must balance immediate performance with long-term learning, and there's no reset to recover from mistakes.

2. Fully Observable vs. Partially Observable

Fully observable environments provide the complete state at each step. The agent sees everything relevant to predicting future states and rewards.

Partially observable environments provide only partial information—the agent's observation is a noisy or incomplete function of the true state. The agent must maintain beliefs about unobserved state variables.

Environment Taxonomy
Dimension	Type A	Type B	Algorithm Implications
Termination	Episodic	Continuing	Episodic allows natural return computation; continuing requires discounting or average reward formulations
Observability	Fully Observable (MDP)	Partially Observable (POMDP)	POMDPs require memory/belief states; much harder to solve
Determinism	Deterministic	Stochastic	Stochastic environments require expected value reasoning and more samples
Time	Discrete	Continuous	Continuous time needs differential equations or fine discretization
Players	Single-agent	Multi-agent	Multi-agent adds game-theoretic complexity; Nash equilibria vs. optimal policies
Stationarity	Stationary	Non-stationary	Non-stationary environments require online adaptation; past learning may become invalid

3. Deterministic vs. Stochastic

Deterministic environments have predictable dynamics—the same action in the same state always produces the same next state. Planning is easier because outcomes are certain.

Stochastic environments have probabilistic dynamics—actions may lead to different next states with different probabilities. The agent must reason about expected values and may need to handle rare but important events.

4. Discrete vs. Continuous

Both states and actions can be discrete or continuous:

Discrete state/action: Board games, gridworlds—finite possibilities
Continuous state/discrete action: Atari games—pixel observations, finite button presses
Discrete state/continuous action: Rare in practice
Continuous state/action: Robotics—real-valued sensor readings, motor torques

environment_types.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
from abc import ABC, abstractmethod
from typing import Tuple, Any, Optional
import numpy as np
 
class Environment(ABC):
    """
    Abstract base class defining the environment interface.
    All RL environments should implement this interface.
    """
    
    @abstractmethod
    def reset(self) -> Any:
        """
        Reset environment to initial state.
        
        Returns:
            initial_state: The starting state of a new episode
        """
        pass
    
    @abstractmethod
    def step(self, action: Any) -> Tuple[Any, float, bool, dict]:
        """
        Execute one environment step.
        
        Args:
            action: The action taken by the agent
            
        Returns:
            next_state: The resulting state
            reward: The scalar reward signal
            done: Whether the episode has terminated
            info: Additional diagnostic information
        """
        pass
    
    @property
    @abstractmethod
    def state_space(self) -> Any:
        """Description of the state space."""
        pass
    
    @property
    @abstractmethod
    def action_space(self) -> Any:
        """Description of the action space."""
        pass
 
 
class EpisodicEnvironment(Environment):
    """Environment with natural episode termination."""
    
    def __init__(self, max_episode_length: int = 1000):
        self.max_episode_length = max_episode_length
        self.current_step = 0
    
    def reset(self) -> Any:
        self.current_step = 0
        return self._get_initial_state()
    
    def step(self, action: Any) -> Tuple[Any, float, bool, dict]:
        self.current_step += 1
        next_state, reward = self._transition(action)
        
        # Check for natural or forced termination
        natural_done = self._is_terminal(next_state)
        timeout = self.current_step >= self.max_episode_length
        done = natural_done or timeout
        
        info = {
            'natural_termination': natural_done,
            'timeout': timeout,
            'step': self.current_step
        }
        
        return next_state, reward, done, info
 
 
class PartiallyObservableEnvironment(Environment):
    """
    Environment where the agent receives observations, not true states.
    Implements a POMDP interface.
    """
    
    def __init__(self):
        self._true_state = None
    
    def reset(self) -> Any:
        self._true_state = self._get_initial_state()
        return self._observe(self._true_state)
    
    def step(self, action: Any) -> Tuple[Any, float, bool, dict]:
        # Transition operates on true state
        next_true_state, reward = self._transition(self._true_state, action)
        done = self._is_terminal(next_true_state)
        
        self._true_state = next_true_state
        
        # Agent only sees an observation
        observation = self._observe(next_true_state)
        
        info = {'true_state': next_true_state}  # For debugging only
        
        return observation, reward, done, info
    
    @abstractmethod
    def _observe(self, true_state: Any) -> Any:
        """Generate observation from true state (may be noisy/partial)."""
        pass

The Action Selection Process

At each time step, the agent must select an action. This selection is governed by the agent's policy, but the mechanism of selection has profound implications for learning.

Deterministic vs. Stochastic Policies

A deterministic policy $\mu: S \rightarrow A$ maps each state to a single action: $$a = \mu(s)$$

A stochastic policy $\pi: S \times A \rightarrow [0,1]$ specifies a probability distribution over actions: $$a \sim \pi(\cdot|s)$$

Stochastic policies are essential for exploration—if the agent always takes the same action in each state, it can never discover better alternatives. They also enable on-policy algorithms where the same policy is used for both acting and learning.

Exploration Strategies

The exploration-exploitation dilemma manifests concretely in action selection:

ε-Greedy: With probability $\epsilon$, select a random action; otherwise, select the greedy action
- Simple and effective
- $\epsilon$ can be annealed over time (high early, low later)
- Problem: Random exploration is undirected
Boltzmann/Softmax Exploration: Select actions with probability proportional to exponentiated value estimates $$\pi(a|s) = \frac{\exp(Q(s,a)/\tau)}{\sum_{a'} \exp(Q(s,a')/\tau)}$$
- Temperature $\tau$ controls exploration (high = uniform, low = greedy)
- Problem: Sensitive to value scale
Upper Confidence Bounds (UCB): Add bonus for uncertainty to encourage trying uncertain actions $$a = \arg\max_a \left[ Q(s,a) + c\sqrt{\frac{\ln t}{N(s,a)}} \right]$$
- Principled approach from bandit theory
- Problem: Requires visit counts, doesn't scale to large state spaces
Entropy Regularization: Add entropy bonus to the objective, encouraging diverse actions
- Foundation of algorithms like SAC (Soft Actor-Critic)
- Theoretically principled; connects to maximum entropy RL

exploration_strategies.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
import numpy as np
from typing import Callable
 
class ExplorationStrategy:
    """Base class for action selection strategies."""
    
    def select_action(self, q_values: np.ndarray, **kwargs) -> int:
        raise NotImplementedError
 
 
class EpsilonGreedy(ExplorationStrategy):
    """
    ε-Greedy exploration: random action with probability ε,
    greedy action otherwise.
    """
    
    def __init__(self, epsilon: float = 0.1, 
                 epsilon_decay: float = 0.999,
                 epsilon_min: float = 0.01):
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay
        self.epsilon_min = epsilon_min
    
    def select_action(self, q_values: np.ndarray, **kwargs) -> int:
        if np.random.random() < self.epsilon:
            # Explore: random action
            action = np.random.randint(len(q_values))
        else:
            # Exploit: greedy action (break ties randomly)
            max_q = np.max(q_values)
            best_actions = np.where(q_values == max_q)[0]
            action = np.random.choice(best_actions)
        
        return action
    
    def decay(self):
        """Decay epsilon after each episode."""
        self.epsilon = max(self.epsilon_min, 
                          self.epsilon * self.epsilon_decay)
 
 
class BoltzmannExploration(ExplorationStrategy):
    """
    Softmax exploration: actions selected proportionally to
    exponentiated Q-values, controlled by temperature.
    """
    
    def __init__(self, temperature: float = 1.0,
                 temperature_decay: float = 0.99,
                 temperature_min: float = 0.1):
        self.temperature = temperature
        self.temperature_decay = temperature_decay
        self.temperature_min = temperature_min
    
    def select_action(self, q_values: np.ndarray, **kwargs) -> int:
        # Numerical stability: subtract max before exp
        q_scaled = (q_values - np.max(q_values)) / self.temperature
        exp_q = np.exp(q_scaled)
        probabilities = exp_q / np.sum(exp_q)
        
        return np.random.choice(len(q_values), p=probabilities)
    
    def decay(self):
        """Decay temperature after each episode."""
        self.temperature = max(self.temperature_min,
                               self.temperature * self.temperature_decay)
 
 
class UCBExploration(ExplorationStrategy):
    """
    Upper Confidence Bound exploration: adds bonus for
    less-visited actions to encourage exploration.
    """
    
    def __init__(self, confidence: float = 2.0):
        self.confidence = confidence
        self.total_steps = 0
        self.action_counts = None
    
    def select_action(self, q_values: np.ndarray, 
                     state: int = None, **kwargs) -> int:
        n_actions = len(q_values)
        
        # Initialize action counts if needed
        if self.action_counts is None:
            self.action_counts = np.ones(n_actions)  # Start at 1 to avoid div by zero
        
        self.total_steps += 1
        
        # UCB formula: Q(a) + c * sqrt(ln(t) / N(a))
        exploration_bonus = self.confidence * np.sqrt(
            np.log(self.total_steps) / self.action_counts
        )
        
        ucb_values = q_values + exploration_bonus
        action = np.argmax(ucb_values)
        
        self.action_counts[action] += 1
        
        return action

Information Flow Analysis

Understanding the agent-environment loop requires careful analysis of what information flows in each direction and when.

From Environment to Agent:

State/Observation: Complete (MDP) or partial (POMDP) information about the world
Reward: Scalar signal indicating desirability of the transition
Done Flag: Whether the episode has terminated
Info Dict: Optional metadata (debugging, diagnostics, ground truth)

From Agent to Environment:

Action: The agent's chosen intervention in the world

Note the asymmetry: the agent sends simple actions but receives rich, structured information. This reflects the reality that agents are embedded in complex worlds.

Timing Matters: The Causality Structure

The temporal structure of the loop enforces causality:

$$s_t \rightarrow a_t \rightarrow (r_{t+1}, s_{t+1}) \rightarrow a_{t+1} \rightarrow \ldots$$

Critical implications:

The agent chooses $a_t$ based on $s_t$ (and history), but before seeing $r_{t+1}$
The reward $r_{t+1}$ evaluates the transition $(s_t, a_t, s_{t+1})$, not just the action
The agent cannot 'peak ahead' to see future states before deciding

This causal structure is why on-policy and off-policy methods differ, and why temporal difference learning is non-trivial.

The Information Content of Rewards

A scalar reward carries surprisingly little information—just one number per step. This sparsity is why reward shaping, intrinsic motivation, and curiosity-driven exploration are active research areas. The reward signal is often insufficient for efficient learning, and augmenting it (carefully) can dramatically accelerate training.

Information Flow at Each Time Step
Time	Agent Receives	Agent Computes	Agent Sends
t=0	Initial state $s_0$	Policy $\pi(a\|s_0)$	Action $a_0$
t=1	$(r_1, s_1, done)$	Update estimates; $\pi(a\|s_1)$	Action $a_1$
t=2	$(r_2, s_2, done)$	Update estimates; $\pi(a\|s_2)$	Action $a_2$
...	...	...	...
t=T	$(r_T, s_T, done=True)$	Final updates; episode complete	—

The Objective Function

The agent-environment interaction exists to serve a purpose: finding a policy that maximizes expected cumulative reward. This objective can be formalized in several equivalent ways:

Episodic Finite-Horizon Objective

For episodic tasks with a fixed horizon $T$:

$$J(\pi) = \mathbb{E}{\tau \sim \pi} \left[ \sum{t=0}^{T-1} r_{t+1} \right] = \mathbb{E}_{\tau \sim \pi} [G_0]$$

Discounted Infinite-Horizon Objective

For continuing tasks or variable-length episodes with discount factor $\gamma < 1$:

$$J(\pi) = \mathbb{E}{\tau \sim \pi} \left[ \sum{t=0}^{\infty} \gamma^t r_{t+1} \right] = \mathbb{E}_{\tau \sim \pi} [G_0]$$

Average Reward Objective

For continuing tasks where we care about long-run average performance:

$$J(\pi) = \lim_{T \rightarrow \infty} \frac{1}{T} \mathbb{E}{\tau \sim \pi} \left[ \sum{t=0}^{T-1} r_{t+1} \right]$$

The Expectation is Over Trajectories

The notation $\tau \sim \pi$ means that trajectories are sampled according to the policy and environment dynamics:

$$P(\tau|\pi) = p(s_0) \prod_{t=0}^{T-1} \pi(a_t|s_t) P(s_{t+1}|s_t, a_t)$$

where:

$p(s_0)$ is the initial state distribution
$\pi(a_t|s_t)$ is the policy's action probability
$P(s_{t+1}|s_t, a_t)$ is the environment's transition probability

The objective thus depends on both the policy (which we control) and the environment dynamics (which we don't).

Optimality and the Bellman Principle

The optimal policy $\pi^*$ achieves the maximum expected return from every state. Remarkably, this doesn't require looking at all possible trajectories—the Bellman optimality principle states that an optimal policy consists of optimal actions at every step, regardless of what happened before. This recursive structure underlies dynamic programming and enables efficient algorithms.

objective_estimation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
import numpy as np
from typing import List, Tuple
 
def compute_episode_return(
    rewards: List[float],
    gamma: float = 0.99
) -> float:
    """
    Compute the discounted return for an episode.
    
    Args:
        rewards: List of rewards [r_1, r_2, ..., r_T]
        gamma: Discount factor
        
    Returns:
        G_0: The return from the start of the episode
    """
    G = 0.0
    for reward in reversed(rewards):
        G = reward + gamma * G
    return G
 
 
def compute_all_returns(
    rewards: List[float],
    gamma: float = 0.99
) -> np.ndarray:
    """
    Compute returns for all time steps (for policy gradient methods).
    
    Returns G_0, G_1, ..., G_{T-1} where:
    G_t = r_{t+1} + γ r_{t+2} + γ² r_{t+3} + ...
    """
    T = len(rewards)
    returns = np.zeros(T)
    G = 0.0
    
    for t in reversed(range(T)):
        G = rewards[t] + gamma * G
        returns[t] = G
    
    return returns
 
 
def estimate_policy_value(
    run_episode_fn,
    num_episodes: int = 100,
    gamma: float = 0.99
) -> Tuple[float, float]:
    """
    Monte Carlo estimation of policy value J(π).
    
    Args:
        run_episode_fn: Function that runs one episode within
                       a policy, returning list of rewards
        num_episodes: Number of episodes to sample
        gamma: Discount factor
        
    Returns:
        mean_return: Estimated J(π)
        std_return: Standard deviation of estimate
    """
    returns = []
    
    for _ in range(num_episodes):
        rewards = run_episode_fn()
        G = compute_episode_return(rewards, gamma)
        returns.append(G)
    
    mean_return = np.mean(returns)
    std_return = np.std(returns)
    std_error = std_return / np.sqrt(num_episodes)
    
    print(f"Estimated J(π) = {mean_return:.2f} ± {std_error:.2f}")
    print(f"(Based on {num_episodes} episodes)")
    
    return mean_return, std_return
 
 
def compare_policies(
    policies: List[Tuple[str, callable]],
    env,
    num_episodes: int = 100,
    gamma: float = 0.99
):
    """Compare multiple policies by their estimated values."""
    results = []
    
    for name, policy in policies:
        def run_episode():
            rewards = []
            state = env.reset()
            done = False
            while not done:
                action = policy(state)
                state, reward, done, _ = env.step(action)
                rewards.append(reward)
            return rewards
        
        mean, std = estimate_policy_value(run_episode, num_episodes, gamma)
        results.append((name, mean, std))
    
    # Rank policies
    results.sort(key=lambda x: -x[1])  # Descending by mean
    
    print("
=== Policy Ranking ===")
    for i, (name, mean, std) in enumerate(results):
        print(f"{i+1}. {name}: {mean:.2f} ± {std/np.sqrt(num_episodes):.2f}")

Practical Considerations

Implementing the agent-environment interaction in practice requires attention to several engineering concerns:

1. Environment Vectorization

Modern RL frameworks (Stable Baselines 3, RLlib, CleanRL) run multiple environment instances in parallel. This:

Increases sample throughput
Reduces variance in gradient estimates
Better utilizes GPU parallelism

2. Frame Stacking and History

For environments where single observations are insufficient (e.g., velocity from static images), agents often receive stacks of recent observations: $$o_t = (s_{t-k+1}, s_{t-k+2}, \ldots, s_t)$$

3. Reward Preprocessing

Raw rewards are often transformed:

Clipping: Bound rewards to [-1, 1] for stability
Normalization: Running normalization to unit variance
Scaling: Adjust magnitude for value function stability

Common Pitfalls

•Off-by-one errors in indexing: Is r_t the reward for arriving at state s_t or leaving it? Be consistent.
•Terminal state handling: The value of a terminal state is zero (no future rewards), but this must be handled correctly in bootstrapping.
•Environment leakage: If the environment leaks information about future states (e.g., through info dict), the agent can 'cheat' during training but fail at test time.
•Non-Markovian wrappers: Some common wrappers (e.g., reward shaping based on episode history) break the Markov property and can cause divergence.
•Seed management: RL is notoriously sensitive to random seeds. Always set seeds explicitly and track them for reproducibility.

The OpenAI Gym/Gymnasium Interface

The Gym interface (reset() -> state, step(action) -> (state, reward, done, info)) has become the de facto standard for RL environments. Understanding and implementing this interface correctly is essential for using modern RL libraries. The newer Gymnasium library adds truncated to distinguish timeouts from natural terminations, which is important for correct value estimation.

Summary: The Foundation of RL

The agent-environment interaction loop is the conceptual foundation upon which all of reinforcement learning is built. Let's consolidate the essential insights:

Key Takeaways

•The agent-environment dichotomy separates the learner from the world, enabling general algorithms that work across domains.
•The interaction loop (observe → act → receive feedback) runs continuously, generating trajectories from which the agent learns.
•The environment is a black box whose dynamics are unknown to the agent, distinguishing RL from classical optimal control.
•The return (discounted cumulative reward) is the agent's objective, encoding long-term thinking through temporal discounting.
•Exploration vs. exploitation is a fundamental dilemma: the agent must balance learning about the world with performing well in it.
•Information flow is asymmetric: rich states and rewards flow from environment to agent; simple actions flow back.

Looking Ahead

With the interaction paradigm established, we can now examine its components in detail. The next page dives into states, actions, and rewards—the concrete quantities that flow through the agent-environment interface. We'll see how the nature of these quantities determines which algorithms are applicable and how difficult learning becomes.

Understanding the interaction loop is understanding what RL is. The following pages will develop how RL algorithms leverage this structure to learn effective policies.

Page Complete

You now understand the agent-environment interaction paradigm—the conceptual foundation of all reinforcement learning. This mental model will guide your understanding as we explore increasingly sophisticated algorithms built upon this simple but powerful abstraction.

1 / 5

Loading learning content...

Machine LearningReinforcement Learning

RL Fundamentals

LevelAdvanced

Duration90 mins

TopicReinforcement Learning

1 / 5

Agent-Environment Interaction

A Fundamentally Different Way to Learn

What You Will Learn

The Agent-Environment Dichotomy

Reinforcement Learning posits a deceptively simple world model: there exists an agent and an environment, and they interact through a continuous loop of perception and action.

This dichotomy may seem arbitrary—where exactly does the agent end and the environment begin? In practice, this boundary is a modeling choice that profoundly affects algorithm design.

The Boundary Problem

Why a Dichotomy?

This separation serves multiple purposes:

Abstraction: It allows us to reason about learning algorithms independently of specific domains. The same Q-learning algorithm can play Atari games, control robots, or optimize data center cooling.
Modularity: We can swap environments (simulation vs. real world) without changing the agent, enabling sim-to-real transfer and curriculum learning.
Formalism: The dichotomy enables precise mathematical treatment. We can define objectives, prove convergence theorems, and analyze sample complexity.
Generalization: By treating the environment as unknown, we design agents that generalize rather than memorize—essential for deployment in novel situations.

Agent vs Environment: Conceptual Boundaries
Aspect	Agent	Environment
Nature	Learner, decision-maker	External world, simulator
Knowledge	Learns from experience	Follows fixed (unknown) rules
Control	Chooses actions	Determines state transitions and rewards
Observability	May have partial information	Has complete internal state
Goal	Maximize cumulative reward	No goal (just dynamics)
Modifiability	We design/train this	Given (or simulated)

The Interaction Loop

At each discrete time step t, a precise sequence of events unfolds:

The agent observes the current state $s_t$ (or observation $o_t$ in partially observable settings)
The agent selects an action $a_t$ based on its policy $\pi(a_t|s_t)$
The environment transitions to a new state $s_{t+1}$ according to its dynamics $P(s_{t+1}|s_t, a_t)$
The environment emits a reward $r_{t+1} = R(s_t, a_t, s_{t+1})$
Time advances: $t \leftarrow t + 1$, and the loop repeats

This loop continues until a terminal state is reached (in episodic tasks) or indefinitely (in continuing tasks).

Converting Mermaid diagram...

The Critical Insight: Delayed Consequences

The Exploration-Exploitation Dilemma

Mathematical Formalism

The agent-environment interaction is formalized mathematically as a trajectory (also called a rollout or episode):

$$\tau = (s_0, a_0, r_1, s_1, a_1, r_2, s_2, \ldots)$$

This sequence of states, actions, and rewards constitutes the agent's experience. From this raw data, the agent must learn a policy that maximizes expected cumulative reward.

Notation Convention

There are two common conventions for indexing rewards:

Convention 1: $r_t$ is the reward received at time $t$ for action $a_{t-1}$ in state $s_{t-1}$
Convention 2: $r_t$ is the reward received at time $t$ for action $a_t$ in state $s_t$

We will use Convention 1 (Sutton & Barto notation) where $r_{t+1}$ is the reward received after taking action $a_t$ in state $s_t$.

interaction_loop.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import numpy as np
from typing import Tuple, List, Any
 
class InteractionLoop:
    """
    Implements the fundamental agent-environment interaction loop.
    This is the core abstraction underlying all RL algorithms.
    """
    
    def __init__(self, env, agent, max_steps: int = 1000):
        self.env = env
        self.agent = agent
        self.max_steps = max_steps
    
    def run_episode(self) -> Tuple[float, List[Tuple]]:
        """
        Execute one complete episode of agent-environment interaction.
        
        Returns:
            total_reward: Sum of all rewards received in the episode
            trajectory: List of (state, action, reward, next_state, done) tuples
        """
        trajectory = []
        total_reward = 0.0
        
        # Step 1: Environment provides initial state
        state = self.env.reset()
        
        for step in range(self.max_steps):
            # Step 2: Agent selects action based on current state
            action = self.agent.select_action(state)
            
            # Step 3 & 4: Environment transitions and emits reward
            next_state, reward, done, info = self.env.step(action)
            
            # Store experience for learning
            trajectory.append((state, action, reward, next_state, done))
            total_reward += reward
            
            # Agent may learn from this transition (online learning)
            self.agent.learn(state, action, reward, next_state, done)
            
            # Step 5: Advance time
            state = next_state
            
            if done:
                break  # Terminal state reached
        
        return total_reward, trajectory
    
    def run_training(self, num_episodes: int) -> List[float]:
        """
        Run multiple episodes for training.
        
        Returns:
            episode_rewards: List of total rewards per episode
        """
        episode_rewards = []
        
        for episode in range(num_episodes):
            total_reward, trajectory = self.run_episode()
            episode_rewards.append(total_reward)
            
            # Agent may perform batch learning after episode
            self.agent.end_episode(trajectory)
            
            if episode % 100 == 0:
                avg_reward = np.mean(episode_rewards[-100:])
                print(f"Episode {episode}, Avg Reward (last 100): {avg_reward:.2f}")
        
        return episode_rewards

The Return: Quantifying Long-Term Success

The agent's objective is not to maximize immediate reward, but cumulative discounted reward (the return):

$$G_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \ldots = \sum_{k=0}^{\infty} \gamma^k r_{t+k+1}$$

where $\gamma \in [0, 1]$ is the discount factor that determines the present value of future rewards.

$\gamma = 0$: The agent is completely myopic, caring only about immediate reward
$\gamma = 1$: The agent values all future rewards equally (only valid for episodic tasks)
$\gamma \approx 0.99$: Common choice, balancing long-term planning with mathematical tractability

Why Discount?

The Environment as a Black Box

A defining characteristic of RL is that the agent does not know the environment's dynamics. This is profoundly different from optimal control theory, where system dynamics are assumed known.

What the Agent Knows:

Its own policy (how it decides actions)
The history of past interactions (states, actions, rewards experienced)

What the Agent Does NOT Know:

The transition function $P(s'|s,a)$: How the environment responds to actions
The reward function $R(s,a,s')$: What reward each transition generates
The state space structure: What states are possible or reachable
The optimal policy: Which actions lead to maximum return

Model-Free RL

•Learns directly from experience
•No explicit environment model
•Examples: Q-learning, SARSA, Policy Gradient
•Pro: No modeling errors (what you see is what you get)
•Con: Sample inefficient—needs lots of interaction

Model-Based RL

•Learns an explicit model of the environment
•Uses model for planning and simulation
•Examples: Dyna-Q, MBPO, World Models
•Pro: More sample efficient—can plan internally
•Con: Model errors compound; may diverge from reality

The Simulation Hypothesis

This enables:

Sim-to-real transfer: Train in simulation, deploy in reality
Safe exploration: Make mistakes in simulation before real deployment
Rapid iteration: Run millions of episodes in minutes

However, it also creates the sim-to-real gap: simulators are imperfect models, and policies that work in simulation may fail catastrophically in the real world.

The Sim-to-Real Gap

Types of Environments

Environments vary along several important dimensions that affect which algorithms are applicable and how difficult learning becomes:

1. Episodic vs. Continuing

Episodic environments have natural endpoints—games end, tasks complete, robots reach goals or fail. Each episode is independent, allowing easy reset and parallel sampling.

2. Fully Observable vs. Partially Observable

Fully observable environments provide the complete state at each step. The agent sees everything relevant to predicting future states and rewards.

Environment Taxonomy
Dimension	Type A	Type B	Algorithm Implications
Termination	Episodic	Continuing	Episodic allows natural return computation; continuing requires discounting or average reward formulations
Observability	Fully Observable (MDP)	Partially Observable (POMDP)	POMDPs require memory/belief states; much harder to solve
Determinism	Deterministic	Stochastic	Stochastic environments require expected value reasoning and more samples
Time	Discrete	Continuous	Continuous time needs differential equations or fine discretization
Players	Single-agent	Multi-agent	Multi-agent adds game-theoretic complexity; Nash equilibria vs. optimal policies
Stationarity	Stationary	Non-stationary	Non-stationary environments require online adaptation; past learning may become invalid

3. Deterministic vs. Stochastic

Deterministic environments have predictable dynamics—the same action in the same state always produces the same next state. Planning is easier because outcomes are certain.

4. Discrete vs. Continuous

Both states and actions can be discrete or continuous:

Discrete state/action: Board games, gridworlds—finite possibilities
Continuous state/discrete action: Atari games—pixel observations, finite button presses
Discrete state/continuous action: Rare in practice
Continuous state/action: Robotics—real-valued sensor readings, motor torques

environment_types.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
from abc import ABC, abstractmethod
from typing import Tuple, Any, Optional
import numpy as np
 
class Environment(ABC):
    """
    Abstract base class defining the environment interface.
    All RL environments should implement this interface.
    """
    
    @abstractmethod
    def reset(self) -> Any:
        """
        Reset environment to initial state.
        
        Returns:
            initial_state: The starting state of a new episode
        """
        pass
    
    @abstractmethod
    def step(self, action: Any) -> Tuple[Any, float, bool, dict]:
        """
        Execute one environment step.
        
        Args:
            action: The action taken by the agent
            
        Returns:
            next_state: The resulting state
            reward: The scalar reward signal
            done: Whether the episode has terminated
            info: Additional diagnostic information
        """
        pass
    
    @property
    @abstractmethod
    def state_space(self) -> Any:
        """Description of the state space."""
        pass
    
    @property
    @abstractmethod
    def action_space(self) -> Any:
        """Description of the action space."""
        pass
 
 
class EpisodicEnvironment(Environment):
    """Environment with natural episode termination."""
    
    def __init__(self, max_episode_length: int = 1000):
        self.max_episode_length = max_episode_length
        self.current_step = 0
    
    def reset(self) -> Any:
        self.current_step = 0
        return self._get_initial_state()
    
    def step(self, action: Any) -> Tuple[Any, float, bool, dict]:
        self.current_step += 1
        next_state, reward = self._transition(action)
        
        # Check for natural or forced termination
        natural_done = self._is_terminal(next_state)
        timeout = self.current_step >= self.max_episode_length
        done = natural_done or timeout
        
        info = {
            'natural_termination': natural_done,
            'timeout': timeout,
            'step': self.current_step
        }
        
        return next_state, reward, done, info
 
 
class PartiallyObservableEnvironment(Environment):
    """
    Environment where the agent receives observations, not true states.
    Implements a POMDP interface.
    """
    
    def __init__(self):
        self._true_state = None
    
    def reset(self) -> Any:
        self._true_state = self._get_initial_state()
        return self._observe(self._true_state)
    
    def step(self, action: Any) -> Tuple[Any, float, bool, dict]:
        # Transition operates on true state
        next_true_state, reward = self._transition(self._true_state, action)
        done = self._is_terminal(next_true_state)
        
        self._true_state = next_true_state
        
        # Agent only sees an observation
        observation = self._observe(next_true_state)
        
        info = {'true_state': next_true_state}  # For debugging only
        
        return observation, reward, done, info
    
    @abstractmethod
    def _observe(self, true_state: Any) -> Any:
        """Generate observation from true state (may be noisy/partial)."""
        pass

The Action Selection Process

At each time step, the agent must select an action. This selection is governed by the agent's policy, but the mechanism of selection has profound implications for learning.

Deterministic vs. Stochastic Policies

A deterministic policy $\mu: S \rightarrow A$ maps each state to a single action: $$a = \mu(s)$$

A stochastic policy $\pi: S \times A \rightarrow [0,1]$ specifies a probability distribution over actions: $$a \sim \pi(\cdot|s)$$

Exploration Strategies

The exploration-exploitation dilemma manifests concretely in action selection:

ε-Greedy: With probability $\epsilon$, select a random action; otherwise, select the greedy action
- Simple and effective
- $\epsilon$ can be annealed over time (high early, low later)
- Problem: Random exploration is undirected
Boltzmann/Softmax Exploration: Select actions with probability proportional to exponentiated value estimates $$\pi(a|s) = \frac{\exp(Q(s,a)/\tau)}{\sum_{a'} \exp(Q(s,a')/\tau)}$$
- Temperature $\tau$ controls exploration (high = uniform, low = greedy)
- Problem: Sensitive to value scale
Upper Confidence Bounds (UCB): Add bonus for uncertainty to encourage trying uncertain actions $$a = \arg\max_a \left[ Q(s,a) + c\sqrt{\frac{\ln t}{N(s,a)}} \right]$$
- Principled approach from bandit theory
- Problem: Requires visit counts, doesn't scale to large state spaces
Entropy Regularization: Add entropy bonus to the objective, encouraging diverse actions
- Foundation of algorithms like SAC (Soft Actor-Critic)
- Theoretically principled; connects to maximum entropy RL

exploration_strategies.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
import numpy as np
from typing import Callable
 
class ExplorationStrategy:
    """Base class for action selection strategies."""
    
    def select_action(self, q_values: np.ndarray, **kwargs) -> int:
        raise NotImplementedError
 
 
class EpsilonGreedy(ExplorationStrategy):
    """
    ε-Greedy exploration: random action with probability ε,
    greedy action otherwise.
    """
    
    def __init__(self, epsilon: float = 0.1, 
                 epsilon_decay: float = 0.999,
                 epsilon_min: float = 0.01):
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay
        self.epsilon_min = epsilon_min
    
    def select_action(self, q_values: np.ndarray, **kwargs) -> int:
        if np.random.random() < self.epsilon:
            # Explore: random action
            action = np.random.randint(len(q_values))
        else:
            # Exploit: greedy action (break ties randomly)
            max_q = np.max(q_values)
            best_actions = np.where(q_values == max_q)[0]
            action = np.random.choice(best_actions)
        
        return action
    
    def decay(self):
        """Decay epsilon after each episode."""
        self.epsilon = max(self.epsilon_min, 
                          self.epsilon * self.epsilon_decay)
 
 
class BoltzmannExploration(ExplorationStrategy):
    """
    Softmax exploration: actions selected proportionally to
    exponentiated Q-values, controlled by temperature.
    """
    
    def __init__(self, temperature: float = 1.0,
                 temperature_decay: float = 0.99,
                 temperature_min: float = 0.1):
        self.temperature = temperature
        self.temperature_decay = temperature_decay
        self.temperature_min = temperature_min
    
    def select_action(self, q_values: np.ndarray, **kwargs) -> int:
        # Numerical stability: subtract max before exp
        q_scaled = (q_values - np.max(q_values)) / self.temperature
        exp_q = np.exp(q_scaled)
        probabilities = exp_q / np.sum(exp_q)
        
        return np.random.choice(len(q_values), p=probabilities)
    
    def decay(self):
        """Decay temperature after each episode."""
        self.temperature = max(self.temperature_min,
                               self.temperature * self.temperature_decay)
 
 
class UCBExploration(ExplorationStrategy):
    """
    Upper Confidence Bound exploration: adds bonus for
    less-visited actions to encourage exploration.
    """
    
    def __init__(self, confidence: float = 2.0):
        self.confidence = confidence
        self.total_steps = 0
        self.action_counts = None
    
    def select_action(self, q_values: np.ndarray, 
                     state: int = None, **kwargs) -> int:
        n_actions = len(q_values)
        
        # Initialize action counts if needed
        if self.action_counts is None:
            self.action_counts = np.ones(n_actions)  # Start at 1 to avoid div by zero
        
        self.total_steps += 1
        
        # UCB formula: Q(a) + c * sqrt(ln(t) / N(a))
        exploration_bonus = self.confidence * np.sqrt(
            np.log(self.total_steps) / self.action_counts
        )
        
        ucb_values = q_values + exploration_bonus
        action = np.argmax(ucb_values)
        
        self.action_counts[action] += 1
        
        return action

Information Flow Analysis

Understanding the agent-environment loop requires careful analysis of what information flows in each direction and when.

From Environment to Agent:

State/Observation: Complete (MDP) or partial (POMDP) information about the world
Reward: Scalar signal indicating desirability of the transition
Done Flag: Whether the episode has terminated
Info Dict: Optional metadata (debugging, diagnostics, ground truth)

From Agent to Environment:

Action: The agent's chosen intervention in the world

Note the asymmetry: the agent sends simple actions but receives rich, structured information. This reflects the reality that agents are embedded in complex worlds.

Timing Matters: The Causality Structure

The temporal structure of the loop enforces causality:

$$s_t \rightarrow a_t \rightarrow (r_{t+1}, s_{t+1}) \rightarrow a_{t+1} \rightarrow \ldots$$

Critical implications:

The agent chooses $a_t$ based on $s_t$ (and history), but before seeing $r_{t+1}$
The reward $r_{t+1}$ evaluates the transition $(s_t, a_t, s_{t+1})$, not just the action
The agent cannot 'peak ahead' to see future states before deciding

This causal structure is why on-policy and off-policy methods differ, and why temporal difference learning is non-trivial.

The Information Content of Rewards

Information Flow at Each Time Step
Time	Agent Receives	Agent Computes	Agent Sends
t=0	Initial state $s_0$	Policy $\pi(a\|s_0)$	Action $a_0$
t=1	$(r_1, s_1, done)$	Update estimates; $\pi(a\|s_1)$	Action $a_1$
t=2	$(r_2, s_2, done)$	Update estimates; $\pi(a\|s_2)$	Action $a_2$
...	...	...	...
t=T	$(r_T, s_T, done=True)$	Final updates; episode complete	—

The Objective Function

The agent-environment interaction exists to serve a purpose: finding a policy that maximizes expected cumulative reward. This objective can be formalized in several equivalent ways:

Episodic Finite-Horizon Objective

For episodic tasks with a fixed horizon $T$:

$$J(\pi) = \mathbb{E}{\tau \sim \pi} \left[ \sum{t=0}^{T-1} r_{t+1} \right] = \mathbb{E}_{\tau \sim \pi} [G_0]$$

Discounted Infinite-Horizon Objective

For continuing tasks or variable-length episodes with discount factor $\gamma < 1$:

$$J(\pi) = \mathbb{E}{\tau \sim \pi} \left[ \sum{t=0}^{\infty} \gamma^t r_{t+1} \right] = \mathbb{E}_{\tau \sim \pi} [G_0]$$

Average Reward Objective

For continuing tasks where we care about long-run average performance:

$$J(\pi) = \lim_{T \rightarrow \infty} \frac{1}{T} \mathbb{E}{\tau \sim \pi} \left[ \sum{t=0}^{T-1} r_{t+1} \right]$$

The Expectation is Over Trajectories

The notation $\tau \sim \pi$ means that trajectories are sampled according to the policy and environment dynamics:

$$P(\tau|\pi) = p(s_0) \prod_{t=0}^{T-1} \pi(a_t|s_t) P(s_{t+1}|s_t, a_t)$$

where:

$p(s_0)$ is the initial state distribution
$\pi(a_t|s_t)$ is the policy's action probability
$P(s_{t+1}|s_t, a_t)$ is the environment's transition probability

The objective thus depends on both the policy (which we control) and the environment dynamics (which we don't).

Optimality and the Bellman Principle

objective_estimation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
import numpy as np
from typing import List, Tuple
 
def compute_episode_return(
    rewards: List[float],
    gamma: float = 0.99
) -> float:
    """
    Compute the discounted return for an episode.
    
    Args:
        rewards: List of rewards [r_1, r_2, ..., r_T]
        gamma: Discount factor
        
    Returns:
        G_0: The return from the start of the episode
    """
    G = 0.0
    for reward in reversed(rewards):
        G = reward + gamma * G
    return G
 
 
def compute_all_returns(
    rewards: List[float],
    gamma: float = 0.99
) -> np.ndarray:
    """
    Compute returns for all time steps (for policy gradient methods).
    
    Returns G_0, G_1, ..., G_{T-1} where:
    G_t = r_{t+1} + γ r_{t+2} + γ² r_{t+3} + ...
    """
    T = len(rewards)
    returns = np.zeros(T)
    G = 0.0
    
    for t in reversed(range(T)):
        G = rewards[t] + gamma * G
        returns[t] = G
    
    return returns
 
 
def estimate_policy_value(
    run_episode_fn,
    num_episodes: int = 100,
    gamma: float = 0.99
) -> Tuple[float, float]:
    """
    Monte Carlo estimation of policy value J(π).
    
    Args:
        run_episode_fn: Function that runs one episode within
                       a policy, returning list of rewards
        num_episodes: Number of episodes to sample
        gamma: Discount factor
        
    Returns:
        mean_return: Estimated J(π)
        std_return: Standard deviation of estimate
    """
    returns = []
    
    for _ in range(num_episodes):
        rewards = run_episode_fn()
        G = compute_episode_return(rewards, gamma)
        returns.append(G)
    
    mean_return = np.mean(returns)
    std_return = np.std(returns)
    std_error = std_return / np.sqrt(num_episodes)
    
    print(f"Estimated J(π) = {mean_return:.2f} ± {std_error:.2f}")
    print(f"(Based on {num_episodes} episodes)")
    
    return mean_return, std_return
 
 
def compare_policies(
    policies: List[Tuple[str, callable]],
    env,
    num_episodes: int = 100,
    gamma: float = 0.99
):
    """Compare multiple policies by their estimated values."""
    results = []
    
    for name, policy in policies:
        def run_episode():
            rewards = []
            state = env.reset()
            done = False
            while not done:
                action = policy(state)
                state, reward, done, _ = env.step(action)
                rewards.append(reward)
            return rewards
        
        mean, std = estimate_policy_value(run_episode, num_episodes, gamma)
        results.append((name, mean, std))
    
    # Rank policies
    results.sort(key=lambda x: -x[1])  # Descending by mean
    
    print("
=== Policy Ranking ===")
    for i, (name, mean, std) in enumerate(results):
        print(f"{i+1}. {name}: {mean:.2f} ± {std/np.sqrt(num_episodes):.2f}")

Practical Considerations

Implementing the agent-environment interaction in practice requires attention to several engineering concerns:

1. Environment Vectorization

Modern RL frameworks (Stable Baselines 3, RLlib, CleanRL) run multiple environment instances in parallel. This:

Increases sample throughput
Reduces variance in gradient estimates
Better utilizes GPU parallelism

2. Frame Stacking and History

For environments where single observations are insufficient (e.g., velocity from static images), agents often receive stacks of recent observations: $$o_t = (s_{t-k+1}, s_{t-k+2}, \ldots, s_t)$$

3. Reward Preprocessing

Raw rewards are often transformed:

Clipping: Bound rewards to [-1, 1] for stability
Normalization: Running normalization to unit variance
Scaling: Adjust magnitude for value function stability

Common Pitfalls

•Off-by-one errors in indexing: Is r_t the reward for arriving at state s_t or leaving it? Be consistent.
•Terminal state handling: The value of a terminal state is zero (no future rewards), but this must be handled correctly in bootstrapping.
•Environment leakage: If the environment leaks information about future states (e.g., through info dict), the agent can 'cheat' during training but fail at test time.
•Non-Markovian wrappers: Some common wrappers (e.g., reward shaping based on episode history) break the Markov property and can cause divergence.
•Seed management: RL is notoriously sensitive to random seeds. Always set seeds explicitly and track them for reproducibility.

The OpenAI Gym/Gymnasium Interface

Summary: The Foundation of RL

The agent-environment interaction loop is the conceptual foundation upon which all of reinforcement learning is built. Let's consolidate the essential insights:

Key Takeaways

•The agent-environment dichotomy separates the learner from the world, enabling general algorithms that work across domains.
•The interaction loop (observe → act → receive feedback) runs continuously, generating trajectories from which the agent learns.
•The environment is a black box whose dynamics are unknown to the agent, distinguishing RL from classical optimal control.
•The return (discounted cumulative reward) is the agent's objective, encoding long-term thinking through temporal discounting.
•Exploration vs. exploitation is a fundamental dilemma: the agent must balance learning about the world with performing well in it.
•Information flow is asymmetric: rich states and rewards flow from environment to agent; simple actions flow back.

Looking Ahead

Understanding the interaction loop is understanding what RL is. The following pages will develop how RL algorithms leverage this structure to learn effective policies.

Page Complete

1 / 5