Loading learning content...
Imagine trying to teach a child to ride a bicycle using only a book of instructions. You could describe the physics of balance, the mechanics of pedaling, the principles of steering—yet the child would still fall the first dozen times they actually tried. Some things can only be learned through experience.
This observation lies at the heart of Reinforcement Learning (RL)—a paradigm that diverges fundamentally from supervised and unsupervised learning. Where supervised learning requires a teacher providing correct answers, and unsupervised learning discovers patterns in static data, reinforcement learning learns through trial, error, and feedback.
The agent-environment interaction loop is the conceptual foundation upon which all of reinforcement learning is built. Understanding this loop deeply—not just superficially—is essential for grasping why RL algorithms work, when they succeed, and why they sometimes fail spectacularly.
By the end of this page, you will understand the agent-environment interaction paradigm at a level suitable for implementing RL systems and reasoning about their behavior. You'll grasp the mathematical formalism, the philosophical underpinnings, and the practical implications that distinguish RL from other machine learning approaches.
Reinforcement Learning posits a deceptively simple world model: there exists an agent and an environment, and they interact through a continuous loop of perception and action.
The Agent is the learner and decision-maker. It perceives the world, chooses actions, and seeks to maximize some notion of cumulative reward. The agent embodies the algorithm we're designing—it's the entity that learns.
The Environment is everything external to the agent. It receives actions, transitions between states, and generates rewards. Crucially, the environment is treated as a black box: the agent may not know the environment's rules, dynamics, or internal structure.
This dichotomy may seem arbitrary—where exactly does the agent end and the environment begin? In practice, this boundary is a modeling choice that profoundly affects algorithm design.
The agent-environment boundary isn't physical. In a robotic system, should the robot's motors be part of the agent or the environment? If the agent includes the motors, it must model motor dynamics; if the environment includes them, the agent's actions become higher-level commands. The 'right' boundary depends on what you want the agent to learn and what you can reliably model.
Why a Dichotomy?
This separation serves multiple purposes:
Abstraction: It allows us to reason about learning algorithms independently of specific domains. The same Q-learning algorithm can play Atari games, control robots, or optimize data center cooling.
Modularity: We can swap environments (simulation vs. real world) without changing the agent, enabling sim-to-real transfer and curriculum learning.
Formalism: The dichotomy enables precise mathematical treatment. We can define objectives, prove convergence theorems, and analyze sample complexity.
Generalization: By treating the environment as unknown, we design agents that generalize rather than memorize—essential for deployment in novel situations.
| Aspect | Agent | Environment |
|---|---|---|
| Nature | Learner, decision-maker | External world, simulator |
| Knowledge | Learns from experience | Follows fixed (unknown) rules |
| Control | Chooses actions | Determines state transitions and rewards |
| Observability | May have partial information | Has complete internal state |
| Goal | Maximize cumulative reward | No goal (just dynamics) |
| Modifiability | We design/train this | Given (or simulated) |
At each discrete time step t, a precise sequence of events unfolds:
The agent observes the current state $s_t$ (or observation $o_t$ in partially observable settings)
The agent selects an action $a_t$ based on its policy $\pi(a_t|s_t)$
The environment transitions to a new state $s_{t+1}$ according to its dynamics $P(s_{t+1}|s_t, a_t)$
The environment emits a reward $r_{t+1} = R(s_t, a_t, s_{t+1})$
Time advances: $t \leftarrow t + 1$, and the loop repeats
This loop continues until a terminal state is reached (in episodic tasks) or indefinitely (in continuing tasks).
The Critical Insight: Delayed Consequences
What makes this loop fundamentally different from supervised learning is the credit assignment problem. When the agent wins or loses a game, which of the hundreds of preceding actions was responsible? When a robot falls over, was it the action taken milliseconds before, or a strategic error made minutes ago?
Unlike supervised learning, where each input has an immediate correct output, RL must attribute success or failure to actions that may have occurred arbitrarily far in the past. This temporal credit assignment is one of the central challenges in RL.
At each step, the agent faces a fundamental trade-off: should it exploit current knowledge (choose the best-known action) or explore uncertain alternatives (try something new that might be better)? This dilemma has no universal solution and is central to many RL algorithms. Too much exploitation leads to suboptimal local maxima; too much exploration wastes time on unpromising alternatives.
The agent-environment interaction is formalized mathematically as a trajectory (also called a rollout or episode):
$$\tau = (s_0, a_0, r_1, s_1, a_1, r_2, s_2, \ldots)$$
This sequence of states, actions, and rewards constitutes the agent's experience. From this raw data, the agent must learn a policy that maximizes expected cumulative reward.
Notation Convention
There are two common conventions for indexing rewards:
We will use Convention 1 (Sutton & Barto notation) where $r_{t+1}$ is the reward received after taking action $a_t$ in state $s_t$.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
import numpy as npfrom typing import Tuple, List, Any class InteractionLoop: """ Implements the fundamental agent-environment interaction loop. This is the core abstraction underlying all RL algorithms. """ def __init__(self, env, agent, max_steps: int = 1000): self.env = env self.agent = agent self.max_steps = max_steps def run_episode(self) -> Tuple[float, List[Tuple]]: """ Execute one complete episode of agent-environment interaction. Returns: total_reward: Sum of all rewards received in the episode trajectory: List of (state, action, reward, next_state, done) tuples """ trajectory = [] total_reward = 0.0 # Step 1: Environment provides initial state state = self.env.reset() for step in range(self.max_steps): # Step 2: Agent selects action based on current state action = self.agent.select_action(state) # Step 3 & 4: Environment transitions and emits reward next_state, reward, done, info = self.env.step(action) # Store experience for learning trajectory.append((state, action, reward, next_state, done)) total_reward += reward # Agent may learn from this transition (online learning) self.agent.learn(state, action, reward, next_state, done) # Step 5: Advance time state = next_state if done: break # Terminal state reached return total_reward, trajectory def run_training(self, num_episodes: int) -> List[float]: """ Run multiple episodes for training. Returns: episode_rewards: List of total rewards per episode """ episode_rewards = [] for episode in range(num_episodes): total_reward, trajectory = self.run_episode() episode_rewards.append(total_reward) # Agent may perform batch learning after episode self.agent.end_episode(trajectory) if episode % 100 == 0: avg_reward = np.mean(episode_rewards[-100:]) print(f"Episode {episode}, Avg Reward (last 100): {avg_reward:.2f}") return episode_rewardsThe Return: Quantifying Long-Term Success
The agent's objective is not to maximize immediate reward, but cumulative discounted reward (the return):
$$G_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \ldots = \sum_{k=0}^{\infty} \gamma^k r_{t+k+1}$$
where $\gamma \in [0, 1]$ is the discount factor that determines the present value of future rewards.
Discounting serves multiple purposes: (1) It ensures the infinite sum converges for continuing tasks. (2) It handles uncertainty about the future—distant rewards are less certain. (3) It models economic time preference—a reward now is worth more than the same reward later. (4) It enables recursive Bellman equations that form the basis of dynamic programming solutions.
A defining characteristic of RL is that the agent does not know the environment's dynamics. This is profoundly different from optimal control theory, where system dynamics are assumed known.
What the Agent Knows:
What the Agent Does NOT Know:
This ignorance is both a weakness and a strength. It's a weakness because the agent must learn everything from experience, which can be sample-inefficient. It's a strength because algorithms designed under this assumption can handle environments that are too complex to model analytically.
The Simulation Hypothesis
A crucial implication of treating environments as black boxes is that simulation and reality become interchangeable at the interface level. Whether the environment is a video game, a physics simulator, or the real world, the agent only sees (state, action, reward, next_state) tuples.
This enables:
However, it also creates the sim-to-real gap: simulators are imperfect models, and policies that work in simulation may fail catastrophically in the real world.
A robot trained in simulation to walk may fail in the real world because the simulator didn't model friction, motor dynamics, or sensor noise accurately. Modern approaches like domain randomization (training with varied simulation parameters) and system identification (adjusting the simulator to match reality) help bridge this gap, but it remains a fundamental challenge.
Environments vary along several important dimensions that affect which algorithms are applicable and how difficult learning becomes:
1. Episodic vs. Continuing
Episodic environments have natural endpoints—games end, tasks complete, robots reach goals or fail. Each episode is independent, allowing easy reset and parallel sampling.
Continuing environments have no natural termination—process control, stock trading, long-running systems. The agent must balance immediate performance with long-term learning, and there's no reset to recover from mistakes.
2. Fully Observable vs. Partially Observable
Fully observable environments provide the complete state at each step. The agent sees everything relevant to predicting future states and rewards.
Partially observable environments provide only partial information—the agent's observation is a noisy or incomplete function of the true state. The agent must maintain beliefs about unobserved state variables.
| Dimension | Type A | Type B | Algorithm Implications |
|---|---|---|---|
| Termination | Episodic | Continuing | Episodic allows natural return computation; continuing requires discounting or average reward formulations |
| Observability | Fully Observable (MDP) | Partially Observable (POMDP) | POMDPs require memory/belief states; much harder to solve |
| Determinism | Deterministic | Stochastic | Stochastic environments require expected value reasoning and more samples |
| Time | Discrete | Continuous | Continuous time needs differential equations or fine discretization |
| Players | Single-agent | Multi-agent | Multi-agent adds game-theoretic complexity; Nash equilibria vs. optimal policies |
| Stationarity | Stationary | Non-stationary | Non-stationary environments require online adaptation; past learning may become invalid |
3. Deterministic vs. Stochastic
Deterministic environments have predictable dynamics—the same action in the same state always produces the same next state. Planning is easier because outcomes are certain.
Stochastic environments have probabilistic dynamics—actions may lead to different next states with different probabilities. The agent must reason about expected values and may need to handle rare but important events.
4. Discrete vs. Continuous
Both states and actions can be discrete or continuous:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109
from abc import ABC, abstractmethodfrom typing import Tuple, Any, Optionalimport numpy as np class Environment(ABC): """ Abstract base class defining the environment interface. All RL environments should implement this interface. """ @abstractmethod def reset(self) -> Any: """ Reset environment to initial state. Returns: initial_state: The starting state of a new episode """ pass @abstractmethod def step(self, action: Any) -> Tuple[Any, float, bool, dict]: """ Execute one environment step. Args: action: The action taken by the agent Returns: next_state: The resulting state reward: The scalar reward signal done: Whether the episode has terminated info: Additional diagnostic information """ pass @property @abstractmethod def state_space(self) -> Any: """Description of the state space.""" pass @property @abstractmethod def action_space(self) -> Any: """Description of the action space.""" pass class EpisodicEnvironment(Environment): """Environment with natural episode termination.""" def __init__(self, max_episode_length: int = 1000): self.max_episode_length = max_episode_length self.current_step = 0 def reset(self) -> Any: self.current_step = 0 return self._get_initial_state() def step(self, action: Any) -> Tuple[Any, float, bool, dict]: self.current_step += 1 next_state, reward = self._transition(action) # Check for natural or forced termination natural_done = self._is_terminal(next_state) timeout = self.current_step >= self.max_episode_length done = natural_done or timeout info = { 'natural_termination': natural_done, 'timeout': timeout, 'step': self.current_step } return next_state, reward, done, info class PartiallyObservableEnvironment(Environment): """ Environment where the agent receives observations, not true states. Implements a POMDP interface. """ def __init__(self): self._true_state = None def reset(self) -> Any: self._true_state = self._get_initial_state() return self._observe(self._true_state) def step(self, action: Any) -> Tuple[Any, float, bool, dict]: # Transition operates on true state next_true_state, reward = self._transition(self._true_state, action) done = self._is_terminal(next_true_state) self._true_state = next_true_state # Agent only sees an observation observation = self._observe(next_true_state) info = {'true_state': next_true_state} # For debugging only return observation, reward, done, info @abstractmethod def _observe(self, true_state: Any) -> Any: """Generate observation from true state (may be noisy/partial).""" passAt each time step, the agent must select an action. This selection is governed by the agent's policy, but the mechanism of selection has profound implications for learning.
Deterministic vs. Stochastic Policies
A deterministic policy $\mu: S \rightarrow A$ maps each state to a single action: $$a = \mu(s)$$
A stochastic policy $\pi: S \times A \rightarrow [0,1]$ specifies a probability distribution over actions: $$a \sim \pi(\cdot|s)$$
Stochastic policies are essential for exploration—if the agent always takes the same action in each state, it can never discover better alternatives. They also enable on-policy algorithms where the same policy is used for both acting and learning.
Exploration Strategies
The exploration-exploitation dilemma manifests concretely in action selection:
ε-Greedy: With probability $\epsilon$, select a random action; otherwise, select the greedy action
Boltzmann/Softmax Exploration: Select actions with probability proportional to exponentiated value estimates $$\pi(a|s) = \frac{\exp(Q(s,a)/\tau)}{\sum_{a'} \exp(Q(s,a')/\tau)}$$
Upper Confidence Bounds (UCB): Add bonus for uncertainty to encourage trying uncertain actions $$a = \arg\max_a \left[ Q(s,a) + c\sqrt{\frac{\ln t}{N(s,a)}} \right]$$
Entropy Regularization: Add entropy bonus to the objective, encouraging diverse actions
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
import numpy as npfrom typing import Callable class ExplorationStrategy: """Base class for action selection strategies.""" def select_action(self, q_values: np.ndarray, **kwargs) -> int: raise NotImplementedError class EpsilonGreedy(ExplorationStrategy): """ ε-Greedy exploration: random action with probability ε, greedy action otherwise. """ def __init__(self, epsilon: float = 0.1, epsilon_decay: float = 0.999, epsilon_min: float = 0.01): self.epsilon = epsilon self.epsilon_decay = epsilon_decay self.epsilon_min = epsilon_min def select_action(self, q_values: np.ndarray, **kwargs) -> int: if np.random.random() < self.epsilon: # Explore: random action action = np.random.randint(len(q_values)) else: # Exploit: greedy action (break ties randomly) max_q = np.max(q_values) best_actions = np.where(q_values == max_q)[0] action = np.random.choice(best_actions) return action def decay(self): """Decay epsilon after each episode.""" self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay) class BoltzmannExploration(ExplorationStrategy): """ Softmax exploration: actions selected proportionally to exponentiated Q-values, controlled by temperature. """ def __init__(self, temperature: float = 1.0, temperature_decay: float = 0.99, temperature_min: float = 0.1): self.temperature = temperature self.temperature_decay = temperature_decay self.temperature_min = temperature_min def select_action(self, q_values: np.ndarray, **kwargs) -> int: # Numerical stability: subtract max before exp q_scaled = (q_values - np.max(q_values)) / self.temperature exp_q = np.exp(q_scaled) probabilities = exp_q / np.sum(exp_q) return np.random.choice(len(q_values), p=probabilities) def decay(self): """Decay temperature after each episode.""" self.temperature = max(self.temperature_min, self.temperature * self.temperature_decay) class UCBExploration(ExplorationStrategy): """ Upper Confidence Bound exploration: adds bonus for less-visited actions to encourage exploration. """ def __init__(self, confidence: float = 2.0): self.confidence = confidence self.total_steps = 0 self.action_counts = None def select_action(self, q_values: np.ndarray, state: int = None, **kwargs) -> int: n_actions = len(q_values) # Initialize action counts if needed if self.action_counts is None: self.action_counts = np.ones(n_actions) # Start at 1 to avoid div by zero self.total_steps += 1 # UCB formula: Q(a) + c * sqrt(ln(t) / N(a)) exploration_bonus = self.confidence * np.sqrt( np.log(self.total_steps) / self.action_counts ) ucb_values = q_values + exploration_bonus action = np.argmax(ucb_values) self.action_counts[action] += 1 return actionUnderstanding the agent-environment loop requires careful analysis of what information flows in each direction and when.
From Environment to Agent:
From Agent to Environment:
Note the asymmetry: the agent sends simple actions but receives rich, structured information. This reflects the reality that agents are embedded in complex worlds.
Timing Matters: The Causality Structure
The temporal structure of the loop enforces causality:
$$s_t \rightarrow a_t \rightarrow (r_{t+1}, s_{t+1}) \rightarrow a_{t+1} \rightarrow \ldots$$
Critical implications:
This causal structure is why on-policy and off-policy methods differ, and why temporal difference learning is non-trivial.
A scalar reward carries surprisingly little information—just one number per step. This sparsity is why reward shaping, intrinsic motivation, and curiosity-driven exploration are active research areas. The reward signal is often insufficient for efficient learning, and augmenting it (carefully) can dramatically accelerate training.
| Time | Agent Receives | Agent Computes | Agent Sends |
|---|---|---|---|
| t=0 | Initial state $s_0$ | Policy $\pi(a|s_0)$ | Action $a_0$ |
| t=1 | $(r_1, s_1, done)$ | Update estimates; $\pi(a|s_1)$ | Action $a_1$ |
| t=2 | $(r_2, s_2, done)$ | Update estimates; $\pi(a|s_2)$ | Action $a_2$ |
| ... | ... | ... | ... |
| t=T | $(r_T, s_T, done=True)$ | Final updates; episode complete | — |
The agent-environment interaction exists to serve a purpose: finding a policy that maximizes expected cumulative reward. This objective can be formalized in several equivalent ways:
Episodic Finite-Horizon Objective
For episodic tasks with a fixed horizon $T$:
$$J(\pi) = \mathbb{E}{\tau \sim \pi} \left[ \sum{t=0}^{T-1} r_{t+1} \right] = \mathbb{E}_{\tau \sim \pi} [G_0]$$
Discounted Infinite-Horizon Objective
For continuing tasks or variable-length episodes with discount factor $\gamma < 1$:
$$J(\pi) = \mathbb{E}{\tau \sim \pi} \left[ \sum{t=0}^{\infty} \gamma^t r_{t+1} \right] = \mathbb{E}_{\tau \sim \pi} [G_0]$$
Average Reward Objective
For continuing tasks where we care about long-run average performance:
$$J(\pi) = \lim_{T \rightarrow \infty} \frac{1}{T} \mathbb{E}{\tau \sim \pi} \left[ \sum{t=0}^{T-1} r_{t+1} \right]$$
The Expectation is Over Trajectories
The notation $\tau \sim \pi$ means that trajectories are sampled according to the policy and environment dynamics:
$$P(\tau|\pi) = p(s_0) \prod_{t=0}^{T-1} \pi(a_t|s_t) P(s_{t+1}|s_t, a_t)$$
where:
The objective thus depends on both the policy (which we control) and the environment dynamics (which we don't).
The optimal policy $\pi^*$ achieves the maximum expected return from every state. Remarkably, this doesn't require looking at all possible trajectories—the Bellman optimality principle states that an optimal policy consists of optimal actions at every step, regardless of what happened before. This recursive structure underlies dynamic programming and enables efficient algorithms.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109
import numpy as npfrom typing import List, Tuple def compute_episode_return( rewards: List[float], gamma: float = 0.99) -> float: """ Compute the discounted return for an episode. Args: rewards: List of rewards [r_1, r_2, ..., r_T] gamma: Discount factor Returns: G_0: The return from the start of the episode """ G = 0.0 for reward in reversed(rewards): G = reward + gamma * G return G def compute_all_returns( rewards: List[float], gamma: float = 0.99) -> np.ndarray: """ Compute returns for all time steps (for policy gradient methods). Returns G_0, G_1, ..., G_{T-1} where: G_t = r_{t+1} + γ r_{t+2} + γ² r_{t+3} + ... """ T = len(rewards) returns = np.zeros(T) G = 0.0 for t in reversed(range(T)): G = rewards[t] + gamma * G returns[t] = G return returns def estimate_policy_value( run_episode_fn, num_episodes: int = 100, gamma: float = 0.99) -> Tuple[float, float]: """ Monte Carlo estimation of policy value J(π). Args: run_episode_fn: Function that runs one episode within a policy, returning list of rewards num_episodes: Number of episodes to sample gamma: Discount factor Returns: mean_return: Estimated J(π) std_return: Standard deviation of estimate """ returns = [] for _ in range(num_episodes): rewards = run_episode_fn() G = compute_episode_return(rewards, gamma) returns.append(G) mean_return = np.mean(returns) std_return = np.std(returns) std_error = std_return / np.sqrt(num_episodes) print(f"Estimated J(π) = {mean_return:.2f} ± {std_error:.2f}") print(f"(Based on {num_episodes} episodes)") return mean_return, std_return def compare_policies( policies: List[Tuple[str, callable]], env, num_episodes: int = 100, gamma: float = 0.99): """Compare multiple policies by their estimated values.""" results = [] for name, policy in policies: def run_episode(): rewards = [] state = env.reset() done = False while not done: action = policy(state) state, reward, done, _ = env.step(action) rewards.append(reward) return rewards mean, std = estimate_policy_value(run_episode, num_episodes, gamma) results.append((name, mean, std)) # Rank policies results.sort(key=lambda x: -x[1]) # Descending by mean print("=== Policy Ranking ===") for i, (name, mean, std) in enumerate(results): print(f"{i+1}. {name}: {mean:.2f} ± {std/np.sqrt(num_episodes):.2f}")Implementing the agent-environment interaction in practice requires attention to several engineering concerns:
1. Environment Vectorization
Modern RL frameworks (Stable Baselines 3, RLlib, CleanRL) run multiple environment instances in parallel. This:
2. Frame Stacking and History
For environments where single observations are insufficient (e.g., velocity from static images), agents often receive stacks of recent observations: $$o_t = (s_{t-k+1}, s_{t-k+2}, \ldots, s_t)$$
3. Reward Preprocessing
Raw rewards are often transformed:
r_t the reward for arriving at state s_t or leaving it? Be consistent.info dict), the agent can 'cheat' during training but fail at test time.The Gym interface (reset() -> state, step(action) -> (state, reward, done, info)) has become the de facto standard for RL environments. Understanding and implementing this interface correctly is essential for using modern RL libraries. The newer Gymnasium library adds truncated to distinguish timeouts from natural terminations, which is important for correct value estimation.
The agent-environment interaction loop is the conceptual foundation upon which all of reinforcement learning is built. Let's consolidate the essential insights:
Looking Ahead
With the interaction paradigm established, we can now examine its components in detail. The next page dives into states, actions, and rewards—the concrete quantities that flow through the agent-environment interface. We'll see how the nature of these quantities determines which algorithms are applicable and how difficult learning becomes.
Understanding the interaction loop is understanding what RL is. The following pages will develop how RL algorithms leverage this structure to learn effective policies.
You now understand the agent-environment interaction paradigm—the conceptual foundation of all reinforcement learning. This mental model will guide your understanding as we explore increasingly sophisticated algorithms built upon this simple but powerful abstraction.