Loading learning content...
In the previous module, we explored reinforcement learning from an intuitive perspective—agents interacting with environments, collecting rewards, and learning policies. But intuition, while valuable, cannot guide the construction of algorithms or provide guarantees about convergence and optimality.
To build rigorous RL algorithms, we need a mathematical framework that precisely specifies what agents can observe, what actions they can take, how the environment responds, and what objectives they pursue. This framework is the Markov Decision Process (MDP).
The MDP is not merely a formalism for its own sake. It is the foundation upon which every major RL algorithm is built—from Q-learning to policy gradients, from actor-critic methods to modern deep RL. Understanding MDPs deeply is prerequisite to understanding any of these methods beyond surface level.
By the end of this page, you will understand the formal 5-tuple definition of MDPs, the role of each component, the critical Markov property, and how this framework captures the essence of sequential decision-making under uncertainty. You'll be able to formulate real-world problems as MDPs and reason about them mathematically.
Why do we need a formal mathematical framework at all? Consider the following practical questions that arise when designing RL systems:
Without a formal framework, these questions remain unanswerable. We would be reduced to empirical trial-and-error, with no principled way to debug failures or predict behavior on new problems.
The MDP formalism provides:
Markov Decision Processes were formalized by Richard Bellman in the 1950s as part of his work on dynamic programming at RAND Corporation. The name honors Andrey Markov's foundational work on stochastic processes. Remarkably, Bellman's framework remains the primary lens through which we understand modern deep RL—a testament to its mathematical elegance and generality.
A Markov Decision Process is formally defined as a 5-tuple:
$$\mathcal{M} = (\mathcal{S}, \mathcal{A}, P, R, \gamma)$$
where:
Let's examine each component with the rigor it deserves.
| Component | Symbol | Type | Description |
|---|---|---|---|
| State Space | $\mathcal{S}$ | Set (finite or infinite) | All possible configurations of the environment |
| Action Space | $\mathcal{A}$ | Set (finite or infinite) | All possible actions available to the agent |
| Transition Function | $P(s'|s, a)$ | Probability distribution | Probability of reaching state $s'$ from state $s$ after action $a$ |
| Reward Function | $R(s, a, s')$ | Real-valued function | Immediate reward received for a transition |
| Discount Factor | $\gamma$ | Scalar in $[0, 1]$ | Weight given to future rewards relative to immediate rewards |
Different textbooks use different notations. Some write $R(s, a)$ for expected reward, some include the initial state distribution $\mu_0$ making it a 6-tuple, and some use $T$ instead of $P$ for transitions. The core concepts remain identical—always check the notation conventions of your source.
The state space $\mathcal{S}$ represents all possible configurations of the environment. Each state $s \in \mathcal{S}$ is a complete description of the environment at a given moment—containing all information relevant to predicting future behavior.
Types of State Spaces:
Finite state spaces contain a countable number of states. Examples:
Continuous state spaces contain uncountably many states. Examples:
Hybrid state spaces mix discrete and continuous components:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
import numpy as npfrom dataclasses import dataclassfrom enum import Enumfrom typing import Tuple # ================================================# Example 1: Discrete State Space - Gridworld# ================================================class GridWorldState: """ State representation for a discrete gridworld. State is fully determined by (row, col) position. """ def __init__(self, rows: int, cols: int): self.rows = rows self.cols = cols # Total number of states self.n_states = rows * cols def state_to_idx(self, row: int, col: int) -> int: """Convert (row, col) to a single state index.""" return row * self.cols + col def idx_to_state(self, idx: int) -> Tuple[int, int]: """Convert state index to (row, col).""" return idx // self.cols, idx % self.cols def is_valid_state(self, row: int, col: int) -> bool: """Check if a position is within the grid.""" return 0 <= row < self.rows and 0 <= col < self.cols # ================================================# Example 2: Continuous State Space - Pendulum# ================================================@dataclassclass PendulumState: """ State representation for inverted pendulum. Continuous state: (angle, angular_velocity) """ theta: float # Angle from vertical (radians), θ ∈ [-π, π] theta_dot: float # Angular velocity, θ̇ ∈ [-8, 8] @property def as_vector(self) -> np.ndarray: """Return state as numpy array for function approximation.""" return np.array([self.theta, self.theta_dot]) @classmethod def from_vector(cls, vec: np.ndarray) -> 'PendulumState': """Construct state from numpy array.""" return cls(theta=vec[0], theta_dot=vec[1]) # ================================================# Example 3: Hybrid State Space - Robot with Battery# ================================================class RoomType(Enum): HALLWAY = 0 OFFICE = 1 KITCHEN = 2 CHARGING_STATION = 3 @dataclassclass RobotState: """ Hybrid state: discrete room + continuous battery. The state space is the Cartesian product: S = {HALLWAY, OFFICE, KITCHEN, CHARGING_STATION} × [0, 100] """ room: RoomType # Discrete: which room battery_level: float # Continuous: battery percentage def is_critical_battery(self) -> bool: """Check if battery is critically low.""" return self.battery_level < 15.0 def can_perform_task(self) -> bool: """Check if robot has enough battery for a task.""" return self.battery_level >= 20.0The state must satisfy the Markov property: it must contain all information needed to predict the future, independent of the past. If your state representation violates this—e.g., the agent needs to remember what happened 10 steps ago—you need to either augment the state (e.g., stack frames) or use architectures designed for partial observability (POMDPs, recurrent networks).
The action space $\mathcal{A}$ defines all actions available to the agent. In general MDPs, the action space may depend on the current state: $\mathcal{A}(s)$ denotes actions available in state $s$.
Discrete Action Spaces:
$$\mathcal{A} = {a_1, a_2, \ldots, a_n}$$
Examples:
Continuous Action Spaces:
$$\mathcal{A} \subseteq \mathbb{R}^d$$
Examples:
Multi-Discrete and Mixed Spaces:
Some problems have factored action spaces—multiple discrete choices or combinations of discrete and continuous components.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101
import numpy as npfrom typing import List, Tuplefrom dataclasses import dataclass # ================================================# Example 1: Discrete Action Space# ================================================class DiscreteActionSpace: """ Discrete action space with finite actions. Used in: Gridworld, Atari, Board Games """ def __init__(self, actions: List[str]): self.actions = actions self.n = len(actions) self.action_to_idx = {a: i for i, a in enumerate(actions)} self.idx_to_action = {i: a for i, a in enumerate(actions)} def sample(self) -> int: """Sample a random action (for exploration).""" return np.random.randint(0, self.n) def contains(self, action: int) -> bool: """Check if action is valid.""" return 0 <= action < self.n # Usagegridworld_actions = DiscreteActionSpace(["UP", "DOWN", "LEFT", "RIGHT"])print(f"Number of actions: {gridworld_actions.n}") # 4print(f"Random action: {gridworld_actions.sample()}") # ================================================# Example 2: Continuous Action Space (Box)# ================================================@dataclassclass ContinuousActionSpace: """ Box-shaped continuous action space. Action is a vector a ∈ [low, high] ⊂ ℝ^d Used in: Robotics, Control, Physics Simulation """ low: np.ndarray # Lower bounds for each dimension high: np.ndarray # Upper bounds for each dimension @property def shape(self) -> Tuple[int, ...]: return self.low.shape @property def dim(self) -> int: return self.low.shape[0] def sample(self) -> np.ndarray: """Sample uniformly from the action space.""" return np.random.uniform(self.low, self.high) def clip(self, action: np.ndarray) -> np.ndarray: """Clip action to lie within bounds.""" return np.clip(action, self.low, self.high) def contains(self, action: np.ndarray) -> bool: """Check if action is within bounds.""" return np.all(action >= self.low) and np.all(action <= self.high) # Usage: Robot arm with 7 torque outputsrobot_actions = ContinuousActionSpace( low=np.array([-2.0] * 7), # -2 Nm minimum torque high=np.array([2.0] * 7) # +2 Nm maximum torque)print(f"Action dimension: {robot_actions.dim}") # 7print(f"Sample action: {robot_actions.sample()}") # ================================================ # Example 3: State-Dependent Action Space# ================================================class ChessActionSpace: """ Action space that depends on current state. In chess, legal moves depend on piece positions. """ def get_legal_actions(self, board_state: 'ChessBoardState') -> List[str]: """ Return legal moves in algebraic notation. This would call a chess engine in practice. """ # Placeholder - real implementation uses chess rules # The key point: A(s) varies with s if board_state.is_check: # Fewer legal moves when in check return self._get_check_escapes(board_state) return self._get_all_legal_moves(board_state) def _get_check_escapes(self, state) -> List[str]: # Return only moves that escape check pass def _get_all_legal_moves(self, state) -> List[str]: # Return all legal moves passThe defining characteristic of MDPs is the Markov property (also called the Markov assumption or memorylessness). Formally:
$$P(S_{t+1} | S_t, A_t, S_{t-1}, A_{t-1}, \ldots, S_0, A_0) = P(S_{t+1} | S_t, A_t)$$
In words: the future is conditionally independent of the past, given the present state. The current state $S_t$ is a sufficient statistic for predicting everything about the future—no historical information beyond $S_t$ adds predictive power.
Why is this property so important?
If your problem violates the Markov property, you either need to augment the state to make it Markov, or use Partially Observable MDPs (POMDPs) which explicitly model the agent's uncertainty about the true state.
state = [frame_t-3, frame_t-2, frame_t-1, frame_t]Ball position AND velocity inferrable from state sequenceThis pattern is ubiquitous in deep RL: when single observations don't satisfy Markov, we augment with history (frame stacking) or use recurrent architectures that maintain hidden memory.
Many real-world problems are not naturally Markov: a robot might need to remember where it was to infer where it is (localization), a dialogue agent needs conversational history, a trading agent needs past price trends. Recognizing when your problem violates Markov—and designing state representations that restore it—is a critical RL engineering skill.
Let's put everything together with a complete MDP specification for a classic problem: the Frozen Lake environment.
Problem Description:
An agent navigates a 4×4 frozen lake. The surface has:
The catch: the ice is slippery! When the agent chooses an action, it only moves in that direction with probability 1/3—it might slip sideways.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179
import numpy as npfrom typing import Dict, Tuple, Listfrom enum import IntEnumfrom dataclasses import dataclass """Complete MDP Specification: Frozen Lake The 4x4 grid: S F F F F H F H F F F H H F F G S = Start, F = Frozen (safe), H = Hole (terminal, reward=0), G = Goal (terminal, reward=1)""" # ================================================# STATE SPACE: S = {0, 1, 2, ..., 15}# ================================================# Each cell is a state, indexed row-major:# 0 1 2 3# 4 5 6 7# 8 9 10 11# 12 13 14 15 N_STATES = 16GRID_SIZE = 4 # Terminal statesHOLE_STATES = {5, 7, 11, 12} # States where episode ends (failure)GOAL_STATE = 15 # State where episode ends (success)TERMINAL_STATES = HOLE_STATES | {GOAL_STATE} # ================================================# ACTION SPACE: A = {0, 1, 2, 3} = {LEFT, DOWN, RIGHT, UP}# ================================================class Action(IntEnum): LEFT = 0 DOWN = 1 RIGHT = 2 UP = 3 N_ACTIONS = 4ACTIONS = list(Action) # ================================================# TRANSITION FUNCTION: P(s' | s, a)# ================================================# Due to slippery ice:# - P(intended direction | s, a) = 1/3# - P(slip left | s, a) = 1/3 # - P(slip right | s, a) = 1/3 def get_next_state(state: int, action: Action) -> int: """Compute next state given current state and action (deterministic step).""" if state in TERMINAL_STATES: return state # Terminal states are absorbing row, col = state // GRID_SIZE, state % GRID_SIZE if action == Action.LEFT: col = max(0, col - 1) elif action == Action.RIGHT: col = min(GRID_SIZE - 1, col + 1) elif action == Action.UP: row = max(0, row - 1) elif action == Action.DOWN: row = min(GRID_SIZE - 1, row + 1) return row * GRID_SIZE + col def get_slip_actions(action: Action) -> List[Action]: """Get the three possible outcomes: intended, slip-left, slip-right.""" # Perpendicular actions based on intended direction slip_map = { Action.LEFT: [Action.LEFT, Action.UP, Action.DOWN], Action.RIGHT: [Action.RIGHT, Action.UP, Action.DOWN], Action.UP: [Action.UP, Action.LEFT, Action.RIGHT], Action.DOWN: [Action.DOWN, Action.LEFT, Action.RIGHT], } return slip_map[action] def build_transition_matrix() -> Dict[Tuple[int, int, int], float]: """ Build the complete transition probability table. Returns P[s, a, s'] = P(s' | s, a) for all valid transitions. """ P = {} # (s, a, s') -> probability for s in range(N_STATES): for a in ACTIONS: if s in TERMINAL_STATES: # Terminal states are absorbing (stay with prob 1) P[(s, a, s)] = 1.0 else: # Slippery: 1/3 each for intended and two perpendicular slip_actions = get_slip_actions(a) for slip_a in slip_actions: s_prime = get_next_state(s, slip_a) key = (s, a, s_prime) P[key] = P.get(key, 0.0) + 1/3 return P # ================================================# REWARD FUNCTION: R(s, a, s')# ================================================def reward_function(s: int, a: int, s_prime: int) -> float: """ Reward function R(s, a, s'). Reward structure: - Reaching goal (s' = 15): R = 1.0 - Falling in hole: R = 0.0 - All other transitions: R = 0.0 """ if s_prime == GOAL_STATE: return 1.0 return 0.0 # Alternative: Step penalty to encourage faster solutionsdef reward_with_step_penalty(s: int, a: int, s_prime: int) -> float: """Variant with step penalty to discourage long paths.""" if s_prime == GOAL_STATE: return 1.0 elif s_prime in HOLE_STATES: return -1.0 else: return -0.01 # Small penalty per step # ================================================# DISCOUNT FACTOR: γ = 0.99# ================================================GAMMA = 0.99 # ================================================# COMPLETE MDP OBJECT# ================================================@dataclassclass FrozenLakeMDP: """Complete MDP specification as a data object.""" n_states: int = N_STATES n_actions: int = N_ACTIONS transition_probs: Dict = None # Built lazily gamma: float = GAMMA terminal_states: set = None def __post_init__(self): self.terminal_states = TERMINAL_STATES self.transition_probs = build_transition_matrix() def P(self, s: int, a: int, s_prime: int) -> float: """Transition probability P(s' | s, a).""" return self.transition_probs.get((s, a, s_prime), 0.0) def R(self, s: int, a: int, s_prime: int) -> float: """Reward R(s, a, s').""" return reward_function(s, a, s_prime) def is_terminal(self, s: int) -> bool: """Check if state is terminal.""" return s in self.terminal_states # Create the MDPfrozen_lake = FrozenLakeMDP() # Verify: probabilities from state 0 with action RIGHTprint("Transitions from s=0, a=RIGHT:")for s_prime in range(N_STATES): p = frozen_lake.P(0, Action.RIGHT, s_prime) if p > 0: print(f" P(s'={s_prime} | s=0, a=RIGHT) = {p:.4f}") # Output:# Transitions from s=0, a=RIGHT:# P(s'=0 | s=0, a=RIGHT) = 0.3333 (slip UP, hit wall)# P(s'=1 | s=0, a=RIGHT) = 0.3333 (intended direction)# P(s'=4 | s=0, a=RIGHT) = 0.3333 (slip DOWN)Notice how every component is precisely specified: states are numbered 0-15, actions are the 4 directions, transitions are stochastic (1/3 probability each outcome), rewards are sparse (1 at goal, 0 elsewhere), and γ=0.99. Given this specification, we can apply any MDP algorithm—value iteration, policy iteration, Q-learning—without ambiguity.
An MDP's horizon refers to the length of the decision-making process:
Finite Horizon MDPs:
Infinite Horizon MDPs:
Why does discounting matter for infinite horizons?
Without discounting ($\gamma = 1$) in infinite horizon settings, the sum of rewards could diverge to infinity, making comparison meaningless. The discount factor ensures:
$$\sum_{t=0}^{\infty} \gamma^t R_t \leq \sum_{t=0}^{\infty} \gamma^t R_{max} = \frac{R_{max}}{1 - \gamma} < \infty$$
| Property | Finite Horizon | Infinite Horizon |
|---|---|---|
| Episode length | Fixed $T$ steps | Unbounded or until terminal |
| Discounting | Optional (often $\gamma=1$) | Required ($\gamma < 1$) |
| Optimal policy | Time-dependent $\pi^*_t(s)$ | Stationary $\pi^*(s)$ |
| Value function | $V^\pi_t(s)$ depends on time | $V^\pi(s)$ time-independent |
| Computation | Backward induction from $T$ | Fixed point iteration |
| Applications | Games, planning with deadline | Control, ongoing optimization |
A separate but related distinction concerns whether the task naturally breaks into episodes:
Episodic Tasks:
Continuing Tasks:
Unified Notation (Sutton & Barto):
To handle both cases elegantly, we can treat episode termination as a transition to an absorbing state $s_{\text{terminal}}$ with $R=0$ and self-loops. This makes episodic tasks a special case of continuing tasks, simplifying algorithm design.
In practice, most RL implementations use episodic formulations even for continuing tasks by artificially truncating episodes after T steps, then bootstrapping from the value estimate at truncation. This gives the learning benefits of episodic resets while approximating the continuing objective.
We have established the formal framework that will support every algorithm and concept in reinforcement learning. Let's consolidate the key insights:
What's Next:
With the MDP structure established, we turn next to state transitions—examining how the transition probability function $P(s'|s,a)$ governs environment dynamics, and how agents can reason about uncertainty in outcomes.
You now understand the formal mathematical framework of Markov Decision Processes—the foundation upon which all reinforcement learning rests. This 5-tuple formalism transforms intuitive notions of sequential decision-making into precise mathematical objects amenable to rigorous analysis and algorithm design.