Machine LearningReinforcement Learning

Markov Decision Processes

LevelAdvanced

Duration90 mins

TopicReinforcement Learning

1 / 5

MDP Formulation

From Intuition to Mathematical Precision

In the previous module, we explored reinforcement learning from an intuitive perspective—agents interacting with environments, collecting rewards, and learning policies. But intuition, while valuable, cannot guide the construction of algorithms or provide guarantees about convergence and optimality.

To build rigorous RL algorithms, we need a mathematical framework that precisely specifies what agents can observe, what actions they can take, how the environment responds, and what objectives they pursue. This framework is the Markov Decision Process (MDP).

The MDP is not merely a formalism for its own sake. It is the foundation upon which every major RL algorithm is built—from Q-learning to policy gradients, from actor-critic methods to modern deep RL. Understanding MDPs deeply is prerequisite to understanding any of these methods beyond surface level.

What You Will Learn

By the end of this page, you will understand the formal 5-tuple definition of MDPs, the role of each component, the critical Markov property, and how this framework captures the essence of sequential decision-making under uncertainty. You'll be able to formulate real-world problems as MDPs and reason about them mathematically.

The Need for Formalization

Why do we need a formal mathematical framework at all? Consider the following practical questions that arise when designing RL systems:

How do we know if an algorithm will converge to the optimal solution?
Under what conditions is the optimal solution even well-defined?
How can we compare different RL algorithms on a level playing field?
How do we identify when a problem is too hard to solve efficiently?

Without a formal framework, these questions remain unanswerable. We would be reduced to empirical trial-and-error, with no principled way to debug failures or predict behavior on new problems.

The MDP formalism provides:

Precise problem specification — A language to unambiguously describe decision-making problems
Optimality criteria — Clear definitions of what it means to "solve" a problem
Algorithmic foundation — The mathematical structure that algorithms exploit
Theoretical guarantees — Conditions under which algorithms provably work

Historical Context

Markov Decision Processes were formalized by Richard Bellman in the 1950s as part of his work on dynamic programming at RAND Corporation. The name honors Andrey Markov's foundational work on stochastic processes. Remarkably, Bellman's framework remains the primary lens through which we understand modern deep RL—a testament to its mathematical elegance and generality.

The MDP 5-Tuple: Formal Definition

A Markov Decision Process is formally defined as a 5-tuple:

$$\mathcal{M} = (\mathcal{S}, \mathcal{A}, P, R, \gamma)$$

where:

$\mathcal{S}$ is the state space — the set of all possible states the environment can be in
$\mathcal{A}$ is the action space — the set of all possible actions the agent can take
$P: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow [0, 1]$ is the transition probability function (or transition dynamics)
$R: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow \mathbb{R}$ is the reward function
$\gamma \in [0, 1]$ is the discount factor

Let's examine each component with the rigor it deserves.

MDP Components at a Glance
Component	Symbol	Type	Description
State Space	$\mathcal{S}$	Set (finite or infinite)	All possible configurations of the environment
Action Space	$\mathcal{A}$	Set (finite or infinite)	All possible actions available to the agent
Transition Function	$P(s'\|s, a)$	Probability distribution	Probability of reaching state $s'$ from state $s$ after action $a$
Reward Function	$R(s, a, s')$	Real-valued function	Immediate reward received for a transition
Discount Factor	$\gamma$	Scalar in $[0, 1]$	Weight given to future rewards relative to immediate rewards

Notation Variations

Different textbooks use different notations. Some write $R(s, a)$ for expected reward, some include the initial state distribution $\mu_0$ making it a 6-tuple, and some use $T$ instead of $P$ for transitions. The core concepts remain identical—always check the notation conventions of your source.

The State Space: Capturing World Configurations

The state space $\mathcal{S}$ represents all possible configurations of the environment. Each state $s \in \mathcal{S}$ is a complete description of the environment at a given moment—containing all information relevant to predicting future behavior.

Types of State Spaces:

Finite state spaces contain a countable number of states. Examples:

Chess: Each board configuration is a state (approximately $10^{43}$ legal positions)
Gridworld: Each cell position is a state (rows × columns states)
Inventory management: Integer inventory levels

Continuous state spaces contain uncountably many states. Examples:

Robot arm: Joint angles and velocities in $\mathbb{R}^{14}$
Autonomous driving: Position, velocity, and sensor readings
Financial trading: Portfolio allocations and market prices

Hybrid state spaces mix discrete and continuous components:

Mobile robot with discrete locations and continuous battery level
Game with discrete levels but continuous player stats

state_space_examples.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import numpy as np
from dataclasses import dataclass
from enum import Enum
from typing import Tuple
 
# ================================================
# Example 1: Discrete State Space - Gridworld
# ================================================
class GridWorldState:
    """
    State representation for a discrete gridworld.
    State is fully determined by (row, col) position.
    """
    def __init__(self, rows: int, cols: int):
        self.rows = rows
        self.cols = cols
        # Total number of states
        self.n_states = rows * cols
        
    def state_to_idx(self, row: int, col: int) -> int:
        """Convert (row, col) to a single state index."""
        return row * self.cols + col
    
    def idx_to_state(self, idx: int) -> Tuple[int, int]:
        """Convert state index to (row, col)."""
        return idx // self.cols, idx % self.cols
    
    def is_valid_state(self, row: int, col: int) -> bool:
        """Check if a position is within the grid."""
        return 0 <= row < self.rows and 0 <= col < self.cols
 
 
# ================================================
# Example 2: Continuous State Space - Pendulum
# ================================================
@dataclass
class PendulumState:
    """
    State representation for inverted pendulum.
    Continuous state: (angle, angular_velocity)
    """
    theta: float      # Angle from vertical (radians), θ ∈ [-π, π]
    theta_dot: float  # Angular velocity, θ̇ ∈ [-8, 8]
    
    @property
    def as_vector(self) -> np.ndarray:
        """Return state as numpy array for function approximation."""
        return np.array([self.theta, self.theta_dot])
    
    @classmethod
    def from_vector(cls, vec: np.ndarray) -> 'PendulumState':
        """Construct state from numpy array."""
        return cls(theta=vec[0], theta_dot=vec[1])
 
 
# ================================================
# Example 3: Hybrid State Space - Robot with Battery
# ================================================
class RoomType(Enum):
    HALLWAY = 0
    OFFICE = 1
    KITCHEN = 2
    CHARGING_STATION = 3
 
@dataclass
class RobotState:
    """
    Hybrid state: discrete room + continuous battery.
    The state space is the Cartesian product:
    S = {HALLWAY, OFFICE, KITCHEN, CHARGING_STATION} × [0, 100]
    """
    room: RoomType           # Discrete: which room
    battery_level: float     # Continuous: battery percentage
    
    def is_critical_battery(self) -> bool:
        """Check if battery is critically low."""
        return self.battery_level < 15.0
    
    def can_perform_task(self) -> bool:
        """Check if robot has enough battery for a task."""
        return self.battery_level >= 20.0

State Design Principle

The state must satisfy the Markov property: it must contain all information needed to predict the future, independent of the past. If your state representation violates this—e.g., the agent needs to remember what happened 10 steps ago—you need to either augment the state (e.g., stack frames) or use architectures designed for partial observability (POMDPs, recurrent networks).

The Action Space: Agent Capabilities

The action space $\mathcal{A}$ defines all actions available to the agent. In general MDPs, the action space may depend on the current state: $\mathcal{A}(s)$ denotes actions available in state $s$.

Discrete Action Spaces:

$$\mathcal{A} = {a_1, a_2, \ldots, a_n}$$

Examples:

Atari games: 18 discrete button combinations
Board games: Legal moves from current position
Recommendation: Select from $n$ items to recommend

Continuous Action Spaces:

$$\mathcal{A} \subseteq \mathbb{R}^d$$

Examples:

Robot control: Joint torques in $\mathbb{R}^7$
Autonomous driving: Steering angle and throttle in $\mathbb{R}^2$
Portfolio management: Asset allocation weights

Multi-Discrete and Mixed Spaces:

Some problems have factored action spaces—multiple discrete choices or combinations of discrete and continuous components.

Discrete Actions

•Finite choices: Navigate {N, S, E, W}
•Natural for games: Button presses, card plays
•Easier exploration: Enumerate all possibilities
•Q-learning friendly: Q(s, a) table feasible
•Argmax tractable: Find best action exactly

Continuous Actions

•Infinite choices: Torque ∈ [-2, 2] Nm
•Natural for control: Physics-based systems
•Exploration challenge: Can't enumerate
•Requires function approx: No tabular Q
•Argmax hard: Often use gradient-based optimization

action_space_examples.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
import numpy as np
from typing import List, Tuple
from dataclasses import dataclass
 
# ================================================
# Example 1: Discrete Action Space
# ================================================
class DiscreteActionSpace:
    """
    Discrete action space with finite actions.
    Used in: Gridworld, Atari, Board Games
    """
    def __init__(self, actions: List[str]):
        self.actions = actions
        self.n = len(actions)
        self.action_to_idx = {a: i for i, a in enumerate(actions)}
        self.idx_to_action = {i: a for i, a in enumerate(actions)}
    
    def sample(self) -> int:
        """Sample a random action (for exploration)."""
        return np.random.randint(0, self.n)
    
    def contains(self, action: int) -> bool:
        """Check if action is valid."""
        return 0 <= action < self.n
 
# Usage
gridworld_actions = DiscreteActionSpace(["UP", "DOWN", "LEFT", "RIGHT"])
print(f"Number of actions: {gridworld_actions.n}")  # 4
print(f"Random action: {gridworld_actions.sample()}")
 
 
# ================================================
# Example 2: Continuous Action Space (Box)
# ================================================
@dataclass
class ContinuousActionSpace:
    """
    Box-shaped continuous action space.
    Action is a vector a ∈ [low, high] ⊂ ℝ^d
    Used in: Robotics, Control, Physics Simulation
    """
    low: np.ndarray    # Lower bounds for each dimension
    high: np.ndarray   # Upper bounds for each dimension
    
    @property
    def shape(self) -> Tuple[int, ...]:
        return self.low.shape
    
    @property
    def dim(self) -> int:
        return self.low.shape[0]
    
    def sample(self) -> np.ndarray:
        """Sample uniformly from the action space."""
        return np.random.uniform(self.low, self.high)
    
    def clip(self, action: np.ndarray) -> np.ndarray:
        """Clip action to lie within bounds."""
        return np.clip(action, self.low, self.high)
    
    def contains(self, action: np.ndarray) -> bool:
        """Check if action is within bounds."""
        return np.all(action >= self.low) and np.all(action <= self.high)
 
# Usage: Robot arm with 7 torque outputs
robot_actions = ContinuousActionSpace(
    low=np.array([-2.0] * 7),   # -2 Nm minimum torque
    high=np.array([2.0] * 7)    # +2 Nm maximum torque
)
print(f"Action dimension: {robot_actions.dim}")  # 7
print(f"Sample action: {robot_actions.sample()}")
 
 
# ================================================  
# Example 3: State-Dependent Action Space
# ================================================
class ChessActionSpace:
    """
    Action space that depends on current state.
    In chess, legal moves depend on piece positions.
    """
    def get_legal_actions(self, board_state: 'ChessBoardState') -> List[str]:
        """
        Return legal moves in algebraic notation.
        This would call a chess engine in practice.
        """
        # Placeholder - real implementation uses chess rules
        # The key point: A(s) varies with s
        if board_state.is_check:
            # Fewer legal moves when in check
            return self._get_check_escapes(board_state)
        return self._get_all_legal_moves(board_state)
    
    def _get_check_escapes(self, state) -> List[str]:
        # Return only moves that escape check
        pass
    
    def _get_all_legal_moves(self, state) -> List[str]:
        # Return all legal moves
        pass

The Markov Property: Memory-Free Dynamics

The defining characteristic of MDPs is the Markov property (also called the Markov assumption or memorylessness). Formally:

$$P(S_{t+1} | S_t, A_t, S_{t-1}, A_{t-1}, \ldots, S_0, A_0) = P(S_{t+1} | S_t, A_t)$$

In words: the future is conditionally independent of the past, given the present state. The current state $S_t$ is a sufficient statistic for predicting everything about the future—no historical information beyond $S_t$ adds predictive power.

Why is this property so important?

Computational tractability: Without the Markov property, we'd need to consider exponentially many histories
Recursive structure: Enables the Bellman equations, which decompose long-term value into one-step look-aheads
Algorithm design: Allows value functions $V(s)$ and $Q(s, a)$ to be well-defined

If your problem violates the Markov property, you either need to augment the state to make it Markov, or use Partially Observable MDPs (POMDPs) which explicitly model the agent's uncertainty about the true state.

Why Pong Needs Frame StackingConsider the Atari game Pong. A single frame shows the ball's position but not its velocity. Without velocity, we can't predict where the ball will be next—it might be moving left or right! Two identical frames could lead to completely different next frames depending on the hidden velocity. **The solution**: Stack 4 consecutive frames as the state. Now the state contains positional information across time, from which velocity can be inferred. The stacked state satisfies the Markov property: it contains sufficient information to predict the future.

Input

state = [frame_t-3, frame_t-2, frame_t-1, frame_t]

Output

Ball position AND velocity inferrable from state sequence

Explanation

This pattern is ubiquitous in deep RL: when single observations don't satisfy Markov, we augment with history (frame stacking) or use recurrent architectures that maintain hidden memory.

Markov Violations in Practice

Many real-world problems are not naturally Markov: a robot might need to remember where it was to infer where it is (localization), a dialogue agent needs conversational history, a trading agent needs past price trends. Recognizing when your problem violates Markov—and designing state representations that restore it—is a critical RL engineering skill.

Complete MDP Specification: Worked Example

Let's put everything together with a complete MDP specification for a classic problem: the Frozen Lake environment.

Problem Description:

An agent navigates a 4×4 frozen lake. The surface has:

Safe frozen tiles (F): Agent can stand here
Holes (H): Agent falls in and episode ends (failure)
Start tile (S): Where the agent begins
Goal tile (G): Reaching here ends the episode (success)

The catch: the ice is slippery! When the agent chooses an action, it only moves in that direction with probability 1/3—it might slip sideways.

frozen_lake_mdp.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
import numpy as np
from typing import Dict, Tuple, List
from enum import IntEnum
from dataclasses import dataclass
 
"""
Complete MDP Specification: Frozen Lake
 
The 4x4 grid:
    S F F F
    F H F H
    F F F H
    H F F G
 
S = Start, F = Frozen (safe), H = Hole (terminal, reward=0), G = Goal (terminal, reward=1)
"""
 
# ================================================
# STATE SPACE: S = {0, 1, 2, ..., 15}
# ================================================
# Each cell is a state, indexed row-major:
#   0  1  2  3
#   4  5  6  7
#   8  9  10 11
#   12 13 14 15
 
N_STATES = 16
GRID_SIZE = 4
 
# Terminal states
HOLE_STATES = {5, 7, 11, 12}  # States where episode ends (failure)
GOAL_STATE = 15               # State where episode ends (success)
TERMINAL_STATES = HOLE_STATES | {GOAL_STATE}
 
# ================================================
# ACTION SPACE: A = {0, 1, 2, 3} = {LEFT, DOWN, RIGHT, UP}
# ================================================
class Action(IntEnum):
    LEFT = 0
    DOWN = 1
    RIGHT = 2
    UP = 3
 
N_ACTIONS = 4
ACTIONS = list(Action)
 
# ================================================
# TRANSITION FUNCTION: P(s' | s, a)
# ================================================
# Due to slippery ice:
# - P(intended direction | s, a) = 1/3
# - P(slip left | s, a) = 1/3  
# - P(slip right | s, a) = 1/3
 
def get_next_state(state: int, action: Action) -> int:
    """Compute next state given current state and action (deterministic step)."""
    if state in TERMINAL_STATES:
        return state  # Terminal states are absorbing
    
    row, col = state // GRID_SIZE, state % GRID_SIZE
    
    if action == Action.LEFT:
        col = max(0, col - 1)
    elif action == Action.RIGHT:
        col = min(GRID_SIZE - 1, col + 1)
    elif action == Action.UP:
        row = max(0, row - 1)
    elif action == Action.DOWN:
        row = min(GRID_SIZE - 1, row + 1)
    
    return row * GRID_SIZE + col
 
def get_slip_actions(action: Action) -> List[Action]:
    """Get the three possible outcomes: intended, slip-left, slip-right."""
    # Perpendicular actions based on intended direction
    slip_map = {
        Action.LEFT:  [Action.LEFT, Action.UP, Action.DOWN],
        Action.RIGHT: [Action.RIGHT, Action.UP, Action.DOWN],
        Action.UP:    [Action.UP, Action.LEFT, Action.RIGHT],
        Action.DOWN:  [Action.DOWN, Action.LEFT, Action.RIGHT],
    }
    return slip_map[action]
 
def build_transition_matrix() -> Dict[Tuple[int, int, int], float]:
    """
    Build the complete transition probability table.
    Returns P[s, a, s'] = P(s' | s, a) for all valid transitions.
    """
    P = {}  # (s, a, s') -> probability
    
    for s in range(N_STATES):
        for a in ACTIONS:
            if s in TERMINAL_STATES:
                # Terminal states are absorbing (stay with prob 1)
                P[(s, a, s)] = 1.0
            else:
                # Slippery: 1/3 each for intended and two perpendicular
                slip_actions = get_slip_actions(a)
                for slip_a in slip_actions:
                    s_prime = get_next_state(s, slip_a)
                    key = (s, a, s_prime)
                    P[key] = P.get(key, 0.0) + 1/3
    
    return P
 
# ================================================
# REWARD FUNCTION: R(s, a, s')
# ================================================
def reward_function(s: int, a: int, s_prime: int) -> float:
    """
    Reward function R(s, a, s').
    
    Reward structure:
    - Reaching goal (s' = 15): R = 1.0
    - Falling in hole: R = 0.0
    - All other transitions: R = 0.0
    """
    if s_prime == GOAL_STATE:
        return 1.0
    return 0.0
 
# Alternative: Step penalty to encourage faster solutions
def reward_with_step_penalty(s: int, a: int, s_prime: int) -> float:
    """Variant with step penalty to discourage long paths."""
    if s_prime == GOAL_STATE:
        return 1.0
    elif s_prime in HOLE_STATES:
        return -1.0
    else:
        return -0.01  # Small penalty per step
 
# ================================================
# DISCOUNT FACTOR: γ = 0.99
# ================================================
GAMMA = 0.99
 
# ================================================
# COMPLETE MDP OBJECT
# ================================================
@dataclass
class FrozenLakeMDP:
    """Complete MDP specification as a data object."""
    n_states: int = N_STATES
    n_actions: int = N_ACTIONS
    transition_probs: Dict = None  # Built lazily
    gamma: float = GAMMA
    terminal_states: set = None
    
    def __post_init__(self):
        self.terminal_states = TERMINAL_STATES
        self.transition_probs = build_transition_matrix()
    
    def P(self, s: int, a: int, s_prime: int) -> float:
        """Transition probability P(s' | s, a)."""
        return self.transition_probs.get((s, a, s_prime), 0.0)
    
    def R(self, s: int, a: int, s_prime: int) -> float:
        """Reward R(s, a, s')."""
        return reward_function(s, a, s_prime)
    
    def is_terminal(self, s: int) -> bool:
        """Check if state is terminal."""
        return s in self.terminal_states
 
# Create the MDP
frozen_lake = FrozenLakeMDP()
 
# Verify: probabilities from state 0 with action RIGHT
print("Transitions from s=0, a=RIGHT:")
for s_prime in range(N_STATES):
    p = frozen_lake.P(0, Action.RIGHT, s_prime)
    if p > 0:
        print(f"  P(s'={s_prime} | s=0, a=RIGHT) = {p:.4f}")
 
# Output:
# Transitions from s=0, a=RIGHT:
#   P(s'=0 | s=0, a=RIGHT) = 0.3333  (slip UP, hit wall)
#   P(s'=1 | s=0, a=RIGHT) = 0.3333  (intended direction)
#   P(s'=4 | s=0, a=RIGHT) = 0.3333  (slip DOWN)

What Makes This an MDP

Notice how every component is precisely specified: states are numbered 0-15, actions are the 4 directions, transitions are stochastic (1/3 probability each outcome), rewards are sparse (1 at goal, 0 elsewhere), and γ=0.99. Given this specification, we can apply any MDP algorithm—value iteration, policy iteration, Q-learning—without ambiguity.

Finite vs. Infinite Horizon MDPs

An MDP's horizon refers to the length of the decision-making process:

Finite Horizon MDPs:

Episode terminates after exactly $T$ steps
Objective: Maximize $\sum_{t=0}^{T-1} R_t$ (often no discounting needed)
Optimal policy may be time-dependent: $\pi^*_t(s)$ varies with $t$
Example: Game with fixed number of moves, planning a trip with departure date

Infinite Horizon MDPs:

Episode can run indefinitely (or until terminal state)
Objective: Maximize $\sum_{t=0}^{\infty} \gamma^t R_t$ (discounting required for convergence)
Optimal policy is stationary: $\pi^*(s)$ independent of time
Example: Continuous control, ongoing resource management

Why does discounting matter for infinite horizons?

Without discounting ($\gamma = 1$) in infinite horizon settings, the sum of rewards could diverge to infinity, making comparison meaningless. The discount factor ensures:

$$\sum_{t=0}^{\infty} \gamma^t R_t \leq \sum_{t=0}^{\infty} \gamma^t R_{max} = \frac{R_{max}}{1 - \gamma} < \infty$$

Comparing Horizon Types
Property	Finite Horizon	Infinite Horizon
Episode length	Fixed $T$ steps	Unbounded or until terminal
Discounting	Optional (often $\gamma=1$)	Required ($\gamma < 1$)
Optimal policy	Time-dependent $\pi^*_t(s)$	Stationary $\pi^*(s)$
Value function	$V^\pi_t(s)$ depends on time	$V^\pi(s)$ time-independent
Computation	Backward induction from $T$	Fixed point iteration
Applications	Games, planning with deadline	Control, ongoing optimization

Episodic vs. Continuing Tasks

A separate but related distinction concerns whether the task naturally breaks into episodes:

Episodic Tasks:

Experience naturally divides into independent episodes
Each episode starts fresh, ends in a terminal state
Examples: Games (each game is an episode), robot trials, dialogue sessions
Agent can learn from many episodes, each a complete trajectory

Continuing Tasks:

No natural end—interaction continues indefinitely
Must consider the single, never-ending experience stream
Examples: Process control, lifelong robotics, server load balancing
Often use average reward formulation instead of discounted sum

Unified Notation (Sutton & Barto):

To handle both cases elegantly, we can treat episode termination as a transition to an absorbing state $s_{\text{terminal}}$ with $R=0$ and self-loops. This makes episodic tasks a special case of continuing tasks, simplifying algorithm design.

Practical Implications

In practice, most RL implementations use episodic formulations even for continuing tasks by artificially truncating episodes after T steps, then bootstrapping from the value estimate at truncation. This gives the learning benefits of episodic resets while approximating the continuing objective.

Summary: The MDP Foundation

We have established the formal framework that will support every algorithm and concept in reinforcement learning. Let's consolidate the key insights:

Key Takeaways

•MDPs are defined by the 5-tuple $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$ — states, actions, transitions, rewards, discount
•The Markov property is fundamental — current state is sufficient to predict the future; enables recursive decomposition
•State spaces can be discrete or continuous — the choice affects algorithm selection and computational complexity
•Action spaces may depend on state — not all actions are legal in all states (e.g., chess moves)
•Horizon matters — finite horizon allows time-dependent policies; infinite horizon requires discounting and yields stationary policies
•Episodic vs. continuing affects formulation — but both can be handled with absorbing state notation

What's Next:

With the MDP structure established, we turn next to state transitions—examining how the transition probability function $P(s'|s,a)$ governs environment dynamics, and how agents can reason about uncertainty in outcomes.

Page Complete

You now understand the formal mathematical framework of Markov Decision Processes—the foundation upon which all reinforcement learning rests. This 5-tuple formalism transforms intuitive notions of sequential decision-making into precise mathematical objects amenable to rigorous analysis and algorithm design.

1 / 5

Loading learning content...

Machine LearningReinforcement Learning

Markov Decision Processes

LevelAdvanced

Duration90 mins

TopicReinforcement Learning

1 / 5

MDP Formulation

From Intuition to Mathematical Precision

What You Will Learn

The Need for Formalization

Why do we need a formal mathematical framework at all? Consider the following practical questions that arise when designing RL systems:

How do we know if an algorithm will converge to the optimal solution?
Under what conditions is the optimal solution even well-defined?
How can we compare different RL algorithms on a level playing field?
How do we identify when a problem is too hard to solve efficiently?

Without a formal framework, these questions remain unanswerable. We would be reduced to empirical trial-and-error, with no principled way to debug failures or predict behavior on new problems.

The MDP formalism provides:

Precise problem specification — A language to unambiguously describe decision-making problems
Optimality criteria — Clear definitions of what it means to "solve" a problem
Algorithmic foundation — The mathematical structure that algorithms exploit
Theoretical guarantees — Conditions under which algorithms provably work

Historical Context

The MDP 5-Tuple: Formal Definition

A Markov Decision Process is formally defined as a 5-tuple:

$$\mathcal{M} = (\mathcal{S}, \mathcal{A}, P, R, \gamma)$$

where:

$\mathcal{S}$ is the state space — the set of all possible states the environment can be in
$\mathcal{A}$ is the action space — the set of all possible actions the agent can take
$P: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow [0, 1]$ is the transition probability function (or transition dynamics)
$R: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow \mathbb{R}$ is the reward function
$\gamma \in [0, 1]$ is the discount factor

Let's examine each component with the rigor it deserves.

MDP Components at a Glance
Component	Symbol	Type	Description
State Space	$\mathcal{S}$	Set (finite or infinite)	All possible configurations of the environment
Action Space	$\mathcal{A}$	Set (finite or infinite)	All possible actions available to the agent
Transition Function	$P(s'\|s, a)$	Probability distribution	Probability of reaching state $s'$ from state $s$ after action $a$
Reward Function	$R(s, a, s')$	Real-valued function	Immediate reward received for a transition
Discount Factor	$\gamma$	Scalar in $[0, 1]$	Weight given to future rewards relative to immediate rewards

Notation Variations

The State Space: Capturing World Configurations

Types of State Spaces:

Finite state spaces contain a countable number of states. Examples:

Chess: Each board configuration is a state (approximately $10^{43}$ legal positions)
Gridworld: Each cell position is a state (rows × columns states)
Inventory management: Integer inventory levels

Continuous state spaces contain uncountably many states. Examples:

Robot arm: Joint angles and velocities in $\mathbb{R}^{14}$
Autonomous driving: Position, velocity, and sensor readings
Financial trading: Portfolio allocations and market prices

Hybrid state spaces mix discrete and continuous components:

Mobile robot with discrete locations and continuous battery level
Game with discrete levels but continuous player stats

state_space_examples.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import numpy as np
from dataclasses import dataclass
from enum import Enum
from typing import Tuple
 
# ================================================
# Example 1: Discrete State Space - Gridworld
# ================================================
class GridWorldState:
    """
    State representation for a discrete gridworld.
    State is fully determined by (row, col) position.
    """
    def __init__(self, rows: int, cols: int):
        self.rows = rows
        self.cols = cols
        # Total number of states
        self.n_states = rows * cols
        
    def state_to_idx(self, row: int, col: int) -> int:
        """Convert (row, col) to a single state index."""
        return row * self.cols + col
    
    def idx_to_state(self, idx: int) -> Tuple[int, int]:
        """Convert state index to (row, col)."""
        return idx // self.cols, idx % self.cols
    
    def is_valid_state(self, row: int, col: int) -> bool:
        """Check if a position is within the grid."""
        return 0 <= row < self.rows and 0 <= col < self.cols
 
 
# ================================================
# Example 2: Continuous State Space - Pendulum
# ================================================
@dataclass
class PendulumState:
    """
    State representation for inverted pendulum.
    Continuous state: (angle, angular_velocity)
    """
    theta: float      # Angle from vertical (radians), θ ∈ [-π, π]
    theta_dot: float  # Angular velocity, θ̇ ∈ [-8, 8]
    
    @property
    def as_vector(self) -> np.ndarray:
        """Return state as numpy array for function approximation."""
        return np.array([self.theta, self.theta_dot])
    
    @classmethod
    def from_vector(cls, vec: np.ndarray) -> 'PendulumState':
        """Construct state from numpy array."""
        return cls(theta=vec[0], theta_dot=vec[1])
 
 
# ================================================
# Example 3: Hybrid State Space - Robot with Battery
# ================================================
class RoomType(Enum):
    HALLWAY = 0
    OFFICE = 1
    KITCHEN = 2
    CHARGING_STATION = 3
 
@dataclass
class RobotState:
    """
    Hybrid state: discrete room + continuous battery.
    The state space is the Cartesian product:
    S = {HALLWAY, OFFICE, KITCHEN, CHARGING_STATION} × [0, 100]
    """
    room: RoomType           # Discrete: which room
    battery_level: float     # Continuous: battery percentage
    
    def is_critical_battery(self) -> bool:
        """Check if battery is critically low."""
        return self.battery_level < 15.0
    
    def can_perform_task(self) -> bool:
        """Check if robot has enough battery for a task."""
        return self.battery_level >= 20.0

State Design Principle

The Action Space: Agent Capabilities

Discrete Action Spaces:

$$\mathcal{A} = {a_1, a_2, \ldots, a_n}$$

Examples:

Atari games: 18 discrete button combinations
Board games: Legal moves from current position
Recommendation: Select from $n$ items to recommend

Continuous Action Spaces:

$$\mathcal{A} \subseteq \mathbb{R}^d$$

Examples:

Robot control: Joint torques in $\mathbb{R}^7$
Autonomous driving: Steering angle and throttle in $\mathbb{R}^2$
Portfolio management: Asset allocation weights

Multi-Discrete and Mixed Spaces:

Some problems have factored action spaces—multiple discrete choices or combinations of discrete and continuous components.

Discrete Actions

•Finite choices: Navigate {N, S, E, W}
•Natural for games: Button presses, card plays
•Easier exploration: Enumerate all possibilities
•Q-learning friendly: Q(s, a) table feasible
•Argmax tractable: Find best action exactly

Continuous Actions

•Infinite choices: Torque ∈ [-2, 2] Nm
•Natural for control: Physics-based systems
•Exploration challenge: Can't enumerate
•Requires function approx: No tabular Q
•Argmax hard: Often use gradient-based optimization

action_space_examples.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
import numpy as np
from typing import List, Tuple
from dataclasses import dataclass
 
# ================================================
# Example 1: Discrete Action Space
# ================================================
class DiscreteActionSpace:
    """
    Discrete action space with finite actions.
    Used in: Gridworld, Atari, Board Games
    """
    def __init__(self, actions: List[str]):
        self.actions = actions
        self.n = len(actions)
        self.action_to_idx = {a: i for i, a in enumerate(actions)}
        self.idx_to_action = {i: a for i, a in enumerate(actions)}
    
    def sample(self) -> int:
        """Sample a random action (for exploration)."""
        return np.random.randint(0, self.n)
    
    def contains(self, action: int) -> bool:
        """Check if action is valid."""
        return 0 <= action < self.n
 
# Usage
gridworld_actions = DiscreteActionSpace(["UP", "DOWN", "LEFT", "RIGHT"])
print(f"Number of actions: {gridworld_actions.n}")  # 4
print(f"Random action: {gridworld_actions.sample()}")
 
 
# ================================================
# Example 2: Continuous Action Space (Box)
# ================================================
@dataclass
class ContinuousActionSpace:
    """
    Box-shaped continuous action space.
    Action is a vector a ∈ [low, high] ⊂ ℝ^d
    Used in: Robotics, Control, Physics Simulation
    """
    low: np.ndarray    # Lower bounds for each dimension
    high: np.ndarray   # Upper bounds for each dimension
    
    @property
    def shape(self) -> Tuple[int, ...]:
        return self.low.shape
    
    @property
    def dim(self) -> int:
        return self.low.shape[0]
    
    def sample(self) -> np.ndarray:
        """Sample uniformly from the action space."""
        return np.random.uniform(self.low, self.high)
    
    def clip(self, action: np.ndarray) -> np.ndarray:
        """Clip action to lie within bounds."""
        return np.clip(action, self.low, self.high)
    
    def contains(self, action: np.ndarray) -> bool:
        """Check if action is within bounds."""
        return np.all(action >= self.low) and np.all(action <= self.high)
 
# Usage: Robot arm with 7 torque outputs
robot_actions = ContinuousActionSpace(
    low=np.array([-2.0] * 7),   # -2 Nm minimum torque
    high=np.array([2.0] * 7)    # +2 Nm maximum torque
)
print(f"Action dimension: {robot_actions.dim}")  # 7
print(f"Sample action: {robot_actions.sample()}")
 
 
# ================================================  
# Example 3: State-Dependent Action Space
# ================================================
class ChessActionSpace:
    """
    Action space that depends on current state.
    In chess, legal moves depend on piece positions.
    """
    def get_legal_actions(self, board_state: 'ChessBoardState') -> List[str]:
        """
        Return legal moves in algebraic notation.
        This would call a chess engine in practice.
        """
        # Placeholder - real implementation uses chess rules
        # The key point: A(s) varies with s
        if board_state.is_check:
            # Fewer legal moves when in check
            return self._get_check_escapes(board_state)
        return self._get_all_legal_moves(board_state)
    
    def _get_check_escapes(self, state) -> List[str]:
        # Return only moves that escape check
        pass
    
    def _get_all_legal_moves(self, state) -> List[str]:
        # Return all legal moves
        pass

The Markov Property: Memory-Free Dynamics

The defining characteristic of MDPs is the Markov property (also called the Markov assumption or memorylessness). Formally:

$$P(S_{t+1} | S_t, A_t, S_{t-1}, A_{t-1}, \ldots, S_0, A_0) = P(S_{t+1} | S_t, A_t)$$

Why is this property so important?

Computational tractability: Without the Markov property, we'd need to consider exponentially many histories
Recursive structure: Enables the Bellman equations, which decompose long-term value into one-step look-aheads
Algorithm design: Allows value functions $V(s)$ and $Q(s, a)$ to be well-defined

Input

state = [frame_t-3, frame_t-2, frame_t-1, frame_t]

Output

Ball position AND velocity inferrable from state sequence

Explanation

This pattern is ubiquitous in deep RL: when single observations don't satisfy Markov, we augment with history (frame stacking) or use recurrent architectures that maintain hidden memory.

Markov Violations in Practice

Complete MDP Specification: Worked Example

Let's put everything together with a complete MDP specification for a classic problem: the Frozen Lake environment.

Problem Description:

An agent navigates a 4×4 frozen lake. The surface has:

Safe frozen tiles (F): Agent can stand here
Holes (H): Agent falls in and episode ends (failure)
Start tile (S): Where the agent begins
Goal tile (G): Reaching here ends the episode (success)

The catch: the ice is slippery! When the agent chooses an action, it only moves in that direction with probability 1/3—it might slip sideways.

frozen_lake_mdp.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
import numpy as np
from typing import Dict, Tuple, List
from enum import IntEnum
from dataclasses import dataclass
 
"""
Complete MDP Specification: Frozen Lake
 
The 4x4 grid:
    S F F F
    F H F H
    F F F H
    H F F G
 
S = Start, F = Frozen (safe), H = Hole (terminal, reward=0), G = Goal (terminal, reward=1)
"""
 
# ================================================
# STATE SPACE: S = {0, 1, 2, ..., 15}
# ================================================
# Each cell is a state, indexed row-major:
#   0  1  2  3
#   4  5  6  7
#   8  9  10 11
#   12 13 14 15
 
N_STATES = 16
GRID_SIZE = 4
 
# Terminal states
HOLE_STATES = {5, 7, 11, 12}  # States where episode ends (failure)
GOAL_STATE = 15               # State where episode ends (success)
TERMINAL_STATES = HOLE_STATES | {GOAL_STATE}
 
# ================================================
# ACTION SPACE: A = {0, 1, 2, 3} = {LEFT, DOWN, RIGHT, UP}
# ================================================
class Action(IntEnum):
    LEFT = 0
    DOWN = 1
    RIGHT = 2
    UP = 3
 
N_ACTIONS = 4
ACTIONS = list(Action)
 
# ================================================
# TRANSITION FUNCTION: P(s' | s, a)
# ================================================
# Due to slippery ice:
# - P(intended direction | s, a) = 1/3
# - P(slip left | s, a) = 1/3  
# - P(slip right | s, a) = 1/3
 
def get_next_state(state: int, action: Action) -> int:
    """Compute next state given current state and action (deterministic step)."""
    if state in TERMINAL_STATES:
        return state  # Terminal states are absorbing
    
    row, col = state // GRID_SIZE, state % GRID_SIZE
    
    if action == Action.LEFT:
        col = max(0, col - 1)
    elif action == Action.RIGHT:
        col = min(GRID_SIZE - 1, col + 1)
    elif action == Action.UP:
        row = max(0, row - 1)
    elif action == Action.DOWN:
        row = min(GRID_SIZE - 1, row + 1)
    
    return row * GRID_SIZE + col
 
def get_slip_actions(action: Action) -> List[Action]:
    """Get the three possible outcomes: intended, slip-left, slip-right."""
    # Perpendicular actions based on intended direction
    slip_map = {
        Action.LEFT:  [Action.LEFT, Action.UP, Action.DOWN],
        Action.RIGHT: [Action.RIGHT, Action.UP, Action.DOWN],
        Action.UP:    [Action.UP, Action.LEFT, Action.RIGHT],
        Action.DOWN:  [Action.DOWN, Action.LEFT, Action.RIGHT],
    }
    return slip_map[action]
 
def build_transition_matrix() -> Dict[Tuple[int, int, int], float]:
    """
    Build the complete transition probability table.
    Returns P[s, a, s'] = P(s' | s, a) for all valid transitions.
    """
    P = {}  # (s, a, s') -> probability
    
    for s in range(N_STATES):
        for a in ACTIONS:
            if s in TERMINAL_STATES:
                # Terminal states are absorbing (stay with prob 1)
                P[(s, a, s)] = 1.0
            else:
                # Slippery: 1/3 each for intended and two perpendicular
                slip_actions = get_slip_actions(a)
                for slip_a in slip_actions:
                    s_prime = get_next_state(s, slip_a)
                    key = (s, a, s_prime)
                    P[key] = P.get(key, 0.0) + 1/3
    
    return P
 
# ================================================
# REWARD FUNCTION: R(s, a, s')
# ================================================
def reward_function(s: int, a: int, s_prime: int) -> float:
    """
    Reward function R(s, a, s').
    
    Reward structure:
    - Reaching goal (s' = 15): R = 1.0
    - Falling in hole: R = 0.0
    - All other transitions: R = 0.0
    """
    if s_prime == GOAL_STATE:
        return 1.0
    return 0.0
 
# Alternative: Step penalty to encourage faster solutions
def reward_with_step_penalty(s: int, a: int, s_prime: int) -> float:
    """Variant with step penalty to discourage long paths."""
    if s_prime == GOAL_STATE:
        return 1.0
    elif s_prime in HOLE_STATES:
        return -1.0
    else:
        return -0.01  # Small penalty per step
 
# ================================================
# DISCOUNT FACTOR: γ = 0.99
# ================================================
GAMMA = 0.99
 
# ================================================
# COMPLETE MDP OBJECT
# ================================================
@dataclass
class FrozenLakeMDP:
    """Complete MDP specification as a data object."""
    n_states: int = N_STATES
    n_actions: int = N_ACTIONS
    transition_probs: Dict = None  # Built lazily
    gamma: float = GAMMA
    terminal_states: set = None
    
    def __post_init__(self):
        self.terminal_states = TERMINAL_STATES
        self.transition_probs = build_transition_matrix()
    
    def P(self, s: int, a: int, s_prime: int) -> float:
        """Transition probability P(s' | s, a)."""
        return self.transition_probs.get((s, a, s_prime), 0.0)
    
    def R(self, s: int, a: int, s_prime: int) -> float:
        """Reward R(s, a, s')."""
        return reward_function(s, a, s_prime)
    
    def is_terminal(self, s: int) -> bool:
        """Check if state is terminal."""
        return s in self.terminal_states
 
# Create the MDP
frozen_lake = FrozenLakeMDP()
 
# Verify: probabilities from state 0 with action RIGHT
print("Transitions from s=0, a=RIGHT:")
for s_prime in range(N_STATES):
    p = frozen_lake.P(0, Action.RIGHT, s_prime)
    if p > 0:
        print(f"  P(s'={s_prime} | s=0, a=RIGHT) = {p:.4f}")
 
# Output:
# Transitions from s=0, a=RIGHT:
#   P(s'=0 | s=0, a=RIGHT) = 0.3333  (slip UP, hit wall)
#   P(s'=1 | s=0, a=RIGHT) = 0.3333  (intended direction)
#   P(s'=4 | s=0, a=RIGHT) = 0.3333  (slip DOWN)

What Makes This an MDP

Finite vs. Infinite Horizon MDPs

An MDP's horizon refers to the length of the decision-making process:

Finite Horizon MDPs:

Episode terminates after exactly $T$ steps
Objective: Maximize $\sum_{t=0}^{T-1} R_t$ (often no discounting needed)
Optimal policy may be time-dependent: $\pi^*_t(s)$ varies with $t$
Example: Game with fixed number of moves, planning a trip with departure date

Infinite Horizon MDPs:

Episode can run indefinitely (or until terminal state)
Objective: Maximize $\sum_{t=0}^{\infty} \gamma^t R_t$ (discounting required for convergence)
Optimal policy is stationary: $\pi^*(s)$ independent of time
Example: Continuous control, ongoing resource management

Why does discounting matter for infinite horizons?

Without discounting ($\gamma = 1$) in infinite horizon settings, the sum of rewards could diverge to infinity, making comparison meaningless. The discount factor ensures:

$$\sum_{t=0}^{\infty} \gamma^t R_t \leq \sum_{t=0}^{\infty} \gamma^t R_{max} = \frac{R_{max}}{1 - \gamma} < \infty$$

Comparing Horizon Types
Property	Finite Horizon	Infinite Horizon
Episode length	Fixed $T$ steps	Unbounded or until terminal
Discounting	Optional (often $\gamma=1$)	Required ($\gamma < 1$)
Optimal policy	Time-dependent $\pi^*_t(s)$	Stationary $\pi^*(s)$
Value function	$V^\pi_t(s)$ depends on time	$V^\pi(s)$ time-independent
Computation	Backward induction from $T$	Fixed point iteration
Applications	Games, planning with deadline	Control, ongoing optimization

Episodic vs. Continuing Tasks

A separate but related distinction concerns whether the task naturally breaks into episodes:

Episodic Tasks:

Experience naturally divides into independent episodes
Each episode starts fresh, ends in a terminal state
Examples: Games (each game is an episode), robot trials, dialogue sessions
Agent can learn from many episodes, each a complete trajectory

Continuing Tasks:

No natural end—interaction continues indefinitely
Must consider the single, never-ending experience stream
Examples: Process control, lifelong robotics, server load balancing
Often use average reward formulation instead of discounted sum

Unified Notation (Sutton & Barto):

Practical Implications

Summary: The MDP Foundation

We have established the formal framework that will support every algorithm and concept in reinforcement learning. Let's consolidate the key insights:

Key Takeaways

•MDPs are defined by the 5-tuple $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$ — states, actions, transitions, rewards, discount
•The Markov property is fundamental — current state is sufficient to predict the future; enables recursive decomposition
•State spaces can be discrete or continuous — the choice affects algorithm selection and computational complexity
•Action spaces may depend on state — not all actions are legal in all states (e.g., chess moves)
•Horizon matters — finite horizon allows time-dependent policies; infinite horizon requires discounting and yields stationary policies
•Episodic vs. continuing affects formulation — but both can be handled with absorbing state notation

What's Next:

Page Complete

1 / 5