Loading content...
In the previous page, we established that an MDP is defined by a 5-tuple $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$. Now we focus on arguably the most complex component: the transition probability function $P$.
The transition function encapsulates how the world works—how states evolve in response to actions. This is the environment dynamics, the rules governing what happens when an agent acts. Understanding transitions deeply is essential because:
By the end of this page, you will understand the transition probability function in depth—from its mathematical definition to its representation as transition matrices, from deterministic to stochastic dynamics, and from known to unknown models. You'll be able to reason about environment dynamics and understand why some problems are tractable while others require millions of samples.
The transition probability function (or transition kernel, or dynamics model) is denoted:
$$P: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow [0, 1]$$
$$P(s' | s, a) = \Pr(S_{t+1} = s' | S_t = s, A_t = a)$$
This reads: "The probability of transitioning to state $s'$, given that the agent is currently in state $s$ and takes action $a$."
Key properties that must hold:
Non-negativity: $P(s' | s, a) \geq 0$ for all $s, a, s'$
Normalization: For each $(s, a)$ pair, the probabilities over next states sum to 1: $$\sum_{s' \in \mathcal{S}} P(s' | s, a) = 1$$
Markov property: The distribution depends only on $(s, a)$, not on any prior history
These properties ensure $P(\cdot | s, a)$ defines a valid probability distribution over next states.
You may encounter different notations: $T(s, a, s')$, $p(s' | s, a)$, or $\mathcal{P}^a_{ss'}$. Some sources use $P_a(s, s')$ to emphasize the action-indexed family of matrices. The meaning is always the same: the probability of the next state given current state and action.
Transition dynamics fall into two fundamental categories based on whether outcomes are certain or uncertain:
Deterministic Dynamics:
For each $(s, a)$ pair, there is exactly one possible next state:
$$P(s' | s, a) = \begin{cases} 1 & \text{if } s' = f(s, a) \ 0 & \text{otherwise} \end{cases}$$
where $f: \mathcal{S} \times \mathcal{A} \rightarrow \mathcal{S}$ is a deterministic transition function.
Examples: Chess (moves have exact outcomes), Rubik's cube, deterministic gridworld, many board games.
Stochastic Dynamics:
Each $(s, a)$ pair induces a probability distribution over possible next states:
$$P(s' | s, a) \in (0, 1) \text{ for multiple } s' \text{ values}$$
Examples: Dice games, slippery gridworld, real-world robotics (noise), card games with shuffled decks.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151
import numpy as npfrom typing import Tuple, Dict, Listfrom dataclasses import dataclassfrom enum import IntEnum class Action(IntEnum): LEFT = 0 DOWN = 1 RIGHT = 2 UP = 3 # ================================================# DETERMINISTIC TRANSITIONS: Standard Gridworld# ================================================class DeterministicGridworld: """ 4x4 gridworld with deterministic dynamics. Action always produces the intended movement (unless hitting wall). """ def __init__(self, grid_size: int = 4): self.size = grid_size self.n_states = grid_size * grid_size self.n_actions = 4 def transition(self, state: int, action: int) -> int: """ Deterministic transition function: s' = f(s, a) Returns exactly one next state. """ row, col = state // self.size, state % self.size if action == Action.UP: row = max(0, row - 1) elif action == Action.DOWN: row = min(self.size - 1, row + 1) elif action == Action.LEFT: col = max(0, col - 1) elif action == Action.RIGHT: col = min(self.size - 1, col + 1) return row * self.size + col def transition_prob(self, state: int, action: int, next_state: int) -> float: """ P(s' | s, a) for deterministic environment. Returns 1.0 if s' = f(s,a), else 0.0 """ if next_state == self.transition(state, action): return 1.0 return 0.0 # ================================================# STOCHASTIC TRANSITIONS: Slippery Gridworld# ================================================class StochasticGridworld: """ 4x4 gridworld with stochastic (slippery) dynamics. Agent moves in intended direction with prob 0.8, slips to perpendicular direction with prob 0.1 each. """ def __init__(self, grid_size: int = 4, slip_prob: float = 0.1): self.size = grid_size self.n_states = grid_size * grid_size self.n_actions = 4 self.slip_prob = slip_prob self.intended_prob = 1.0 - 2 * slip_prob # 0.8 by default def _move(self, state: int, action: int) -> int: """Helper: compute next state for a given action.""" row, col = state // self.size, state % self.size if action == Action.UP: row = max(0, row - 1) elif action == Action.DOWN: row = min(self.size - 1, row + 1) elif action == Action.LEFT: col = max(0, col - 1) elif action == Action.RIGHT: col = min(self.size - 1, col + 1) return row * self.size + col def get_perpendicular_actions(self, action: int) -> Tuple[int, int]: """Get the two perpendicular slip directions.""" if action in [Action.UP, Action.DOWN]: return Action.LEFT, Action.RIGHT else: return Action.UP, Action.DOWN def transition_probs(self, state: int, action: int) -> Dict[int, float]: """ Return full distribution P(s' | s, a). Returns {next_state: probability} for all reachable states. """ probs = {} # Intended direction (prob = 0.8) intended_next = self._move(state, action) probs[intended_next] = probs.get(intended_next, 0) + self.intended_prob # Slip directions (prob = 0.1 each) slip_left, slip_right = self.get_perpendicular_actions(action) next_slip_left = self._move(state, slip_left) probs[next_slip_left] = probs.get(next_slip_left, 0) + self.slip_prob next_slip_right = self._move(state, slip_right) probs[next_slip_right] = probs.get(next_slip_right, 0) + self.slip_prob return probs def sample_transition(self, state: int, action: int) -> int: """ Sample next state according to transition probabilities. This is what happens during actual environment interaction. """ probs = self.transition_probs(state, action) states = list(probs.keys()) probabilities = list(probs.values()) return np.random.choice(states, p=probabilities) # ================================================# DEMONSTRATION# ================================================if __name__ == "__main__": det = DeterministicGridworld() stoch = StochasticGridworld() state = 5 # Middle of grid action = Action.RIGHT print("=== Deterministic Gridworld ===") next_det = det.transition(state, action) print(f"State {state}, Action RIGHT → Always state {next_det}") print(f"P(s'={next_det} | s={state}, a=RIGHT) = {det.transition_prob(state, action, next_det)}") print("=== Stochastic Gridworld ===") probs = stoch.transition_probs(state, action) print(f"State {state}, Action RIGHT → Distribution:") for s_prime, p in sorted(probs.items()): print(f" P(s'={s_prime} | s={state}, a=RIGHT) = {p:.2f}") # Simulate 1000 transitions to verify probabilities samples = [stoch.sample_transition(state, action) for _ in range(1000)] empirical = {s: samples.count(s)/1000 for s in set(samples)} print(f"Empirical frequencies (1000 samples): {empirical}")For finite MDPs (finite state and action spaces), we can represent the transition function as a collection of transition matrices. For each action $a \in \mathcal{A}$, we define:
$$\mathbf{P}^a \in \mathbb{R}^{|\mathcal{S}| \times |\mathcal{S}|}$$
where the $(i, j)$ entry is:
$$\mathbf{P}^a_{ij} = P(s_j | s_i, a)$$
This matrix has important properties:
Stochastic matrix properties:
A matrix satisfying these properties is called a row-stochastic matrix (or right stochastic matrix).
Why the matrix view matters:
Matrix representation enables:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122
import numpy as npfrom typing import List class TransitionMatrices: """ Represents MDP transitions as a collection of stochastic matrices. One matrix P^a for each action a. """ def __init__(self, n_states: int, n_actions: int): self.n_states = n_states self.n_actions = n_actions # Initialize with zeros; will be populated # Shape: (n_actions, n_states, n_states) # P[a, i, j] = P(s_j | s_i, a) self.P = np.zeros((n_actions, n_states, n_states)) def set_transition(self, action: int, from_state: int, to_state: int, prob: float): """Set P(to_state | from_state, action) = prob""" self.P[action, from_state, to_state] = prob def get_matrix(self, action: int) -> np.ndarray: """Get the transition matrix for a specific action.""" return self.P[action] def verify_stochastic(self) -> bool: """ Verify all matrices are valid row-stochastic matrices. Returns True if valid, raises error otherwise. """ for a in range(self.n_actions): Pa = self.P[a] # Check non-negativity if np.any(Pa < 0): raise ValueError(f"Matrix P^{a} has negative entries") # Check row sums row_sums = Pa.sum(axis=1) if not np.allclose(row_sums, 1.0): raise ValueError( f"Matrix P^{a} rows don't sum to 1: {row_sums}" ) return True def expected_next_state_distribution( self, state_dist: np.ndarray, action: int ) -> np.ndarray: """ Given a distribution over current states and an action, compute the distribution over next states. p(s') = Σ_s p(s) · P(s' | s, a) In matrix form: p' = p @ P^a """ return state_dist @ self.P[action] def n_step_transition(self, action: int, n_steps: int) -> np.ndarray: """ Compute n-step transition matrix for repeated action. (P^a)^n gives probability of reaching s' from s in n steps. """ return np.linalg.matrix_power(self.P[action], n_steps) # ================================================# Example: Simple 3-State MDP# ================================================"""States: {0: Start, 1: Middle, 2: Goal}Actions: {0: Stay, 1: Advance} Transition dynamics:- Stay: 90% stay in place, 10% slip forward- Advance: 70% move forward, 20% stay, 10% slip backward""" mdp = TransitionMatrices(n_states=3, n_actions=2) # Action 0: Stay# To: 0 1 2# From 0: [0.9, 0.1, 0.0]# From 1: [0.0, 0.9, 0.1]# From 2: [0.0, 0.0, 1.0] (Goal is absorbing)mdp.P[0] = np.array([ [0.9, 0.1, 0.0], [0.0, 0.9, 0.1], [0.0, 0.0, 1.0]]) # Action 1: Advance# To: 0 1 2# From 0: [0.2, 0.7, 0.1]# From 1: [0.1, 0.2, 0.7]# From 2: [0.0, 0.0, 1.0] (Goal is absorbing)mdp.P[1] = np.array([ [0.2, 0.7, 0.1], [0.1, 0.2, 0.7], [0.0, 0.0, 1.0]]) # Verify matrices are validmdp.verify_stochastic()print("✓ All transition matrices are valid row-stochastic matrices") # Example: Start in state 0, take Advance actionprint("Starting in state 0, action=Advance:")print(f" P(next=0) = {mdp.P[1, 0, 0]:.1f}")print(f" P(next=1) = {mdp.P[1, 0, 1]:.1f}")print(f" P(next=2) = {mdp.P[1, 0, 2]:.1f}") # Multi-step transitions: P(reach goal in exactly 2 advances from start)P_advance = mdp.P[1]P_2step = P_advance @ P_advance # Matrix multiplicationprint(f"2-step transition probs from state 0 with Advance:")print(f" P(s'=0 in 2 steps) = {P_2step[0, 0]:.3f}")print(f" P(s'=1 in 2 steps) = {P_2step[0, 1]:.3f}")print(f" P(s'=2 in 2 steps) = {P_2step[0, 2]:.3f}")A key insight: $(\mathbf{P}^a)^n_{ij}$ gives the probability of reaching state $j$ from state $i$ in exactly $n$ steps, all taking action $a$. This follows from the Chapman-Kolmogorov equations and enables efficient computation of long-horizon probabilities.
For efficient computation, we often store all transition matrices in a single 3D tensor:
$$\mathbf{P} \in \mathbb{R}^{|\mathcal{A}| \times |\mathcal{S}| \times |\mathcal{S}|}$$
where $\mathbf{P}[a, s, s'] = P(s' | s, a)$.
Memory considerations:
For an MDP with $|\mathcal{S}|$ states and $|\mathcal{A}|$ actions:
Sparse representations:
Many real MDPs have sparse transitions—each $(s, a)$ pair leads to only a few possible next states. Sparse matrix formats can reduce storage from $O(|\mathcal{S}|^2)$ to $O(k \cdot |\mathcal{S}|)$ where $k$ is the average number of reachable states.
| States | Actions | Dense (float32) | Sparse k=4 |
|---|---|---|---|
| 100 | 4 | 160 KB | 6.4 KB |
| 1,000 | 4 | 16 MB | 64 KB |
| 10,000 | 4 | 1.6 GB | 640 KB |
| 100,000 | 4 | 160 GB | 6.4 MB |
| 1,000,000 | 4 | 16 TB | 64 MB |
For continuous state spaces, explicit transition matrices are impossible. Instead, we work with transition dynamics functions $f(s, a, \epsilon)$ where $\epsilon$ represents noise, or we learn approximate models. This is a fundamental limitation of tabular methods.
In practice, the transition function may be:
Let's examine each scenario:
Known Dynamics:
When $P(s'|s,a)$ is known for all $(s, a, s')$:
Learned Dynamics (World Models):
When transitions must be learned from experience:
Unknown Dynamics (Model-Free):
When we don't explicitly model transitions:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201
import numpy as npfrom collections import defaultdictfrom typing import Tuple, List, Dict # ================================================# APPROACH 1: Exact Model (Known Dynamics)# ================================================class KnownDynamics: """ When we have the true transition probabilities. Example: Game with known rules. """ def __init__(self, transition_matrix: np.ndarray): # P[a, s, s'] = P(s' | s, a) self.P = transition_matrix def get_probs(self, state: int, action: int) -> np.ndarray: """Return exact distribution P(· | s, a)""" return self.P[action, state, :] def expected_next_value( self, state: int, action: int, values: np.ndarray ) -> float: """ Compute E[V(s') | s, a] = Σ_{s'} P(s'|s,a) V(s') This is used in Bellman backup. """ return np.dot(self.P[action, state, :], values) # ================================================# APPROACH 2: Learned Model (Tabular MLE)# ================================================class LearnedTabularDynamics: """ Learn transition probabilities from experience. Uses maximum likelihood estimation (frequency counting). """ def __init__(self, n_states: int, n_actions: int): self.n_states = n_states self.n_actions = n_actions # Count transitions: N[a][s][s'] = count of (s,a) → s' self.counts = defaultdict(lambda: defaultdict(lambda: defaultdict(int))) # Total transitions from each (s, a) self.totals = defaultdict(lambda: defaultdict(int)) def update(self, state: int, action: int, next_state: int): """ Update model with observed transition. Called after each environment step. """ self.counts[action][state][next_state] += 1 self.totals[action][state] += 1 def get_probs(self, state: int, action: int) -> np.ndarray: """ Return estimated P(· | s, a) using MLE. P̂(s' | s, a) = N(s, a, s') / N(s, a) """ probs = np.zeros(self.n_states) total = self.totals[action][state] if total == 0: # No data: assume uniform (or use prior) return np.ones(self.n_states) / self.n_states for s_prime, count in self.counts[action][state].items(): probs[s_prime] = count / total return probs def get_confidence(self, state: int, action: int) -> float: """ Return confidence in estimate (based on sample count). More samples = higher confidence. """ return self.totals[action][state] # ================================================# APPROACH 3: Learned Model (Neural Network)# ================================================class NeuralDynamicsModel: """ Learn dynamics with function approximation. For continuous/large state spaces. Predicts: s_{t+1} = f_θ(s_t, a_t) + ε """ def __init__(self, state_dim: int, action_dim: int, hidden_dim: int = 256): import torch import torch.nn as nn self.model = nn.Sequential( nn.Linear(state_dim + action_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), # Output: mean and log-variance of next state nn.Linear(hidden_dim, state_dim * 2) # [μ, log σ²] ) self.state_dim = state_dim self.optimizer = torch.optim.Adam(self.model.parameters(), lr=1e-3) def predict(self, state: np.ndarray, action: np.ndarray): """ Predict next state distribution. Returns (mean, std) of Gaussian prediction. """ import torch x = np.concatenate([state, action]) x = torch.FloatTensor(x).unsqueeze(0) out = self.model(x) mean = out[0, :self.state_dim] log_var = out[0, self.state_dim:] std = torch.exp(0.5 * log_var) return mean.detach().numpy(), std.detach().numpy() def train_step( self, states: np.ndarray, actions: np.ndarray, next_states: np.ndarray ) -> float: """ Train on batch of transitions. Minimizes Gaussian NLL: -log p(s' | s, a; θ) """ import torch import torch.nn.functional as F # Concatenate inputs x = np.concatenate([states, actions], axis=1) x = torch.FloatTensor(x) targets = torch.FloatTensor(next_states) # Forward pass out = self.model(x) mean = out[:, :self.state_dim] log_var = out[:, self.state_dim:] # Gaussian NLL loss # -log p(y|x) = 0.5 * [log(2π) + log σ² + (y-μ)²/σ²] var = torch.exp(log_var) nll = 0.5 * (log_var + (targets - mean)**2 / var) loss = nll.mean() # Backward pass self.optimizer.zero_grad() loss.backward() self.optimizer.step() return loss.item() # ================================================# Demonstration: Model Learning# ================================================def demo_model_learning(): """ Simulate an environment and learn its dynamics. """ # True (unknown to learner) transition probabilities # Simple 3-state MDP true_P = np.array([ # Action 0 [[0.7, 0.2, 0.1], [0.1, 0.7, 0.2], [0.0, 0.1, 0.9]], # Action 1 [[0.1, 0.8, 0.1], [0.2, 0.1, 0.7], [0.1, 0.1, 0.8]] ]) # Learner's model learner = LearnedTabularDynamics(n_states=3, n_actions=2) # Collect experience n_episodes = 100 for _ in range(n_episodes): state = 0 for _ in range(10): action = np.random.randint(2) # Sample from true dynamics next_state = np.random.choice(3, p=true_P[action, state]) # Update learner's model learner.update(state, action, next_state) state = next_state # Compare learned vs true print("Learned vs True P(s'|s=0, a=0):") learned = learner.get_probs(0, 0) true = true_P[0, 0] for s in range(3): print(f" s'={s}: learned={learned[s]:.3f}, true={true[s]:.3f}") demo_model_learning()The presence or absence of a transition model defines two major branches of RL:
Model-Based Reinforcement Learning:
Model-Free Reinforcement Learning:
| Aspect | Model-Based | Model-Free |
|---|---|---|
| Transition model | Explicit $P(s'|s,a)$ | No model |
| Sample efficiency | High (simulate cheaply) | Low (real samples only) |
| Computational cost | High (planning) | Low (no planning) |
| Asymptotic performance | Limited by model accuracy | Can reach optimal Q* |
| Generalization | Structured, can transfer | Task-specific |
| When to use | Expensive real interaction | Cheap interaction, complex dynamics |
In practice, hybrid approaches often work best. The Dyna architecture (Sutton, 1990) interleaves real experience collection with model-based simulated experience. This captures benefits of both: real experience corrects model errors, while simulated experience accelerates learning.
Understanding why transitions are stochastic helps in designing better state representations and algorithms. Common sources:
1. Inherent Environmental Randomness:
2. Partial Observability:
3. Modeling Simplification:
4. Discretization Effects:
Many problems have special structure in their transition dynamics that can be exploited for efficient solution:
Factored MDPs:
State is a vector of features $s = (x_1, \ldots, x_n)$. Transitions factor: $$P(s' | s, a) = \prod_{i=1}^{n} P(x'_i | \text{parents}(x_i), a)$$
This compact representation enables dynamic Bayesian network methods.
Linear Dynamics:
In continuous control, dynamics may be linear: $$s_{t+1} = As_t + Ba_t + \epsilon$$
LQR (Linear Quadratic Regulator) provides closed-form optimal policy.
Locality:
Each action only affects a small "neighborhood" of the state space. Common in spatial domains (robotics, games with local moves). $$P(s' | s, a) = 0 \text{ unless } s' \text{ is "close" to } s$$
Reversibility:
For every transition $s \rightarrow s'$, there exists a reverse transition $s' \rightarrow s$. Enables efficient exploration via backtracking.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123
import numpy as np # ================================================# Example: Factored MDP (Inventory Management)# ================================================"""State: (inventory_A, inventory_B) ∈ {0..10} × {0..10}Action: (order_A, order_B) ∈ {0..5} × {0..5} Factored dynamics:- inventory'_A depends only on inventory_A, order_A, demand_A- inventory'_B depends only on inventory_B, order_B, demand_B(The two products are independent) Full state space: 11 × 11 = 121 statesFactored representation: 2 independent 11-state problems""" class FactoredInventoryMDP: def __init__(self, max_inventory: int = 10): self.max_inv = max_inventory # Demand distribution for each product self.demand_probs = np.array([0.1, 0.2, 0.3, 0.2, 0.1, 0.1]) # 0-5 demand def single_product_transition( self, inventory: int, order: int ) -> np.ndarray: """ P(inventory' | inventory, order) for single product. inventory' = max(0, min(max_inv, inventory + order - demand)) """ probs = np.zeros(self.max_inv + 1) for demand, p_demand in enumerate(self.demand_probs): new_inv = max(0, min(self.max_inv, inventory + order - demand)) probs[new_inv] += p_demand return probs def factored_transition( self, state: tuple, action: tuple ) -> dict: """ P(s' | s, a) using factored structure. Returns sparse dict {state: probability}. Complexity: O(n_vals^2) instead of O(n_vals^4) for full """ inv_a, inv_b = state order_a, order_b = action probs_a = self.single_product_transition(inv_a, order_a) probs_b = self.single_product_transition(inv_b, order_b) result = {} for new_inv_a, p_a in enumerate(probs_a): for new_inv_b, p_b in enumerate(probs_b): if p_a > 0 and p_b > 0: result[(new_inv_a, new_inv_b)] = p_a * p_b return result # ================================================# Example: Linear Dynamics (Simple Robot)# ================================================"""State: [position, velocity] ∈ ℝ²Action: acceleration ∈ [-1, 1] Linear dynamics:s_{t+1} = A @ s_t + B @ a_t + noise where A = [[1, dt], # position += velocity * dt [0, 1]] # velocity stays (without action) B = [[0], # action doesn't directly affect position [dt]] # action affects velocity""" class LinearRobotDynamics: def __init__(self, dt: float = 0.1, noise_std: float = 0.01): self.dt = dt self.noise_std = noise_std # State transition matrix self.A = np.array([ [1.0, dt], [0.0, 1.0] ]) # Control matrix self.B = np.array([ [0.0], [dt] ]) def transition( self, state: np.ndarray, action: float ) -> tuple: """ Sample next state: s' = As + Ba + ε Returns (mean, sampled state) """ action_vec = np.array([[action]]) mean = self.A @ state + (self.B @ action_vec).flatten() noise = np.random.normal(0, self.noise_std, size=2) sampled = mean + noise return mean, sampled def transition_covariance(self) -> np.ndarray: """Return covariance matrix of transition noise.""" return np.eye(2) * self.noise_std**2 # Demorobot = LinearRobotDynamics()state = np.array([0.0, 0.0]) # At origin, zero velocityaction = 1.0 # Accelerate print("Linear dynamics demo:")for t in range(5): mean, state = robot.transition(state, action) print(f" t={t+1}: position={state[0]:.3f}, velocity={state[1]:.3f}")We have deeply explored the transition probability function—the dynamic heart of any MDP. This component determines how actions affect the world and fundamentally shapes what algorithms can solve the problem efficiently.
What's Next:
With state transitions understood, we turn to the reward signal—the scalar feedback that defines what the agent should optimize. The reward function transforms the dynamics into a decision problem with clear objectives.
You now have a deep understanding of transition dynamics in MDPs—from formal definitions to practical implementations, from deterministic to stochastic, from model-based to model-free paradigms. This knowledge is essential for understanding how RL algorithms learn to navigate uncertain environments.