Markov Decision Processes - Learning Module

Loading content...

0/278

State Transitions

The Dynamics of Decision

In the previous page, we established that an MDP is defined by a 5-tuple $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$. Now we focus on arguably the most complex component: the transition probability function $P$.

The transition function encapsulates how the world works—how states evolve in response to actions. This is the environment dynamics, the rules governing what happens when an agent acts. Understanding transitions deeply is essential because:

Different dynamics demand different algorithms — deterministic vs. stochastic environments require different approaches
Model-based RL exploits known transitions — if you know $P$, you can plan without exploration
Model-free RL works around unknown transitions — learning value functions directly from experience
Real-world problems have complex dynamics — understanding how to represent them is an engineering skill

What You Will Learn

By the end of this page, you will understand the transition probability function in depth—from its mathematical definition to its representation as transition matrices, from deterministic to stochastic dynamics, and from known to unknown models. You'll be able to reason about environment dynamics and understand why some problems are tractable while others require millions of samples.

Formal Definition of Transition Dynamics

The transition probability function (or transition kernel, or dynamics model) is denoted:

$$P: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow [0, 1]$$

$$P(s' | s, a) = \Pr(S_{t+1} = s' | S_t = s, A_t = a)$$

This reads: "The probability of transitioning to state $s'$, given that the agent is currently in state $s$ and takes action $a$."

Key properties that must hold:

Non-negativity: $P(s' | s, a) \geq 0$ for all $s, a, s'$
Normalization: For each $(s, a)$ pair, the probabilities over next states sum to 1: $$\sum_{s' \in \mathcal{S}} P(s' | s, a) = 1$$
Markov property: The distribution depends only on $(s, a)$, not on any prior history

These properties ensure $P(\cdot | s, a)$ defines a valid probability distribution over next states.

Alternative Notations

You may encounter different notations: $T(s, a, s')$, $p(s' | s, a)$, or $\mathcal{P}^a_{ss'}$. Some sources use $P_a(s, s')$ to emphasize the action-indexed family of matrices. The meaning is always the same: the probability of the next state given current state and action.

Deterministic vs. Stochastic Dynamics

Transition dynamics fall into two fundamental categories based on whether outcomes are certain or uncertain:

Deterministic Dynamics:

For each $(s, a)$ pair, there is exactly one possible next state:

$$P(s' | s, a) = \begin{cases} 1 & \text{if } s' = f(s, a) \ 0 & \text{otherwise} \end{cases}$$

where $f: \mathcal{S} \times \mathcal{A} \rightarrow \mathcal{S}$ is a deterministic transition function.

Examples: Chess (moves have exact outcomes), Rubik's cube, deterministic gridworld, many board games.

Stochastic Dynamics:

Each $(s, a)$ pair induces a probability distribution over possible next states:

$$P(s' | s, a) \in (0, 1) \text{ for multiple } s' \text{ values}$$

Examples: Dice games, slippery gridworld, real-world robotics (noise), card games with shuffled decks.

Deterministic Transitions

•Predictable outcomes — Know exactly what happens
•Planning is easier — Can simulate future perfectly
•No exploration variance — Same action → same result
•Simpler algorithms — Often don't need stochastic methods
•Examples: Chess, puzzles, many video games

Stochastic Transitions

•Uncertain outcomes — Actions have probabilistic effects
•Planning is harder — Must reason about distributions
•Exploration variance — Same action → different results
•Expected values needed — Must average over outcomes
•Examples: Robotics, games with dice, financial markets

transition_types.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
import numpy as np
from typing import Tuple, Dict, List
from dataclasses import dataclass
from enum import IntEnum
 
class Action(IntEnum):
    LEFT = 0
    DOWN = 1
    RIGHT = 2
    UP = 3
 
 
# ================================================
# DETERMINISTIC TRANSITIONS: Standard Gridworld
# ================================================
class DeterministicGridworld:
    """
    4x4 gridworld with deterministic dynamics.
    Action always produces the intended movement (unless hitting wall).
    """
    def __init__(self, grid_size: int = 4):
        self.size = grid_size
        self.n_states = grid_size * grid_size
        self.n_actions = 4
    
    def transition(self, state: int, action: int) -> int:
        """
        Deterministic transition function: s' = f(s, a)
        Returns exactly one next state.
        """
        row, col = state // self.size, state % self.size
        
        if action == Action.UP:
            row = max(0, row - 1)
        elif action == Action.DOWN:
            row = min(self.size - 1, row + 1)
        elif action == Action.LEFT:
            col = max(0, col - 1)
        elif action == Action.RIGHT:
            col = min(self.size - 1, col + 1)
        
        return row * self.size + col
    
    def transition_prob(self, state: int, action: int, next_state: int) -> float:
        """
        P(s' | s, a) for deterministic environment.
        Returns 1.0 if s' = f(s,a), else 0.0
        """
        if next_state == self.transition(state, action):
            return 1.0
        return 0.0
 
 
# ================================================
# STOCHASTIC TRANSITIONS: Slippery Gridworld
# ================================================
class StochasticGridworld:
    """
    4x4 gridworld with stochastic (slippery) dynamics.
    Agent moves in intended direction with prob 0.8,
    slips to perpendicular direction with prob 0.1 each.
    """
    def __init__(self, grid_size: int = 4, slip_prob: float = 0.1):
        self.size = grid_size
        self.n_states = grid_size * grid_size
        self.n_actions = 4
        self.slip_prob = slip_prob
        self.intended_prob = 1.0 - 2 * slip_prob  # 0.8 by default
    
    def _move(self, state: int, action: int) -> int:
        """Helper: compute next state for a given action."""
        row, col = state // self.size, state % self.size
        
        if action == Action.UP:
            row = max(0, row - 1)
        elif action == Action.DOWN:
            row = min(self.size - 1, row + 1)
        elif action == Action.LEFT:
            col = max(0, col - 1)
        elif action == Action.RIGHT:
            col = min(self.size - 1, col + 1)
        
        return row * self.size + col
    
    def get_perpendicular_actions(self, action: int) -> Tuple[int, int]:
        """Get the two perpendicular slip directions."""
        if action in [Action.UP, Action.DOWN]:
            return Action.LEFT, Action.RIGHT
        else:
            return Action.UP, Action.DOWN
    
    def transition_probs(self, state: int, action: int) -> Dict[int, float]:
        """
        Return full distribution P(s' | s, a).
        Returns {next_state: probability} for all reachable states.
        """
        probs = {}
        
        # Intended direction (prob = 0.8)
        intended_next = self._move(state, action)
        probs[intended_next] = probs.get(intended_next, 0) + self.intended_prob
        
        # Slip directions (prob = 0.1 each)
        slip_left, slip_right = self.get_perpendicular_actions(action)
        
        next_slip_left = self._move(state, slip_left)
        probs[next_slip_left] = probs.get(next_slip_left, 0) + self.slip_prob
        
        next_slip_right = self._move(state, slip_right)
        probs[next_slip_right] = probs.get(next_slip_right, 0) + self.slip_prob
        
        return probs
    
    def sample_transition(self, state: int, action: int) -> int:
        """
        Sample next state according to transition probabilities.
        This is what happens during actual environment interaction.
        """
        probs = self.transition_probs(state, action)
        states = list(probs.keys())
        probabilities = list(probs.values())
        return np.random.choice(states, p=probabilities)
 
 
# ================================================
# DEMONSTRATION
# ================================================
if __name__ == "__main__":
    det = DeterministicGridworld()
    stoch = StochasticGridworld()
    
    state = 5  # Middle of grid
    action = Action.RIGHT
    
    print("=== Deterministic Gridworld ===")
    next_det = det.transition(state, action)
    print(f"State {state}, Action RIGHT → Always state {next_det}")
    print(f"P(s'={next_det} | s={state}, a=RIGHT) = {det.transition_prob(state, action, next_det)}")
    
    print("
=== Stochastic Gridworld ===")
    probs = stoch.transition_probs(state, action)
    print(f"State {state}, Action RIGHT → Distribution:")
    for s_prime, p in sorted(probs.items()):
        print(f"  P(s'={s_prime} | s={state}, a=RIGHT) = {p:.2f}")
    
    # Simulate 1000 transitions to verify probabilities
    samples = [stoch.sample_transition(state, action) for _ in range(1000)]
    empirical = {s: samples.count(s)/1000 for s in set(samples)}
    print(f"
Empirical frequencies (1000 samples): {empirical}")

Transition Matrices: The Algebraic View

For finite MDPs (finite state and action spaces), we can represent the transition function as a collection of transition matrices. For each action $a \in \mathcal{A}$, we define:

$$\mathbf{P}^a \in \mathbb{R}^{|\mathcal{S}| \times |\mathcal{S}|}$$

where the $(i, j)$ entry is:

$$\mathbf{P}^a_{ij} = P(s_j | s_i, a)$$

This matrix has important properties:

Stochastic matrix properties:

All entries are non-negative: $\mathbf{P}^a_{ij} \geq 0$
Each row sums to 1: $\sum_j \mathbf{P}^a_{ij} = 1$

A matrix satisfying these properties is called a row-stochastic matrix (or right stochastic matrix).

Why the matrix view matters:

Matrix representation enables:

Efficient computation: Matrix-vector products for value iteration
Theoretical analysis: Eigenvalue analysis for convergence
Stationary distribution: Finding long-term state occupancy

transition_matrices.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
import numpy as np
from typing import List
 
class TransitionMatrices:
    """
    Represents MDP transitions as a collection of stochastic matrices.
    One matrix P^a for each action a.
    """
    def __init__(self, n_states: int, n_actions: int):
        self.n_states = n_states
        self.n_actions = n_actions
        # Initialize with zeros; will be populated
        # Shape: (n_actions, n_states, n_states)
        # P[a, i, j] = P(s_j | s_i, a)
        self.P = np.zeros((n_actions, n_states, n_states))
    
    def set_transition(self, action: int, from_state: int, 
                       to_state: int, prob: float):
        """Set P(to_state | from_state, action) = prob"""
        self.P[action, from_state, to_state] = prob
    
    def get_matrix(self, action: int) -> np.ndarray:
        """Get the transition matrix for a specific action."""
        return self.P[action]
    
    def verify_stochastic(self) -> bool:
        """
        Verify all matrices are valid row-stochastic matrices.
        Returns True if valid, raises error otherwise.
        """
        for a in range(self.n_actions):
            Pa = self.P[a]
            
            # Check non-negativity
            if np.any(Pa < 0):
                raise ValueError(f"Matrix P^{a} has negative entries")
            
            # Check row sums
            row_sums = Pa.sum(axis=1)
            if not np.allclose(row_sums, 1.0):
                raise ValueError(
                    f"Matrix P^{a} rows don't sum to 1: {row_sums}"
                )
        
        return True
    
    def expected_next_state_distribution(
        self, state_dist: np.ndarray, action: int
    ) -> np.ndarray:
        """
        Given a distribution over current states and an action,
        compute the distribution over next states.
        
        p(s') = Σ_s p(s) · P(s' | s, a)
        
        In matrix form: p' = p @ P^a
        """
        return state_dist @ self.P[action]
    
    def n_step_transition(self, action: int, n_steps: int) -> np.ndarray:
        """
        Compute n-step transition matrix for repeated action.
        (P^a)^n gives probability of reaching s' from s in n steps.
        """
        return np.linalg.matrix_power(self.P[action], n_steps)
 
 
# ================================================
# Example: Simple 3-State MDP
# ================================================
"""
States: {0: Start, 1: Middle, 2: Goal}
Actions: {0: Stay, 1: Advance}
 
Transition dynamics:
- Stay: 90% stay in place, 10% slip forward
- Advance: 70% move forward, 20% stay, 10% slip backward
"""
 
mdp = TransitionMatrices(n_states=3, n_actions=2)
 
# Action 0: Stay
#       To:  0     1     2
# From 0:  [0.9,  0.1,  0.0]
# From 1:  [0.0,  0.9,  0.1]
# From 2:  [0.0,  0.0,  1.0]  (Goal is absorbing)
mdp.P[0] = np.array([
    [0.9, 0.1, 0.0],
    [0.0, 0.9, 0.1],
    [0.0, 0.0, 1.0]
])
 
# Action 1: Advance
#       To:  0     1     2
# From 0:  [0.2,  0.7,  0.1]
# From 1:  [0.1,  0.2,  0.7]
# From 2:  [0.0,  0.0,  1.0]  (Goal is absorbing)
mdp.P[1] = np.array([
    [0.2, 0.7, 0.1],
    [0.1, 0.2, 0.7],
    [0.0, 0.0, 1.0]
])
 
# Verify matrices are valid
mdp.verify_stochastic()
print("✓ All transition matrices are valid row-stochastic matrices")
 
# Example: Start in state 0, take Advance action
print("
Starting in state 0, action=Advance:")
print(f"  P(next=0) = {mdp.P[1, 0, 0]:.1f}")
print(f"  P(next=1) = {mdp.P[1, 0, 1]:.1f}")
print(f"  P(next=2) = {mdp.P[1, 0, 2]:.1f}")
 
# Multi-step transitions: P(reach goal in exactly 2 advances from start)
P_advance = mdp.P[1]
P_2step = P_advance @ P_advance  # Matrix multiplication
print(f"
2-step transition probs from state 0 with Advance:")
print(f"  P(s'=0 in 2 steps) = {P_2step[0, 0]:.3f}")
print(f"  P(s'=1 in 2 steps) = {P_2step[0, 1]:.3f}")
print(f"  P(s'=2 in 2 steps) = {P_2step[0, 2]:.3f}")

Matrix Power for Multi-Step Transitions

A key insight: $(\mathbf{P}^a)^n_{ij}$ gives the probability of reaching state $j$ from state $i$ in exactly $n$ steps, all taking action $a$. This follows from the Chapman-Kolmogorov equations and enables efficient computation of long-horizon probabilities.

Transition Tensors: The Full Picture

For efficient computation, we often store all transition matrices in a single 3D tensor:

$$\mathbf{P} \in \mathbb{R}^{|\mathcal{A}| \times |\mathcal{S}| \times |\mathcal{S}|}$$

where $\mathbf{P}[a, s, s'] = P(s' | s, a)$.

Memory considerations:

For an MDP with $|\mathcal{S}|$ states and $|\mathcal{A}|$ actions:

Full tensor: $O(|\mathcal{A}| \cdot |\mathcal{S}|^2)$ entries
For large state spaces, this becomes prohibitive

Sparse representations:

Many real MDPs have sparse transitions—each $(s, a)$ pair leads to only a few possible next states. Sparse matrix formats can reduce storage from $O(|\mathcal{S}|^2)$ to $O(k \cdot |\mathcal{S}|)$ where $k$ is the average number of reachable states.

Memory Requirements for Transition Storage
States	Actions	Dense (float32)	Sparse k=4
100	4	160 KB	6.4 KB
1,000	4	16 MB	64 KB
10,000	4	1.6 GB	640 KB
100,000	4	160 GB	6.4 MB
1,000,000	4	16 TB	64 MB

The Curse of Dimensionality

For continuous state spaces, explicit transition matrices are impossible. Instead, we work with transition dynamics functions $f(s, a, \epsilon)$ where $\epsilon$ represents noise, or we learn approximate models. This is a fundamental limitation of tabular methods.

Environment Dynamics Modeling

In practice, the transition function may be:

Known exactly (model-based with perfect model)
Approximated (model-based with learned model)
Unknown (model-free learning)

Let's examine each scenario:

Known Dynamics:

When $P(s'|s,a)$ is known for all $(s, a, s')$:

We can use dynamic programming (value iteration, policy iteration)
Planning via simulation is exact
Common in games, simulations, and engineered systems

Learned Dynamics (World Models):

When transitions must be learned from experience:

Collect transitions $(s, a, s')$ from environment
Fit a model $\hat{P}(s'|s,a)$ to predict next states
Plan using the learned model
Risk: model errors compound during planning

Unknown Dynamics (Model-Free):

When we don't explicitly model transitions:

Learn value functions or policies directly from returns
No planning, pure learning from experience
Requires more samples but avoids model bias

dynamics_learning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
import numpy as np
from collections import defaultdict
from typing import Tuple, List, Dict
 
# ================================================
# APPROACH 1: Exact Model (Known Dynamics)
# ================================================
class KnownDynamics:
    """
    When we have the true transition probabilities.
    Example: Game with known rules.
    """
    def __init__(self, transition_matrix: np.ndarray):
        # P[a, s, s'] = P(s' | s, a)
        self.P = transition_matrix
    
    def get_probs(self, state: int, action: int) -> np.ndarray:
        """Return exact distribution P(· | s, a)"""
        return self.P[action, state, :]
    
    def expected_next_value(
        self, state: int, action: int, values: np.ndarray
    ) -> float:
        """
        Compute E[V(s') | s, a] = Σ_{s'} P(s'|s,a) V(s')
        This is used in Bellman backup.
        """
        return np.dot(self.P[action, state, :], values)
 
 
# ================================================
# APPROACH 2: Learned Model (Tabular MLE)
# ================================================
class LearnedTabularDynamics:
    """
    Learn transition probabilities from experience.
    Uses maximum likelihood estimation (frequency counting).
    """
    def __init__(self, n_states: int, n_actions: int):
        self.n_states = n_states
        self.n_actions = n_actions
        # Count transitions: N[a][s][s'] = count of (s,a) → s'
        self.counts = defaultdict(lambda: defaultdict(lambda: defaultdict(int)))
        # Total transitions from each (s, a)
        self.totals = defaultdict(lambda: defaultdict(int))
    
    def update(self, state: int, action: int, next_state: int):
        """
        Update model with observed transition.
        Called after each environment step.
        """
        self.counts[action][state][next_state] += 1
        self.totals[action][state] += 1
    
    def get_probs(self, state: int, action: int) -> np.ndarray:
        """
        Return estimated P(· | s, a) using MLE.
        P̂(s' | s, a) = N(s, a, s') / N(s, a)
        """
        probs = np.zeros(self.n_states)
        total = self.totals[action][state]
        
        if total == 0:
            # No data: assume uniform (or use prior)
            return np.ones(self.n_states) / self.n_states
        
        for s_prime, count in self.counts[action][state].items():
            probs[s_prime] = count / total
        
        return probs
    
    def get_confidence(self, state: int, action: int) -> float:
        """
        Return confidence in estimate (based on sample count).
        More samples = higher confidence.
        """
        return self.totals[action][state]
 
 
# ================================================
# APPROACH 3: Learned Model (Neural Network)
# ================================================
class NeuralDynamicsModel:
    """
    Learn dynamics with function approximation.
    For continuous/large state spaces.
    
    Predicts: s_{t+1} = f_θ(s_t, a_t) + ε
    """
    def __init__(self, state_dim: int, action_dim: int, hidden_dim: int = 256):
        import torch
        import torch.nn as nn
        
        self.model = nn.Sequential(
            nn.Linear(state_dim + action_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            # Output: mean and log-variance of next state
            nn.Linear(hidden_dim, state_dim * 2)  # [μ, log σ²]
        )
        self.state_dim = state_dim
        self.optimizer = torch.optim.Adam(self.model.parameters(), lr=1e-3)
    
    def predict(self, state: np.ndarray, action: np.ndarray):
        """
        Predict next state distribution.
        Returns (mean, std) of Gaussian prediction.
        """
        import torch
        
        x = np.concatenate([state, action])
        x = torch.FloatTensor(x).unsqueeze(0)
        
        out = self.model(x)
        mean = out[0, :self.state_dim]
        log_var = out[0, self.state_dim:]
        std = torch.exp(0.5 * log_var)
        
        return mean.detach().numpy(), std.detach().numpy()
    
    def train_step(
        self, 
        states: np.ndarray, 
        actions: np.ndarray, 
        next_states: np.ndarray
    ) -> float:
        """
        Train on batch of transitions.
        Minimizes Gaussian NLL: -log p(s' | s, a; θ)
        """
        import torch
        import torch.nn.functional as F
        
        # Concatenate inputs
        x = np.concatenate([states, actions], axis=1)
        x = torch.FloatTensor(x)
        targets = torch.FloatTensor(next_states)
        
        # Forward pass
        out = self.model(x)
        mean = out[:, :self.state_dim]
        log_var = out[:, self.state_dim:]
        
        # Gaussian NLL loss
        # -log p(y|x) = 0.5 * [log(2π) + log σ² + (y-μ)²/σ²]
        var = torch.exp(log_var)
        nll = 0.5 * (log_var + (targets - mean)**2 / var)
        loss = nll.mean()
        
        # Backward pass
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        return loss.item()
 
 
# ================================================
# Demonstration: Model Learning
# ================================================
def demo_model_learning():
    """
    Simulate an environment and learn its dynamics.
    """
    # True (unknown to learner) transition probabilities
    # Simple 3-state MDP
    true_P = np.array([
        # Action 0
        [[0.7, 0.2, 0.1],
         [0.1, 0.7, 0.2],
         [0.0, 0.1, 0.9]],
        # Action 1
        [[0.1, 0.8, 0.1],
         [0.2, 0.1, 0.7],
         [0.1, 0.1, 0.8]]
    ])
    
    # Learner's model
    learner = LearnedTabularDynamics(n_states=3, n_actions=2)
    
    # Collect experience
    n_episodes = 100
    for _ in range(n_episodes):
        state = 0
        for _ in range(10):
            action = np.random.randint(2)
            # Sample from true dynamics
            next_state = np.random.choice(3, p=true_P[action, state])
            # Update learner's model
            learner.update(state, action, next_state)
            state = next_state
    
    # Compare learned vs true
    print("Learned vs True P(s'|s=0, a=0):")
    learned = learner.get_probs(0, 0)
    true = true_P[0, 0]
    for s in range(3):
        print(f"  s'={s}: learned={learned[s]:.3f}, true={true[s]:.3f}")
 
demo_model_learning()

Model-Based vs. Model-Free: The Fundamental Dichotomy

The presence or absence of a transition model defines two major branches of RL:

Model-Based Reinforcement Learning:

Has access to $P(s'|s,a)$ (exactly or approximately)
Planning: Can simulate future trajectories without real environment interaction
Sample efficient: Uses model to generate "imagined" experience
Risk: Model errors can lead to suboptimal or dangerous policies
Examples: AlphaGo (Monte Carlo Tree Search with perfect model), Dyna-Q, MuZero

Model-Free Reinforcement Learning:

No explicit model of transition dynamics
Learning from returns: Estimates V or Q directly from sampled rewards
Sample inefficient: Needs many real environment interactions
Robust: No model bias—learns what actually happens
Examples: Q-learning, SARSA, Policy Gradient, PPO, SAC

Model-Based vs Model-Free Comparison
Aspect	Model-Based	Model-Free
Transition model	Explicit $P(s'\|s,a)$	No model
Sample efficiency	High (simulate cheaply)	Low (real samples only)
Computational cost	High (planning)	Low (no planning)
Asymptotic performance	Limited by model accuracy	Can reach optimal Q*
Generalization	Structured, can transfer	Task-specific
When to use	Expensive real interaction	Cheap interaction, complex dynamics

The Middle Ground: Dyna Architecture

In practice, hybrid approaches often work best. The Dyna architecture (Sutton, 1990) interleaves real experience collection with model-based simulated experience. This captures benefits of both: real experience corrects model errors, while simulated experience accelerates learning.

Sources of Stochasticity in Transitions

Understanding why transitions are stochastic helps in designing better state representations and algorithms. Common sources:

1. Inherent Environmental Randomness:

Dice rolls, card shuffles, quantum effects
Cannot be reduced by better sensing
Example: Blackjack—next card is random

2. Partial Observability:

Agent doesn't see full state; hidden variables cause apparent randomness
Can be reduced with better sensing or state augmentation
Example: Poker—opponent's cards are hidden

3. Modeling Simplification:

Complex deterministic dynamics approximated as stochastic
Reality is deterministic but too complex to model exactly
Example: Weather models use stochastic components

4. Discretization Effects:

Continuous dynamics discretized to finite state space
Points that would land between discrete states get spread randomly
Example: Continuous physics simulation mapped to grid

Implications for RL System Design

•Inherent randomness: Must design policies that are robust to variability; can't eliminate uncertainty
•Partial observability: Add more sensors, use recurrent architectures, or explicitly model as POMDP
•Modeling simplification: Learn residual dynamics to capture what the model misses
•Discretization: Use finer grids, or switch to continuous-state methods (function approximation)

Special Transition Structures

Many problems have special structure in their transition dynamics that can be exploited for efficient solution:

Factored MDPs:

State is a vector of features $s = (x_1, \ldots, x_n)$. Transitions factor: $$P(s' | s, a) = \prod_{i=1}^{n} P(x'_i | \text{parents}(x_i), a)$$

This compact representation enables dynamic Bayesian network methods.

Linear Dynamics:

In continuous control, dynamics may be linear: $$s_{t+1} = As_t + Ba_t + \epsilon$$

LQR (Linear Quadratic Regulator) provides closed-form optimal policy.

Locality:

Each action only affects a small "neighborhood" of the state space. Common in spatial domains (robotics, games with local moves). $$P(s' | s, a) = 0 \text{ unless } s' \text{ is "close" to } s$$

Reversibility:

For every transition $s \rightarrow s'$, there exists a reverse transition $s' \rightarrow s$. Enables efficient exploration via backtracking.

special_structures.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
import numpy as np
 
# ================================================
# Example: Factored MDP (Inventory Management)
# ================================================
"""
State: (inventory_A, inventory_B) ∈ {0..10} × {0..10}
Action: (order_A, order_B) ∈ {0..5} × {0..5}
 
Factored dynamics:
- inventory'_A depends only on inventory_A, order_A, demand_A
- inventory'_B depends only on inventory_B, order_B, demand_B
(The two products are independent)
 
Full state space: 11 × 11 = 121 states
Factored representation: 2 independent 11-state problems
"""
 
class FactoredInventoryMDP:
    def __init__(self, max_inventory: int = 10):
        self.max_inv = max_inventory
        # Demand distribution for each product
        self.demand_probs = np.array([0.1, 0.2, 0.3, 0.2, 0.1, 0.1])  # 0-5 demand
    
    def single_product_transition(
        self, inventory: int, order: int
    ) -> np.ndarray:
        """
        P(inventory' | inventory, order) for single product.
        inventory' = max(0, min(max_inv, inventory + order - demand))
        """
        probs = np.zeros(self.max_inv + 1)
        
        for demand, p_demand in enumerate(self.demand_probs):
            new_inv = max(0, min(self.max_inv, inventory + order - demand))
            probs[new_inv] += p_demand
        
        return probs
    
    def factored_transition(
        self, state: tuple, action: tuple
    ) -> dict:
        """
        P(s' | s, a) using factored structure.
        Returns sparse dict {state: probability}.
        
        Complexity: O(n_vals^2) instead of O(n_vals^4) for full
        """
        inv_a, inv_b = state
        order_a, order_b = action
        
        probs_a = self.single_product_transition(inv_a, order_a)
        probs_b = self.single_product_transition(inv_b, order_b)
        
        result = {}
        for new_inv_a, p_a in enumerate(probs_a):
            for new_inv_b, p_b in enumerate(probs_b):
                if p_a > 0 and p_b > 0:
                    result[(new_inv_a, new_inv_b)] = p_a * p_b
        
        return result
 
 
# ================================================
# Example: Linear Dynamics (Simple Robot)
# ================================================
"""
State: [position, velocity] ∈ ℝ²
Action: acceleration ∈ [-1, 1]
 
Linear dynamics:
s_{t+1} = A @ s_t + B @ a_t + noise
 
where A = [[1, dt],     # position += velocity * dt
           [0, 1]]      # velocity stays (without action)
      B = [[0],         # action doesn't directly affect position
           [dt]]        # action affects velocity
"""
 
class LinearRobotDynamics:
    def __init__(self, dt: float = 0.1, noise_std: float = 0.01):
        self.dt = dt
        self.noise_std = noise_std
        
        # State transition matrix
        self.A = np.array([
            [1.0, dt],
            [0.0, 1.0]
        ])
        
        # Control matrix
        self.B = np.array([
            [0.0],
            [dt]
        ])
    
    def transition(
        self, state: np.ndarray, action: float
    ) -> tuple:
        """
        Sample next state: s' = As + Ba + ε
        Returns (mean, sampled state)
        """
        action_vec = np.array([[action]])
        mean = self.A @ state + (self.B @ action_vec).flatten()
        noise = np.random.normal(0, self.noise_std, size=2)
        sampled = mean + noise
        return mean, sampled
    
    def transition_covariance(self) -> np.ndarray:
        """Return covariance matrix of transition noise."""
        return np.eye(2) * self.noise_std**2
 
 
# Demo
robot = LinearRobotDynamics()
state = np.array([0.0, 0.0])  # At origin, zero velocity
action = 1.0  # Accelerate
 
print("Linear dynamics demo:")
for t in range(5):
    mean, state = robot.transition(state, action)
    print(f"  t={t+1}: position={state[0]:.3f}, velocity={state[1]:.3f}")

Summary: Understanding State Transitions

We have deeply explored the transition probability function—the dynamic heart of any MDP. This component determines how actions affect the world and fundamentally shapes what algorithms can solve the problem efficiently.

Key Takeaways

•Transition function $P(s'|s,a)$ defines environment dynamics—the probability of reaching each next state
•Deterministic vs. stochastic dynamics require different algorithmic approaches and have different planning implications
•Transition matrices $\mathbf{P}^a$ provide the algebraic foundation for dynamic programming algorithms
•Model-based RL exploits known or learned transitions for planning; model-free RL works without explicit dynamics
•Sources of stochasticity include inherent randomness, partial observability, and modeling simplifications
•Special structures (factored, linear, local) can be exploited for exponential computational savings

What's Next:

With state transitions understood, we turn to the reward signal—the scalar feedback that defines what the agent should optimize. The reward function transforms the dynamics into a decision problem with clear objectives.

Page Complete

You now have a deep understanding of transition dynamics in MDPs—from formal definitions to practical implementations, from deterministic to stochastic, from model-based to model-free paradigms. This knowledge is essential for understanding how RL algorithms learn to navigate uncertain environments.