Rl Fundamentals - Learning Module

Loading content...

0/278

Markov Property

The Memory That Matters

How much history does an agent need to remember? In principle, the entire past could matter—every state visited, every action taken, every reward received. But this would make decision-making computationally intractable.

The Markov property provides a powerful simplification: the future depends only on the present, not on how we got there. If the current state contains all relevant information, we can forget the past and still make optimal decisions.

This property is simultaneously an assumption, a design goal, and a mathematical foundation. Understanding it deeply—when it holds, when it fails, and how to work around failures—is essential for building effective RL systems.

What You Will Learn

By the end of this page, you'll understand the formal definition of the Markov property, its role in making RL tractable, how to recognize Markov violations, the distinction between MDPs and POMDPs, and practical strategies for handling non-Markovian domains.

The Markov Property: Formal Definition

A stochastic process has the Markov property if the conditional probability distribution of future states depends only on the present state, not on the sequence of events that preceded it.

Formal Statement

A state $S_t$ is Markov if and only if:

$$P(S_{t+1} | S_t, A_t) = P(S_{t+1} | S_0, A_0, S_1, A_1, \ldots, S_t, A_t)$$

Equivalently, for rewards: $$P(R_{t+1} | S_t, A_t) = P(R_{t+1} | S_0, A_0, \ldots, S_t, A_t)$$

In words: given the present state and action, the future is independent of the past. The present state is a sufficient statistic for the history.

Information-Theoretic View

The Markov property means the present state contains all information from the past that's relevant to predicting the future:

$$I(S_{t+1}; S_0, \ldots, S_{t-1} | S_t, A_t) = 0$$

No mutual information between the future and the past given the present.

The State as a Compression

The Markov state compresses all relevant history into a fixed representation. No matter how long the episode has been running, the state contains everything needed for optimal decision-making. This compression is what makes dynamic programming and RL algorithms efficient.

Examples

Markov States:

Position AND velocity of a particle (knowing both determines future trajectory)
Full board configuration in chess (past moves are irrelevant once we see the current position)
Complete machine state in a deterministic system

Non-Markov Observations:

Position WITHOUT velocity (can't predict which way the particle is moving)
Partial view of a chess board (can't evaluate position without seeing all pieces)
Sensor reading at a single instant (without context or derivatives)

Markov Decision Processes (MDPs)

A Markov Decision Process is the mathematical framework for decision-making with Markov states. It's defined by a tuple $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$:

$\mathcal{S}$: State space - set of all possible states
$\mathcal{A}$: Action space - set of all possible actions
$P: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow [0, 1]$: Transition function - $P(s'|s, a)$ gives probability of transitioning to $s'$ from $s$ under action $a$
$R: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow \mathbb{R}$: Reward function - $R(s, a, s')$ gives immediate reward
$\gamma \in [0, 1)$: Discount factor - determines present value of future rewards

Key Properties of MDPs

Existence of optimal policy: There exists at least one deterministic stationary optimal policy
Bellman optimality: The optimal value function satisfies the Bellman equation
Policy iteration converges: Alternating policy evaluation and improvement converges to optimal
Value iteration converges: Repeated Bellman updates converge to optimal values

mdp_definition.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
import numpy as np
from dataclasses import dataclass
from typing import Dict, Tuple, List, Optional
from abc import ABC, abstractmethod
 
@dataclass
class MDP:
    """
    Markov Decision Process specification.
    
    Finite MDP with known dynamics (for educational purposes).
    """
    n_states: int
    n_actions: int
    transitions: np.ndarray      # P[s, a, s'] = P(s'|s, a)
    rewards: np.ndarray          # R[s, a, s'] = R(s, a, s')
    gamma: float = 0.99
    initial_state_dist: Optional[np.ndarray] = None
    
    def __post_init__(self):
        # Validate dimensions
        assert self.transitions.shape == (
            self.n_states, self.n_actions, self.n_states
        ), "Transition matrix shape mismatch"
        
        # Validate probability distributions
        for s in range(self.n_states):
            for a in range(self.n_actions):
                assert np.isclose(
                    self.transitions[s, a].sum(), 1.0
                ), f"P(·|s={s}, a={a}) doesn't sum to 1"
        
        # Default uniform initial state distribution
        if self.initial_state_dist is None:
            self.initial_state_dist = np.ones(self.n_states) / self.n_states
    
    def step(self, state: int, action: int) -> Tuple[int, float]:
        """
        Sample next state and reward from MDP dynamics.
        
        This is how the MDP presents itself to the agent:
        as a black-box that can be queried.
        """
        # Sample next state from transition distribution
        next_state = np.random.choice(
            self.n_states, 
            p=self.transitions[state, action]
        )
        
        # Get reward
        reward = self.rewards[state, action, next_state]
        
        return next_state, reward
    
    def get_transition_matrix(self, policy: np.ndarray) -> np.ndarray:
        """
        Compute transition matrix under a given policy.
        
        P^π[s, s'] = sum_a π(a|s) P(s'|s, a)
        
        Useful for theoretical analysis and policy evaluation.
        """
        # policy.shape = (n_states, n_actions)
        # Result shape = (n_states, n_states)
        
        P_pi = np.zeros((self.n_states, self.n_states))
        for s in range(self.n_states):
            for a in range(self.n_actions):
                P_pi[s] += policy[s, a] * self.transitions[s, a]
        
        return P_pi
    
    def check_markov_property(self, trajectories: List[List[Tuple]]) -> dict:
        """
        Empirically verify Markov property from trajectory data.
        
        For each (s, a) pair, check if P(s'|s, a) is consistent
        regardless of what came before.
        
        Returns statistics on Markov consistency.
        """
        # Count transitions with different histories
        from collections import defaultdict
        
        # (s, a) -> {history_hash -> [next_states]}
        transitions_by_history = defaultdict(lambda: defaultdict(list))
        
        for trajectory in trajectories:
            for t, (s, a, r, s_next) in enumerate(trajectory):
                # Hash the history (simplified: just length and prev state)
                if t == 0:
                    history_hash = "start"
                else:
                    history_hash = f"len_{t}_prev_{trajectory[t-1][0]}"
                
                transitions_by_history[(s, a)][history_hash].append(s_next)
        
        # Compare distributions across histories
        results = {
            'state_action_pairs': len(transitions_by_history),
            'consistent_pairs': 0,
            'inconsistent_pairs': 0,
            'details': []
        }
        
        for (s, a), histories in transitions_by_history.items():
            if len(histories) < 2:
                continue
            
            # Compare transition distributions
            all_next_states = []
            for hist, next_states in histories.items():
                all_next_states.extend(next_states)
            
            # Simple consistency check: similar distributions?
            # (A full test would use statistical hypothesis testing)
            results['consistent_pairs'] += 1  # Simplified
        
        return results
 
 
def value_iteration(mdp: MDP, tolerance: float = 1e-6) -> Tuple[np.ndarray, np.ndarray]:
    """
    Compute optimal value function and policy for known MDP.
    
    V*(s) = max_a sum_{s'} P(s'|s,a) [R(s,a,s') + γV*(s')]
    
    This is only possible because the MDP is fully specified.
    """
    V = np.zeros(mdp.n_states)
    
    while True:
        V_new = np.zeros(mdp.n_states)
        
        for s in range(mdp.n_states):
            # Compute Q(s, a) for each action
            q_values = np.zeros(mdp.n_actions)
            for a in range(mdp.n_actions):
                for s_next in range(mdp.n_states):
                    p = mdp.transitions[s, a, s_next]
                    r = mdp.rewards[s, a, s_next]
                    q_values[a] += p * (r + mdp.gamma * V[s_next])
            
            # V*(s) = max_a Q*(s, a)
            V_new[s] = np.max(q_values)
        
        # Check convergence
        if np.max(np.abs(V_new - V)) < tolerance:
            break
        V = V_new
    
    # Extract optimal policy
    policy = np.zeros((mdp.n_states, mdp.n_actions))
    for s in range(mdp.n_states):
        q_values = np.zeros(mdp.n_actions)
        for a in range(mdp.n_actions):
            for s_next in range(mdp.n_states):
                p = mdp.transitions[s, a, s_next]
                r = mdp.rewards[s, a, s_next]
                q_values[a] += p * (r + mdp.gamma * V[s_next])
        
        best_action = np.argmax(q_values)
        policy[s, best_action] = 1.0
    
    return V, policy

Why the Markov Property Matters

The Markov property isn't just a mathematical convenience—it's the foundation that makes RL tractable. Without it, our standard algorithms and theory would not apply.

1. Enables Optimal Stationary Policies

In an MDP, there exists a stationary optimal policy: a single mapping from states to actions that's optimal regardless of when or how we reached the state.

Without Markov: The optimal action might depend on the entire history. Policy space explodes to infinite dimensions.

2. Enables Recursive Value Functions

The Bellman equation relates $V(s)$ to $V(s')$—value of current state to values of next states. This recursive structure enables:

Dynamic programming solutions
Efficient bootstrapping in TD learning
Tractable policy evaluation

Without Markov: Value would depend on history: $V(s_0, a_0, s_1, \ldots, s_t)$. No recursion, no efficient computation.

Markov Property: What It Enables
Property	With Markov	Without Markov
Optimal policy form	Stationary: π(a\|s)	History-dependent: π(a\|s_0, ..., s_t)
Value function	V(s) - finite table/function	V(h) - infinite over histories
Policy search space	Finite (for finite S, A)	Infinite (all history mappings)
Dynamic programming	Bellman recursion works	No recursive structure
Sample complexity	Depends on \|S\|, \|A\|	Can be unbounded
Convergence guarantees	Strong (VI, PI, Q-learning)	Weaker or no guarantees

3. Enables Off-Policy Learning

Q-learning's ability to learn about the optimal policy while following any behavior policy relies on the Markov property. The update: $$Q(s, a) \leftarrow r + \gamma \max_{a'} Q(s', a')$$

works because the value of $(s', a')$ doesn't depend on how we got to $s'$.

4. Enables Experience Replay

Storing transitions $(s, a, r, s')$ and replaying them out of order assumes that these transitions would look the same regardless of when they occurred. This is only true under the Markov property.

5. Enables Generalization

When we approximate $V(s)$ with a neural network, we're assuming similar states have similar values. This generalization relies on states being self-contained—their value doesn't depend on unobserved history.

The Hidden Assumption

When RL algorithms fail in practice, Markov violations are a common culprit. The algorithm is correct, but the state representation doesn't satisfy the assumption the algorithm requires. Always ask: does my state contain enough information?

When Markov Fails: POMDPs

Many real-world problems violate the Markov property. When the agent receives observations rather than true states, we enter the realm of Partially Observable MDPs (POMDPs).

POMDP Definition

A POMDP extends an MDP with:

$\Omega$: Observation space
$O: \mathcal{S} \times \Omega \rightarrow [0, 1]$: Observation function

The agent doesn't see state $s$ directly—only observation $o \sim O(\cdot | s)$.

Common Sources of Partial Observability

Sensor limitations: Cameras have blind spots, microphones miss frequencies
Hidden variables: Internal machine states, opponent intentions, market conditions
Occlusion: Parts of the world blocked from view
Lossy observation: Quantization, noise, missing data
Temporal aliasing: Same observation from different true states

MDP Characteristics

•Agent sees true state
•Optimal policy is stationary
•Standard RL algorithms apply
•Computational complexity manageable
•Strong convergence guarantees

POMDP Characteristics

•Agent sees noisy/partial observation
•Optimal policy may need full history
•Standard algorithms may fail/be suboptimal
•Generally PSPACE-complete
•Weaker or no convergence guarantees

The Belief State

One approach to POMDPs: maintain a belief state—a probability distribution over possible true states given the observation history:

$$b(s) = P(S_t = s | o_0, a_0, \ldots, o_t)$$

The belief state IS Markov: updated via Bayes' rule using the observation model. But the belief space is continuous and high-dimensional (one dimension per state), making exact solutions intractable for large state spaces.

Practical POMDP Approaches

Belief-based planning: Approximate belief states, use MDP solvers
Recurrent policies: Let RNN/LSTM maintain implicit belief
Frame stacking: Include recent observations as pseudo-state
State augmentation: Add features that capture hidden information
Predictive state representations: Learn features that predict future observations

In Practice: Approximate Markov is Often Enough

Pure POMDPs are computationally hard, but most problems are 'almost Markov'—some history helps, but not infinite history. Adding a few frames of memory or a small RNN often captures the relevant information. The art is finding the minimal history length that makes the problem approximately Markov.

Detecting Markov Violations

How do you know if your state representation is Markov? Here are practical diagnostic techniques:

1. Transition Consistency Test

For each (state, action) pair, examine the distribution of next states. If this distribution varies depending on what came before, the state isn't Markov.

Formally: Compare $P(s' | s, a, h_1)$ vs $P(s' | s, a, h_2)$ for different histories $h_1, h_2$.

2. Value Prediction Test

Train a value function $V(s)$. If the Bellman error $\delta = r + \gamma V(s') - V(s)$ is systematically correlated with history features not in $s$, those features should be included.

3. Policy Performance Gap

Compare a memoryless policy $\pi(a|s)$ to a history-dependent policy $\pi(a|s_{t-k:t})$. If the history-dependent policy significantly outperforms, the state is missing information.

4. Mutual Information Analysis

Measure $I(s_{t+1}; s_{t-k} | s_t, a_t)$. Non-zero mutual information between future and past given present indicates Markov violation.

markov_diagnostics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
import numpy as np
from collections import defaultdict
from typing import List, Tuple, Dict
from scipy import stats
 
def test_transition_consistency(
    trajectories: List[List[Tuple]],
    state_discretizer=None,
    significance_level: float = 0.05
) -> Dict:
    """
    Test Markov property by checking if transition distributions
    are consistent across different histories.
    
    For each (s, a) pair, collect transitions under different
    historical contexts and test if distributions differ significantly.
    
    Returns:
        Dictionary with test results and problematic state-action pairs
    """
    # Group transitions by (state, action) and recent history
    # (s, a) -> {history_context -> [next_states]}
    transitions = defaultdict(lambda: defaultdict(list))
    
    for trajectory in trajectories:
        for t, (s, a, r, s_next) in enumerate(trajectory):
            # Create history context (e.g., previous state if t > 0)
            if t > 0:
                prev_s = trajectory[t-1][0]
                context = f"prev_{prev_s}"
            else:
                context = "start"
            
            # Discretize continuous states if needed
            if state_discretizer:
                s = state_discretizer(s)
                s_next = state_discretizer(s_next)
            
            transitions[(s, a)][context].append(s_next)
    
    # Test each (s, a) pair for consistency
    results = {
        'num_pairs_tested': 0,
        'num_violations': 0,
        'violation_rate': 0.0,
        'violations': []
    }
    
    for (s, a), contexts in transitions.items():
        if len(contexts) < 2:
            continue  # Need at least 2 contexts to compare
        
        results['num_pairs_tested'] += 1
        
        # Get samples from different contexts
        context_samples = list(contexts.values())
        
        # Chi-square test for discrete, KS test for continuous
        # Here we use chi-square assuming discrete next states
        try:
            # Pool all next states to get categories
            all_next = [s for samples in context_samples for s in samples]
            unique_next = list(set(all_next))
            
            if len(unique_next) < 2:
                continue
            
            # Count frequencies per context
            counts = []
            for samples in context_samples:
                ctx_counts = [samples.count(ns) for ns in unique_next]
                if sum(ctx_counts) >= 5:  # Min samples for chi-square
                    counts.append(ctx_counts)
            
            if len(counts) >= 2:
                # Chi-square test of homogeneity
                _, p_value, _, _ = stats.chi2_contingency(counts)
                
                if p_value < significance_level:
                    results['num_violations'] += 1
                    results['violations'].append({
                        'state': s,
                        'action': a,
                        'p_value': p_value,
                        'contexts': list(contexts.keys())
                    })
        except Exception as e:
            pass  # Handle edge cases gracefully
    
    if results['num_pairs_tested'] > 0:
        results['violation_rate'] = (
            results['num_violations'] / results['num_pairs_tested']
        )
    
    return results
 
 
def test_value_prediction_residuals(
    value_fn,
    trajectories: List[List[Tuple]],
    gamma: float = 0.99
) -> Dict:
    """
    Check if TD errors correlate with history features.
    
    If V(s) is accurate and state is Markov, TD errors should be
    zero-mean noise uncorrelated with history.
    
    If TD errors correlate with past states/actions, those features
    should be included in the state.
    """
    td_errors = []
    history_features = []
    
    for trajectory in trajectories:
        for t, (s, a, r, s_next) in enumerate(trajectory):
            # Compute TD error
            v_s = value_fn(s)
            v_s_next = value_fn(s_next)
            td_error = r + gamma * v_s_next - v_s
            
            td_errors.append(td_error)
            
            # Extract history features
            if t > 0:
                prev_s, prev_a, _, _ = trajectory[t-1]
                # Simple features: was previous state different?
                features = {
                    'prev_state_diff': float(prev_s != s),
                    'episode_position': t
                }
            else:
                features = {
                    'prev_state_diff': 0.0,
                    'episode_position': 0
                }
            history_features.append(features)
    
    # Analyze correlations
    td_errors = np.array(td_errors)
    results = {
        'mean_td_error': np.mean(td_errors),
        'std_td_error': np.std(td_errors),
        'correlations': {}
    }
    
    for feature_name in history_features[0].keys():
        feature_values = np.array([f[feature_name] for f in history_features])
        correlation = np.corrcoef(td_errors, feature_values)[0, 1]
        results['correlations'][feature_name] = correlation
    
    # Flag significant correlations
    results['potential_issues'] = [
        name for name, corr in results['correlations'].items()
        if abs(corr) > 0.1
    ]
    
    return results
 
 
def compare_memoryless_vs_history_policy(
    env,
    memoryless_policy,
    history_policy,
    n_episodes: int = 100
) -> Dict:
    """
    Compare performance of memoryless vs history-dependent policy.
    
    Large gap suggests state is missing important information.
    """
    def evaluate_policy(policy, uses_history: bool):
        returns = []
        for _ in range(n_episodes):
            state = env.reset()
            history = [state] if uses_history else None
            episode_return = 0
            done = False
            
            while not done:
                if uses_history:
                    action = policy(history)
                else:
                    action = policy(state)
                
                next_state, reward, done, _ = env.step(action)
                episode_return += reward
                
                if uses_history:
                    history.append(action)
                    history.append(next_state)
                
                state = next_state
            
            returns.append(episode_return)
        
        return np.mean(returns), np.std(returns)
    
    memoryless_mean, memoryless_std = evaluate_policy(
        memoryless_policy, uses_history=False
    )
    history_mean, history_std = evaluate_policy(
        history_policy, uses_history=True
    )
    
    return {
        'memoryless_return': memoryless_mean,
        'memoryless_std': memoryless_std,
        'history_return': history_mean,
        'history_std': history_std,
        'improvement': history_mean - memoryless_mean,
        'significant': (history_mean - memoryless_mean) > 2 * memoryless_std
    }

Practical Approaches to Partial Observability

When the environment isn't fully observable, we need strategies to recover approximate Markovianity.

1. State Augmentation

Add features to the state until it becomes sufficiently Markov:

Include velocities (derivatives of positions)
Include sensor readings from multiple modalities
Include relevant context (time of day, mode, flags)

This is the simplest approach: expand the state until the problem becomes approximately Markov.

2. Frame Stacking

For sequential observations (especially images), stack recent frames: $$o_t^{\text{stacked}} = [o_{t-k+1}, \ldots, o_t]$$

DQN's classic approach: stack 4 Atari frames. This allows inferring velocity from position changes.

3. Recurrent Neural Networks

Use LSTM or GRU to process observation history: $$h_t = \text{RNN}(o_t, h_{t-1})$$

The hidden state $h_t$ serves as an approximate belief state, encoding relevant history.

handling_pomdp.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
import torch
import torch.nn as nn
from collections import deque
import numpy as np
from typing import Tuple
 
class FrameStack:
    """
    Stack recent observations to create pseudo-Markov state.
    
    Classic approach in Atari: 4-frame stack allows inferring
    velocity from position changes across frames.
    """
    
    def __init__(self, n_frames: int = 4):
        self.n_frames = n_frames
        self.frames = deque(maxlen=n_frames)
    
    def reset(self, initial_obs: np.ndarray):
        """Initialize with copies of first observation."""
        self.frames.clear()
        for _ in range(self.n_frames):
            self.frames.append(initial_obs)
        return self._get_stacked()
    
    def add(self, obs: np.ndarray) -> np.ndarray:
        """Add new observation, return stacked state."""
        self.frames.append(obs)
        return self._get_stacked()
    
    def _get_stacked(self) -> np.ndarray:
        """Stack frames along channel dimension."""
        return np.concatenate(list(self.frames), axis=-1)
 
 
class RecurrentPolicy(nn.Module):
    """
    LSTM-based policy for POMDPs.
    
    The hidden state maintains a learned compression of
    relevant history, approximating a belief state.
    
    Key insight: LSTM learns WHAT to remember, not fixed history.
    """
    
    def __init__(self, obs_dim: int, action_dim: int,
                 hidden_dim: int = 128,
                 n_layers: int = 1):
        super().__init__()
        
        self.hidden_dim = hidden_dim
        self.n_layers = n_layers
        
        # Observation encoder
        self.encoder = nn.Sequential(
            nn.Linear(obs_dim, hidden_dim),
            nn.ReLU(),
        )
        
        # Recurrent core
        self.lstm = nn.LSTM(
            input_size=hidden_dim,
            hidden_size=hidden_dim,
            num_layers=n_layers,
            batch_first=True
        )
        
        # Policy head
        self.policy_head = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim)
        )
    
    def forward(self, obs: torch.Tensor, 
                hidden: Tuple[torch.Tensor, torch.Tensor] = None
                ) -> Tuple[torch.Tensor, Tuple]:
        """
        Process observation and update hidden state.
        
        Args:
            obs: Current observation [batch, obs_dim] or [batch, seq, obs_dim]
            hidden: (h, c) LSTM hidden state
            
        Returns:
            logits: Action logits
            hidden: Updated (h, c) hidden state
        """
        # Initialize hidden state if not provided
        if hidden is None:
            batch_size = obs.shape[0]
            hidden = self.init_hidden(batch_size)
        
        # Encode observation
        if obs.dim() == 2:
            obs = obs.unsqueeze(1)  # Add sequence dimension
        
        x = self.encoder(obs)
        
        # Process through LSTM
        lstm_out, hidden = self.lstm(x, hidden)
        
        # Get last timestep output
        if lstm_out.dim() == 3:
            lstm_out = lstm_out[:, -1, :]
        
        # Produce action logits
        logits = self.policy_head(lstm_out)
        
        return logits, hidden
    
    def init_hidden(self, batch_size: int) -> Tuple[torch.Tensor, torch.Tensor]:
        """Initialize LSTM hidden state to zeros."""
        device = next(self.parameters()).device
        h = torch.zeros(self.n_layers, batch_size, self.hidden_dim, device=device)
        c = torch.zeros(self.n_layers, batch_size, self.hidden_dim, device=device)
        return (h, c)
    
    def act(self, obs: torch.Tensor, 
            hidden: Tuple[torch.Tensor, torch.Tensor] = None,
            deterministic: bool = False
            ) -> Tuple[int, Tuple]:
        """Select action and return updated hidden state."""
        with torch.no_grad():
            logits, hidden = self.forward(obs.unsqueeze(0), hidden)
            
            if deterministic:
                action = logits.argmax(dim=-1)
            else:
                action = torch.distributions.Categorical(logits=logits).sample()
            
            return action.item(), hidden
 
 
class TransformerPolicy(nn.Module):
    """
    Transformer-based policy for POMDPs.
    
    Attends over the full observation history (or a window).
    More flexible than RNN but higher computational cost.
    
    Used in Decision Transformer and similar architectures.
    """
    
    def __init__(self, obs_dim: int, action_dim: int,
                 hidden_dim: int = 128,
                 n_heads: int = 4,
                 n_layers: int = 2,
                 max_seq_len: int = 100):
        super().__init__()
        
        self.hidden_dim = hidden_dim
        self.max_seq_len = max_seq_len
        
        # Observation embedding
        self.obs_embedding = nn.Linear(obs_dim, hidden_dim)
        
        # Positional encoding
        self.pos_encoding = nn.Parameter(
            torch.randn(1, max_seq_len, hidden_dim) * 0.1
        )
        
        # Transformer encoder
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=hidden_dim,
            nhead=n_heads,
            dim_feedforward=hidden_dim * 4,
            batch_first=True
        )
        self.transformer = nn.TransformerEncoder(encoder_layer, n_layers)
        
        # Policy head
        self.policy_head = nn.Linear(hidden_dim, action_dim)
    
    def forward(self, obs_sequence: torch.Tensor) -> torch.Tensor:
        """
        Process observation sequence, output action for last timestep.
        
        Args:
            obs_sequence: [batch, seq_len, obs_dim]
            
        Returns:
            logits: Action logits for last timestep [batch, action_dim]
        """
        seq_len = obs_sequence.shape[1]
        
        # Embed observations
        x = self.obs_embedding(obs_sequence)
        
        # Add positional encoding
        x = x + self.pos_encoding[:, :seq_len, :]
        
        # Generate causal mask (can only attend to past)
        mask = torch.triu(
            torch.ones(seq_len, seq_len, device=x.device) * float('-inf'),
            diagonal=1
        )
        
        # Transform
        x = self.transformer(x, mask=mask)
        
        # Policy from last position
        logits = self.policy_head(x[:, -1, :])
        
        return logits

4. Attention Mechanisms

Transformers can attend to any part of the history, learning which past observations are relevant: $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right)V$$

More flexible than RNNs (no fixed-length compression) but quadratic in sequence length.

5. Predictive State Representations

Learn state features that predict future observations: $$\phi(o_{0:t}) \text{ such that } P(o_{t+1:t+k} | o_{0:t}) \approx P(o_{t+1:t+k} | \phi(o_{0:t}))$$

If the representation predicts the future well, it captures the relevant past.

Choosing an Approach

Mild non-Markovianity (missing velocities): Frame stacking usually suffices
Moderate non-Markovianity (hidden modes): RNN or state augmentation
Severe non-Markovianity (complex hidden dynamics): Transformer or full POMDP methods

Designing Markov State Representations

Designing a good state representation is one of the most important decisions in RL system design. Here are principles for creating approximately Markov states:

Principle 1: Include Derivatives

If position matters, velocity probably matters too. If position and velocity matter, acceleration might matter.

$$s = [x, \dot{x}, \ddot{x}, \ldots]$$

Finite differences can approximate derivatives from history.

Principle 2: Include Context

State should include any contextual information that affects dynamics:

Mode/phase (walking vs running vs jumping)
Time of day, day of week (for systems with temporal patterns)
Configuration flags (enabled features, active modes)

Principle 3: Include Relevant History

If recent actions affect future outcomes (actuator lag, cooldowns, momentum), include them: $$s_t = [o_t, a_{t-1}, a_{t-2}, \ldots, a_{t-k}]$$

State Design Checklist

•Sufficient for prediction: Given s and a, can you (in principle) predict s' and r?
•Includes dynamics: Does the state capture velocities, accelerations, and other derivatives?
•Includes context: Are all relevant modes, phases, and contexts represented?
•Minimal but complete: Is the state as compact as possible while remaining sufficient?
•Observable at deployment: Can you actually observe all state features when deployed?
•Consistent across time: Does the state mean the same thing throughout an episode?

The Debugging Question

Ask yourself: If I were the agent, with only the current state, could I make a good decision? Or would I need to ask 'what happened before?' If the latter, that information should be in the state.

Empirical Considerations

In practice, the Markov property is a matter of degree, not binary. Here's how to think about it empirically:

Approximate Markov Is Often Enough

RL algorithms are robust to mild Markov violations. If the optimal action is 'usually' determinable from state, algorithms will 'usually' find good policies.

Trade-off: State Size vs. Markovianity

Larger states are more likely to be Markov but:

Require more samples to learn from
May dilute signal with irrelevant features
Increase computation per step

Find the sweet spot: enough information to be approximately Markov, not so much that learning is impractical.

Symptoms of Markov Violations

Inconsistent learning progress: Performance oscillates rather than improving
Policy that 'forgets': Works for a while, then fails in similar situations
History dependence in failures: Same state leads to different outcomes
RNN significantly outperforms MLP: Memory provides large boost

Diagnosis and Treatment of Markov Issues
Symptom	Likely Cause	Treatment
Policy oscillates in 'same' state	State aliasing	Add distinguishing features
Can't predict next state from current	Missing dynamics info	Add velocities/derivatives
Performance depends on episode history	Missing context	Include context features or use RNN
Works in some episodes, not others	Hidden modes	Identify and include mode indicators
Reward prediction error correlates with past	Missing reward-relevant info	Analyze what past info predicts reward

The Curse of History Length

As you add more history to the state, you approach Markovianity but also increase state dimensionality exponentially. A k-step history of observations o ∈ ℝ^n creates a state in ℝ^(n×k). Be strategic about what history to include.

Theoretical Perspectives

The Markov Property as an Idealization

No finite representation is perfectly Markov for a complex physical world. The true state of the universe is unfathomably rich. The Markov property should be understood as:

An approximation that enables tractable algorithms
A design goal for state representations
A theoretical framework that still provides insight even when not perfectly satisfied

Relationship to Sufficient Statistics

A Markov state is a sufficient statistic for the future given the past. This connects to core statistical concepts and explains why finding Markov representations is related to learning compact, useful features.

State Abstraction

MDP theory studies when multiple true states can be aggregated without losing optimality. If two states have the same optimal action and transition to the same abstract states under that action, they can be merged. This enables state compression while preserving Markovianity.

Model-Based RL and Markov

Model-based RL learns $P(s'|s, a)$ and $R(s, a)$. If the state isn't Markov, these functions become ill-defined:

Same $(s, a)$ leads to different $s'$ depending on history
Model predictions are unreliable
Planning in the learned model produces poor policies

Model-based methods are thus particularly sensitive to Markov violations.

The Information State

In POMDPs, the belief state $b_t$ (distribution over true states given observation history) is always Markov. The challenge is that $b_t$ lives in a continuous space of dimension $|\mathcal{S}| - 1$, which is intractable for large state spaces.

Recurrent neural networks implicitly learn a compressed belief state. The hidden state $h_t$ is an approximation to $b_t$ that's computationally tractable.

Markov vs. Markovian

A process is 'Markov' (or 'has the Markov property') if the future is independent of the past given the present. A decision is 'Markovian' if it's based only on current state, not history. Optimal policies in MDPs are Markovian; optimal policies in POMDPs may not be.

Summary: The Foundation of Tractability

The Markov property is the mathematical assumption that makes reinforcement learning tractable. It asserts that the present contains all information relevant to the future—enabling elegant algorithms, strong convergence guarantees, and efficient computation.

Key Takeaways

•Markov property states that future is independent of past given present state.
•MDPs are decision processes where this property holds, enabling optimal stationary policies.
•POMDPs arise when observations don't capture full state, requiring memory or belief states.
•Detecting violations involves checking transition consistency and value prediction residuals.
•Handling violations includes frame stacking, recurrent policies, and state augmentation.
•State design should aim for minimal sufficient representations—Markov but not bloated.
•Approximate Markov is often enough—RL algorithms are somewhat robust to mild violations.

Module Complete: RL Fundamentals

With this page, we've completed Module 1: RL Fundamentals. You now understand:

Agent-Environment Interaction: The continuous loop of perception, action, and feedback
States, Actions, Rewards: The three pillars flowing through the interface
Policies: How agents decide—from deterministic to stochastic, tabular to neural
Value Functions: How to quantify future success and guide decision-making
Markov Property: The mathematical foundation enabling tractable solutions

These concepts form the bedrock upon which all RL algorithms are built. The next module will formalize these ideas further as Markov Decision Processes and introduce the algorithmic machinery for finding optimal policies.

Module Complete

Congratulations! You've completed RL Fundamentals—the conceptual foundation of reinforcement learning. You understand the agent-environment paradigm, the quantities that flow through it, and the mathematical assumptions that make RL tractable. You're now ready to study MDPs in depth and learn the algorithms that solve them.