Loading content...
How much history does an agent need to remember? In principle, the entire past could matter—every state visited, every action taken, every reward received. But this would make decision-making computationally intractable.
The Markov property provides a powerful simplification: the future depends only on the present, not on how we got there. If the current state contains all relevant information, we can forget the past and still make optimal decisions.
This property is simultaneously an assumption, a design goal, and a mathematical foundation. Understanding it deeply—when it holds, when it fails, and how to work around failures—is essential for building effective RL systems.
By the end of this page, you'll understand the formal definition of the Markov property, its role in making RL tractable, how to recognize Markov violations, the distinction between MDPs and POMDPs, and practical strategies for handling non-Markovian domains.
A stochastic process has the Markov property if the conditional probability distribution of future states depends only on the present state, not on the sequence of events that preceded it.
Formal Statement
A state $S_t$ is Markov if and only if:
$$P(S_{t+1} | S_t, A_t) = P(S_{t+1} | S_0, A_0, S_1, A_1, \ldots, S_t, A_t)$$
Equivalently, for rewards: $$P(R_{t+1} | S_t, A_t) = P(R_{t+1} | S_0, A_0, \ldots, S_t, A_t)$$
In words: given the present state and action, the future is independent of the past. The present state is a sufficient statistic for the history.
Information-Theoretic View
The Markov property means the present state contains all information from the past that's relevant to predicting the future:
$$I(S_{t+1}; S_0, \ldots, S_{t-1} | S_t, A_t) = 0$$
No mutual information between the future and the past given the present.
The Markov state compresses all relevant history into a fixed representation. No matter how long the episode has been running, the state contains everything needed for optimal decision-making. This compression is what makes dynamic programming and RL algorithms efficient.
Examples
Markov States:
Non-Markov Observations:
A Markov Decision Process is the mathematical framework for decision-making with Markov states. It's defined by a tuple $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$:
Key Properties of MDPs
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165
import numpy as npfrom dataclasses import dataclassfrom typing import Dict, Tuple, List, Optionalfrom abc import ABC, abstractmethod @dataclassclass MDP: """ Markov Decision Process specification. Finite MDP with known dynamics (for educational purposes). """ n_states: int n_actions: int transitions: np.ndarray # P[s, a, s'] = P(s'|s, a) rewards: np.ndarray # R[s, a, s'] = R(s, a, s') gamma: float = 0.99 initial_state_dist: Optional[np.ndarray] = None def __post_init__(self): # Validate dimensions assert self.transitions.shape == ( self.n_states, self.n_actions, self.n_states ), "Transition matrix shape mismatch" # Validate probability distributions for s in range(self.n_states): for a in range(self.n_actions): assert np.isclose( self.transitions[s, a].sum(), 1.0 ), f"P(·|s={s}, a={a}) doesn't sum to 1" # Default uniform initial state distribution if self.initial_state_dist is None: self.initial_state_dist = np.ones(self.n_states) / self.n_states def step(self, state: int, action: int) -> Tuple[int, float]: """ Sample next state and reward from MDP dynamics. This is how the MDP presents itself to the agent: as a black-box that can be queried. """ # Sample next state from transition distribution next_state = np.random.choice( self.n_states, p=self.transitions[state, action] ) # Get reward reward = self.rewards[state, action, next_state] return next_state, reward def get_transition_matrix(self, policy: np.ndarray) -> np.ndarray: """ Compute transition matrix under a given policy. P^π[s, s'] = sum_a π(a|s) P(s'|s, a) Useful for theoretical analysis and policy evaluation. """ # policy.shape = (n_states, n_actions) # Result shape = (n_states, n_states) P_pi = np.zeros((self.n_states, self.n_states)) for s in range(self.n_states): for a in range(self.n_actions): P_pi[s] += policy[s, a] * self.transitions[s, a] return P_pi def check_markov_property(self, trajectories: List[List[Tuple]]) -> dict: """ Empirically verify Markov property from trajectory data. For each (s, a) pair, check if P(s'|s, a) is consistent regardless of what came before. Returns statistics on Markov consistency. """ # Count transitions with different histories from collections import defaultdict # (s, a) -> {history_hash -> [next_states]} transitions_by_history = defaultdict(lambda: defaultdict(list)) for trajectory in trajectories: for t, (s, a, r, s_next) in enumerate(trajectory): # Hash the history (simplified: just length and prev state) if t == 0: history_hash = "start" else: history_hash = f"len_{t}_prev_{trajectory[t-1][0]}" transitions_by_history[(s, a)][history_hash].append(s_next) # Compare distributions across histories results = { 'state_action_pairs': len(transitions_by_history), 'consistent_pairs': 0, 'inconsistent_pairs': 0, 'details': [] } for (s, a), histories in transitions_by_history.items(): if len(histories) < 2: continue # Compare transition distributions all_next_states = [] for hist, next_states in histories.items(): all_next_states.extend(next_states) # Simple consistency check: similar distributions? # (A full test would use statistical hypothesis testing) results['consistent_pairs'] += 1 # Simplified return results def value_iteration(mdp: MDP, tolerance: float = 1e-6) -> Tuple[np.ndarray, np.ndarray]: """ Compute optimal value function and policy for known MDP. V*(s) = max_a sum_{s'} P(s'|s,a) [R(s,a,s') + γV*(s')] This is only possible because the MDP is fully specified. """ V = np.zeros(mdp.n_states) while True: V_new = np.zeros(mdp.n_states) for s in range(mdp.n_states): # Compute Q(s, a) for each action q_values = np.zeros(mdp.n_actions) for a in range(mdp.n_actions): for s_next in range(mdp.n_states): p = mdp.transitions[s, a, s_next] r = mdp.rewards[s, a, s_next] q_values[a] += p * (r + mdp.gamma * V[s_next]) # V*(s) = max_a Q*(s, a) V_new[s] = np.max(q_values) # Check convergence if np.max(np.abs(V_new - V)) < tolerance: break V = V_new # Extract optimal policy policy = np.zeros((mdp.n_states, mdp.n_actions)) for s in range(mdp.n_states): q_values = np.zeros(mdp.n_actions) for a in range(mdp.n_actions): for s_next in range(mdp.n_states): p = mdp.transitions[s, a, s_next] r = mdp.rewards[s, a, s_next] q_values[a] += p * (r + mdp.gamma * V[s_next]) best_action = np.argmax(q_values) policy[s, best_action] = 1.0 return V, policyThe Markov property isn't just a mathematical convenience—it's the foundation that makes RL tractable. Without it, our standard algorithms and theory would not apply.
1. Enables Optimal Stationary Policies
In an MDP, there exists a stationary optimal policy: a single mapping from states to actions that's optimal regardless of when or how we reached the state.
Without Markov: The optimal action might depend on the entire history. Policy space explodes to infinite dimensions.
2. Enables Recursive Value Functions
The Bellman equation relates $V(s)$ to $V(s')$—value of current state to values of next states. This recursive structure enables:
Without Markov: Value would depend on history: $V(s_0, a_0, s_1, \ldots, s_t)$. No recursion, no efficient computation.
| Property | With Markov | Without Markov |
|---|---|---|
| Optimal policy form | Stationary: π(a|s) | History-dependent: π(a|s_0, ..., s_t) |
| Value function | V(s) - finite table/function | V(h) - infinite over histories |
| Policy search space | Finite (for finite S, A) | Infinite (all history mappings) |
| Dynamic programming | Bellman recursion works | No recursive structure |
| Sample complexity | Depends on |S|, |A| | Can be unbounded |
| Convergence guarantees | Strong (VI, PI, Q-learning) | Weaker or no guarantees |
3. Enables Off-Policy Learning
Q-learning's ability to learn about the optimal policy while following any behavior policy relies on the Markov property. The update: $$Q(s, a) \leftarrow r + \gamma \max_{a'} Q(s', a')$$
works because the value of $(s', a')$ doesn't depend on how we got to $s'$.
4. Enables Experience Replay
Storing transitions $(s, a, r, s')$ and replaying them out of order assumes that these transitions would look the same regardless of when they occurred. This is only true under the Markov property.
5. Enables Generalization
When we approximate $V(s)$ with a neural network, we're assuming similar states have similar values. This generalization relies on states being self-contained—their value doesn't depend on unobserved history.
When RL algorithms fail in practice, Markov violations are a common culprit. The algorithm is correct, but the state representation doesn't satisfy the assumption the algorithm requires. Always ask: does my state contain enough information?
Many real-world problems violate the Markov property. When the agent receives observations rather than true states, we enter the realm of Partially Observable MDPs (POMDPs).
POMDP Definition
A POMDP extends an MDP with:
The agent doesn't see state $s$ directly—only observation $o \sim O(\cdot | s)$.
Common Sources of Partial Observability
The Belief State
One approach to POMDPs: maintain a belief state—a probability distribution over possible true states given the observation history:
$$b(s) = P(S_t = s | o_0, a_0, \ldots, o_t)$$
The belief state IS Markov: updated via Bayes' rule using the observation model. But the belief space is continuous and high-dimensional (one dimension per state), making exact solutions intractable for large state spaces.
Practical POMDP Approaches
Pure POMDPs are computationally hard, but most problems are 'almost Markov'—some history helps, but not infinite history. Adding a few frames of memory or a small RNN often captures the relevant information. The art is finding the minimal history length that makes the problem approximately Markov.
How do you know if your state representation is Markov? Here are practical diagnostic techniques:
1. Transition Consistency Test
For each (state, action) pair, examine the distribution of next states. If this distribution varies depending on what came before, the state isn't Markov.
Formally: Compare $P(s' | s, a, h_1)$ vs $P(s' | s, a, h_2)$ for different histories $h_1, h_2$.
2. Value Prediction Test
Train a value function $V(s)$. If the Bellman error $\delta = r + \gamma V(s') - V(s)$ is systematically correlated with history features not in $s$, those features should be included.
3. Policy Performance Gap
Compare a memoryless policy $\pi(a|s)$ to a history-dependent policy $\pi(a|s_{t-k:t})$. If the history-dependent policy significantly outperforms, the state is missing information.
4. Mutual Information Analysis
Measure $I(s_{t+1}; s_{t-k} | s_t, a_t)$. Non-zero mutual information between future and past given present indicates Markov violation.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213
import numpy as npfrom collections import defaultdictfrom typing import List, Tuple, Dictfrom scipy import stats def test_transition_consistency( trajectories: List[List[Tuple]], state_discretizer=None, significance_level: float = 0.05) -> Dict: """ Test Markov property by checking if transition distributions are consistent across different histories. For each (s, a) pair, collect transitions under different historical contexts and test if distributions differ significantly. Returns: Dictionary with test results and problematic state-action pairs """ # Group transitions by (state, action) and recent history # (s, a) -> {history_context -> [next_states]} transitions = defaultdict(lambda: defaultdict(list)) for trajectory in trajectories: for t, (s, a, r, s_next) in enumerate(trajectory): # Create history context (e.g., previous state if t > 0) if t > 0: prev_s = trajectory[t-1][0] context = f"prev_{prev_s}" else: context = "start" # Discretize continuous states if needed if state_discretizer: s = state_discretizer(s) s_next = state_discretizer(s_next) transitions[(s, a)][context].append(s_next) # Test each (s, a) pair for consistency results = { 'num_pairs_tested': 0, 'num_violations': 0, 'violation_rate': 0.0, 'violations': [] } for (s, a), contexts in transitions.items(): if len(contexts) < 2: continue # Need at least 2 contexts to compare results['num_pairs_tested'] += 1 # Get samples from different contexts context_samples = list(contexts.values()) # Chi-square test for discrete, KS test for continuous # Here we use chi-square assuming discrete next states try: # Pool all next states to get categories all_next = [s for samples in context_samples for s in samples] unique_next = list(set(all_next)) if len(unique_next) < 2: continue # Count frequencies per context counts = [] for samples in context_samples: ctx_counts = [samples.count(ns) for ns in unique_next] if sum(ctx_counts) >= 5: # Min samples for chi-square counts.append(ctx_counts) if len(counts) >= 2: # Chi-square test of homogeneity _, p_value, _, _ = stats.chi2_contingency(counts) if p_value < significance_level: results['num_violations'] += 1 results['violations'].append({ 'state': s, 'action': a, 'p_value': p_value, 'contexts': list(contexts.keys()) }) except Exception as e: pass # Handle edge cases gracefully if results['num_pairs_tested'] > 0: results['violation_rate'] = ( results['num_violations'] / results['num_pairs_tested'] ) return results def test_value_prediction_residuals( value_fn, trajectories: List[List[Tuple]], gamma: float = 0.99) -> Dict: """ Check if TD errors correlate with history features. If V(s) is accurate and state is Markov, TD errors should be zero-mean noise uncorrelated with history. If TD errors correlate with past states/actions, those features should be included in the state. """ td_errors = [] history_features = [] for trajectory in trajectories: for t, (s, a, r, s_next) in enumerate(trajectory): # Compute TD error v_s = value_fn(s) v_s_next = value_fn(s_next) td_error = r + gamma * v_s_next - v_s td_errors.append(td_error) # Extract history features if t > 0: prev_s, prev_a, _, _ = trajectory[t-1] # Simple features: was previous state different? features = { 'prev_state_diff': float(prev_s != s), 'episode_position': t } else: features = { 'prev_state_diff': 0.0, 'episode_position': 0 } history_features.append(features) # Analyze correlations td_errors = np.array(td_errors) results = { 'mean_td_error': np.mean(td_errors), 'std_td_error': np.std(td_errors), 'correlations': {} } for feature_name in history_features[0].keys(): feature_values = np.array([f[feature_name] for f in history_features]) correlation = np.corrcoef(td_errors, feature_values)[0, 1] results['correlations'][feature_name] = correlation # Flag significant correlations results['potential_issues'] = [ name for name, corr in results['correlations'].items() if abs(corr) > 0.1 ] return results def compare_memoryless_vs_history_policy( env, memoryless_policy, history_policy, n_episodes: int = 100) -> Dict: """ Compare performance of memoryless vs history-dependent policy. Large gap suggests state is missing important information. """ def evaluate_policy(policy, uses_history: bool): returns = [] for _ in range(n_episodes): state = env.reset() history = [state] if uses_history else None episode_return = 0 done = False while not done: if uses_history: action = policy(history) else: action = policy(state) next_state, reward, done, _ = env.step(action) episode_return += reward if uses_history: history.append(action) history.append(next_state) state = next_state returns.append(episode_return) return np.mean(returns), np.std(returns) memoryless_mean, memoryless_std = evaluate_policy( memoryless_policy, uses_history=False ) history_mean, history_std = evaluate_policy( history_policy, uses_history=True ) return { 'memoryless_return': memoryless_mean, 'memoryless_std': memoryless_std, 'history_return': history_mean, 'history_std': history_std, 'improvement': history_mean - memoryless_mean, 'significant': (history_mean - memoryless_mean) > 2 * memoryless_std }When the environment isn't fully observable, we need strategies to recover approximate Markovianity.
1. State Augmentation
Add features to the state until it becomes sufficiently Markov:
This is the simplest approach: expand the state until the problem becomes approximately Markov.
2. Frame Stacking
For sequential observations (especially images), stack recent frames: $$o_t^{\text{stacked}} = [o_{t-k+1}, \ldots, o_t]$$
DQN's classic approach: stack 4 Atari frames. This allows inferring velocity from position changes.
3. Recurrent Neural Networks
Use LSTM or GRU to process observation history: $$h_t = \text{RNN}(o_t, h_{t-1})$$
The hidden state $h_t$ serves as an approximate belief state, encoding relevant history.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205
import torchimport torch.nn as nnfrom collections import dequeimport numpy as npfrom typing import Tuple class FrameStack: """ Stack recent observations to create pseudo-Markov state. Classic approach in Atari: 4-frame stack allows inferring velocity from position changes across frames. """ def __init__(self, n_frames: int = 4): self.n_frames = n_frames self.frames = deque(maxlen=n_frames) def reset(self, initial_obs: np.ndarray): """Initialize with copies of first observation.""" self.frames.clear() for _ in range(self.n_frames): self.frames.append(initial_obs) return self._get_stacked() def add(self, obs: np.ndarray) -> np.ndarray: """Add new observation, return stacked state.""" self.frames.append(obs) return self._get_stacked() def _get_stacked(self) -> np.ndarray: """Stack frames along channel dimension.""" return np.concatenate(list(self.frames), axis=-1) class RecurrentPolicy(nn.Module): """ LSTM-based policy for POMDPs. The hidden state maintains a learned compression of relevant history, approximating a belief state. Key insight: LSTM learns WHAT to remember, not fixed history. """ def __init__(self, obs_dim: int, action_dim: int, hidden_dim: int = 128, n_layers: int = 1): super().__init__() self.hidden_dim = hidden_dim self.n_layers = n_layers # Observation encoder self.encoder = nn.Sequential( nn.Linear(obs_dim, hidden_dim), nn.ReLU(), ) # Recurrent core self.lstm = nn.LSTM( input_size=hidden_dim, hidden_size=hidden_dim, num_layers=n_layers, batch_first=True ) # Policy head self.policy_head = nn.Sequential( nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, action_dim) ) def forward(self, obs: torch.Tensor, hidden: Tuple[torch.Tensor, torch.Tensor] = None ) -> Tuple[torch.Tensor, Tuple]: """ Process observation and update hidden state. Args: obs: Current observation [batch, obs_dim] or [batch, seq, obs_dim] hidden: (h, c) LSTM hidden state Returns: logits: Action logits hidden: Updated (h, c) hidden state """ # Initialize hidden state if not provided if hidden is None: batch_size = obs.shape[0] hidden = self.init_hidden(batch_size) # Encode observation if obs.dim() == 2: obs = obs.unsqueeze(1) # Add sequence dimension x = self.encoder(obs) # Process through LSTM lstm_out, hidden = self.lstm(x, hidden) # Get last timestep output if lstm_out.dim() == 3: lstm_out = lstm_out[:, -1, :] # Produce action logits logits = self.policy_head(lstm_out) return logits, hidden def init_hidden(self, batch_size: int) -> Tuple[torch.Tensor, torch.Tensor]: """Initialize LSTM hidden state to zeros.""" device = next(self.parameters()).device h = torch.zeros(self.n_layers, batch_size, self.hidden_dim, device=device) c = torch.zeros(self.n_layers, batch_size, self.hidden_dim, device=device) return (h, c) def act(self, obs: torch.Tensor, hidden: Tuple[torch.Tensor, torch.Tensor] = None, deterministic: bool = False ) -> Tuple[int, Tuple]: """Select action and return updated hidden state.""" with torch.no_grad(): logits, hidden = self.forward(obs.unsqueeze(0), hidden) if deterministic: action = logits.argmax(dim=-1) else: action = torch.distributions.Categorical(logits=logits).sample() return action.item(), hidden class TransformerPolicy(nn.Module): """ Transformer-based policy for POMDPs. Attends over the full observation history (or a window). More flexible than RNN but higher computational cost. Used in Decision Transformer and similar architectures. """ def __init__(self, obs_dim: int, action_dim: int, hidden_dim: int = 128, n_heads: int = 4, n_layers: int = 2, max_seq_len: int = 100): super().__init__() self.hidden_dim = hidden_dim self.max_seq_len = max_seq_len # Observation embedding self.obs_embedding = nn.Linear(obs_dim, hidden_dim) # Positional encoding self.pos_encoding = nn.Parameter( torch.randn(1, max_seq_len, hidden_dim) * 0.1 ) # Transformer encoder encoder_layer = nn.TransformerEncoderLayer( d_model=hidden_dim, nhead=n_heads, dim_feedforward=hidden_dim * 4, batch_first=True ) self.transformer = nn.TransformerEncoder(encoder_layer, n_layers) # Policy head self.policy_head = nn.Linear(hidden_dim, action_dim) def forward(self, obs_sequence: torch.Tensor) -> torch.Tensor: """ Process observation sequence, output action for last timestep. Args: obs_sequence: [batch, seq_len, obs_dim] Returns: logits: Action logits for last timestep [batch, action_dim] """ seq_len = obs_sequence.shape[1] # Embed observations x = self.obs_embedding(obs_sequence) # Add positional encoding x = x + self.pos_encoding[:, :seq_len, :] # Generate causal mask (can only attend to past) mask = torch.triu( torch.ones(seq_len, seq_len, device=x.device) * float('-inf'), diagonal=1 ) # Transform x = self.transformer(x, mask=mask) # Policy from last position logits = self.policy_head(x[:, -1, :]) return logits4. Attention Mechanisms
Transformers can attend to any part of the history, learning which past observations are relevant: $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right)V$$
More flexible than RNNs (no fixed-length compression) but quadratic in sequence length.
5. Predictive State Representations
Learn state features that predict future observations: $$\phi(o_{0:t}) \text{ such that } P(o_{t+1:t+k} | o_{0:t}) \approx P(o_{t+1:t+k} | \phi(o_{0:t}))$$
If the representation predicts the future well, it captures the relevant past.
Choosing an Approach
Designing a good state representation is one of the most important decisions in RL system design. Here are principles for creating approximately Markov states:
Principle 1: Include Derivatives
If position matters, velocity probably matters too. If position and velocity matter, acceleration might matter.
$$s = [x, \dot{x}, \ddot{x}, \ldots]$$
Finite differences can approximate derivatives from history.
Principle 2: Include Context
State should include any contextual information that affects dynamics:
Principle 3: Include Relevant History
If recent actions affect future outcomes (actuator lag, cooldowns, momentum), include them: $$s_t = [o_t, a_{t-1}, a_{t-2}, \ldots, a_{t-k}]$$
Ask yourself: If I were the agent, with only the current state, could I make a good decision? Or would I need to ask 'what happened before?' If the latter, that information should be in the state.
In practice, the Markov property is a matter of degree, not binary. Here's how to think about it empirically:
Approximate Markov Is Often Enough
RL algorithms are robust to mild Markov violations. If the optimal action is 'usually' determinable from state, algorithms will 'usually' find good policies.
Trade-off: State Size vs. Markovianity
Larger states are more likely to be Markov but:
Find the sweet spot: enough information to be approximately Markov, not so much that learning is impractical.
Symptoms of Markov Violations
| Symptom | Likely Cause | Treatment |
|---|---|---|
| Policy oscillates in 'same' state | State aliasing | Add distinguishing features |
| Can't predict next state from current | Missing dynamics info | Add velocities/derivatives |
| Performance depends on episode history | Missing context | Include context features or use RNN |
| Works in some episodes, not others | Hidden modes | Identify and include mode indicators |
| Reward prediction error correlates with past | Missing reward-relevant info | Analyze what past info predicts reward |
As you add more history to the state, you approach Markovianity but also increase state dimensionality exponentially. A k-step history of observations o ∈ ℝ^n creates a state in ℝ^(n×k). Be strategic about what history to include.
The Markov Property as an Idealization
No finite representation is perfectly Markov for a complex physical world. The true state of the universe is unfathomably rich. The Markov property should be understood as:
Relationship to Sufficient Statistics
A Markov state is a sufficient statistic for the future given the past. This connects to core statistical concepts and explains why finding Markov representations is related to learning compact, useful features.
State Abstraction
MDP theory studies when multiple true states can be aggregated without losing optimality. If two states have the same optimal action and transition to the same abstract states under that action, they can be merged. This enables state compression while preserving Markovianity.
Model-Based RL and Markov
Model-based RL learns $P(s'|s, a)$ and $R(s, a)$. If the state isn't Markov, these functions become ill-defined:
Model-based methods are thus particularly sensitive to Markov violations.
The Information State
In POMDPs, the belief state $b_t$ (distribution over true states given observation history) is always Markov. The challenge is that $b_t$ lives in a continuous space of dimension $|\mathcal{S}| - 1$, which is intractable for large state spaces.
Recurrent neural networks implicitly learn a compressed belief state. The hidden state $h_t$ is an approximation to $b_t$ that's computationally tractable.
A process is 'Markov' (or 'has the Markov property') if the future is independent of the past given the present. A decision is 'Markovian' if it's based only on current state, not history. Optimal policies in MDPs are Markovian; optimal policies in POMDPs may not be.
The Markov property is the mathematical assumption that makes reinforcement learning tractable. It asserts that the present contains all information relevant to the future—enabling elegant algorithms, strong convergence guarantees, and efficient computation.
Module Complete: RL Fundamentals
With this page, we've completed Module 1: RL Fundamentals. You now understand:
These concepts form the bedrock upon which all RL algorithms are built. The next module will formalize these ideas further as Markov Decision Processes and introduce the algorithmic machinery for finding optimal policies.
Congratulations! You've completed RL Fundamentals—the conceptual foundation of reinforcement learning. You understand the agent-environment paradigm, the quantities that flow through it, and the mathematical assumptions that make RL tractable. You're now ready to study MDPs in depth and learn the algorithms that solve them.