Loading learning content...
The essence of sequential modeling lies in capturing dependencies—the statistical relationships that bind elements of a sequence together. When we observe that 'the' is likely followed by a noun, that a rising stock price often accelerates, or that a specific DNA motif predicts gene activation, we are recognizing dependencies.
But dependencies in sequences present a fundamental challenge: variable-length conditioning. To predict $x_t$, we ideally want to condition on the entire history $x_1, x_2, \ldots, x_{t-1}$. As $t$ grows, this history becomes unbounded—yet we must compress it into a fixed computational representation usable by our model.
This page develops the theoretical frameworks for understanding and modeling sequential dependencies. We begin with the simplest assumption—the Markov property—and progressively relax it, building towards the need for architectures that can maintain and utilize long-range information.
By the end of this page, you will: (1) Understand the Markov assumption and its limitations, (2) Formalize higher-order Markov models, (3) Grasp the concept of sufficient statistics for histories, (4) See why fixed-length context is insufficient, and (5) Motivate the need for learned, variable-length representations.
Let us begin with the most general formulation of sequential modeling. Given a sequence $\mathbf{x} = (x_1, x_2, \ldots, x_T)$, we wish to model the joint probability distribution:
$$P(\mathbf{x}) = P(x_1, x_2, \ldots, x_T)$$
Using the chain rule of probability, this can always be decomposed as:
$$P(x_1, \ldots, x_T) = P(x_1) \cdot P(x_2|x_1) \cdot P(x_3|x_1, x_2) \cdot \ldots \cdot P(x_T|x_1, \ldots, x_{T-1})$$
$$= \prod_{t=1}^{T} P(x_t | \mathbf{x}_{<t})$$
where $\mathbf{x}{<t} = (x_1, \ldots, x{t-1})$ denotes the history up to time $t-1$.
The chain rule factorization is not an approximation—it is an exact identity that holds for any joint distribution. The modeling choices come in HOW we represent or approximate each conditional $P(x_t | \mathbf{x}_{<t})$. Different assumptions about the form of these conditionals lead to different model classes.
The Representation Problem:
In principle, each conditional $P(x_t | \mathbf{x}{<t})$ is a function of the entire history $\mathbf{x}{<t}$. For a sequence of length $T$ with each $x_t$ taking $K$ possible values, the number of possible histories grows as $K^{t-1}$.
At $t = 10$ with vocabulary $K = 10{,}000$ (typical for word-level text), there are $10{,}000^9 = 10^{36}$ possible histories—far more than the number of atoms in the observable universe ($\approx 10^{80}$).
Clearly, we cannot explicitly enumerate all histories. We need assumptions that constrain the conditional distributions to be learnable from finite data.
Three Fundamental Approaches:
Independence assumption (i.i.d.): $P(x_t | \mathbf{x}_{<t}) = P(x_t)$ — the trivial model that ignores all dependencies.
Finite history (Markov): $P(x_t | \mathbf{x}{<t}) = P(x_t | x{t-k}, \ldots, x_{t-1})$ — only recent history matters.
Learned compression: Encode history $\mathbf{x}{<t}$ into a fixed-size representation $h_t$ such that $P(x_t | \mathbf{x}{<t}) \approx P(x_t | h_t)$.
The third approach is the foundation of recurrent neural networks. But to appreciate its necessity and elegance, we must first understand why the simpler approaches are insufficient.
The Markov assumption (or Markov property) is the simplest non-trivial dependency structure. It states that the future depends on the past only through the present—all relevant information from history is captured in the current state.
First-Order Markov Property:
$$P(x_t | x_1, x_2, \ldots, x_{t-1}) = P(x_t | x_{t-1})$$
Given the immediately preceding element $x_{t-1}$, all earlier elements $x_1, \ldots, x_{t-2}$ provide no additional information about $x_t$. The sequence 'forgets' everything except the most recent observation.
Think of a first-order Markov process as a system with 'one step memory.' It knows where it is now, but not how it got there. Tomorrow's weather given today's is independent of yesterday's—or so the model assumes. This is a strong simplification that makes many problems tractable.
Markov Chains:
A Markov chain is a sequence of random variables satisfying the first-order Markov property. For discrete state spaces with $K$ states, the chain is fully characterized by:
Initial distribution: $\pi_i = P(x_1 = i)$ for $i = 1, \ldots, K$
Transition probabilities: $T_{ij} = P(x_t = j | x_{t-1} = i)$ for all state pairs
The transition matrix $\mathbf{T} \in \mathbb{R}^{K \times K}$ has rows summing to 1 (each row is a probability distribution). This representation requires only $O(K^2)$ parameters—independent of sequence length.
Joint probability under Markov assumption:
$$P(x_1, \ldots, x_T) = P(x_1) \prod_{t=2}^{T} P(x_t | x_{t-1}) = \pi_{x_1} \prod_{t=2}^{T} T_{x_{t-1}, x_t}$$
| From \ To | Sunny | Cloudy | Rainy |
|---|---|---|---|
| Sunny | 0.6 | 0.3 | 0.1 |
| Cloudy | 0.3 | 0.4 | 0.3 |
| Rainy | 0.2 | 0.3 | 0.5 |
Where Markov Chains Work:
Where Markov Chains Fail:
Consider the sentence: 'The keys on the table that the man bought at the antique shop last weekend are missing.'
To predict 'are' (plural), the model needs 'keys' (the subject), not 'weekend' (the most recent noun). First-order Markov models fail catastrophically on such long-range subject-verb agreement—a fundamental pattern in language.
A natural extension of the first-order Markov assumption is to condition on more than one previous element. An $n$th-order Markov model conditions on the previous $n$ elements:
$$P(x_t | x_1, \ldots, x_{t-1}) = P(x_t | x_{t-n}, \ldots, x_{t-1})$$
This is equivalent to an $(n+1)$-gram model in NLP terminology: unigram ($n=0$), bigram ($n=1$), trigram ($n=2$), etc.
N-Gram Language Models:
For decades, $n$-gram models were the workhorses of statistical NLP. Given a vocabulary of $V$ words, an $n$-gram model estimates:
$$P(w_t | w_{t-n+1}, \ldots, w_{t-1})$$
These probabilities are estimated from corpus counts:
$$\hat{P}(w_t | w_{t-n+1:t-1}) = \frac{\text{count}(w_{t-n+1}, \ldots, w_t)}{\text{count}(w_{t-n+1}, \ldots, w_{t-1})}$$
The exponential explosion problem:
The number of parameters in an $n$-gram model is $O(V^n)$. With vocabulary $V = 50{,}000$:
| Order | Parameters | Storage (32-bit) |
|---|---|---|
| Unigram $(n=1)$ | 50K | 200 KB |
| Bigram $(n=2)$ | 2.5 billion | 10 GB |
| Trigram $(n=3)$ | 125 trillion | 500 TB |
| 4-gram $(n=4)$ | 6.25 × 10^{18} | 25 exabytes |
Beyond trigrams, explicit storage becomes impossible. Even with aggressive pruning and smoothing, higher-order $n$-grams hit fundamental limits.
Even with enormous corpora, most n-grams never appear in training data. This 'sparse data' problem means that raw counts give probability 0 to never-seen-but-valid sequences. Smoothing techniques (Kneser-Ney, Good-Turing) help distribute probability mass, but cannot fundamentally overcome sparsity for long contexts.
Limitations of Fixed-Order Models:
No matter how large $n$ is, fixed-order Markov models have fundamental limitations:
Bounded memory: They cannot capture dependencies longer than $n$ steps, no matter how important.
No generalization across contexts: The learned statistics for 'the cat sat' don't inform predictions for 'the dog sat'—each context is separate.
No semantic compositionality: The meaning of a phrase isn't derived from component meanings—each n-gram is atomic.
Wasteful representation: Storing all possible contexts ignores that most contexts never occur and many are similar.
12345678910111213141516171819202122232425262728293031323334353637
# Demonstration of n-gram limitations import numpy as np def demonstrate_ngram_limits(): """ Three sentences with identical trigram contexts but different correct continuations """ examples = [ # Trigram context: "the dog that" ("The dog that the man saw ___", "ran"), # 'ran' follows 7 words back ("The dog that bit me ___", "ran"), # 'ran' follows 4 words back ("The dogs that the man saw ___", "ran"), # But wait - 'dogs' requires 'ran' vs singular ] # The issue: "the man saw ___" trigram is identical # But the correct verb number depends on subject # which is arbitrarily far back # Trigram model sees: P(word | "man", "saw") # It cannot know whether subject was singular/plural context = "The programmer who wrote the complex algorithm that "\ "solved the problem that had stumped the team for months ___" # Subject "programmer" is ~15 words back # Correct: "was" (singular), not "were" (plural) # No fixed n-gram can reliably capture this print("N-gram order required for subject-verb agreement:") print(f" Simple sentence: n ≥ 4-5") print(f" Complex sentence: n ≥ 15-20") print(f" Legal/academic prose: n ≥ 30+") print("\nConclusion: No fixed n works for natural language") demonstrate_ngram_limits()The limitations of explicit n-gram models point toward a more powerful idea: instead of conditioning on raw history, we condition on a compressed representation of history. This representation—a hidden state or summary statistic—captures everything from the past that is relevant for predicting the future.
Formally, we seek a function $h: \mathcal{X}^* \to \mathcal{H}$ that maps variable-length histories to a fixed-dimensional space $\mathcal{H}$:
$$h_t = h(x_1, x_2, \ldots, x_{t-1})$$
such that:
$$P(x_t | x_1, \ldots, x_{t-1}) = P(x_t | h_t)$$
If this equality holds exactly, $h_t$ is called a sufficient statistic for predicting $x_t$.
In classical statistics, a sufficient statistic captures all information in data relevant for estimating a parameter. The sample mean is sufficient for estimating the mean of a Gaussian. Here, we extend this idea dynamically: $h_t$ should capture all information in the history relevant for predicting the next observation.
The Existence Question:
Does a finite-dimensional sufficient statistic always exist? Unfortunately, no. Consider the XOR parity sequence: $x_t = x_1 \oplus x_2 \oplus \ldots \oplus x_{t-1}$. Predicting $x_t$ requires knowing the parity of all previous bits—compressing to anything less loses critical information.
In general, the minimal sufficient statistic for a sequence may be the sequence itself. However, for many natural sequences, good approximations exist that capture most predictive information in bounded dimensions.
Hidden Markov Models (HMMs):
HMMs introduce hidden states that DO satisfy the Markov property, even when observations don't:
$$P(z_t | z_1, \ldots, z_{t-1}) = P(z_t | z_{t-1})$$ $$P(x_t | z_1, \ldots, z_t, x_1, \ldots, x_{t-1}) = P(x_t | z_t)$$
The key insight: while observations may have complex long-range dependencies, these can sometimes arise from simple Markovian dynamics in a latent space. The hidden state $z_t$ captures everything needed to generate future observations.
Limitations of HMMs:
Toward Neural Hidden States:
What if, instead of discrete hidden states, we used continuous vectors that could flexibly encode arbitrary information? And what if, instead of hand-designed transition dynamics, we learned how to update these vectors from data?
This is precisely the idea behind recurrent neural networks:
$$h_t = f_\theta(h_{t-1}, x_t)$$
where:
The hidden state $h_t$ serves as a learned, continuous compression of the entire history $(x_1, \ldots, x_t)$.
Before designing models, it's valuable to categorize the types of dependencies we aim to capture. Different dependency structures suggest different architectural choices.
Mutual Information Analysis:
The dependency structure can be probed empirically by computing mutual information between $x_t$ and $x_{t-k}$ for varying lags $k$:
$$I(X_t; X_{t-k}) = H(X_t) - H(X_t | X_{t-k})$$
This measures how much knowing $x_{t-k}$ reduces uncertainty about $x_t$. Plotting $I(X_t; X_{t-k})$ vs. $k$ reveals the dependency structure:
Understanding dependency structure informs model selection. For pure local dependencies, CNNs or short-context models may suffice. For long-range dependencies, LSTMs, attention mechanisms, or state-space models become necessary. For very long sequences with sparse dependencies, specialized architectures like Transformers with sparse attention or retrieval augmentation may be required.
A powerful theoretical lens for understanding sequential dependencies comes from predictive information theory. This framework quantifies what a sequence remembers about its past and how that memory supports future prediction.
Excess Entropy (Predictive Information):
The excess entropy $E$ measures the total mutual information between past and future:
$$E = I(\mathbf{X}{\text{past}}; \mathbf{X}{\text{future}}) = H(\mathbf{X}{\text{future}}) - H(\mathbf{X}{\text{future}} | \mathbf{X}_{\text{past}})$$
This quantifies how much knowing the entire past reduces uncertainty about the entire future. It captures the 'complexity' of a sequence in terms of its long-range structure.
Statistical Complexity and Minimal Representations:
The statistical complexity $C_\mu$ represents the minimum memory required to optimally predict future observations. It equals the entropy of the causal states—equivalence classes of histories that lead to identical future predictions.
Formally, two histories $\mathbf{x}{<t}$ and $\mathbf{x}'{<t}$ are equivalent if:
$$P(\mathbf{X}{\geq t} | \mathbf{X}{<t} = \mathbf{x}{<t}) = P(\mathbf{X}{\geq t} | \mathbf{X}{<t} = \mathbf{x}'{<t})$$
The statistical complexity $C_\mu$ satisfies:
$$E \leq C_\mu$$
with equality only for 'cryptic' processes. This gap $C_\mu - E$ measures how much 'hidden' information the past contains that doesn't directly show in future predictions.
Statistical complexity gives a lower bound on hidden state dimensionality: any model that optimally predicts must have at least $C_\mu$ bits of hidden state. This provides a target for model capacity—too little capacity means suboptimal prediction; too much may overfit. RNN hidden dimensions are often chosen empirically, but this theory suggests principled targets.
The Memory-Prediction Tradeoff:
An important insight from this framework is the tradeoff between memory and prediction accuracy. With limited memory (hidden state dimensionality), we cannot maintain perfect representations of history. This means:
This is exactly what recurrent neural networks learn to do: they develop representations that maximize predictive power within fixed capacity constraints. The hidden state dynamics encode a compression scheme that retains predictively relevant information and discards the rest.
Before turning to neural approaches, let's survey classical methods for modeling sequential dependencies. These form the intellectual foundation upon which RNNs build.
Autoregressive (AR) Models:
For continuous-valued sequences, autoregressive models express $x_t$ as a linear combination of past values plus noise:
$$x_t = c + \sum_{i=1}^{p} \phi_i x_{t-i} + \epsilon_t$$
where $\epsilon_t \sim \mathcal{N}(0, \sigma^2)$ is white noise.
Properties:
Limitations:
We have now surveyed the landscape of classical sequence models and their limitations. The case for neural approaches emerges from several converging arguments:
The Fundamental Insight:
Recurrent neural networks address the dependency modeling problem by maintaining a learned hidden state that evolves as the sequence progresses:
$$h_t = f_\theta(h_{t-1}, x_t)$$
This simple recursion has profound implications:
Variable-length history: Unlike n-grams with fixed context, $h_t$ can theoretically carry information from arbitrarily far back.
Learned compression: The network learns what to remember and what to forget, based on what improves predictions.
Shared parameters: The same function $f_\theta$ is applied at every timestep, enabling generalization and reducing parameters.
Differentiable: The entire computation from $x_{1:T}$ to predictions is differentiable, enabling gradient-based learning.
This is the architecture we will develop in the coming pages, starting with the autoregressive formulation that makes training tractable.
At each timestep, an RNN cell (i) takes the current input and previous hidden state, (ii) computes a new hidden state via learned transformations, and (iii) optionally produces an output. By unfolding this recursion, we can view the RNN as a very deep network with shared weights across layers—making training challenging but powerful.
This page has established the theoretical foundations for understanding sequential dependencies—the core challenge that recurrent neural networks address.
What's Next:
Having established the need for learned hidden state representations, we now turn to Autoregressive Models—the fundamental framework for neural sequence generation. This framework provides the training objective and evaluation methodology for recurrent neural networks.
You now understand why sequence modeling is challenging and why recurrent neural networks were developed. The key insight: we need learned, continuous representations that can flexibly encode variable-length histories while remaining computationally tractable. This is precisely what RNNs provide.