Machine LearningSequence Modeling

Sequence Modeling

LevelIntermediate

Duration90 mins

TopicSequence Modeling

2 / 5

Modeling Dependencies

The Dependency Challenge

The essence of sequential modeling lies in capturing dependencies—the statistical relationships that bind elements of a sequence together. When we observe that 'the' is likely followed by a noun, that a rising stock price often accelerates, or that a specific DNA motif predicts gene activation, we are recognizing dependencies.

But dependencies in sequences present a fundamental challenge: variable-length conditioning. To predict $x_t$, we ideally want to condition on the entire history $x_1, x_2, \ldots, x_{t-1}$. As $t$ grows, this history becomes unbounded—yet we must compress it into a fixed computational representation usable by our model.

This page develops the theoretical frameworks for understanding and modeling sequential dependencies. We begin with the simplest assumption—the Markov property—and progressively relax it, building towards the need for architectures that can maintain and utilize long-range information.

Learning Objectives

By the end of this page, you will: (1) Understand the Markov assumption and its limitations, (2) Formalize higher-order Markov models, (3) Grasp the concept of sufficient statistics for histories, (4) See why fixed-length context is insufficient, and (5) Motivate the need for learned, variable-length representations.

The Full Joint Distribution

Let us begin with the most general formulation of sequential modeling. Given a sequence $\mathbf{x} = (x_1, x_2, \ldots, x_T)$, we wish to model the joint probability distribution:

$$P(\mathbf{x}) = P(x_1, x_2, \ldots, x_T)$$

Using the chain rule of probability, this can always be decomposed as:

$$P(x_1, \ldots, x_T) = P(x_1) \cdot P(x_2|x_1) \cdot P(x_3|x_1, x_2) \cdot \ldots \cdot P(x_T|x_1, \ldots, x_{T-1})$$

$$= \prod_{t=1}^{T} P(x_t | \mathbf{x}_{<t})$$

where $\mathbf{x}{<t} = (x_1, \ldots, x{t-1})$ denotes the history up to time $t-1$.

Chain Rule Factorization

The chain rule factorization is not an approximation—it is an exact identity that holds for any joint distribution. The modeling choices come in HOW we represent or approximate each conditional $P(x_t | \mathbf{x}_{<t})$. Different assumptions about the form of these conditionals lead to different model classes.

The Representation Problem:

In principle, each conditional $P(x_t | \mathbf{x}{<t})$ is a function of the entire history $\mathbf{x}{<t}$. For a sequence of length $T$ with each $x_t$ taking $K$ possible values, the number of possible histories grows as $K^{t-1}$.

At $t = 10$ with vocabulary $K = 10{,}000$ (typical for word-level text), there are $10{,}000^9 = 10^{36}$ possible histories—far more than the number of atoms in the observable universe ($\approx 10^{80}$).

Clearly, we cannot explicitly enumerate all histories. We need assumptions that constrain the conditional distributions to be learnable from finite data.

Three Fundamental Approaches:

Independence assumption (i.i.d.): $P(x_t | \mathbf{x}_{<t}) = P(x_t)$ — the trivial model that ignores all dependencies.
Finite history (Markov): $P(x_t | \mathbf{x}{<t}) = P(x_t | x{t-k}, \ldots, x_{t-1})$ — only recent history matters.
Learned compression: Encode history $\mathbf{x}{<t}$ into a fixed-size representation $h_t$ such that $P(x_t | \mathbf{x}{<t}) \approx P(x_t | h_t)$.

The third approach is the foundation of recurrent neural networks. But to appreciate its necessity and elegance, we must first understand why the simpler approaches are insufficient.

The Markov Assumption

The Markov assumption (or Markov property) is the simplest non-trivial dependency structure. It states that the future depends on the past only through the present—all relevant information from history is captured in the current state.

First-Order Markov Property:

$$P(x_t | x_1, x_2, \ldots, x_{t-1}) = P(x_t | x_{t-1})$$

Given the immediately preceding element $x_{t-1}$, all earlier elements $x_1, \ldots, x_{t-2}$ provide no additional information about $x_t$. The sequence 'forgets' everything except the most recent observation.

Intuition: Memoryless Process

Think of a first-order Markov process as a system with 'one step memory.' It knows where it is now, but not how it got there. Tomorrow's weather given today's is independent of yesterday's—or so the model assumes. This is a strong simplification that makes many problems tractable.

Markov Chains:

A Markov chain is a sequence of random variables satisfying the first-order Markov property. For discrete state spaces with $K$ states, the chain is fully characterized by:

Initial distribution: $\pi_i = P(x_1 = i)$ for $i = 1, \ldots, K$
Transition probabilities: $T_{ij} = P(x_t = j | x_{t-1} = i)$ for all state pairs

The transition matrix $\mathbf{T} \in \mathbb{R}^{K \times K}$ has rows summing to 1 (each row is a probability distribution). This representation requires only $O(K^2)$ parameters—independent of sequence length.

Joint probability under Markov assumption:

$$P(x_1, \ldots, x_T) = P(x_1) \prod_{t=2}^{T} P(x_t | x_{t-1}) = \pi_{x_1} \prod_{t=2}^{T} T_{x_{t-1}, x_t}$$

Transition Matrix Example (Weather Model)
From \ To	Sunny	Cloudy	Rainy
Sunny	0.6	0.3	0.1
Cloudy	0.3	0.4	0.3
Rainy	0.2	0.3	0.5

Where Markov Chains Work:

Physical processes with limited memory: Brownian motion, radioactive decay
Game states: Board positions in chess (the position encodes everything relevant)
Queue models: Number of customers in a queue determines future dynamics
Simple language models: Bigram models (though limited in quality)

Where Markov Chains Fail:

Consider the sentence: 'The keys on the table that the man bought at the antique shop last weekend are missing.'

To predict 'are' (plural), the model needs 'keys' (the subject), not 'weekend' (the most recent noun). First-order Markov models fail catastrophically on such long-range subject-verb agreement—a fundamental pattern in language.

Higher-Order Markov Models

A natural extension of the first-order Markov assumption is to condition on more than one previous element. An $n$th-order Markov model conditions on the previous $n$ elements:

$$P(x_t | x_1, \ldots, x_{t-1}) = P(x_t | x_{t-n}, \ldots, x_{t-1})$$

This is equivalent to an $(n+1)$-gram model in NLP terminology: unigram ($n=0$), bigram ($n=1$), trigram ($n=2$), etc.

N-Gram Language Models:

For decades, $n$-gram models were the workhorses of statistical NLP. Given a vocabulary of $V$ words, an $n$-gram model estimates:

$$P(w_t | w_{t-n+1}, \ldots, w_{t-1})$$

These probabilities are estimated from corpus counts:

$$\hat{P}(w_t | w_{t-n+1:t-1}) = \frac{\text{count}(w_{t-n+1}, \ldots, w_t)}{\text{count}(w_{t-n+1}, \ldots, w_{t-1})}$$

The exponential explosion problem:

The number of parameters in an $n$-gram model is $O(V^n)$. With vocabulary $V = 50{,}000$:

Order	Parameters	Storage (32-bit)
Unigram $(n=1)$	50K	200 KB
Bigram $(n=2)$	2.5 billion	10 GB
Trigram $(n=3)$	125 trillion	500 TB
4-gram $(n=4)$	6.25 × 10^{18}	25 exabytes

Beyond trigrams, explicit storage becomes impossible. Even with aggressive pruning and smoothing, higher-order $n$-grams hit fundamental limits.

The Data Sparsity Problem

Even with enormous corpora, most n-grams never appear in training data. This 'sparse data' problem means that raw counts give probability 0 to never-seen-but-valid sequences. Smoothing techniques (Kneser-Ney, Good-Turing) help distribute probability mass, but cannot fundamentally overcome sparsity for long contexts.

Limitations of Fixed-Order Models:

No matter how large $n$ is, fixed-order Markov models have fundamental limitations:

Bounded memory: They cannot capture dependencies longer than $n$ steps, no matter how important.
No generalization across contexts: The learned statistics for 'the cat sat' don't inform predictions for 'the dog sat'—each context is separate.
No semantic compositionality: The meaning of a phrase isn't derived from component meanings—each n-gram is atomic.
Wasteful representation: Storing all possible contexts ignores that most contexts never occur and many are similar.

ngram_limitations.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# Demonstration of n-gram limitations
 
import numpy as np
 
def demonstrate_ngram_limits():
    """
    Three sentences with identical trigram contexts
    but different correct continuations
    """
    
    examples = [
        # Trigram context: "the dog that"
        ("The dog that the man saw ___", "ran"),    # 'ran' follows 7 words back
        ("The dog that bit me ___", "ran"),          # 'ran' follows 4 words back
        ("The dogs that the man saw ___", "ran"),    # But wait - 'dogs' requires 'ran' vs singular
    ]
    
    # The issue: "the man saw ___" trigram is identical
    # But the correct verb number depends on subject
    # which is arbitrarily far back
    
    # Trigram model sees: P(word | "man", "saw")
    # It cannot know whether subject was singular/plural
    
    context = "The programmer who wrote the complex algorithm that "\
              "solved the problem that had stumped the team for months ___"
    # Subject "programmer" is ~15 words back
    # Correct: "was" (singular), not "were" (plural)
    # No fixed n-gram can reliably capture this
    
    print("N-gram order required for subject-verb agreement:")
    print(f"  Simple sentence: n ≥ 4-5")
    print(f"  Complex sentence: n ≥ 15-20")
    print(f"  Legal/academic prose: n ≥ 30+")
    print("\nConclusion: No fixed n works for natural language")
 
demonstrate_ngram_limits()

Hidden State and Sufficient Statistics

The limitations of explicit n-gram models point toward a more powerful idea: instead of conditioning on raw history, we condition on a compressed representation of history. This representation—a hidden state or summary statistic—captures everything from the past that is relevant for predicting the future.

Formally, we seek a function $h: \mathcal{X}^* \to \mathcal{H}$ that maps variable-length histories to a fixed-dimensional space $\mathcal{H}$:

$$h_t = h(x_1, x_2, \ldots, x_{t-1})$$

such that:

$$P(x_t | x_1, \ldots, x_{t-1}) = P(x_t | h_t)$$

If this equality holds exactly, $h_t$ is called a sufficient statistic for predicting $x_t$.

Sufficient Statistics in Statistics

In classical statistics, a sufficient statistic captures all information in data relevant for estimating a parameter. The sample mean is sufficient for estimating the mean of a Gaussian. Here, we extend this idea dynamically: $h_t$ should capture all information in the history relevant for predicting the next observation.

The Existence Question:

Does a finite-dimensional sufficient statistic always exist? Unfortunately, no. Consider the XOR parity sequence: $x_t = x_1 \oplus x_2 \oplus \ldots \oplus x_{t-1}$. Predicting $x_t$ requires knowing the parity of all previous bits—compressing to anything less loses critical information.

In general, the minimal sufficient statistic for a sequence may be the sequence itself. However, for many natural sequences, good approximations exist that capture most predictive information in bounded dimensions.

Hidden Markov Models (HMMs):

HMMs introduce hidden states that DO satisfy the Markov property, even when observations don't:

Hidden states: $z_t \in {1, \ldots, K}$ form a Markov chain
Emissions: $x_t \sim P(x|z_t)$ depend only on current hidden state

$$P(z_t | z_1, \ldots, z_{t-1}) = P(z_t | z_{t-1})$$ $$P(x_t | z_1, \ldots, z_t, x_1, \ldots, x_{t-1}) = P(x_t | z_t)$$

The key insight: while observations may have complex long-range dependencies, these can sometimes arise from simple Markovian dynamics in a latent space. The hidden state $z_t$ captures everything needed to generate future observations.

Limitations of HMMs:

Discrete hidden states limit representational capacity
Number of states $K$ must be specified in advance
Inference (computing $P(z_t|x_1, \ldots, x_t)$) is $O(K^2 T)$, limiting scalability
Linear observation models may be too restrictive

Toward Neural Hidden States:

What if, instead of discrete hidden states, we used continuous vectors that could flexibly encode arbitrary information? And what if, instead of hand-designed transition dynamics, we learned how to update these vectors from data?

This is precisely the idea behind recurrent neural networks:

$$h_t = f_\theta(h_{t-1}, x_t)$$

where:

$h_t \in \mathbb{R}^d$ is a continuous hidden state vector
$f_\theta$ is a neural network with learnable parameters $\theta$
The recursion allows information to flow across the entire sequence

The hidden state $h_t$ serves as a learned, continuous compression of the entire history $(x_1, \ldots, x_t)$.

Dependency Types and Structures

Before designing models, it's valuable to categorize the types of dependencies we aim to capture. Different dependency structures suggest different architectural choices.

Local vs. Long-Range Dependencies

•Local/Short-Range: $x_t$ depends primarily on $x_{t-1}, x_{t-2}, \ldots, x_{t-k}$ for small $k$. Examples: phonetic coarticulation (2-3 phonemes), local motion patterns, immediate stock price momentum.
•Long-Range: $x_t$ depends on elements far back in the sequence. Examples: coreference resolution (pronouns referring to entities paragraphs earlier), musical theme recurrence, annual business cycles.
•Hierarchical: Dependencies exist at multiple scales simultaneously. Example: in text, word-level syntax, sentence-level semantics, paragraph-level discourse, document-level topic.

Dependency Decay Patterns

•Exponential Decay: Influence of past elements decays exponentially with distance. Characteristic of memoryless processes. $I(x_t; x_{t-k}) \sim \lambda^k$ for some $\lambda < 1$.
•Power-Law Decay: Influence decays polynomially: $I(x_t; x_{t-k}) \sim k^{-\alpha}$. Slower forgetting; very distant elements still matter somewhat. Common in natural language and complex systems.
•Selective/Sparse: Only specific past elements matter, not determined by recency. Example: matching brackets in code—a closing bracket depends on its opener, regardless of intervening content.

Mutual Information Analysis:

The dependency structure can be probed empirically by computing mutual information between $x_t$ and $x_{t-k}$ for varying lags $k$:

$$I(X_t; X_{t-k}) = H(X_t) - H(X_t | X_{t-k})$$

This measures how much knowing $x_{t-k}$ reduces uncertainty about $x_t$. Plotting $I(X_t; X_{t-k})$ vs. $k$ reveals the dependency structure:

Rapid decay → short-range dependencies, Markov models may suffice
Slow decay → long-range dependencies, need sophisticated architectures
Non-monotonic with spikes → periodic or structured dependencies

Architectural Implications

Understanding dependency structure informs model selection. For pure local dependencies, CNNs or short-context models may suffice. For long-range dependencies, LSTMs, attention mechanisms, or state-space models become necessary. For very long sequences with sparse dependencies, specialized architectures like Transformers with sparse attention or retrieval augmentation may be required.

The Predictive Information Framework

A powerful theoretical lens for understanding sequential dependencies comes from predictive information theory. This framework quantifies what a sequence remembers about its past and how that memory supports future prediction.

Excess Entropy (Predictive Information):

The excess entropy $E$ measures the total mutual information between past and future:

$$E = I(\mathbf{X}{\text{past}}; \mathbf{X}{\text{future}}) = H(\mathbf{X}{\text{future}}) - H(\mathbf{X}{\text{future}} | \mathbf{X}_{\text{past}})$$

This quantifies how much knowing the entire past reduces uncertainty about the entire future. It captures the 'complexity' of a sequence in terms of its long-range structure.

I.I.D. sequences: $E = 0$ (past tells nothing about future)
Periodic sequences: $E$ equals the entropy of one period
Natural language: Very high $E$ (past strongly predicts future)

Statistical Complexity and Minimal Representations:

The statistical complexity $C_\mu$ represents the minimum memory required to optimally predict future observations. It equals the entropy of the causal states—equivalence classes of histories that lead to identical future predictions.

Formally, two histories $\mathbf{x}{<t}$ and $\mathbf{x}'{<t}$ are equivalent if:

$$P(\mathbf{X}{\geq t} | \mathbf{X}{<t} = \mathbf{x}{<t}) = P(\mathbf{X}{\geq t} | \mathbf{X}{<t} = \mathbf{x}'{<t})$$

The statistical complexity $C_\mu$ satisfies:

$$E \leq C_\mu$$

with equality only for 'cryptic' processes. This gap $C_\mu - E$ measures how much 'hidden' information the past contains that doesn't directly show in future predictions.

Implications for Model Design

Statistical complexity gives a lower bound on hidden state dimensionality: any model that optimally predicts must have at least $C_\mu$ bits of hidden state. This provides a target for model capacity—too little capacity means suboptimal prediction; too much may overfit. RNN hidden dimensions are often chosen empirically, but this theory suggests principled targets.

The Memory-Prediction Tradeoff:

An important insight from this framework is the tradeoff between memory and prediction accuracy. With limited memory (hidden state dimensionality), we cannot maintain perfect representations of history. This means:

Some historical information will be lost
Predictions will be suboptimal (higher entropy than with perfect memory)
The model must learn to prioritize—storing information that has greatest predictive value

This is exactly what recurrent neural networks learn to do: they develop representations that maximize predictive power within fixed capacity constraints. The hidden state dynamics encode a compression scheme that retains predictively relevant information and discards the rest.

Explicit Dependency Modeling Approaches

Before turning to neural approaches, let's survey classical methods for modeling sequential dependencies. These form the intellectual foundation upon which RNNs build.

Autoregressive (AR) Models:

For continuous-valued sequences, autoregressive models express $x_t$ as a linear combination of past values plus noise:

$$x_t = c + \sum_{i=1}^{p} \phi_i x_{t-i} + \epsilon_t$$

where $\epsilon_t \sim \mathcal{N}(0, \sigma^2)$ is white noise.

Properties:

$p$ coefficients $\phi_i$ capture dependency structure
Stationary if polynomial $1 - \phi_1 z - \ldots - \phi_p z^p$ has roots outside unit circle
Can be extended to ARMA (adding moving average) and ARIMA (adding differencing)

Limitations:

Linear relationships only
Fixed order $p$ must be determined
Stationarity assumption often violated
Cannot capture complex conditional distributions

The Case for Neural Representations

We have now surveyed the landscape of classical sequence models and their limitations. The case for neural approaches emerges from several converging arguments:

Why Neural Sequence Models?

•Representation learning: Rather than hand-engineering features, neural models learn representations from data. This enables adaptation to diverse domains without domain expertise.
•Nonlinear dynamics: Neural networks can model complex, nonlinear relationships between past and present that linear models cannot capture.
•Distributed representations: Continuous vectors enable generalization. Similar contexts map to similar hidden states, even if never seen together.
•End-to-end training: All parameters—including hidden state dynamics—can be optimized jointly for the final objective via backpropagation.
•Scalability: Modern hardware accelerates neural network computation. Large models with billions of parameters become tractable.
•Unified framework: The same architecture handles text, speech, time series, and more. Transfer learning becomes possible across domains.

The Fundamental Insight:

Recurrent neural networks address the dependency modeling problem by maintaining a learned hidden state that evolves as the sequence progresses:

$$h_t = f_\theta(h_{t-1}, x_t)$$

This simple recursion has profound implications:

Variable-length history: Unlike n-grams with fixed context, $h_t$ can theoretically carry information from arbitrarily far back.
Learned compression: The network learns what to remember and what to forget, based on what improves predictions.
Shared parameters: The same function $f_\theta$ is applied at every timestep, enabling generalization and reducing parameters.
Differentiable: The entire computation from $x_{1:T}$ to predictions is differentiable, enabling gradient-based learning.

This is the architecture we will develop in the coming pages, starting with the autoregressive formulation that makes training tractable.

Preview: The Recurrent Computation

At each timestep, an RNN cell (i) takes the current input and previous hidden state, (ii) computes a new hidden state via learned transformations, and (iii) optionally produces an output. By unfolding this recursion, we can view the RNN as a very deep network with shared weights across layers—making training challenging but powerful.

Summary: The Dependency Problem

This page has established the theoretical foundations for understanding sequential dependencies—the core challenge that recurrent neural networks address.

Key Takeaways

•The full joint distribution factorizes via chain rule, but explicit conditioning on full history is intractable.
•Markov assumptions limit conditioning to recent history, but fail for long-range dependencies.
•Higher-order Markov models face exponential parameter growth and data sparsity.
•Hidden states can serve as sufficient statistics, compressing variable-length history to fixed-size representations.
•Dependencies vary in structure: local vs. long-range, exponential vs. power-law decay, hierarchical patterns.
•Predictive information theory quantifies memory requirements for optimal prediction.
•Classical models (AR, HMM, Kalman, CRF) provide building blocks but have fundamental limitations.
•Neural approaches offer learned, nonlinear, distributed representations with end-to-end training.

What's Next:

Having established the need for learned hidden state representations, we now turn to Autoregressive Models—the fundamental framework for neural sequence generation. This framework provides the training objective and evaluation methodology for recurrent neural networks.

Theoretical Foundation Complete

You now understand why sequence modeling is challenging and why recurrent neural networks were developed. The key insight: we need learned, continuous representations that can flexibly encode variable-length histories while remaining computationally tractable. This is precisely what RNNs provide.

2 / 5

Loading learning content...

Machine LearningSequence Modeling

Sequence Modeling

LevelIntermediate

Duration90 mins

TopicSequence Modeling

2 / 5

Modeling Dependencies

The Dependency Challenge

Learning Objectives

The Full Joint Distribution

Let us begin with the most general formulation of sequential modeling. Given a sequence $\mathbf{x} = (x_1, x_2, \ldots, x_T)$, we wish to model the joint probability distribution:

$$P(\mathbf{x}) = P(x_1, x_2, \ldots, x_T)$$

Using the chain rule of probability, this can always be decomposed as:

$$P(x_1, \ldots, x_T) = P(x_1) \cdot P(x_2|x_1) \cdot P(x_3|x_1, x_2) \cdot \ldots \cdot P(x_T|x_1, \ldots, x_{T-1})$$

$$= \prod_{t=1}^{T} P(x_t | \mathbf{x}_{<t})$$

where $\mathbf{x}{<t} = (x_1, \ldots, x{t-1})$ denotes the history up to time $t-1$.

Chain Rule Factorization

The Representation Problem:

Clearly, we cannot explicitly enumerate all histories. We need assumptions that constrain the conditional distributions to be learnable from finite data.

Three Fundamental Approaches:

Independence assumption (i.i.d.): $P(x_t | \mathbf{x}_{<t}) = P(x_t)$ — the trivial model that ignores all dependencies.
Finite history (Markov): $P(x_t | \mathbf{x}{<t}) = P(x_t | x{t-k}, \ldots, x_{t-1})$ — only recent history matters.
Learned compression: Encode history $\mathbf{x}{<t}$ into a fixed-size representation $h_t$ such that $P(x_t | \mathbf{x}{<t}) \approx P(x_t | h_t)$.

The third approach is the foundation of recurrent neural networks. But to appreciate its necessity and elegance, we must first understand why the simpler approaches are insufficient.

The Markov Assumption

First-Order Markov Property:

$$P(x_t | x_1, x_2, \ldots, x_{t-1}) = P(x_t | x_{t-1})$$

Intuition: Memoryless Process

Markov Chains:

A Markov chain is a sequence of random variables satisfying the first-order Markov property. For discrete state spaces with $K$ states, the chain is fully characterized by:

Initial distribution: $\pi_i = P(x_1 = i)$ for $i = 1, \ldots, K$
Transition probabilities: $T_{ij} = P(x_t = j | x_{t-1} = i)$ for all state pairs

Joint probability under Markov assumption:

$$P(x_1, \ldots, x_T) = P(x_1) \prod_{t=2}^{T} P(x_t | x_{t-1}) = \pi_{x_1} \prod_{t=2}^{T} T_{x_{t-1}, x_t}$$

Transition Matrix Example (Weather Model)
From \ To	Sunny	Cloudy	Rainy
Sunny	0.6	0.3	0.1
Cloudy	0.3	0.4	0.3
Rainy	0.2	0.3	0.5

Where Markov Chains Work:

Physical processes with limited memory: Brownian motion, radioactive decay
Game states: Board positions in chess (the position encodes everything relevant)
Queue models: Number of customers in a queue determines future dynamics
Simple language models: Bigram models (though limited in quality)

Where Markov Chains Fail:

Consider the sentence: 'The keys on the table that the man bought at the antique shop last weekend are missing.'

Higher-Order Markov Models

A natural extension of the first-order Markov assumption is to condition on more than one previous element. An $n$th-order Markov model conditions on the previous $n$ elements:

$$P(x_t | x_1, \ldots, x_{t-1}) = P(x_t | x_{t-n}, \ldots, x_{t-1})$$

This is equivalent to an $(n+1)$-gram model in NLP terminology: unigram ($n=0$), bigram ($n=1$), trigram ($n=2$), etc.

N-Gram Language Models:

For decades, $n$-gram models were the workhorses of statistical NLP. Given a vocabulary of $V$ words, an $n$-gram model estimates:

$$P(w_t | w_{t-n+1}, \ldots, w_{t-1})$$

These probabilities are estimated from corpus counts:

$$\hat{P}(w_t | w_{t-n+1:t-1}) = \frac{\text{count}(w_{t-n+1}, \ldots, w_t)}{\text{count}(w_{t-n+1}, \ldots, w_{t-1})}$$

The exponential explosion problem:

The number of parameters in an $n$-gram model is $O(V^n)$. With vocabulary $V = 50{,}000$:

Order	Parameters	Storage (32-bit)
Unigram $(n=1)$	50K	200 KB
Bigram $(n=2)$	2.5 billion	10 GB
Trigram $(n=3)$	125 trillion	500 TB
4-gram $(n=4)$	6.25 × 10^{18}	25 exabytes

Beyond trigrams, explicit storage becomes impossible. Even with aggressive pruning and smoothing, higher-order $n$-grams hit fundamental limits.

The Data Sparsity Problem

Limitations of Fixed-Order Models:

No matter how large $n$ is, fixed-order Markov models have fundamental limitations:

Bounded memory: They cannot capture dependencies longer than $n$ steps, no matter how important.
No generalization across contexts: The learned statistics for 'the cat sat' don't inform predictions for 'the dog sat'—each context is separate.
No semantic compositionality: The meaning of a phrase isn't derived from component meanings—each n-gram is atomic.
Wasteful representation: Storing all possible contexts ignores that most contexts never occur and many are similar.

ngram_limitations.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# Demonstration of n-gram limitations
 
import numpy as np
 
def demonstrate_ngram_limits():
    """
    Three sentences with identical trigram contexts
    but different correct continuations
    """
    
    examples = [
        # Trigram context: "the dog that"
        ("The dog that the man saw ___", "ran"),    # 'ran' follows 7 words back
        ("The dog that bit me ___", "ran"),          # 'ran' follows 4 words back
        ("The dogs that the man saw ___", "ran"),    # But wait - 'dogs' requires 'ran' vs singular
    ]
    
    # The issue: "the man saw ___" trigram is identical
    # But the correct verb number depends on subject
    # which is arbitrarily far back
    
    # Trigram model sees: P(word | "man", "saw")
    # It cannot know whether subject was singular/plural
    
    context = "The programmer who wrote the complex algorithm that "\
              "solved the problem that had stumped the team for months ___"
    # Subject "programmer" is ~15 words back
    # Correct: "was" (singular), not "were" (plural)
    # No fixed n-gram can reliably capture this
    
    print("N-gram order required for subject-verb agreement:")
    print(f"  Simple sentence: n ≥ 4-5")
    print(f"  Complex sentence: n ≥ 15-20")
    print(f"  Legal/academic prose: n ≥ 30+")
    print("\nConclusion: No fixed n works for natural language")
 
demonstrate_ngram_limits()

Hidden State and Sufficient Statistics

Formally, we seek a function $h: \mathcal{X}^* \to \mathcal{H}$ that maps variable-length histories to a fixed-dimensional space $\mathcal{H}$:

$$h_t = h(x_1, x_2, \ldots, x_{t-1})$$

such that:

$$P(x_t | x_1, \ldots, x_{t-1}) = P(x_t | h_t)$$

If this equality holds exactly, $h_t$ is called a sufficient statistic for predicting $x_t$.

Sufficient Statistics in Statistics

The Existence Question:

Hidden Markov Models (HMMs):

HMMs introduce hidden states that DO satisfy the Markov property, even when observations don't:

Hidden states: $z_t \in {1, \ldots, K}$ form a Markov chain
Emissions: $x_t \sim P(x|z_t)$ depend only on current hidden state

$$P(z_t | z_1, \ldots, z_{t-1}) = P(z_t | z_{t-1})$$ $$P(x_t | z_1, \ldots, z_t, x_1, \ldots, x_{t-1}) = P(x_t | z_t)$$

Limitations of HMMs:

Discrete hidden states limit representational capacity
Number of states $K$ must be specified in advance
Inference (computing $P(z_t|x_1, \ldots, x_t)$) is $O(K^2 T)$, limiting scalability
Linear observation models may be too restrictive

Toward Neural Hidden States:

This is precisely the idea behind recurrent neural networks:

$$h_t = f_\theta(h_{t-1}, x_t)$$

where:

$h_t \in \mathbb{R}^d$ is a continuous hidden state vector
$f_\theta$ is a neural network with learnable parameters $\theta$
The recursion allows information to flow across the entire sequence

The hidden state $h_t$ serves as a learned, continuous compression of the entire history $(x_1, \ldots, x_t)$.

Dependency Types and Structures

Before designing models, it's valuable to categorize the types of dependencies we aim to capture. Different dependency structures suggest different architectural choices.

Local vs. Long-Range Dependencies

•Local/Short-Range: $x_t$ depends primarily on $x_{t-1}, x_{t-2}, \ldots, x_{t-k}$ for small $k$. Examples: phonetic coarticulation (2-3 phonemes), local motion patterns, immediate stock price momentum.
•Long-Range: $x_t$ depends on elements far back in the sequence. Examples: coreference resolution (pronouns referring to entities paragraphs earlier), musical theme recurrence, annual business cycles.
•Hierarchical: Dependencies exist at multiple scales simultaneously. Example: in text, word-level syntax, sentence-level semantics, paragraph-level discourse, document-level topic.

Dependency Decay Patterns

•Exponential Decay: Influence of past elements decays exponentially with distance. Characteristic of memoryless processes. $I(x_t; x_{t-k}) \sim \lambda^k$ for some $\lambda < 1$.
•Power-Law Decay: Influence decays polynomially: $I(x_t; x_{t-k}) \sim k^{-\alpha}$. Slower forgetting; very distant elements still matter somewhat. Common in natural language and complex systems.
•Selective/Sparse: Only specific past elements matter, not determined by recency. Example: matching brackets in code—a closing bracket depends on its opener, regardless of intervening content.

Mutual Information Analysis:

The dependency structure can be probed empirically by computing mutual information between $x_t$ and $x_{t-k}$ for varying lags $k$:

$$I(X_t; X_{t-k}) = H(X_t) - H(X_t | X_{t-k})$$

This measures how much knowing $x_{t-k}$ reduces uncertainty about $x_t$. Plotting $I(X_t; X_{t-k})$ vs. $k$ reveals the dependency structure:

Rapid decay → short-range dependencies, Markov models may suffice
Slow decay → long-range dependencies, need sophisticated architectures
Non-monotonic with spikes → periodic or structured dependencies

Architectural Implications

The Predictive Information Framework

Excess Entropy (Predictive Information):

The excess entropy $E$ measures the total mutual information between past and future:

$$E = I(\mathbf{X}{\text{past}}; \mathbf{X}{\text{future}}) = H(\mathbf{X}{\text{future}}) - H(\mathbf{X}{\text{future}} | \mathbf{X}_{\text{past}})$$

This quantifies how much knowing the entire past reduces uncertainty about the entire future. It captures the 'complexity' of a sequence in terms of its long-range structure.

I.I.D. sequences: $E = 0$ (past tells nothing about future)
Periodic sequences: $E$ equals the entropy of one period
Natural language: Very high $E$ (past strongly predicts future)

Statistical Complexity and Minimal Representations:

Formally, two histories $\mathbf{x}{<t}$ and $\mathbf{x}'{<t}$ are equivalent if:

$$P(\mathbf{X}{\geq t} | \mathbf{X}{<t} = \mathbf{x}{<t}) = P(\mathbf{X}{\geq t} | \mathbf{X}{<t} = \mathbf{x}'{<t})$$

The statistical complexity $C_\mu$ satisfies:

$$E \leq C_\mu$$

with equality only for 'cryptic' processes. This gap $C_\mu - E$ measures how much 'hidden' information the past contains that doesn't directly show in future predictions.

Implications for Model Design

The Memory-Prediction Tradeoff:

Some historical information will be lost
Predictions will be suboptimal (higher entropy than with perfect memory)
The model must learn to prioritize—storing information that has greatest predictive value

Explicit Dependency Modeling Approaches

Before turning to neural approaches, let's survey classical methods for modeling sequential dependencies. These form the intellectual foundation upon which RNNs build.

Autoregressive (AR) Models:

For continuous-valued sequences, autoregressive models express $x_t$ as a linear combination of past values plus noise:

$$x_t = c + \sum_{i=1}^{p} \phi_i x_{t-i} + \epsilon_t$$

where $\epsilon_t \sim \mathcal{N}(0, \sigma^2)$ is white noise.

Properties:

$p$ coefficients $\phi_i$ capture dependency structure
Stationary if polynomial $1 - \phi_1 z - \ldots - \phi_p z^p$ has roots outside unit circle
Can be extended to ARMA (adding moving average) and ARIMA (adding differencing)

Limitations:

Linear relationships only
Fixed order $p$ must be determined
Stationarity assumption often violated
Cannot capture complex conditional distributions

The Case for Neural Representations

We have now surveyed the landscape of classical sequence models and their limitations. The case for neural approaches emerges from several converging arguments:

Why Neural Sequence Models?

•Representation learning: Rather than hand-engineering features, neural models learn representations from data. This enables adaptation to diverse domains without domain expertise.
•Nonlinear dynamics: Neural networks can model complex, nonlinear relationships between past and present that linear models cannot capture.
•Distributed representations: Continuous vectors enable generalization. Similar contexts map to similar hidden states, even if never seen together.
•End-to-end training: All parameters—including hidden state dynamics—can be optimized jointly for the final objective via backpropagation.
•Scalability: Modern hardware accelerates neural network computation. Large models with billions of parameters become tractable.
•Unified framework: The same architecture handles text, speech, time series, and more. Transfer learning becomes possible across domains.

The Fundamental Insight:

Recurrent neural networks address the dependency modeling problem by maintaining a learned hidden state that evolves as the sequence progresses:

$$h_t = f_\theta(h_{t-1}, x_t)$$

This simple recursion has profound implications:

Variable-length history: Unlike n-grams with fixed context, $h_t$ can theoretically carry information from arbitrarily far back.
Learned compression: The network learns what to remember and what to forget, based on what improves predictions.
Shared parameters: The same function $f_\theta$ is applied at every timestep, enabling generalization and reducing parameters.
Differentiable: The entire computation from $x_{1:T}$ to predictions is differentiable, enabling gradient-based learning.

This is the architecture we will develop in the coming pages, starting with the autoregressive formulation that makes training tractable.

Preview: The Recurrent Computation

Summary: The Dependency Problem

This page has established the theoretical foundations for understanding sequential dependencies—the core challenge that recurrent neural networks address.

Key Takeaways

•The full joint distribution factorizes via chain rule, but explicit conditioning on full history is intractable.
•Markov assumptions limit conditioning to recent history, but fail for long-range dependencies.
•Higher-order Markov models face exponential parameter growth and data sparsity.
•Hidden states can serve as sufficient statistics, compressing variable-length history to fixed-size representations.
•Dependencies vary in structure: local vs. long-range, exponential vs. power-law decay, hierarchical patterns.
•Predictive information theory quantifies memory requirements for optimal prediction.
•Classical models (AR, HMM, Kalman, CRF) provide building blocks but have fundamental limitations.
•Neural approaches offer learned, nonlinear, distributed representations with end-to-end training.

What's Next:

Theoretical Foundation Complete

2 / 5