Loading content...
Consider the difference between a photograph and a video. The photograph captures a single instant—a frozen slice of reality with spatial structure but no temporal extent. The video, by contrast, is a sequence: a succession of frames whose meaning emerges not just from individual images but from their ordering, their transitions, and their evolution over time.
This distinction lies at the heart of sequence modeling. While much of classical machine learning focuses on independent and identically distributed (i.i.d.) data—where each sample stands alone—an enormous class of real-world problems involves data where order matters fundamentally. The word 'dog' differs from 'god' not in its letters but in their arrangement. A stock price today depends on prices yesterday and last month. A medical diagnosis unfolds through a progression of symptoms over time.
In this page, we develop a rigorous taxonomy of sequential data types, understanding their structural properties, the assumptions they encode, and the challenges they pose for machine learning systems. This foundation is essential before we can design neural architectures that capture temporal dependencies effectively.
By the end of this page, you will: (1) Distinguish between i.i.d. and sequential data paradigms, (2) Classify sequential data by structural properties, (3) Understand the unique challenges sequences pose, and (4) Recognize sequential patterns across diverse application domains.
Most classical machine learning rests on a comfortable foundation: the assumption that training examples are independent and identically distributed (i.i.d.). Under this assumption, each data point $(x_i, y_i)$ is drawn from the same underlying distribution $P(X, Y)$, and knowing one data point tells us nothing additional about any other.
This assumption enables powerful theoretical guarantees. It allows us to bound generalization error, prove convergence of learning algorithms, and treat the training set as an unbiased sample of the true distribution. Most supervised learning theory—from VC dimension to PAC learning—depends critically on the i.i.d. assumption.
A sequence of random variables $X_1, X_2, \ldots, X_n$ is independent and identically distributed if: (1) Each $X_i$ follows the same probability distribution $P(X)$, and (2) For any subset of indices, $P(X_{i_1}, \ldots, X_{i_k}) = P(X_{i_1}) \cdot \ldots \cdot P(X_{i_k})$. The joint distribution factorizes completely.
Where I.I.D. Works Well:
The i.i.d. assumption is reasonable for many practical problems:
In these settings, treating samples as independent introduces minimal modeling error, and the mathematical convenience is worth this approximation.
Where I.I.D. Breaks Down:
However, vast domains of data violate the i.i.d. assumption fundamentally—not as a technical inconvenience but as a structural feature of the data itself:
Text: Words in a sentence are not independent. 'The cat sat on the ___' restricts the next word far more than knowing individual word frequencies.
Speech: Phonemes are influenced by surrounding sounds (coarticulation). The sound of 'k' in 'skip' differs from 'k' in 'ski' because of what follows.
Time Series: A stock price at time $t$ depends on prices at times $t-1, t-2, \ldots$. Temperature today correlates with temperature yesterday.
Video: Each frame depends on previous frames. Object trajectories, camera motion, and scene dynamics all create temporal structure.
Genomic Sequences: DNA bases follow statistical patterns. Promoter regions, codons, and regulatory elements all create non-random sequential structure.
User Behavior Logs: A user's next action depends on their previous actions. Session context, browsing history, and past purchases all inform predictions.
Treating sequential data as i.i.d. doesn't just reduce accuracy—it can lead to fundamentally wrong conclusions. A language model trained on shuffled text would learn that 'the' is common, but not that it precedes nouns. A time series model ignoring temporal order would miss trends, periodicities, and dependencies that define the signal. The structure IS the information.
A sequence is an ordered collection of elements, typically indexed by a discrete or continuous variable representing position or time. Formally:
$$\mathbf{x} = (x_1, x_2, \ldots, x_T)$$
where $T$ is the sequence length (which may vary across examples or even be infinite), and each $x_t$ belongs to some observation space $\mathcal{X}$.
The critical property distinguishing sequences from sets or bags is that the index $t$ carries semantic meaning. Permuting a sequence generally produces a different object with different meaning:
$$(x_1, x_2, x_3) \neq (x_3, x_1, x_2)$$
This contrasts with a set ${x_1, x_2, x_3}$ which equals ${x_3, x_1, x_2}$ by definition.
| Variable | Description | Common Domains |
|---|---|---|
| $\mathbf{x}$ | Complete sequence | $(x_1, x_2, \ldots, x_T) \in \mathcal{X}^T$ |
| $x_t$ | Element at position/time $t$ | $x_t \in \mathcal{X}$ (reals, vectors, symbols) |
| $T$ | Sequence length | $T \in \mathbb{Z}^+$ (variable or fixed) |
| $t$ | Index variable | Time, position, step number |
| $\mathbf{x}_{<t}$ | Prefix up to position $t-1$ | $(x_1, \ldots, x_{t-1})$ |
| $\mathbf{x}_{t:t'}$ | Slice from $t$ to $t'$ | $(x_t, x_{t+1}, \ldots, x_{t'})$ |
The Probabilistic View:
From a probabilistic perspective, a sequence is a realization of a stochastic process—a collection of random variables indexed by time or position. The joint distribution of a sequence of length $T$ is:
$$P(\mathbf{x}) = P(x_1, x_2, \ldots, x_T)$$
Unlike i.i.d. data where this factorizes into independent terms, sequential data exhibits complex dependencies. The chain rule of probability always allows us to write:
$$P(x_1, x_2, \ldots, x_T) = P(x_1) \cdot P(x_2|x_1) \cdot P(x_3|x_1, x_2) \cdot \ldots \cdot P(x_T|x_1, \ldots, x_{T-1})$$
$$= \prod_{t=1}^{T} P(x_t | x_1, \ldots, x_{t-1}) = \prod_{t=1}^{T} P(x_t | \mathbf{x}_{<t})$$
This factorization is central to autoregressive modeling, which we will explore in detail. The key insight is that the conditional distributions $P(x_t | \mathbf{x}_{<t})$ capture how each element depends on its history.
If all $x_t$ were independent, then $P(x_t | \mathbf{x}{<t}) = P(x_t)$, and the chain rule would collapse to a simple product of marginals. The deviation from this—how much $P(x_t | \mathbf{x}{<t})$ differs from $P(x_t)$—quantifies the sequential structure. Highly structured sequences (like meaningful text) have strong dependencies; noise sequences have weak dependencies.
Sequential data manifests in remarkably diverse forms across application domains. Understanding this diversity is crucial because different structural properties demand different modeling approaches. We classify sequences along several orthogonal dimensions:
| Domain | Sequence Type | Element Type | Typical Length |
|---|---|---|---|
| Natural Language | Text (sentences, documents) | Discrete tokens | 10-10,000+ tokens |
| Speech | Audio waveform | Continuous (sampled at 16kHz+) | 10,000-1,000,000+ samples |
| Computer Vision | Video frames | High-dim continuous | 30-10,000+ frames |
| Genomics | DNA/RNA sequences | Discrete (4-letter alphabet) | 100-3 billion bases |
| Finance | Price time series | Continuous multivariate | 100-100,000+ ticks |
| Healthcare | Clinical records | Mixed (events + values) | 10-10,000+ events |
| Robotics | Sensor/action trajectories | Continuous multivariate | 100-10,000+ steps |
| Music | Audio or symbolic notes | Either discrete or continuous | 100-1,000,000+ samples |
Beyond the basic taxonomy, sequences exhibit several important structural properties that influence modeling choices. Understanding these properties helps us select appropriate architectures and anticipate potential challenges.
A sequence is stationary if its statistical properties do not change over time. Formally, for a stationary process, the joint distribution of $(x_t, x_{t+1}, \ldots, x_{t+k})$ is identical for all $t$—only the relative positions matter, not the absolute time.
Examples of stationarity:
Examples of non-stationarity:
Non-stationarity complicates modeling because a model trained on one part of the sequence may not generalize to another. Techniques like differencing, normalization, or explicit time-aware mechanisms help address non-stationarity.
Sequences differ in how far back dependencies extend:
Short-range dependencies: $x_t$ depends primarily on the recent past—$x_{t-1}, x_{t-2}, \ldots, x_{t-k}$ for small $k$. Many physical processes and Markovian systems exhibit short-range dependencies.
Long-range dependencies: $x_t$ may depend on elements far back in history. Examples:
The challenge: Capturing long-range dependencies is one of the central challenges in sequence modeling. Vanilla RNNs struggle with this due to vanishing gradients—a topic we will explore in depth later. Architectures like LSTM, GRU, and especially Transformers were developed specifically to address long-range dependency modeling.
Many real-world sequences exhibit periodic patterns—regular cycles at one or more frequencies:
Periodicities can be explicitly modeled (via Fourier features, seasonal decomposition) or left for the model to discover. The choice depends on whether periodicity is known a priori and how strongly it dominates other patterns.
Some sequences have compositional structure—they are built from reusable parts combined according to rules:
Compositionality suggests that models benefit from hierarchical representations—capturing structure at multiple levels of abstraction. This insight motivates architectures with attention across levels or explicit hierarchical processing.
The most effective sequence models leverage known structural properties. For periodic data, incorporate Fourier features or seasonal indices. For compositional data, use hierarchical or attention-based models. For short-range dependencies, simpler models may suffice. Understanding your data's structure before selecting an architecture is crucial.
Information theory provides a powerful lens for understanding sequential data and quantifying its structure. The central concept is entropy—a measure of uncertainty or information content.
For a single random variable $X$ with distribution $P$, the entropy is:
$$H(X) = -\sum_x P(x) \log P(x)$$
For sequences, we care about conditional entropy—the remaining uncertainty about $x_t$ after observing all previous elements:
$$H(X_t | X_{<t}) = -\sum_{x_{<t}} P(x_{<t}) \sum_{x_t} P(x_t|x_{<t}) \log P(x_t|x_{<t})$$
Entropy Rate:
For a stationary process, the entropy rate measures the average uncertainty per symbol as sequence length grows infinitely:
$$H(\mathcal{X}) = \lim_{T \to \infty} \frac{1}{T} H(X_1, X_2, \ldots, X_T) = \lim_{t \to \infty} H(X_t | X_{<t})$$
The entropy rate is a fundamental measure of sequence complexity:
For English text, the entropy rate has been estimated at approximately 1.2-1.5 bits per character—far below the ~4.7 bits/character that would occur if characters were uniformly distributed. This gap represents the redundancy in language that models can exploit.
Shannon's source coding theorem tells us that the entropy rate equals the minimum achievable compression rate. This creates a deep connection: a good sequence model IS a good compressor, and vice versa. The better a model predicts $P(x_t|x_{<t})$, the more efficiently it can encode the sequence. This insight has been used to evaluate language models via perplexity, a measure directly related to compression efficiency.
Mutual Information and Dependency Quantification:
The mutual information between a current element and its history quantifies the sequential structure:
$$I(X_t; X_{<t}) = H(X_t) - H(X_t | X_{<t})$$
High mutual information indicates strong dependencies—knowing the history significantly reduces uncertainty about the current element. For i.i.d. sequences, mutual information is zero. For highly structured sequences like meaningful text, it is substantial.
We can also examine mutual information at different lags:
$$I(X_t; X_{t-k})$$
This reveals how far back dependencies extend. For Markovian processes, $I(X_t; X_{t-k})$ decays exponentially with lag $k$. For long-range dependent processes, decay is slower—potentially polynomial or even constant.
To solidify our understanding, let's examine specific application domains in detail, highlighting the unique characteristics and challenges of their sequential data.
Natural Language Processing (NLP)
Text represents a paradigmatic example of sequential data with rich structure:
Tokenization levels:
Key properties:
Sequential patterns in language:
Sequential data presents several challenges that don't arise—or are less severe—with i.i.d. data. Understanding these challenges is essential for designing effective models and training procedures.
The Causality Constraint:
A particularly important constraint in sequence modeling is causality: at prediction time, we only have access to the past. When predicting $x_{t+1}$, we cannot use information from $x_{t+2}, x_{t+3}, \ldots$—those haven't occurred yet.
This seems obvious, but it has profound implications:
Training vs. inference asymmetry: During training, we typically have access to the entire sequence. We must deliberately mask future information to avoid learning to 'cheat'.
Autoregressive bottleneck: At inference time, we must generate one element at a time, each conditioned on previously generated elements. This is inherently sequential and cannot be parallelized.
Error accumulation: Small errors in prediction compound. If we predict incorrectly at step $t$, all predictions from step $t+1$ onward may be influenced by this error.
Note that in some applications (e.g., bidirectional text encoding for classification), we DO have access to the full sequence. But for generation tasks—language generation, forecasting, planning—causality is fundamental.
A common training technique is 'teacher forcing'—using ground-truth history during training. This speeds convergence but creates a mismatch: during training, the model sees perfect history; during generation, it sees its own (imperfect) predictions. This 'exposure bias' can lead to cascading errors at inference time. We'll discuss mitigation strategies in later modules.
We have developed a comprehensive framework for understanding sequential data—the foundation upon which all recurrent neural network architectures are built. Let us consolidate the key insights:
What's Next:
Having established what sequential data is and why it requires specialized treatment, we now turn to the question of how to model the dependencies within sequences. The next page introduces Modeling Dependencies—the core challenge of capturing how earlier elements in a sequence influence later ones, and why this is both critically important and surprisingly difficult.
You now have a rigorous understanding of sequential data types, their properties, and the challenges they present. This foundation is essential for appreciating why recurrent architectures were developed and how they address—and sometimes fail to address—the fundamental challenges of sequence modeling.