Loading learning content...
Standard mixture models treat observations as independently and identically distributed (i.i.d.)—each data point is generated by first selecting a component, then sampling from that component's distribution. This assumption is fundamental to tractable inference but profoundly limiting for sequential data.
Consider speech recognition: an audio signal consists of a sequence of acoustic frames corresponding to phonemes. Crucially, phonemes don't appear randomly—the phoneme at time $t$ strongly depends on previous phonemes. A 'k' sound is likely followed by a vowel, not by silence. This temporal structure carries essential information that i.i.d. models discard.
Hidden Markov Models (HMMs) extend mixture models by introducing Markov dependencies between the latent component indicators. Instead of independent component selection, the sequence of latent states $z_1, z_2, \ldots, z_T$ forms a Markov chain: the current state depends only on the previous state. This seemingly simple modification enables HMMs to capture rich temporal structure while remaining computationally tractable.
HMMs are foundational in machine learning, having driven breakthroughs in speech recognition, computational biology, handwriting recognition, and financial modeling. Understanding HMMs provides the conceptual foundation for more advanced sequential models like conditional random fields and recurrent neural networks.
By the end of this page, you will understand: (1) The HMM graphical model and its factorization; (2) The three fundamental HMM problems—evaluation, decoding, and learning; (3) The forward-backward algorithm for computing likelihoods and posteriors; (4) The Viterbi algorithm for finding optimal state sequences; (5) The Baum-Welch algorithm (EM for HMMs); (6) Connections to reinforcement learning and modern sequence models.
An HMM jointly models a sequence of observations $\mathbf{x} = (x_1, x_2, \ldots, x_T)$ and a corresponding sequence of hidden states $\mathbf{z} = (z_1, z_2, \ldots, z_T)$.
1. State space: Discrete states $z_t \in {1, 2, \ldots, K}$
2. Initial state distribution $\boldsymbol{\pi}$: $$\pi_k = p(z_1 = k)$$ Vector of length $K$ summing to 1.
3. Transition matrix $\mathbf{A}$: $$A_{jk} = p(z_t = k | z_{t-1} = j)$$ $K \times K$ matrix where each row sums to 1. Entry $A_{jk}$ is the probability of transitioning from state $j$ to state $k$.
4. Emission distributions ${p(x | z = k)}_{k=1}^K$: The probability of observing $x$ given state $k$. Common choices:
An HMM generates a sequence as follows:
The joint probability factorizes as:
$$p(\mathbf{x}, \mathbf{z}) = p(z_1) \prod_{t=2}^T p(z_t | z_{t-1}) \prod_{t=1}^T p(x_t | z_t)$$
$$= \pi_{z_1} \prod_{t=2}^T A_{z_{t-1}, z_t} \prod_{t=1}^T p(x_t | z_t)$$
Markov property on states: $z_t \perp z_{1:t-2} | z_{t-1}$ (future is independent of past given present)
Conditional independence of observations: $x_t \perp x_{-t}, z_{-t} | z_t$ (observation depends only on its corresponding state)
D-separation: Given the hidden states, all observations are independent. Given observations, states are generally not independent—there's "explaining away."
A standard mixture model is an HMM with T=1. The transition matrix is irrelevant for single observations. Alternatively, an HMM with A = π·1ᵀ (all rows identical to π) gives i.i.d. component selection—recovering the standard mixture.
Working with HMMs requires solving three fundamental computational problems:
Question: Given model parameters $\boldsymbol{\theta} = (\boldsymbol{\pi}, \mathbf{A}, {\text{emission params}})$ and observation sequence $\mathbf{x} = (x_1, \ldots, x_T)$, compute $p(\mathbf{x} | \boldsymbol{\theta})$.
Challenge: Naive computation requires summing over all $K^T$ possible state sequences: $$p(\mathbf{x}) = \sum_{z_1, \ldots, z_T} p(\mathbf{x}, \mathbf{z})$$
For $T = 100$ and $K = 50$, this is $50^{100} \approx 10^{170}$ terms—utterly intractable.
Solution: The forward algorithm computes this in $O(TK^2)$ time using dynamic programming.
Applications: Model selection, anomaly detection, speech recognition confidence.
Question: Given $\boldsymbol{\theta}$ and $\mathbf{x}$, find the most probable state sequence: $$\mathbf{z}^* = \arg\max_{\mathbf{z}} p(\mathbf{z} | \mathbf{x}, \boldsymbol{\theta})$$
Challenge: Same combinatorial explosion—$K^T$ candidate sequences.
Solution: The Viterbi algorithm finds the global optimum in $O(TK^2)$ using dynamic programming.
Applications: Speech-to-text transcription, part-of-speech tagging, gene finding.
Question: Given observation sequences (potentially multiple), find parameters that maximize likelihood: $$\boldsymbol{\theta}^* = \arg\max_{\boldsymbol{\theta}} p(\mathbf{x} | \boldsymbol{\theta})$$
Challenge: Hidden states make the likelihood non-convex. Direct optimization is difficult.
Solution: The Baum-Welch algorithm (EM for HMMs) iteratively improves parameters.
Applications: Training speech recognizers, building biological sequence models.
| Problem | Algorithm | Time Complexity | Space Complexity | Output |
|---|---|---|---|---|
| Evaluation | Forward | O(TK²) | O(TK) | p(x₁:T) |
| Decoding | Viterbi | O(TK²) | O(TK) | argmax z₁:T |
| Learning | Baum-Welch | O(TK² · iterations) | O(TK) | θ* |
The forward-backward algorithm efficiently computes the likelihood $p(\mathbf{x})$ and the posterior marginals $p(z_t | \mathbf{x})$. It consists of two passes: a forward pass that accumulates evidence from the past, and a backward pass that accumulates evidence from the future.
Define the forward variable $\alpha_t(k)$ as the joint probability of observing the first $t$ observations and ending in state $k$:
$$\alpha_t(k) = p(x_1, \ldots, x_t, z_t = k)$$
Base case ($t = 1$): $$\alpha_1(k) = \pi_k \cdot p(x_1 | z_1 = k)$$
Recursion ($t > 1$): $$\alpha_t(k) = p(x_t | z_t = k) \sum_{j=1}^K \alpha_{t-1}(j) \cdot A_{jk}$$
The sum accumulates all ways to reach state $k$ at time $t$: for each previous state $j$, multiply the probability of being in $j$ at $t-1$ by the transition probability to $k$.
Termination: The total likelihood is: $$p(\mathbf{x}) = \sum_{k=1}^K \alpha_T(k)$$
Define the backward variable $\beta_t(k)$ as the probability of observing the remaining sequence given that we're in state $k$ at time $t$:
$$\beta_t(k) = p(x_{t+1}, \ldots, x_T | z_t = k)$$
Base case ($t = T$): $$\beta_T(k) = 1$$
Recursion ($t < T$): $$\beta_t(k) = \sum_{j=1}^K A_{kj} \cdot p(x_{t+1} | z_{t+1} = j) \cdot \beta_{t+1}(j)$$
Combining forward and backward variables yields the posterior:
$$\gamma_t(k) = p(z_t = k | \mathbf{x}) = \frac{\alpha_t(k) \cdot \beta_t(k)}{p(\mathbf{x})} = \frac{\alpha_t(k) \cdot \beta_t(k)}{\sum_j \alpha_t(j) \cdot \beta_t(j)}$$
Pairwise posterior (needed for learning): $$\xi_t(j, k) = p(z_t = j, z_{t+1} = k | \mathbf{x}) = \frac{\alpha_t(j) \cdot A_{jk} \cdot p(x_{t+1} | k) \cdot \beta_{t+1}(k)}{p(\mathbf{x})}$$
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
import numpy as np def forward_backward(observations, pi, A, emission_probs): """ Forward-backward algorithm for HMMs. Args: observations: Sequence of observation indices (T,) pi: Initial distribution (K,) A: Transition matrix (K, K) emission_probs: Emission probabilities (K, V) for V observations Returns: alpha: Forward variables (T, K) beta: Backward variables (T, K) gamma: State posteriors (T, K) xi: Transition posteriors (T-1, K, K) log_likelihood: log p(observations) """ T = len(observations) K = len(pi) # Get emission probabilities for observed sequence B = emission_probs[:, observations].T # (T, K) # Forward pass (log-space for numerical stability) log_alpha = np.zeros((T, K)) log_alpha[0] = np.log(pi + 1e-10) + np.log(B[0] + 1e-10) for t in range(1, T): for k in range(K): log_alpha[t, k] = (np.log(B[t, k] + 1e-10) + np.logaddexp.reduce(log_alpha[t-1] + np.log(A[:, k] + 1e-10))) # Log-likelihood log_likelihood = np.logaddexp.reduce(log_alpha[T-1]) # Backward pass log_beta = np.zeros((T, K)) log_beta[T-1] = 0 # log(1) = 0 for t in range(T-2, -1, -1): for k in range(K): log_beta[t, k] = np.logaddexp.reduce( np.log(A[k] + 1e-10) + np.log(B[t+1] + 1e-10) + log_beta[t+1] ) # State posteriors log_gamma = log_alpha + log_beta - log_likelihood gamma = np.exp(log_gamma) # Transition posteriors xi = np.zeros((T-1, K, K)) for t in range(T-1): for j in range(K): for k in range(K): xi[t, j, k] = np.exp( log_alpha[t, j] + np.log(A[j, k] + 1e-10) + np.log(B[t+1, k] + 1e-10) + log_beta[t+1, k] - log_likelihood ) alpha = np.exp(log_alpha) beta = np.exp(log_beta) return alpha, beta, gamma, xi, log_likelihoodα and β values become exponentially small for long sequences due to repeated multiplication of probabilities < 1. Always work in log-space, using log-sum-exp for additions. Alternatively, use scaling factors that normalize α at each step (maintaining running log of scale factors).
The Viterbi algorithm finds the most likely state sequence—the global maximum a posteriori (MAP) path—using dynamic programming. It exploits the same Markov structure as forward-backward but uses max instead of sum.
The optimal principle: the best path to state $k$ at time $t$ must pass through some state $j$ at time $t-1$, and that prefix must itself be optimal for reaching $j$.
Define $\delta_t(k)$ as the probability of the best path ending in state $k$ at time $t$:
$$\delta_t(k) = \max_{z_1, \ldots, z_{t-1}} p(z_1, \ldots, z_{t-1}, z_t = k, x_1, \ldots, x_t)$$
Base case: $$\delta_1(k) = \pi_k \cdot p(x_1 | k)$$
Recursion: $$\delta_t(k) = p(x_t | k) \cdot \max_j \left[ \delta_{t-1}(j) \cdot A_{jk} \right]$$
Backpointers: Store the $\arg\max$ at each step: $$\psi_t(k) = \arg\max_j \left[ \delta_{t-1}(j) \cdot A_{jk} \right]$$
After completing the forward pass:
| Aspect | Forward Algorithm | Viterbi Algorithm |
|---|---|---|
| Operation | Sum over previous states | Max over previous states |
| Output | Total probability p(x) | Best path probability |
| Backtracking | Not needed | Needed to recover path |
| Semantics | Marginalizes states | Finds MAP states |
Forward/Posterior: When you need uncertainty—"what's the probability of state $k$ at time $t$?"
Viterbi/MAP: When you need a single best explanation—"what sequence of states best explains the observations?"
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
import numpy as np def viterbi(observations, pi, A, emission_probs): """ Viterbi algorithm for finding the most likely state sequence. Args: observations: Sequence of observation indices (T,) pi: Initial distribution (K,) A: Transition matrix (K, K) emission_probs: Emission probabilities (K, V) Returns: best_path: Most likely state sequence (T,) best_prob: Probability of best path """ T = len(observations) K = len(pi) # Emission probabilities for observed sequence B = emission_probs[:, observations].T # (T, K) # Log-space for numerical stability log_pi = np.log(pi + 1e-10) log_A = np.log(A + 1e-10) log_B = np.log(B + 1e-10) # Viterbi variables log_delta = np.zeros((T, K)) psi = np.zeros((T, K), dtype=int) # Base case log_delta[0] = log_pi + log_B[0] # Recursion for t in range(1, T): for k in range(K): # Find best previous state scores = log_delta[t-1] + log_A[:, k] psi[t, k] = np.argmax(scores) log_delta[t, k] = log_B[t, k] + scores[psi[t, k]] # Backtrack best_path = np.zeros(T, dtype=int) best_path[T-1] = np.argmax(log_delta[T-1]) best_log_prob = log_delta[T-1, best_path[T-1]] for t in range(T-2, -1, -1): best_path[t] = psi[t+1, best_path[t+1]] return best_path, np.exp(best_log_prob) # Example usagenp.random.seed(42)K, V, T = 3, 4, 10 pi = np.array([0.6, 0.3, 0.1])A = np.array([ [0.7, 0.2, 0.1], [0.1, 0.6, 0.3], [0.3, 0.3, 0.4]])emission_probs = np.random.dirichlet(np.ones(V), size=K) observations = np.random.choice(V, size=T)path, prob = viterbi(observations, pi, A, emission_probs)print(f"Best path: {path}")print(f"Path probability: {prob:.6f}")The Baum-Welch algorithm is EM specialized to HMMs. It iteratively updates parameters using expected sufficient statistics computed via forward-backward.
Run forward-backward to compute:
Initial distribution: $$\hat{\pi}_k = \gamma_1(k)$$
The expected count of starting in state $k$.
Transition matrix: $$\hat{A}{jk} = \frac{\sum{t=1}^{T-1} \xi_t(j, k)}{\sum_{t=1}^{T-1} \gamma_t(j)}$$
Numerator: expected count of $j \to k$ transitions. Denominator: expected count of being in state $j$ (when a transition is possible).
Emission parameters (discrete case): $$\hat{B}{kv} = p(x = v | z = k) = \frac{\sum{t: x_t = v} \gamma_t(k)}{\sum_{t=1}^T \gamma_t(k)}$$
Expected count of emitting $v$ from state $k$, divided by expected time in state $k$.
For Gaussian emission $p(x | z = k) = \mathcal{N}(x | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$:
$$\hat{\boldsymbol{\mu}}k = \frac{\sum{t=1}^T \gamma_t(k) \mathbf{x}t}{\sum{t=1}^T \gamma_t(k)}$$
$$\hat{\boldsymbol{\Sigma}}k = \frac{\sum{t=1}^T \gamma_t(k) (\mathbf{x}_t - \hat{\boldsymbol{\mu}}_k)(\mathbf{x}_t - \hat{\boldsymbol{\mu}}k)^T}{\sum{t=1}^T \gamma_t(k)}$$
Weighted mean and covariance, weighted by the posterior probability of being in state $k$ at each time.
Given multiple independent sequences ${\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(M)}}$:
For example: $$\hat{A}{jk} = \frac{\sum{m=1}^M \sum_{t=1}^{T_m-1} \xi_t^{(m)}(j, k)}{\sum_{m=1}^M \sum_{t=1}^{T_m-1} \gamma_t^{(m)}(j)}$$
Like all EM algorithms, Baum-Welch guarantees monotonic likelihood improvement and convergence to a local maximum—but not necessarily the global maximum. Run from multiple random initializations and select the solution with highest likelihood. Domain-specific initialization (e.g., from uniform segmentation) often helps.
The basic HMM can be extended in many directions to handle different application requirements.
For sequential processes (speech, handwriting), states represent progression through a pattern. Constrain transitions to only allow forward movement:
$$A_{jk} = 0 \text{ for } k < j$$
The model can only stay in the current state or advance. This is standard in speech recognition phoneme models.
Replace discrete emissions with continuous distributions:
GMM-HMMs were the dominant approach in speech recognition before deep learning.
Share Gaussian components across states, with state-specific mixture weights: $$p(x|k) = \sum_m w_{km} \mathcal{N}(x | \boldsymbol{\mu}_m, \boldsymbol{\Sigma}_m)$$
Reduces parameters while allowing state-specific distributions.
| Variant | Emissions | Transitions | Use Case |
|---|---|---|---|
| Discrete HMM | Categorical | Full | Discrete sequences (DNA, text) |
| Gaussian HMM | Single Gaussian | Full | Simple continuous data |
| GMM-HMM | Gaussian mixture | Full | Complex continuous (speech) |
| Left-Right | Any | Constrained (≥) | Sequential processes |
| Factorial HMM | Conditional on all | Independent chains | Multiple sources |
| Input-Output HMM | Depends on input | May depend on input | Conditional sequences |
Multiple independent Markov chains $\mathbf{z}^{(1)}, \ldots, \mathbf{z}^{(F)}$, with observations depending on all chains:
$$p(x_t | z_t^{(1)}, \ldots, z_t^{(F)})$$
This models multiple independent "causes" contributing to observations (e.g., different speakers in audio).
Condition transitions and/or emissions on external inputs: $$p(z_t | z_{t-1}, \mathbf{u}_t)$$ $$p(x_t | z_t, \mathbf{u}_t)$$
This bridges HMMs with supervised learning and connects to recurrent neural networks.
HMMs have driven major advances across diverse application domains.
The application that defined HMM research for decades:
GMM-HMMs achieved commercially viable speech recognition, later enhanced by deep neural networks replacing GMMs.
Gene finding: Identify protein-coding regions in DNA
Protein structure: Profile HMMs for sequence alignment
Assign grammatical categories to words:
RNNs and Transformers have largely superseded HMMs for complex sequence modeling (speech, language). However, HMMs remain valuable for: (1) Small data regimes where generative models excel; (2) Problems requiring interpretable state sequences; (3) Explicit uncertainty over hidden states; (4) Structured domains with known state semantics (biology).
This page has provided a comprehensive treatment of Hidden Markov Models, from fundamental concepts through practical algorithms.
What's Next: Infinite Mixtures (Dirichlet Process Mixture Models)
Both standard mixture models and HMMs require specifying the number of components $K$ in advance. But often we don't know how many components exist—we want the model complexity to grow with the data. The next page introduces Dirichlet Process Mixture Models (DPMMs), which extend mixture models with a nonparametric Bayesian prior that allows for a potentially infinite number of components, automatically inferring appropriate complexity from data.
You now understand Hidden Markov Models as mixture models extended to sequential data. The dynamic programming algorithms—forward-backward, Viterbi, and Baum-Welch—make inference and learning tractable despite exponential state spaces. Next, we'll explore how to let the number of mixture components be determined by the data itself.