Loading learning content...
Consider a fundamental question: Given an HMM with known parameters, what is the probability that it generated a specific observation sequence? This seemingly simple question—computing $P(\mathbf{x} | \boldsymbol{\theta})$—lies at the heart of many practical applications:
The naive approach would enumerate all possible hidden state sequences, but with $N$ states and $T$ time steps, there are $N^T$ possibilities—an astronomical number even for moderate sequences. A 100-state HMM processing a 50-frame audio segment would require $100^{50} = 10^{100}$ computations, exceeding the number of atoms in the observable universe!
The Forward Algorithm provides an elegant solution: a dynamic programming approach that computes the exact likelihood in $O(TN^2)$ time by exploiting the Markov structure of HMMs.
By the end of this page, you will:
• Understand why naive enumeration is intractable • Master the forward variable definition and recursive computation • Implement the Forward Algorithm in both probability and log-probability space • Understand the algorithm's connection to message passing in graphical models • Know when and why to use scaling or log-space modifications
We formalize the evaluation problem as follows:
Given:
Compute: $$P(\mathbf{x} | \boldsymbol{\theta}) = P(x_1, x_2, \ldots, x_T | \boldsymbol{\theta})$$
The probability of observing exactly this sequence under the model.
The observations depend on hidden states, which we don't observe. To compute the observation probability, we must marginalize over all possible hidden state sequences:
$$P(\mathbf{x} | \boldsymbol{\theta}) = \sum_{\mathbf{z}} P(\mathbf{x}, \mathbf{z} | \boldsymbol{\theta})$$
where the sum is over all $N^T$ possible sequences $\mathbf{z} = (z_1, \ldots, z_T)$.
Using the HMM factorization from the previous page:
$$P(\mathbf{x} | \boldsymbol{\theta}) = \sum_{z_1=1}^{N} \sum_{z_2=1}^{N} \cdots \sum_{z_T=1}^{N} \pi_{z_1} \prod_{t=2}^{T} A_{z_{t-1}, z_t} \prod_{t=1}^{T} B_{z_t}(x_t)$$
Direct computation requires $N^T$ terms in the sum. Each term involves $O(T)$ multiplications. Total complexity: $O(T \cdot N^T)$.
For a modest HMM with $N=10$ states and $T=100$ time steps: • Number of sequences: $10^{100}$ • Even at $10^{15}$ operations/second, this would take $10^{77}$ years!
Clearly, we need a smarter approach.
The breakthrough comes from recognizing that we can rearrange the computation by pushing sums inside products. Consider a simplified example:
$$\sum_{a}\sum_{b}\sum_{c} f(a) \cdot g(a,b) \cdot h(b,c) \cdot k(c)$$
Naively, this sums over all $|A| \times |B| \times |C|$ combinations. But we can factor:
$$= \sum_{c} k(c) \sum_{b} h(b,c) \sum_{a} f(a) \cdot g(a,b)$$
Now the innermost sum is computed first and reused, reducing complexity.
For HMMs, the Markov structure means each transition term $A_{z_{t-1}, z_t}$ only connects adjacent time steps. This allows us to process the sequence left-to-right, aggregating partial results as we go.
The forward variable $\alpha_t(i)$ is defined as the joint probability of observing the first $t$ observations and being in state $i$ at time $t$:
$$\alpha_t(i) = P(x_1, x_2, \ldots, x_t, z_t = i | \boldsymbol{\theta})$$
This is a partial observation probability: we've seen $x_1$ through $x_t$ and we know we're in state $i$ at time $t$.
Once we compute $\alpha_T(i)$ for all states $i$, we can obtain the full observation probability by summing over the final state:
$$P(\mathbf{x} | \boldsymbol{\theta}) = \sum_{i=1}^{N} \alpha_T(i) = \sum_{i=1}^{N} P(x_1, \ldots, x_T, z_T = i)$$
The sum over final states marginalizes out the hidden state at time $T$, giving us the marginal probability of observations only.
Think of $\alpha_t(i)$ as a message that accumulates all the ways the system could have arrived at state $i$ at time $t$ while producing observations $x_1, \ldots, x_t$. Instead of tracking $N^t$ individual paths, we only track $N$ aggregate "probability masses"—one for each possible current state.
The forward computation is often visualized as a trellis diagram—a grid where:
We fill the trellis left-to-right, computing each column from the previous one. This is the essence of dynamic programming: solve subproblems (earlier time steps) and combine their solutions to solve larger problems (later time steps).
At $t=1$, there's no previous state—we're at the beginning of the sequence. The forward variable is simply the probability of starting in state $i$ and emitting $x_1$:
$$\alpha_1(i) = P(x_1, z_1 = i) = P(z_1 = i) \cdot P(x_1 | z_1 = i) = \pi_i \cdot B_i(x_1)$$
Interpretation: For each state $i$, multiply the prior probability of starting there ($\pi_i$) by the probability of emitting the first observation from that state ($B_i(x_1)$).
For $t > 1$, we derive the recursive relationship:
$$\begin{align} \alpha_t(j) &= P(x_1, \ldots, x_t, z_t = j) &= \sum_{i=1}^{N} P(x_1, \ldots, x_t, z_{t-1} = i, z_t = j) \quad \text{(marginalize over } z_{t-1}\text{)} &= \sum_{i=1}^{N} P(x_1, \ldots, x_{t-1}, z_{t-1} = i) \cdot P(z_t = j | z_{t-1} = i) \cdot P(x_t | z_t = j) &= \sum_{i=1}^{N} \alpha_{t-1}(i) \cdot A_{ij} \cdot B_j(x_t) \end{align}$$
Rearranging for clarity:
$$\boxed{\alpha_t(j) = \left[ \sum_{i=1}^{N} \alpha_{t-1}(i) \cdot A_{ij} \right] \cdot B_j(x_t)}$$
The recursion has two conceptual parts:
1. Transition aggregation: $\sum_{i=1}^{N} \alpha_{t-1}(i) \cdot A_{ij}$
Sum over all states $i$ we could have been in at time $t-1$, weighted by: • How much probability mass was in state $i$: $\alpha_{t-1}(i)$ • The probability of transitioning from $i$ to $j$: $A_{ij}$
2. Emission incorporation: multiply by $B_j(x_t)$
Weight by the probability of observing $x_t$ from state $j$.
After processing all $T$ observations, the total observation probability is:
$$P(\mathbf{x} | \boldsymbol{\theta}) = \sum_{i=1}^{N} \alpha_T(i)$$
This sums the probability of all paths ending in each possible state.
Forward Algorithm
Initialize: For $i = 1, \ldots, N$: $$\alpha_1(i) = \pi_i \cdot B_i(x_1)$$
Recurse: For $t = 2, \ldots, T$ and $j = 1, \ldots, N$: $$\alpha_t(j) = \left[ \sum_{i=1}^{N} \alpha_{t-1}(i) \cdot A_{ij} \right] \cdot B_j(x_t)$$
Terminate: $$P(\mathbf{x} | \boldsymbol{\theta}) = \sum_{i=1}^{N} \alpha_T(i)$$
Time: $O(TN^2)$
Space: $O(N)$ if we only keep the current and previous columns; $O(TN)$ if we store the full trellis (needed for Baum-Welch)
Let's implement the Forward Algorithm with careful attention to numerical stability and code clarity.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130
import numpy as npfrom typing import Tuple def forward_algorithm( A: np.ndarray, # (N, N) transition matrix B: np.ndarray, # (N, M) emission matrix pi: np.ndarray, # (N,) initial distribution observations: np.ndarray # (T,) observation indices) -> Tuple[np.ndarray, float]: """ Compute forward variables and observation probability. This is the basic implementation in probability space. WARNING: Underflows for long sequences! Use log-space version. Parameters ---------- A : np.ndarray of shape (N, N) Transition matrix, A[i,j] = P(z_t=j | z_{t-1}=i) B : np.ndarray of shape (N, M) Emission matrix, B[i,k] = P(x_t=k | z_t=i) pi : np.ndarray of shape (N,) Initial state distribution observations : np.ndarray of shape (T,) Sequence of observation indices Returns ------- alpha : np.ndarray of shape (T, N) Forward variables, alpha[t, i] = P(x_1:t, z_t=i) prob : float P(observations | model) = sum of alpha[T-1, :] """ T = len(observations) N = len(pi) # Initialize alpha matrix to store all forward variables alpha = np.zeros((T, N)) # Step 1: Initialization (t=0 in 0-indexed) # α_1(i) = π_i * B_i(x_1) alpha[0] = pi * B[:, observations[0]] # Step 2: Recursion (t=1 to T-1 in 0-indexed) # α_t(j) = [Σ_i α_{t-1}(i) * A_{i,j}] * B_j(x_t) for t in range(1, T): for j in range(N): # Sum over all previous states alpha[t, j] = np.sum(alpha[t-1] * A[:, j]) * B[j, observations[t]] # Step 3: Termination # P(x) = Σ_i α_T(i) prob = np.sum(alpha[-1]) return alpha, prob def forward_algorithm_vectorized( A: np.ndarray, B: np.ndarray, pi: np.ndarray, observations: np.ndarray) -> Tuple[np.ndarray, float]: """ Vectorized implementation of Forward Algorithm. Same result as forward_algorithm(), but uses matrix operations for better performance on larger models. """ T = len(observations) N = len(pi) alpha = np.zeros((T, N)) # Initialization alpha[0] = pi * B[:, observations[0]] # Recursion - vectorized for t in range(1, T): # alpha[t-1] @ A gives transition-weighted sums for each target state # Element-wise multiply by emission probabilities alpha[t] = (alpha[t-1] @ A) * B[:, observations[t]] prob = np.sum(alpha[-1]) return alpha, prob # Example: Weather HMMdef demo_forward_algorithm(): """Demonstrate Forward Algorithm on weather example.""" # States: 0=Sunny, 1=Cloudy, 2=Rainy # Observations: 0=Walk, 1=Shop, 2=Clean A = np.array([ [0.7, 0.2, 0.1], # From Sunny [0.3, 0.4, 0.3], # From Cloudy [0.2, 0.3, 0.5], # From Rainy ]) B = np.array([ [0.7, 0.2, 0.1], # Sunny: prefer Walk [0.3, 0.4, 0.3], # Cloudy: mixed [0.1, 0.2, 0.7], # Rainy: prefer Clean ]) pi = np.array([0.6, 0.3, 0.1]) # Observation sequence: Walk, Shop, Clean, Clean observations = np.array([0, 1, 2, 2]) alpha, prob = forward_algorithm(A, B, pi, observations) print("Forward Variables (α):") print("-" * 50) states = ["Sunny", "Cloudy", "Rainy"] obs_names = ["Walk", "Shop", "Clean", "Clean"] for t in range(len(observations)): print(f"t={t+1}, obs={obs_names[t]:<6}:", end=" ") for i, s in enumerate(states): print(f"α({s})={alpha[t,i]:.6f}", end=" ") print(f"| sum={np.sum(alpha[t]):.6f}") print(f"P(observations) = {prob:.8f}") print(f"Log probability = {np.log(prob):.4f}") demo_forward_algorithm()The basic Forward Algorithm multiplies many probabilities together. For long sequences, this causes numerical underflow—the numbers become so small that they round to zero in floating-point representation.
Consider: if each emission probability averages 0.1 and we have 100 time steps: $$0.1^{100} = 10^{-100} $$
This is far smaller than the smallest positive double-precision float (~$10^{-308}$), but intermediate computations will underflow much sooner, especially when combined with small transition probabilities.
Work in log-probability space throughout. Define: $$\tilde{\alpha}_t(i) = \log \alpha_t(i) = \log P(x_1, \ldots, x_t, z_t = i)$$
The challenge is that the recursion involves a sum inside the log: $$\alpha_t(j) = \left[ \sum_{i} \alpha_{t-1}(i) \cdot A_{ij} \right] \cdot B_j(x_t)$$
In log space: $$\tilde{\alpha}t(j) = \log\left[ \sum{i} \exp(\tilde{\alpha}{t-1}(i) + \log A{ij}) \right] + \log B_j(x_t)$$
We use the log-sum-exp trick for numerical stability.
To compute $\log(e^{a_1} + e^{a_2} + \cdots + e^{a_n})$ stably:
After subtracting $a_{\max}$, the largest exponent is 0 (so $e^0 = 1$), and other terms are small but positive. This prevents both underflow (tiny exponents) and overflow (huge exponents).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110
import numpy as npfrom scipy.special import logsumexp # Numerically stable log-sum-exp def forward_algorithm_log( log_A: np.ndarray, # (N, N) log transition matrix log_B: np.ndarray, # (N, M) log emission matrix log_pi: np.ndarray, # (N,) log initial distribution observations: np.ndarray) -> tuple[np.ndarray, float]: """ Forward Algorithm in log-probability space. Numerically stable for arbitrarily long sequences. All inputs should be log-probabilities. Returns ------- log_alpha : np.ndarray of shape (T, N) Log forward variables log_prob : float Log P(observations | model) """ T = len(observations) N = len(log_pi) log_alpha = np.zeros((T, N)) # Initialization: log α_1(i) = log π_i + log B_i(x_1) log_alpha[0] = log_pi + log_B[:, observations[0]] # Recursion in log space for t in range(1, T): for j in range(N): # log [Σ_i α_{t-1}(i) * A_{i,j}] # = log Σ_i exp(log α_{t-1}(i) + log A_{i,j}) log_alpha[t, j] = logsumexp( log_alpha[t-1] + log_A[:, j] ) + log_B[j, observations[t]] # Termination: log P(x) = log Σ_i α_T(i) log_prob = logsumexp(log_alpha[-1]) return log_alpha, log_prob def forward_algorithm_log_vectorized( log_A: np.ndarray, log_B: np.ndarray, log_pi: np.ndarray, observations: np.ndarray) -> tuple[np.ndarray, float]: """ Vectorized log-space Forward Algorithm. Uses broadcasting for efficient computation. """ T = len(observations) N = len(log_pi) log_alpha = np.zeros((T, N)) log_alpha[0] = log_pi + log_B[:, observations[0]] for t in range(1, T): # (N, 1) + (N, N) = (N, N), then logsumexp over axis 0 # Gives the log of sum of alpha[t-1, i] * A[i, j] for each j log_alpha[t] = logsumexp( log_alpha[t-1, :, np.newaxis] + log_A, axis=0 ) + log_B[:, observations[t]] log_prob = logsumexp(log_alpha[-1]) return log_alpha, log_prob # Comparison: probability space vs log spacedef compare_implementations(): """Show underflow in probability space vs stability in log space.""" # Simple 2-state HMM A = np.array([[0.7, 0.3], [0.4, 0.6]]) B = np.array([[0.9, 0.1], [0.2, 0.8]]) pi = np.array([0.6, 0.4]) # Generate a longer sequence to trigger underflow np.random.seed(42) T_values = [10, 50, 100, 200, 500] print("Sequence Length | Prob Space | Log Space") print("-" * 55) for T in T_values: # Generate random observations obs = np.random.choice(2, size=T) # Probability space (will underflow) _, prob = forward_algorithm_vectorized(A, B, pi, obs) # Log space (stable) log_A = np.log(A) log_B = np.log(B) log_pi = np.log(pi) _, log_prob = forward_algorithm_log_vectorized( log_A, log_B, log_pi, obs ) prob_str = f"{prob:.2e}" if prob > 0 else "UNDERFLOW (0.0)" print(f"T = {T:4d} | {prob_str:17s} | {log_prob:.4f}") compare_implementations()An alternative approach is scaling, where we normalize the forward variables at each time step and track the scaling factors:
$$\hat{\alpha}_t(i) = \frac{\alpha_t(i)}{c_t}, \quad c_t = \sum_i \alpha_t(i)$$
The scaled variables satisfy $\sum_i \hat{\alpha}_t(i) = 1$ at each time step, preventing underflow.
The log probability is recovered as: $$\log P(\mathbf{x}) = \sum_{t=1}^{T} \log c_t$$
Scaling is equivalent to log-space computation but can be more natural in some implementations. The same scaling factors are used in the Backward pass (for Baum-Welch), ensuring consistency.
| Aspect | Log-Space | Scaling |
|---|---|---|
| Numerical stability | Excellent | Excellent |
| Implementation | Use logsumexp | Track scale factors |
| Combined with Backward | Separate log-forward and log-backward | Same scale factors for both |
| Interpretation | Log probabilities throughout | Normalized pseudo-probabilities |
| Memory | Same as basic | Extra O(T) for scale factors |
Let's trace through the Forward Algorithm step by step on a concrete example.
HMM Parameters:
$$\boldsymbol{\pi} = \begin{pmatrix} 0.6 \ 0.4 \end{pmatrix}, \quad \mathbf{A} = \begin{pmatrix} 0.7 & 0.3 \ 0.4 & 0.6 \end{pmatrix}, \quad \mathbf{B} = \begin{pmatrix} 0.5 & 0.4 & 0.1 \ 0.1 & 0.3 & 0.6 \end{pmatrix}$$
Observation sequence: $\mathbf{x} = (0, 1, 2)$ (length $T=3$)
$$\alpha_1(H) = \pi_H \cdot B_H(0) = 0.6 \times 0.5 = 0.30$$ $$\alpha_1(L) = \pi_L \cdot B_L(0) = 0.4 \times 0.1 = 0.04$$
Sanity check: Sum = 0.34 (partial probability of observing $x_1=0$)
For state H: $$\alpha_2(H) = \left[ \alpha_1(H) \cdot A_{HH} + \alpha_1(L) \cdot A_{LH} \right] \cdot B_H(1)$$ $$= [0.30 \times 0.7 + 0.04 \times 0.4] \times 0.4$$ $$= [0.21 + 0.016] \times 0.4 = 0.226 \times 0.4 = 0.0904$$
For state L: $$\alpha_2(L) = \left[ \alpha_1(H) \cdot A_{HL} + \alpha_1(L) \cdot A_{LL} \right] \cdot B_L(1)$$ $$= [0.30 \times 0.3 + 0.04 \times 0.6] \times 0.3$$ $$= [0.09 + 0.024] \times 0.3 = 0.114 \times 0.3 = 0.0342$$
For state H: $$\alpha_3(H) = [0.0904 \times 0.7 + 0.0342 \times 0.4] \times 0.1$$ $$= [0.06328 + 0.01368] \times 0.1 = 0.00770$$
For state L: $$\alpha_3(L) = [0.0904 \times 0.3 + 0.0342 \times 0.6] \times 0.6$$ $$= [0.02712 + 0.02052] \times 0.6 = 0.02858$$
$$P(\mathbf{x} = (0,1,2)) = \alpha_3(H) + \alpha_3(L) = 0.00770 + 0.02858 = 0.03628$$
| Time | Observation | α(H) | α(L) | Sum |
|---|---|---|---|---|
| t=1 | 0 | 0.3000 | 0.0400 | 0.3400 |
| t=2 | 1 | 0.0904 | 0.0342 | 0.1246 |
| t=3 | 2 | 0.0077 | 0.0286 | 0.0363 |
Notice how the sum decreases at each step—this is the "cumulative penalty" of observing each symbol. After observing $x_3=2$, most probability mass is in state L (0.0286 vs 0.0077), reflecting that state L prefers emitting symbol 2 (probability 0.6) while state H rarely emits it (probability 0.1).
The Forward Algorithm is a special case of a more general framework: belief propagation (message passing) in graphical models.
In the HMM graphical model, we can view the forward variable as a message passed from left to right:
$$\alpha_t(z_t) = \text{"message from past telling } z_t \text{ what's been observed"}$$
At each time step, we:
The Forward Algorithm implements the sum-product algorithm on the HMM factor graph:
$$\alpha_t(z_t) = \sum_{z_{t-1}} \underbrace{\alpha_{t-1}(z_{t-1})}{\text{incoming message}} \cdot \underbrace{A{z_{t-1}, z_t}}{\text{transition factor}} \cdot \underbrace{B{z_t}(x_t)}_{\text{observation factor}}$$
This perspective generalizes:
Understanding the message-passing interpretation:
• Reveals the unity between different inference algorithms • Shows why HMMs are a special case of general graphical model inference • Explains why the Backward pass (for Baum-Welch) has the same structure • Generalizes to more complex models (factorial HMMs, DBNs)
Just as we pass messages forward, we can pass them backward:
$$\beta_t(i) = P(x_{t+1}, \ldots, x_T | z_t = i)$$
The backward variable asks: "Given we're in state $i$ at time $t$, what's the probability of all future observations?"
The recursion runs right-to-left: $$\beta_t(i) = \sum_{j=1}^{N} A_{ij} \cdot B_j(x_{t+1}) \cdot \beta_{t+1}(j)$$
Combining forward and backward gives us: $$P(z_t = i | \mathbf{x}) = \frac{\alpha_t(i) \cdot \beta_t(i)}{P(\mathbf{x})}$$
This is the posterior probability of state $i$ at time $t$, used in the Baum-Welch algorithm for learning. We'll develop this fully when covering parameter learning.
The Forward Algorithm is the first of the three core HMM algorithms, solving the evaluation problem efficiently.
Next Page: The Viterbi Algorithm
While the Forward Algorithm computes the probability of observations (summing over all state sequences), the Viterbi Algorithm finds the single most likely state sequence. Instead of summing, we maximize—using dynamic programming structure that closely parallels the Forward Algorithm but answers a fundamentally different question.
You now have a thorough understanding of the Forward Algorithm—the mathematical derivation, efficient implementation, numerical stability considerations, and its role in the broader framework of graphical model inference. This algorithm is foundational for HMM applications and forms the basis for more advanced algorithms like Forward-Backward (for posterior decoding) and Baum-Welch (for learning).