Machine LearningGraphical Models

Hidden Markov Models

LevelAdvanced

Duration120 mins

TopicGraphical Models

2 / 5

Forward Algorithm

Computing the Probability of What We Observe

Consider a fundamental question: Given an HMM with known parameters, what is the probability that it generated a specific observation sequence? This seemingly simple question—computing $P(\mathbf{x} | \boldsymbol{\theta})$—lies at the heart of many practical applications:

Speech Recognition: Which word model (HMM) is most likely to have generated this acoustic signal?
Anomaly Detection: Is this sequence unlikely enough under our model to be considered anomalous?
Language Modeling: How probable is this sentence according to our language model?

The naive approach would enumerate all possible hidden state sequences, but with $N$ states and $T$ time steps, there are $N^T$ possibilities—an astronomical number even for moderate sequences. A 100-state HMM processing a 50-frame audio segment would require $100^{50} = 10^{100}$ computations, exceeding the number of atoms in the observable universe!

The Forward Algorithm provides an elegant solution: a dynamic programming approach that computes the exact likelihood in $O(TN^2)$ time by exploiting the Markov structure of HMMs.

What You Will Learn

By the end of this page, you will:

• Understand why naive enumeration is intractable • Master the forward variable definition and recursive computation • Implement the Forward Algorithm in both probability and log-probability space • Understand the algorithm's connection to message passing in graphical models • Know when and why to use scaling or log-space modifications

The Evaluation Problem

Formal Problem Statement

We formalize the evaluation problem as follows:

Given:

HMM parameters $\boldsymbol{\theta} = (\mathbf{A}, \mathbf{B}, \boldsymbol{\pi})$
Observation sequence $\mathbf{x} = (x_1, x_2, \ldots, x_T)$

Compute: $$P(\mathbf{x} | \boldsymbol{\theta}) = P(x_1, x_2, \ldots, x_T | \boldsymbol{\theta})$$

The probability of observing exactly this sequence under the model.

Why This is Non-Trivial

The observations depend on hidden states, which we don't observe. To compute the observation probability, we must marginalize over all possible hidden state sequences:

$$P(\mathbf{x} | \boldsymbol{\theta}) = \sum_{\mathbf{z}} P(\mathbf{x}, \mathbf{z} | \boldsymbol{\theta})$$

where the sum is over all $N^T$ possible sequences $\mathbf{z} = (z_1, \ldots, z_T)$.

Using the HMM factorization from the previous page:

$$P(\mathbf{x} | \boldsymbol{\theta}) = \sum_{z_1=1}^{N} \sum_{z_2=1}^{N} \cdots \sum_{z_T=1}^{N} \pi_{z_1} \prod_{t=2}^{T} A_{z_{t-1}, z_t} \prod_{t=1}^{T} B_{z_t}(x_t)$$

The Exponential Explosion

Direct computation requires $N^T$ terms in the sum. Each term involves $O(T)$ multiplications. Total complexity: $O(T \cdot N^T)$.

For a modest HMM with $N=10$ states and $T=100$ time steps: • Number of sequences: $10^{100}$ • Even at $10^{15}$ operations/second, this would take $10^{77}$ years!

Clearly, we need a smarter approach.

The Key Insight: Distributive Property and Markov Structure

The breakthrough comes from recognizing that we can rearrange the computation by pushing sums inside products. Consider a simplified example:

$$\sum_{a}\sum_{b}\sum_{c} f(a) \cdot g(a,b) \cdot h(b,c) \cdot k(c)$$

Naively, this sums over all $|A| \times |B| \times |C|$ combinations. But we can factor:

$$= \sum_{c} k(c) \sum_{b} h(b,c) \sum_{a} f(a) \cdot g(a,b)$$

Now the innermost sum is computed first and reused, reducing complexity.

For HMMs, the Markov structure means each transition term $A_{z_{t-1}, z_t}$ only connects adjacent time steps. This allows us to process the sequence left-to-right, aggregating partial results as we go.

The Forward Variable

Definition

The forward variable $\alpha_t(i)$ is defined as the joint probability of observing the first $t$ observations and being in state $i$ at time $t$:

$$\alpha_t(i) = P(x_1, x_2, \ldots, x_t, z_t = i | \boldsymbol{\theta})$$

This is a partial observation probability: we've seen $x_1$ through $x_t$ and we know we're in state $i$ at time $t$.

Why This Definition Works

Once we compute $\alpha_T(i)$ for all states $i$, we can obtain the full observation probability by summing over the final state:

$$P(\mathbf{x} | \boldsymbol{\theta}) = \sum_{i=1}^{N} \alpha_T(i) = \sum_{i=1}^{N} P(x_1, \ldots, x_T, z_T = i)$$

The sum over final states marginalizes out the hidden state at time $T$, giving us the marginal probability of observations only.

Intuition

Think of $\alpha_t(i)$ as a message that accumulates all the ways the system could have arrived at state $i$ at time $t$ while producing observations $x_1, \ldots, x_t$. Instead of tracking $N^t$ individual paths, we only track $N$ aggregate "probability masses"—one for each possible current state.

Converting Mermaid diagram...

The Trellis Structure

The forward computation is often visualized as a trellis diagram—a grid where:

Rows represent states ($i = 1, \ldots, N$)
Columns represent time steps ($t = 1, \ldots, T$)
Each cell $(i, t)$ holds the forward variable $\alpha_t(i)$
Arrows show transitions from time $t-1$ to $t$

We fill the trellis left-to-right, computing each column from the previous one. This is the essence of dynamic programming: solve subproblems (earlier time steps) and combine their solutions to solve larger problems (later time steps).

The Forward Algorithm: Step-by-Step Derivation

Step 1: Initialization ($t = 1$)

At $t=1$, there's no previous state—we're at the beginning of the sequence. The forward variable is simply the probability of starting in state $i$ and emitting $x_1$:

$$\alpha_1(i) = P(x_1, z_1 = i) = P(z_1 = i) \cdot P(x_1 | z_1 = i) = \pi_i \cdot B_i(x_1)$$

Interpretation: For each state $i$, multiply the prior probability of starting there ($\pi_i$) by the probability of emitting the first observation from that state ($B_i(x_1)$).

Step 2: Recursion ($t = 2, \ldots, T$)

For $t > 1$, we derive the recursive relationship:

$$\begin{align} \alpha_t(j) &= P(x_1, \ldots, x_t, z_t = j) &= \sum_{i=1}^{N} P(x_1, \ldots, x_t, z_{t-1} = i, z_t = j) \quad \text{(marginalize over } z_{t-1}\text{)} &= \sum_{i=1}^{N} P(x_1, \ldots, x_{t-1}, z_{t-1} = i) \cdot P(z_t = j | z_{t-1} = i) \cdot P(x_t | z_t = j) &= \sum_{i=1}^{N} \alpha_{t-1}(i) \cdot A_{ij} \cdot B_j(x_t) \end{align}$$

Rearranging for clarity:

$$\boxed{\alpha_t(j) = \left[ \sum_{i=1}^{N} \alpha_{t-1}(i) \cdot A_{ij} \right] \cdot B_j(x_t)}$$

Understanding the Recursion

The recursion has two conceptual parts:

1. Transition aggregation: $\sum_{i=1}^{N} \alpha_{t-1}(i) \cdot A_{ij}$

Sum over all states $i$ we could have been in at time $t-1$, weighted by: • How much probability mass was in state $i$: $\alpha_{t-1}(i)$ • The probability of transitioning from $i$ to $j$: $A_{ij}$

2. Emission incorporation: multiply by $B_j(x_t)$

Weight by the probability of observing $x_t$ from state $j$.

Step 3: Termination

After processing all $T$ observations, the total observation probability is:

$$P(\mathbf{x} | \boldsymbol{\theta}) = \sum_{i=1}^{N} \alpha_T(i)$$

This sums the probability of all paths ending in each possible state.

Complete Algorithm Summary

Forward Algorithm

Initialize: For $i = 1, \ldots, N$: $$\alpha_1(i) = \pi_i \cdot B_i(x_1)$$
Recurse: For $t = 2, \ldots, T$ and $j = 1, \ldots, N$: $$\alpha_t(j) = \left[ \sum_{i=1}^{N} \alpha_{t-1}(i) \cdot A_{ij} \right] \cdot B_j(x_t)$$
Terminate: $$P(\mathbf{x} | \boldsymbol{\theta}) = \sum_{i=1}^{N} \alpha_T(i)$$

Complexity Analysis

Time: $O(TN^2)$
- $T$ time steps
- At each step, compute $N$ forward variables
- Each variable requires summing over $N$ previous states
Space: $O(N)$ if we only keep the current and previous columns; $O(TN)$ if we store the full trellis (needed for Baum-Welch)

Implementation

Let's implement the Forward Algorithm with careful attention to numerical stability and code clarity.

Basic Implementation (Probability Space)

forward_algorithm.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
import numpy as np
from typing import Tuple
 
def forward_algorithm(
    A: np.ndarray,    # (N, N) transition matrix
    B: np.ndarray,    # (N, M) emission matrix
    pi: np.ndarray,   # (N,) initial distribution
    observations: np.ndarray  # (T,) observation indices
) -> Tuple[np.ndarray, float]:
    """
    Compute forward variables and observation probability.
    
    This is the basic implementation in probability space.
    WARNING: Underflows for long sequences! Use log-space version.
    
    Parameters
    ----------
    A : np.ndarray of shape (N, N)
        Transition matrix, A[i,j] = P(z_t=j | z_{t-1}=i)
    B : np.ndarray of shape (N, M)
        Emission matrix, B[i,k] = P(x_t=k | z_t=i)
    pi : np.ndarray of shape (N,)
        Initial state distribution
    observations : np.ndarray of shape (T,)
        Sequence of observation indices
    
    Returns
    -------
    alpha : np.ndarray of shape (T, N)
        Forward variables, alpha[t, i] = P(x_1:t, z_t=i)
    prob : float
        P(observations | model) = sum of alpha[T-1, :]
    """
    T = len(observations)
    N = len(pi)
    
    # Initialize alpha matrix to store all forward variables
    alpha = np.zeros((T, N))
    
    # Step 1: Initialization (t=0 in 0-indexed)
    # α_1(i) = π_i * B_i(x_1)
    alpha[0] = pi * B[:, observations[0]]
    
    # Step 2: Recursion (t=1 to T-1 in 0-indexed)
    # α_t(j) = [Σ_i α_{t-1}(i) * A_{i,j}] * B_j(x_t)
    for t in range(1, T):
        for j in range(N):
            # Sum over all previous states
            alpha[t, j] = np.sum(alpha[t-1] * A[:, j]) * B[j, observations[t]]
    
    # Step 3: Termination
    # P(x) = Σ_i α_T(i)
    prob = np.sum(alpha[-1])
    
    return alpha, prob
 
 
def forward_algorithm_vectorized(
    A: np.ndarray,
    B: np.ndarray,
    pi: np.ndarray,
    observations: np.ndarray
) -> Tuple[np.ndarray, float]:
    """
    Vectorized implementation of Forward Algorithm.
    
    Same result as forward_algorithm(), but uses matrix operations
    for better performance on larger models.
    """
    T = len(observations)
    N = len(pi)
    
    alpha = np.zeros((T, N))
    
    # Initialization
    alpha[0] = pi * B[:, observations[0]]
    
    # Recursion - vectorized
    for t in range(1, T):
        # alpha[t-1] @ A gives transition-weighted sums for each target state
        # Element-wise multiply by emission probabilities
        alpha[t] = (alpha[t-1] @ A) * B[:, observations[t]]
    
    prob = np.sum(alpha[-1])
    return alpha, prob
 
 
# Example: Weather HMM
def demo_forward_algorithm():
    """Demonstrate Forward Algorithm on weather example."""
    
    # States: 0=Sunny, 1=Cloudy, 2=Rainy
    # Observations: 0=Walk, 1=Shop, 2=Clean
    
    A = np.array([
        [0.7, 0.2, 0.1],  # From Sunny
        [0.3, 0.4, 0.3],  # From Cloudy
        [0.2, 0.3, 0.5],  # From Rainy
    ])
    
    B = np.array([
        [0.7, 0.2, 0.1],  # Sunny: prefer Walk
        [0.3, 0.4, 0.3],  # Cloudy: mixed
        [0.1, 0.2, 0.7],  # Rainy: prefer Clean
    ])
    
    pi = np.array([0.6, 0.3, 0.1])
    
    # Observation sequence: Walk, Shop, Clean, Clean
    observations = np.array([0, 1, 2, 2])
    
    alpha, prob = forward_algorithm(A, B, pi, observations)
    
    print("Forward Variables (α):")
    print("-" * 50)
    states = ["Sunny", "Cloudy", "Rainy"]
    obs_names = ["Walk", "Shop", "Clean", "Clean"]
    
    for t in range(len(observations)):
        print(f"t={t+1}, obs={obs_names[t]:<6}:", end=" ")
        for i, s in enumerate(states):
            print(f"α({s})={alpha[t,i]:.6f}", end="  ")
        print(f"| sum={np.sum(alpha[t]):.6f}")
    
    print(f"
P(observations) = {prob:.8f}")
    print(f"Log probability = {np.log(prob):.4f}")
 
 
demo_forward_algorithm()

Numerical Stability: Log-Space and Scaling

The Underflow Problem

The basic Forward Algorithm multiplies many probabilities together. For long sequences, this causes numerical underflow—the numbers become so small that they round to zero in floating-point representation.

Consider: if each emission probability averages 0.1 and we have 100 time steps: $$0.1^{100} = 10^{-100} $$

This is far smaller than the smallest positive double-precision float (~$10^{-308}$), but intermediate computations will underflow much sooner, especially when combined with small transition probabilities.

Solution 1: Log-Space Computation

Work in log-probability space throughout. Define: $$\tilde{\alpha}_t(i) = \log \alpha_t(i) = \log P(x_1, \ldots, x_t, z_t = i)$$

The challenge is that the recursion involves a sum inside the log: $$\alpha_t(j) = \left[ \sum_{i} \alpha_{t-1}(i) \cdot A_{ij} \right] \cdot B_j(x_t)$$

In log space: $$\tilde{\alpha}t(j) = \log\left[ \sum{i} \exp(\tilde{\alpha}{t-1}(i) + \log A{ij}) \right] + \log B_j(x_t)$$

We use the log-sum-exp trick for numerical stability.

The Log-Sum-Exp Trick

To compute $\log(e^{a_1} + e^{a_2} + \cdots + e^{a_n})$ stably:

Find $a_{\max} = \max_i a_i$
Compute $\log(\sum_i e^{a_i - a_{\max}}) + a_{\max}$

After subtracting $a_{\max}$, the largest exponent is 0 (so $e^0 = 1$), and other terms are small but positive. This prevents both underflow (tiny exponents) and overflow (huge exponents).

forward_log_space.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
import numpy as np
from scipy.special import logsumexp  # Numerically stable log-sum-exp
 
def forward_algorithm_log(
    log_A: np.ndarray,   # (N, N) log transition matrix
    log_B: np.ndarray,   # (N, M) log emission matrix  
    log_pi: np.ndarray,  # (N,) log initial distribution
    observations: np.ndarray
) -> tuple[np.ndarray, float]:
    """
    Forward Algorithm in log-probability space.
    
    Numerically stable for arbitrarily long sequences.
    All inputs should be log-probabilities.
    
    Returns
    -------
    log_alpha : np.ndarray of shape (T, N)
        Log forward variables
    log_prob : float
        Log P(observations | model)
    """
    T = len(observations)
    N = len(log_pi)
    
    log_alpha = np.zeros((T, N))
    
    # Initialization: log α_1(i) = log π_i + log B_i(x_1)
    log_alpha[0] = log_pi + log_B[:, observations[0]]
    
    # Recursion in log space
    for t in range(1, T):
        for j in range(N):
            # log [Σ_i α_{t-1}(i) * A_{i,j}]
            # = log Σ_i exp(log α_{t-1}(i) + log A_{i,j})
            log_alpha[t, j] = logsumexp(
                log_alpha[t-1] + log_A[:, j]
            ) + log_B[j, observations[t]]
    
    # Termination: log P(x) = log Σ_i α_T(i)
    log_prob = logsumexp(log_alpha[-1])
    
    return log_alpha, log_prob
 
 
def forward_algorithm_log_vectorized(
    log_A: np.ndarray,
    log_B: np.ndarray,
    log_pi: np.ndarray,
    observations: np.ndarray
) -> tuple[np.ndarray, float]:
    """
    Vectorized log-space Forward Algorithm.
    
    Uses broadcasting for efficient computation.
    """
    T = len(observations)
    N = len(log_pi)
    
    log_alpha = np.zeros((T, N))
    log_alpha[0] = log_pi + log_B[:, observations[0]]
    
    for t in range(1, T):
        # (N, 1) + (N, N) = (N, N), then logsumexp over axis 0
        # Gives the log of sum of alpha[t-1, i] * A[i, j] for each j
        log_alpha[t] = logsumexp(
            log_alpha[t-1, :, np.newaxis] + log_A,
            axis=0
        ) + log_B[:, observations[t]]
    
    log_prob = logsumexp(log_alpha[-1])
    return log_alpha, log_prob
 
 
# Comparison: probability space vs log space
def compare_implementations():
    """Show underflow in probability space vs stability in log space."""
    
    # Simple 2-state HMM
    A = np.array([[0.7, 0.3], [0.4, 0.6]])
    B = np.array([[0.9, 0.1], [0.2, 0.8]])
    pi = np.array([0.6, 0.4])
    
    # Generate a longer sequence to trigger underflow
    np.random.seed(42)
    T_values = [10, 50, 100, 200, 500]
    
    print("Sequence Length | Prob Space        | Log Space")
    print("-" * 55)
    
    for T in T_values:
        # Generate random observations
        obs = np.random.choice(2, size=T)
        
        # Probability space (will underflow)
        _, prob = forward_algorithm_vectorized(A, B, pi, obs)
        
        # Log space (stable)
        log_A = np.log(A)
        log_B = np.log(B)
        log_pi = np.log(pi)
        _, log_prob = forward_algorithm_log_vectorized(
            log_A, log_B, log_pi, obs
        )
        
        prob_str = f"{prob:.2e}" if prob > 0 else "UNDERFLOW (0.0)"
        print(f"T = {T:4d}        | {prob_str:17s} | {log_prob:.4f}")
 
 
compare_implementations()

Solution 2: Scaling

An alternative approach is scaling, where we normalize the forward variables at each time step and track the scaling factors:

$$\hat{\alpha}_t(i) = \frac{\alpha_t(i)}{c_t}, \quad c_t = \sum_i \alpha_t(i)$$

The scaled variables satisfy $\sum_i \hat{\alpha}_t(i) = 1$ at each time step, preventing underflow.

The log probability is recovered as: $$\log P(\mathbf{x}) = \sum_{t=1}^{T} \log c_t$$

Scaling is equivalent to log-space computation but can be more natural in some implementations. The same scaling factors are used in the Backward pass (for Baum-Welch), ensuring consistency.

Comparison: Log-Space vs Scaling
Aspect	Log-Space	Scaling
Numerical stability	Excellent	Excellent
Implementation	Use logsumexp	Track scale factors
Combined with Backward	Separate log-forward and log-backward	Same scale factors for both
Interpretation	Log probabilities throughout	Normalized pseudo-probabilities
Memory	Same as basic	Extra O(T) for scale factors

Complete Worked Example

Let's trace through the Forward Algorithm step by step on a concrete example.

Problem Setup

HMM Parameters:

States: $\mathcal{S} = {H, L}$ (High, Low)
Observations: $\mathcal{O} = {0, 1, 2}$ (three symbols)

$$\boldsymbol{\pi} = \begin{pmatrix} 0.6 \ 0.4 \end{pmatrix}, \quad \mathbf{A} = \begin{pmatrix} 0.7 & 0.3 \ 0.4 & 0.6 \end{pmatrix}, \quad \mathbf{B} = \begin{pmatrix} 0.5 & 0.4 & 0.1 \ 0.1 & 0.3 & 0.6 \end{pmatrix}$$

Observation sequence: $\mathbf{x} = (0, 1, 2)$ (length $T=3$)

Step 1: Initialization ($t=1$, observation $x_1 = 0$)

$$\alpha_1(H) = \pi_H \cdot B_H(0) = 0.6 \times 0.5 = 0.30$$ $$\alpha_1(L) = \pi_L \cdot B_L(0) = 0.4 \times 0.1 = 0.04$$

Sanity check: Sum = 0.34 (partial probability of observing $x_1=0$)

Step 2: First Recursion ($t=2$, observation $x_2 = 1$)

For state H: $$\alpha_2(H) = \left[ \alpha_1(H) \cdot A_{HH} + \alpha_1(L) \cdot A_{LH} \right] \cdot B_H(1)$$ $$= [0.30 \times 0.7 + 0.04 \times 0.4] \times 0.4$$ $$= [0.21 + 0.016] \times 0.4 = 0.226 \times 0.4 = 0.0904$$

For state L: $$\alpha_2(L) = \left[ \alpha_1(H) \cdot A_{HL} + \alpha_1(L) \cdot A_{LL} \right] \cdot B_L(1)$$ $$= [0.30 \times 0.3 + 0.04 \times 0.6] \times 0.3$$ $$= [0.09 + 0.024] \times 0.3 = 0.114 \times 0.3 = 0.0342$$

Step 3: Second Recursion ($t=3$, observation $x_3 = 2$)

For state H: $$\alpha_3(H) = [0.0904 \times 0.7 + 0.0342 \times 0.4] \times 0.1$$ $$= [0.06328 + 0.01368] \times 0.1 = 0.00770$$

For state L: $$\alpha_3(L) = [0.0904 \times 0.3 + 0.0342 \times 0.6] \times 0.6$$ $$= [0.02712 + 0.02052] \times 0.6 = 0.02858$$

Termination

$$P(\mathbf{x} = (0,1,2)) = \alpha_3(H) + \alpha_3(L) = 0.00770 + 0.02858 = 0.03628$$

Forward Variables Trellis
Time	Observation	α(H)	α(L)	Sum
t=1	0	0.3000	0.0400	0.3400
t=2	1	0.0904	0.0342	0.1246
t=3	2	0.0077	0.0286	0.0363

Observation

Notice how the sum decreases at each step—this is the "cumulative penalty" of observing each symbol. After observing $x_3=2$, most probability mass is in state L (0.0286 vs 0.0077), reflecting that state L prefers emitting symbol 2 (probability 0.6) while state H rarely emits it (probability 0.1).

Connection to Message Passing in Graphical Models

The Forward Algorithm is a special case of a more general framework: belief propagation (message passing) in graphical models.

Messages in the HMM Graph

In the HMM graphical model, we can view the forward variable as a message passed from left to right:

$$\alpha_t(z_t) = \text{"message from past telling } z_t \text{ what's been observed"}$$

At each time step, we:

Receive messages from all possible previous states
Combine them weighted by transition probabilities
Update with local observation evidence
Send the result as a message to the next step

The Sum-Product Algorithm Perspective

The Forward Algorithm implements the sum-product algorithm on the HMM factor graph:

$$\alpha_t(z_t) = \sum_{z_{t-1}} \underbrace{\alpha_{t-1}(z_{t-1})}{\text{incoming message}} \cdot \underbrace{A{z_{t-1}, z_t}}{\text{transition factor}} \cdot \underbrace{B{z_t}(x_t)}_{\text{observation factor}}$$

This perspective generalizes:

To trees: exact inference via two passes of message passing
To general graphs: loopy belief propagation (approximate)
To HMMs: Forward-Backward = two-pass message passing

Why This Matters

Understanding the message-passing interpretation:

• Reveals the unity between different inference algorithms • Shows why HMMs are a special case of general graphical model inference • Explains why the Backward pass (for Baum-Welch) has the same structure • Generalizes to more complex models (factorial HMMs, DBNs)

The Backward Algorithm (Preview)

Just as we pass messages forward, we can pass them backward:

$$\beta_t(i) = P(x_{t+1}, \ldots, x_T | z_t = i)$$

The backward variable asks: "Given we're in state $i$ at time $t$, what's the probability of all future observations?"

The recursion runs right-to-left: $$\beta_t(i) = \sum_{j=1}^{N} A_{ij} \cdot B_j(x_{t+1}) \cdot \beta_{t+1}(j)$$

Combining forward and backward gives us: $$P(z_t = i | \mathbf{x}) = \frac{\alpha_t(i) \cdot \beta_t(i)}{P(\mathbf{x})}$$

This is the posterior probability of state $i$ at time $t$, used in the Baum-Welch algorithm for learning. We'll develop this fully when covering parameter learning.

Summary: The Forward Algorithm

The Forward Algorithm is the first of the three core HMM algorithms, solving the evaluation problem efficiently.

Key Takeaways

•The evaluation problem asks for $P(\mathbf{x} | \boldsymbol{\theta})$—how likely is the observation sequence under the model?
•Naive enumeration is intractable: $O(N^T)$ state sequences make brute force impossible even for moderate $T$.
•The forward variable $\alpha_t(i) = P(x_1, \ldots, x_t, z_t = i)$ captures partial observation probability while tracking current state.
•Dynamic programming reduces complexity to $O(TN^2)$: Each step combines previous forward variables through transitions.
•Numerical stability requires care: Use log-space computation or scaling to prevent underflow on long sequences.
•The algorithm is message passing: Forward variables are messages propagating belief about past observations.
•Termination gives the answer: $P(\mathbf{x}) = \sum_i \alpha_T(i)$ sums over all possible final states.

What's Coming Next

Next Page: The Viterbi Algorithm

While the Forward Algorithm computes the probability of observations (summing over all state sequences), the Viterbi Algorithm finds the single most likely state sequence. Instead of summing, we maximize—using dynamic programming structure that closely parallels the Forward Algorithm but answers a fundamentally different question.

Page Complete

You now have a thorough understanding of the Forward Algorithm—the mathematical derivation, efficient implementation, numerical stability considerations, and its role in the broader framework of graphical model inference. This algorithm is foundational for HMM applications and forms the basis for more advanced algorithms like Forward-Backward (for posterior decoding) and Baum-Welch (for learning).

2 / 5

Loading learning content...

Machine LearningGraphical Models

Hidden Markov Models

LevelAdvanced

Duration120 mins

TopicGraphical Models

2 / 5

Forward Algorithm

Computing the Probability of What We Observe

Speech Recognition: Which word model (HMM) is most likely to have generated this acoustic signal?
Anomaly Detection: Is this sequence unlikely enough under our model to be considered anomalous?
Language Modeling: How probable is this sentence according to our language model?

The Forward Algorithm provides an elegant solution: a dynamic programming approach that computes the exact likelihood in $O(TN^2)$ time by exploiting the Markov structure of HMMs.

What You Will Learn

By the end of this page, you will:

The Evaluation Problem

Formal Problem Statement

We formalize the evaluation problem as follows:

Given:

HMM parameters $\boldsymbol{\theta} = (\mathbf{A}, \mathbf{B}, \boldsymbol{\pi})$
Observation sequence $\mathbf{x} = (x_1, x_2, \ldots, x_T)$

Compute: $$P(\mathbf{x} | \boldsymbol{\theta}) = P(x_1, x_2, \ldots, x_T | \boldsymbol{\theta})$$

The probability of observing exactly this sequence under the model.

Why This is Non-Trivial

The observations depend on hidden states, which we don't observe. To compute the observation probability, we must marginalize over all possible hidden state sequences:

$$P(\mathbf{x} | \boldsymbol{\theta}) = \sum_{\mathbf{z}} P(\mathbf{x}, \mathbf{z} | \boldsymbol{\theta})$$

where the sum is over all $N^T$ possible sequences $\mathbf{z} = (z_1, \ldots, z_T)$.

Using the HMM factorization from the previous page:

$$P(\mathbf{x} | \boldsymbol{\theta}) = \sum_{z_1=1}^{N} \sum_{z_2=1}^{N} \cdots \sum_{z_T=1}^{N} \pi_{z_1} \prod_{t=2}^{T} A_{z_{t-1}, z_t} \prod_{t=1}^{T} B_{z_t}(x_t)$$

The Exponential Explosion

Direct computation requires $N^T$ terms in the sum. Each term involves $O(T)$ multiplications. Total complexity: $O(T \cdot N^T)$.

For a modest HMM with $N=10$ states and $T=100$ time steps: • Number of sequences: $10^{100}$ • Even at $10^{15}$ operations/second, this would take $10^{77}$ years!

Clearly, we need a smarter approach.

The Key Insight: Distributive Property and Markov Structure

The breakthrough comes from recognizing that we can rearrange the computation by pushing sums inside products. Consider a simplified example:

$$\sum_{a}\sum_{b}\sum_{c} f(a) \cdot g(a,b) \cdot h(b,c) \cdot k(c)$$

Naively, this sums over all $|A| \times |B| \times |C|$ combinations. But we can factor:

$$= \sum_{c} k(c) \sum_{b} h(b,c) \sum_{a} f(a) \cdot g(a,b)$$

Now the innermost sum is computed first and reused, reducing complexity.

The Forward Variable

Definition

The forward variable $\alpha_t(i)$ is defined as the joint probability of observing the first $t$ observations and being in state $i$ at time $t$:

$$\alpha_t(i) = P(x_1, x_2, \ldots, x_t, z_t = i | \boldsymbol{\theta})$$

This is a partial observation probability: we've seen $x_1$ through $x_t$ and we know we're in state $i$ at time $t$.

Why This Definition Works

Once we compute $\alpha_T(i)$ for all states $i$, we can obtain the full observation probability by summing over the final state:

$$P(\mathbf{x} | \boldsymbol{\theta}) = \sum_{i=1}^{N} \alpha_T(i) = \sum_{i=1}^{N} P(x_1, \ldots, x_T, z_T = i)$$

The sum over final states marginalizes out the hidden state at time $T$, giving us the marginal probability of observations only.

Intuition

Converting Mermaid diagram...

The Trellis Structure

The forward computation is often visualized as a trellis diagram—a grid where:

Rows represent states ($i = 1, \ldots, N$)
Columns represent time steps ($t = 1, \ldots, T$)
Each cell $(i, t)$ holds the forward variable $\alpha_t(i)$
Arrows show transitions from time $t-1$ to $t$

The Forward Algorithm: Step-by-Step Derivation

Step 1: Initialization ($t = 1$)

At $t=1$, there's no previous state—we're at the beginning of the sequence. The forward variable is simply the probability of starting in state $i$ and emitting $x_1$:

$$\alpha_1(i) = P(x_1, z_1 = i) = P(z_1 = i) \cdot P(x_1 | z_1 = i) = \pi_i \cdot B_i(x_1)$$

Interpretation: For each state $i$, multiply the prior probability of starting there ($\pi_i$) by the probability of emitting the first observation from that state ($B_i(x_1)$).

Step 2: Recursion ($t = 2, \ldots, T$)

For $t > 1$, we derive the recursive relationship:

Rearranging for clarity:

$$\boxed{\alpha_t(j) = \left[ \sum_{i=1}^{N} \alpha_{t-1}(i) \cdot A_{ij} \right] \cdot B_j(x_t)}$$

Understanding the Recursion

The recursion has two conceptual parts:

1. Transition aggregation: $\sum_{i=1}^{N} \alpha_{t-1}(i) \cdot A_{ij}$

2. Emission incorporation: multiply by $B_j(x_t)$

Weight by the probability of observing $x_t$ from state $j$.

Step 3: Termination

After processing all $T$ observations, the total observation probability is:

$$P(\mathbf{x} | \boldsymbol{\theta}) = \sum_{i=1}^{N} \alpha_T(i)$$

This sums the probability of all paths ending in each possible state.

Complete Algorithm Summary

Forward Algorithm

Initialize: For $i = 1, \ldots, N$: $$\alpha_1(i) = \pi_i \cdot B_i(x_1)$$
Recurse: For $t = 2, \ldots, T$ and $j = 1, \ldots, N$: $$\alpha_t(j) = \left[ \sum_{i=1}^{N} \alpha_{t-1}(i) \cdot A_{ij} \right] \cdot B_j(x_t)$$
Terminate: $$P(\mathbf{x} | \boldsymbol{\theta}) = \sum_{i=1}^{N} \alpha_T(i)$$

Complexity Analysis

Time: $O(TN^2)$
- $T$ time steps
- At each step, compute $N$ forward variables
- Each variable requires summing over $N$ previous states
Space: $O(N)$ if we only keep the current and previous columns; $O(TN)$ if we store the full trellis (needed for Baum-Welch)

Implementation

Let's implement the Forward Algorithm with careful attention to numerical stability and code clarity.

Basic Implementation (Probability Space)

forward_algorithm.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
import numpy as np
from typing import Tuple
 
def forward_algorithm(
    A: np.ndarray,    # (N, N) transition matrix
    B: np.ndarray,    # (N, M) emission matrix
    pi: np.ndarray,   # (N,) initial distribution
    observations: np.ndarray  # (T,) observation indices
) -> Tuple[np.ndarray, float]:
    """
    Compute forward variables and observation probability.
    
    This is the basic implementation in probability space.
    WARNING: Underflows for long sequences! Use log-space version.
    
    Parameters
    ----------
    A : np.ndarray of shape (N, N)
        Transition matrix, A[i,j] = P(z_t=j | z_{t-1}=i)
    B : np.ndarray of shape (N, M)
        Emission matrix, B[i,k] = P(x_t=k | z_t=i)
    pi : np.ndarray of shape (N,)
        Initial state distribution
    observations : np.ndarray of shape (T,)
        Sequence of observation indices
    
    Returns
    -------
    alpha : np.ndarray of shape (T, N)
        Forward variables, alpha[t, i] = P(x_1:t, z_t=i)
    prob : float
        P(observations | model) = sum of alpha[T-1, :]
    """
    T = len(observations)
    N = len(pi)
    
    # Initialize alpha matrix to store all forward variables
    alpha = np.zeros((T, N))
    
    # Step 1: Initialization (t=0 in 0-indexed)
    # α_1(i) = π_i * B_i(x_1)
    alpha[0] = pi * B[:, observations[0]]
    
    # Step 2: Recursion (t=1 to T-1 in 0-indexed)
    # α_t(j) = [Σ_i α_{t-1}(i) * A_{i,j}] * B_j(x_t)
    for t in range(1, T):
        for j in range(N):
            # Sum over all previous states
            alpha[t, j] = np.sum(alpha[t-1] * A[:, j]) * B[j, observations[t]]
    
    # Step 3: Termination
    # P(x) = Σ_i α_T(i)
    prob = np.sum(alpha[-1])
    
    return alpha, prob
 
 
def forward_algorithm_vectorized(
    A: np.ndarray,
    B: np.ndarray,
    pi: np.ndarray,
    observations: np.ndarray
) -> Tuple[np.ndarray, float]:
    """
    Vectorized implementation of Forward Algorithm.
    
    Same result as forward_algorithm(), but uses matrix operations
    for better performance on larger models.
    """
    T = len(observations)
    N = len(pi)
    
    alpha = np.zeros((T, N))
    
    # Initialization
    alpha[0] = pi * B[:, observations[0]]
    
    # Recursion - vectorized
    for t in range(1, T):
        # alpha[t-1] @ A gives transition-weighted sums for each target state
        # Element-wise multiply by emission probabilities
        alpha[t] = (alpha[t-1] @ A) * B[:, observations[t]]
    
    prob = np.sum(alpha[-1])
    return alpha, prob
 
 
# Example: Weather HMM
def demo_forward_algorithm():
    """Demonstrate Forward Algorithm on weather example."""
    
    # States: 0=Sunny, 1=Cloudy, 2=Rainy
    # Observations: 0=Walk, 1=Shop, 2=Clean
    
    A = np.array([
        [0.7, 0.2, 0.1],  # From Sunny
        [0.3, 0.4, 0.3],  # From Cloudy
        [0.2, 0.3, 0.5],  # From Rainy
    ])
    
    B = np.array([
        [0.7, 0.2, 0.1],  # Sunny: prefer Walk
        [0.3, 0.4, 0.3],  # Cloudy: mixed
        [0.1, 0.2, 0.7],  # Rainy: prefer Clean
    ])
    
    pi = np.array([0.6, 0.3, 0.1])
    
    # Observation sequence: Walk, Shop, Clean, Clean
    observations = np.array([0, 1, 2, 2])
    
    alpha, prob = forward_algorithm(A, B, pi, observations)
    
    print("Forward Variables (α):")
    print("-" * 50)
    states = ["Sunny", "Cloudy", "Rainy"]
    obs_names = ["Walk", "Shop", "Clean", "Clean"]
    
    for t in range(len(observations)):
        print(f"t={t+1}, obs={obs_names[t]:<6}:", end=" ")
        for i, s in enumerate(states):
            print(f"α({s})={alpha[t,i]:.6f}", end="  ")
        print(f"| sum={np.sum(alpha[t]):.6f}")
    
    print(f"
P(observations) = {prob:.8f}")
    print(f"Log probability = {np.log(prob):.4f}")
 
 
demo_forward_algorithm()

Numerical Stability: Log-Space and Scaling

The Underflow Problem

Consider: if each emission probability averages 0.1 and we have 100 time steps: $$0.1^{100} = 10^{-100} $$

Solution 1: Log-Space Computation

Work in log-probability space throughout. Define: $$\tilde{\alpha}_t(i) = \log \alpha_t(i) = \log P(x_1, \ldots, x_t, z_t = i)$$

The challenge is that the recursion involves a sum inside the log: $$\alpha_t(j) = \left[ \sum_{i} \alpha_{t-1}(i) \cdot A_{ij} \right] \cdot B_j(x_t)$$

In log space: $$\tilde{\alpha}t(j) = \log\left[ \sum{i} \exp(\tilde{\alpha}{t-1}(i) + \log A{ij}) \right] + \log B_j(x_t)$$

We use the log-sum-exp trick for numerical stability.

The Log-Sum-Exp Trick

To compute $\log(e^{a_1} + e^{a_2} + \cdots + e^{a_n})$ stably:

Find $a_{\max} = \max_i a_i$
Compute $\log(\sum_i e^{a_i - a_{\max}}) + a_{\max}$

After subtracting $a_{\max}$, the largest exponent is 0 (so $e^0 = 1$), and other terms are small but positive. This prevents both underflow (tiny exponents) and overflow (huge exponents).

forward_log_space.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
import numpy as np
from scipy.special import logsumexp  # Numerically stable log-sum-exp
 
def forward_algorithm_log(
    log_A: np.ndarray,   # (N, N) log transition matrix
    log_B: np.ndarray,   # (N, M) log emission matrix  
    log_pi: np.ndarray,  # (N,) log initial distribution
    observations: np.ndarray
) -> tuple[np.ndarray, float]:
    """
    Forward Algorithm in log-probability space.
    
    Numerically stable for arbitrarily long sequences.
    All inputs should be log-probabilities.
    
    Returns
    -------
    log_alpha : np.ndarray of shape (T, N)
        Log forward variables
    log_prob : float
        Log P(observations | model)
    """
    T = len(observations)
    N = len(log_pi)
    
    log_alpha = np.zeros((T, N))
    
    # Initialization: log α_1(i) = log π_i + log B_i(x_1)
    log_alpha[0] = log_pi + log_B[:, observations[0]]
    
    # Recursion in log space
    for t in range(1, T):
        for j in range(N):
            # log [Σ_i α_{t-1}(i) * A_{i,j}]
            # = log Σ_i exp(log α_{t-1}(i) + log A_{i,j})
            log_alpha[t, j] = logsumexp(
                log_alpha[t-1] + log_A[:, j]
            ) + log_B[j, observations[t]]
    
    # Termination: log P(x) = log Σ_i α_T(i)
    log_prob = logsumexp(log_alpha[-1])
    
    return log_alpha, log_prob
 
 
def forward_algorithm_log_vectorized(
    log_A: np.ndarray,
    log_B: np.ndarray,
    log_pi: np.ndarray,
    observations: np.ndarray
) -> tuple[np.ndarray, float]:
    """
    Vectorized log-space Forward Algorithm.
    
    Uses broadcasting for efficient computation.
    """
    T = len(observations)
    N = len(log_pi)
    
    log_alpha = np.zeros((T, N))
    log_alpha[0] = log_pi + log_B[:, observations[0]]
    
    for t in range(1, T):
        # (N, 1) + (N, N) = (N, N), then logsumexp over axis 0
        # Gives the log of sum of alpha[t-1, i] * A[i, j] for each j
        log_alpha[t] = logsumexp(
            log_alpha[t-1, :, np.newaxis] + log_A,
            axis=0
        ) + log_B[:, observations[t]]
    
    log_prob = logsumexp(log_alpha[-1])
    return log_alpha, log_prob
 
 
# Comparison: probability space vs log space
def compare_implementations():
    """Show underflow in probability space vs stability in log space."""
    
    # Simple 2-state HMM
    A = np.array([[0.7, 0.3], [0.4, 0.6]])
    B = np.array([[0.9, 0.1], [0.2, 0.8]])
    pi = np.array([0.6, 0.4])
    
    # Generate a longer sequence to trigger underflow
    np.random.seed(42)
    T_values = [10, 50, 100, 200, 500]
    
    print("Sequence Length | Prob Space        | Log Space")
    print("-" * 55)
    
    for T in T_values:
        # Generate random observations
        obs = np.random.choice(2, size=T)
        
        # Probability space (will underflow)
        _, prob = forward_algorithm_vectorized(A, B, pi, obs)
        
        # Log space (stable)
        log_A = np.log(A)
        log_B = np.log(B)
        log_pi = np.log(pi)
        _, log_prob = forward_algorithm_log_vectorized(
            log_A, log_B, log_pi, obs
        )
        
        prob_str = f"{prob:.2e}" if prob > 0 else "UNDERFLOW (0.0)"
        print(f"T = {T:4d}        | {prob_str:17s} | {log_prob:.4f}")
 
 
compare_implementations()

Solution 2: Scaling

An alternative approach is scaling, where we normalize the forward variables at each time step and track the scaling factors:

$$\hat{\alpha}_t(i) = \frac{\alpha_t(i)}{c_t}, \quad c_t = \sum_i \alpha_t(i)$$

The scaled variables satisfy $\sum_i \hat{\alpha}_t(i) = 1$ at each time step, preventing underflow.

The log probability is recovered as: $$\log P(\mathbf{x}) = \sum_{t=1}^{T} \log c_t$$

Scaling is equivalent to log-space computation but can be more natural in some implementations. The same scaling factors are used in the Backward pass (for Baum-Welch), ensuring consistency.

Comparison: Log-Space vs Scaling
Aspect	Log-Space	Scaling
Numerical stability	Excellent	Excellent
Implementation	Use logsumexp	Track scale factors
Combined with Backward	Separate log-forward and log-backward	Same scale factors for both
Interpretation	Log probabilities throughout	Normalized pseudo-probabilities
Memory	Same as basic	Extra O(T) for scale factors

Complete Worked Example

Let's trace through the Forward Algorithm step by step on a concrete example.

Problem Setup

HMM Parameters:

States: $\mathcal{S} = {H, L}$ (High, Low)
Observations: $\mathcal{O} = {0, 1, 2}$ (three symbols)

Observation sequence: $\mathbf{x} = (0, 1, 2)$ (length $T=3$)

Step 1: Initialization ($t=1$, observation $x_1 = 0$)

$$\alpha_1(H) = \pi_H \cdot B_H(0) = 0.6 \times 0.5 = 0.30$$ $$\alpha_1(L) = \pi_L \cdot B_L(0) = 0.4 \times 0.1 = 0.04$$

Sanity check: Sum = 0.34 (partial probability of observing $x_1=0$)

Step 2: First Recursion ($t=2$, observation $x_2 = 1$)

Step 3: Second Recursion ($t=3$, observation $x_3 = 2$)

For state H: $$\alpha_3(H) = [0.0904 \times 0.7 + 0.0342 \times 0.4] \times 0.1$$ $$= [0.06328 + 0.01368] \times 0.1 = 0.00770$$

For state L: $$\alpha_3(L) = [0.0904 \times 0.3 + 0.0342 \times 0.6] \times 0.6$$ $$= [0.02712 + 0.02052] \times 0.6 = 0.02858$$

Termination

$$P(\mathbf{x} = (0,1,2)) = \alpha_3(H) + \alpha_3(L) = 0.00770 + 0.02858 = 0.03628$$

Forward Variables Trellis
Time	Observation	α(H)	α(L)	Sum
t=1	0	0.3000	0.0400	0.3400
t=2	1	0.0904	0.0342	0.1246
t=3	2	0.0077	0.0286	0.0363

Observation

Connection to Message Passing in Graphical Models

The Forward Algorithm is a special case of a more general framework: belief propagation (message passing) in graphical models.

Messages in the HMM Graph

In the HMM graphical model, we can view the forward variable as a message passed from left to right:

$$\alpha_t(z_t) = \text{"message from past telling } z_t \text{ what's been observed"}$$

At each time step, we:

Receive messages from all possible previous states
Combine them weighted by transition probabilities
Update with local observation evidence
Send the result as a message to the next step

The Sum-Product Algorithm Perspective

The Forward Algorithm implements the sum-product algorithm on the HMM factor graph:

This perspective generalizes:

To trees: exact inference via two passes of message passing
To general graphs: loopy belief propagation (approximate)
To HMMs: Forward-Backward = two-pass message passing

Why This Matters

Understanding the message-passing interpretation:

The Backward Algorithm (Preview)

Just as we pass messages forward, we can pass them backward:

$$\beta_t(i) = P(x_{t+1}, \ldots, x_T | z_t = i)$$

The backward variable asks: "Given we're in state $i$ at time $t$, what's the probability of all future observations?"

The recursion runs right-to-left: $$\beta_t(i) = \sum_{j=1}^{N} A_{ij} \cdot B_j(x_{t+1}) \cdot \beta_{t+1}(j)$$

Combining forward and backward gives us: $$P(z_t = i | \mathbf{x}) = \frac{\alpha_t(i) \cdot \beta_t(i)}{P(\mathbf{x})}$$

This is the posterior probability of state $i$ at time $t$, used in the Baum-Welch algorithm for learning. We'll develop this fully when covering parameter learning.

Summary: The Forward Algorithm

The Forward Algorithm is the first of the three core HMM algorithms, solving the evaluation problem efficiently.

Key Takeaways

•The evaluation problem asks for $P(\mathbf{x} | \boldsymbol{\theta})$—how likely is the observation sequence under the model?
•Naive enumeration is intractable: $O(N^T)$ state sequences make brute force impossible even for moderate $T$.
•The forward variable $\alpha_t(i) = P(x_1, \ldots, x_t, z_t = i)$ captures partial observation probability while tracking current state.
•Dynamic programming reduces complexity to $O(TN^2)$: Each step combines previous forward variables through transitions.
•Numerical stability requires care: Use log-space computation or scaling to prevent underflow on long sequences.
•The algorithm is message passing: Forward variables are messages propagating belief about past observations.
•Termination gives the answer: $P(\mathbf{x}) = \sum_i \alpha_T(i)$ sums over all possible final states.

What's Coming Next

Next Page: The Viterbi Algorithm

Page Complete

2 / 5