Machine LearningGraphical Models

Hidden Markov Models

LevelAdvanced

Duration120 mins

TopicGraphical Models

3 / 5

Viterbi Algorithm

Finding the Hidden Story Behind the Observations

When we observe a sequence of outputs from a Hidden Markov Model, a natural question arises: What is the most likely sequence of hidden states that generated these observations?

This is the decoding problem, and it arises constantly in practice:

Speech Recognition: Given an acoustic signal, what sequence of phonemes (hidden states) was most likely spoken?
Part-of-Speech Tagging: Given a sentence, what is the most likely sequence of grammatical tags (noun, verb, adjective)?
Gene Finding: Given a DNA sequence, which regions are most likely to be coding sequences?
Financial Modeling: Given market observations, what is the most likely sequence of market regimes (bull, bear, neutral)?

The Viterbi Algorithm, named after Andrew Viterbi who developed it in 1967 for decoding convolutional codes, provides an elegant $O(TN^2)$ solution using dynamic programming. Like the Forward Algorithm, it exploits the Markov structure—but instead of summing over paths, it maximizes.

What You Will Learn

By the end of this page, you will:

• Formally define the decoding problem and distinguish it from the evaluation problem • Understand the Viterbi variable and its recursive computation • Master backtracking to recover the optimal state sequence • Implement the Viterbi Algorithm in log-space for numerical stability • Understand when Viterbi decoding differs from posterior decoding

The Decoding Problem

Formal Problem Statement

The decoding problem (also called inference or segmentation) is:

Given:

HMM parameters $\boldsymbol{\theta} = (\mathbf{A}, \mathbf{B}, \boldsymbol{\pi})$
Observation sequence $\mathbf{x} = (x_1, x_2, \ldots, x_T)$

Find: $$\mathbf{z}^* = \arg\max_{\mathbf{z}} P(\mathbf{z} | \mathbf{x}, \boldsymbol{\theta})$$

The sequence of hidden states that is most probable given the observations.

Equivalent Formulation

Using Bayes' rule: $$P(\mathbf{z} | \mathbf{x}) = \frac{P(\mathbf{x}, \mathbf{z})}{P(\mathbf{x})}$$

Since $P(\mathbf{x})$ is constant with respect to $\mathbf{z}$, maximizing the posterior is equivalent to maximizing the joint:

$$\mathbf{z}^* = \arg\max_{\mathbf{z}} P(\mathbf{z} | \mathbf{x}) = \arg\max_{\mathbf{z}} P(\mathbf{x}, \mathbf{z})$$

This is convenient because we already know how to compute $P(\mathbf{x}, \mathbf{z})$ from the HMM factorization.

The Exponential Search Space

As with the evaluation problem, brute force enumeration of all $N^T$ state sequences is intractable. For $N=100$ states and $T=50$ time steps, we'd need to evaluate $100^{50}$ sequences—more than the number of atoms in the universe!

Key Insight: Optimal Substructure

The Viterbi Algorithm exploits a crucial property: the best path to any state at time $t$ must include the best path to some state at time $t-1$.

Formally, if the globally optimal sequence passes through state $j$ at time $t$, then:

The portion of this sequence up to time $t-1$ must be the best path to whichever state $i$ preceded $j$
Otherwise, we could replace that prefix with a better one and improve the overall sequence

This is the optimal substructure property that makes dynamic programming applicable. We don't need to track all $N^t$ paths to time $t$—we only need to track the $N$ best paths, one ending at each state.

The Viterbi Variable

Definition

The Viterbi variable $\delta_t(i)$ is defined as the probability of the most likely path ending in state $i$ at time $t$, having generated observations $x_1, \ldots, x_t$:

$$\delta_t(i) = \max_{z_1, \ldots, z_{t-1}} P(z_1, \ldots, z_{t-1}, z_t = i, x_1, \ldots, x_t | \boldsymbol{\theta})$$

Comparison with Forward Variable

Variable	Definition	Operation	Semantics
$\alpha_t(i)$	$P(x_{1:t}, z_t=i)$	Sum over all paths to $i$	Total probability mass
$\delta_t(i)$	$\max_{z_{1:t-1}} P(x_{1:t}, z_{1:t-1}, z_t=i)$	Max over all paths to $i$	Best single path

The key difference: Forward sums (for marginalization), Viterbi maximizes (for optimization).

Backpointer Variable

To reconstruct the optimal path, we need to remember which previous state led to the maximum. Define the backpointer:

$$\psi_t(j) = \arg\max_{i} \left[ \delta_{t-1}(i) \cdot A_{ij} \right]$$

This stores the predecessor state on the best path ending at state $j$ at time $t$.

Converting Mermaid diagram...

The Viterbi Algorithm: Step-by-Step Derivation

Step 1: Initialization ($t = 1$)

At the first time step, there's no path history—just the initial state probability and first emission:

$$\delta_1(i) = \pi_i \cdot B_i(x_1)$$ $$\psi_1(i) = 0 \quad \text{(no predecessor at } t=1 \text{)}$$

This is identical to the Forward Algorithm initialization.

Step 2: Recursion ($t = 2, \ldots, T$)

For $t > 1$, we derive the recursive relationship:

$$\begin{align} \delta_t(j) &= \max_{z_1, \ldots, z_{t-1}} P(z_1, \ldots, z_{t-1}, z_t = j, x_1, \ldots, x_t) \ &= \max_{i} \left[ \max_{z_1, \ldots, z_{t-2}} P(z_1, \ldots, z_{t-2}, z_{t-1} = i, x_1, \ldots, x_{t-1}) \cdot P(z_t = j | z_{t-1} = i) \cdot P(x_t | z_t = j) \right] \ &= \max_{i} \left[ \delta_{t-1}(i) \cdot A_{ij} \right] \cdot B_j(x_t) \end{align}$$

The corresponding backpointer: $$\psi_t(j) = \arg\max_{i} \left[ \delta_{t-1}(i) \cdot A_{ij} \right]$$

Forward vs. Viterbi Recursion

Notice the parallel structure:

Forward: $\alpha_t(j) = \left[ \sum_{i} \alpha_{t-1}(i) \cdot A_{ij} \right] \cdot B_j(x_t)$

Viterbi: $\delta_t(j) = \left[ \max_{i} \delta_{t-1}(i) \cdot A_{ij} \right] \cdot B_j(x_t)$

The only difference is sum ↔ max. This is an instance of the sum-product ↔ max-product correspondence in graphical model inference!

Step 3: Termination

After processing all $T$ observations, find the best final state:

$$z_T^* = \arg\max_{i} \delta_T(i)$$

The probability of the best path: $$P^* = \max_{i} \delta_T(i)$$

Step 4: Backtracking

Recover the complete optimal sequence by tracing backpointers:

$$z_{t-1}^* = \psi_t(z_t^*) \quad \text{for } t = T, T-1, \ldots, 2$$

This gives us $\mathbf{z}^* = (z_1^, z_2^, \ldots, z_T^*)$.

Complete Algorithm Summary

Viterbi Algorithm

Initialize: For $i = 1, \ldots, N$: $$\delta_1(i) = \pi_i \cdot B_i(x_1), \quad \psi_1(i) = 0$$
Recurse: For $t = 2, \ldots, T$ and $j = 1, \ldots, N$: $$\delta_t(j) = \max_{i} \left[ \delta_{t-1}(i) \cdot A_{ij} \right] \cdot B_j(x_t)$$ $$\psi_t(j) = \arg\max_{i} \left[ \delta_{t-1}(i) \cdot A_{ij} \right]$$
Terminate: $$P^* = \max_{i} \delta_T(i), \quad z_T^* = \arg\max_{i} \delta_T(i)$$
Backtrack: For $t = T-1, \ldots, 1$: $$z_t^* = \psi_{t+1}(z_{t+1}^*)$$

Complexity Analysis

Time: $O(TN^2)$ — same as Forward
Space: $O(TN)$ for backpointers (must store full trellis)

Implementation

Let's implement the Viterbi Algorithm with proper numerical stability.

Log-Space Implementation

For the same reasons as the Forward Algorithm, we work in log space to prevent underflow. The recursion becomes:

$$\log \delta_t(j) = \max_{i} \left[ \log \delta_{t-1}(i) + \log A_{ij} \right] + \log B_j(x_t)$$

Note that unlike the Forward Algorithm, we don't need log-sum-exp—the max operation works directly in log space!

viterbi_algorithm.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
import numpy as np
from typing import Tuple
 
def viterbi_algorithm(
    A: np.ndarray,    # (N, N) transition matrix
    B: np.ndarray,    # (N, M) emission matrix
    pi: np.ndarray,   # (N,) initial distribution
    observations: np.ndarray  # (T,) observation indices
) -> Tuple[np.ndarray, float, np.ndarray]:
    """
    Find the most likely state sequence using Viterbi algorithm.
    
    Works in log-probability space for numerical stability.
    
    Parameters
    ----------
    A : np.ndarray of shape (N, N)
        Transition matrix, A[i,j] = P(z_t=j | z_{t-1}=i)
    B : np.ndarray of shape (N, M)
        Emission matrix, B[i,k] = P(x_t=k | z_t=i)
    pi : np.ndarray of shape (N,)
        Initial state distribution
    observations : np.ndarray of shape (T,)
        Sequence of observation indices
    
    Returns
    -------
    best_path : np.ndarray of shape (T,)
        Most likely state sequence
    log_prob : float
        Log probability of the best path
    delta : np.ndarray of shape (T, N)
        Viterbi variables (log probabilities)
    """
    T = len(observations)
    N = len(pi)
    
    # Convert to log space
    log_A = np.log(A + 1e-300)  # Add small constant to avoid log(0)
    log_B = np.log(B + 1e-300)
    log_pi = np.log(pi + 1e-300)
    
    # Initialize
    log_delta = np.zeros((T, N))
    psi = np.zeros((T, N), dtype=int)  # Backpointers
    
    # Step 1: Initialization
    log_delta[0] = log_pi + log_B[:, observations[0]]
    psi[0] = 0  # No predecessor at t=0
    
    # Step 2: Recursion
    for t in range(1, T):
        for j in range(N):
            # Compute max over all previous states
            candidates = log_delta[t-1] + log_A[:, j]
            psi[t, j] = np.argmax(candidates)
            log_delta[t, j] = candidates[psi[t, j]] + log_B[j, observations[t]]
    
    # Step 3: Termination
    last_state = np.argmax(log_delta[-1])
    log_prob = log_delta[-1, last_state]
    
    # Step 4: Backtracking
    best_path = np.zeros(T, dtype=int)
    best_path[-1] = last_state
    for t in range(T-2, -1, -1):
        best_path[t] = psi[t+1, best_path[t+1]]
    
    return best_path, log_prob, log_delta
 
 
def viterbi_algorithm_vectorized(
    A: np.ndarray,
    B: np.ndarray,
    pi: np.ndarray,
    observations: np.ndarray
) -> Tuple[np.ndarray, float, np.ndarray]:
    """
    Vectorized Viterbi implementation.
    
    Uses numpy broadcasting for better performance.
    """
    T = len(observations)
    N = len(pi)
    
    log_A = np.log(A + 1e-300)
    log_B = np.log(B + 1e-300)
    log_pi = np.log(pi + 1e-300)
    
    log_delta = np.zeros((T, N))
    psi = np.zeros((T, N), dtype=int)
    
    # Initialization
    log_delta[0] = log_pi + log_B[:, observations[0]]
    
    # Recursion (vectorized over target states)
    for t in range(1, T):
        # (N, 1) + (N, N) = (N, N) matrix of candidates
        # candidates[i, j] = log_delta[t-1, i] + log_A[i, j]
        candidates = log_delta[t-1, :, np.newaxis] + log_A
        
        # Max and argmax over source states (axis 0)
        psi[t] = np.argmax(candidates, axis=0)
        log_delta[t] = np.max(candidates, axis=0) + log_B[:, observations[t]]
    
    # Termination and backtracking
    best_path = np.zeros(T, dtype=int)
    best_path[-1] = np.argmax(log_delta[-1])
    log_prob = log_delta[-1, best_path[-1]]
    
    for t in range(T-2, -1, -1):
        best_path[t] = psi[t+1, best_path[t+1]]
    
    return best_path, log_prob, log_delta
 
 
# Example usage
def demo_viterbi():
    """Demonstrate Viterbi on weather HMM."""
    
    # States: 0=Sunny, 1=Cloudy, 2=Rainy
    # Observations: 0=Walk, 1=Shop, 2=Clean
    
    A = np.array([
        [0.7, 0.2, 0.1],
        [0.3, 0.4, 0.3],
        [0.2, 0.3, 0.5],
    ])
    
    B = np.array([
        [0.7, 0.2, 0.1],  # Sunny: prefer Walk
        [0.3, 0.4, 0.3],  # Cloudy: mixed
        [0.1, 0.2, 0.7],  # Rainy: prefer Clean
    ])
    
    pi = np.array([0.6, 0.3, 0.1])
    
    # Observation: Walk, Shop, Clean, Clean
    observations = np.array([0, 1, 2, 2])
    
    best_path, log_prob, delta = viterbi_algorithm_vectorized(A, B, pi, observations)
    
    state_names = ["Sunny", "Cloudy", "Rainy"]
    obs_names = ["Walk", "Shop", "Clean", "Clean"]
    
    print("Viterbi Decoding Results:")
    print("-" * 60)
    print(f"Observations: {obs_names}")
    print(f"Best path:    {[state_names[s] for s in best_path]}")
    print(f"Log prob:     {log_prob:.4f}")
    print(f"Probability:  {np.exp(log_prob):.8f}")
    
    print("\nViterbi Variables (log δ):")
    for t in range(len(observations)):
        print(f"  t={t+1}: ", end="")
        for i, s in enumerate(state_names):
            print(f"δ({s})={delta[t,i]:.3f}  ", end="")
        print()
 
 
demo_viterbi()

Complete Worked Example

Let's trace through Viterbi on the same example as the Forward Algorithm to see the differences.

Problem Setup

HMM Parameters:

States: $\mathcal{S} = {H, L}$
Observations: $\mathcal{O} = {0, 1, 2}$

$$\boldsymbol{\pi} = \begin{pmatrix} 0.6 \ 0.4 \end{pmatrix}, \quad \mathbf{A} = \begin{pmatrix} 0.7 & 0.3 \ 0.4 & 0.6 \end{pmatrix}, \quad \mathbf{B} = \begin{pmatrix} 0.5 & 0.4 & 0.1 \ 0.1 & 0.3 & 0.6 \end{pmatrix}$$

Observation sequence: $\mathbf{x} = (0, 1, 2)$

Step 1: Initialization ($t=1$, $x_1=0$)

$$\delta_1(H) = \pi_H \cdot B_H(0) = 0.6 \times 0.5 = 0.30$$ $$\delta_1(L) = \pi_L \cdot B_L(0) = 0.4 \times 0.1 = 0.04$$ $$\psi_1(H) = \psi_1(L) = 0 \quad \text{(no predecessor)}$$

Step 2: Recursion ($t=2$, $x_2=1$)

For state H: $$\delta_{t-1}(H) \cdot A_{HH} = 0.30 \times 0.7 = 0.210$$ $$\delta_{t-1}(L) \cdot A_{LH} = 0.04 \times 0.4 = 0.016$$ $$\max = 0.210 \Rightarrow \psi_2(H) = H$$ $$\delta_2(H) = 0.210 \times B_H(1) = 0.210 \times 0.4 = 0.084$$

For state L: $$\delta_{t-1}(H) \cdot A_{HL} = 0.30 \times 0.3 = 0.090$$ $$\delta_{t-1}(L) \cdot A_{LL} = 0.04 \times 0.6 = 0.024$$ $$\max = 0.090 \Rightarrow \psi_2(L) = H$$ $$\delta_2(L) = 0.090 \times B_L(1) = 0.090 \times 0.3 = 0.027$$

Step 3: Recursion ($t=3$, $x_3=2$)

For state H: $$\delta_2(H) \cdot A_{HH} = 0.084 \times 0.7 = 0.0588$$ $$\delta_2(L) \cdot A_{LH} = 0.027 \times 0.4 = 0.0108$$ $$\max = 0.0588 \Rightarrow \psi_3(H) = H$$ $$\delta_3(H) = 0.0588 \times 0.1 = 0.00588$$

For state L: $$\delta_2(H) \cdot A_{HL} = 0.084 \times 0.3 = 0.0252$$ $$\delta_2(L) \cdot A_{LL} = 0.027 \times 0.6 = 0.0162$$ $$\max = 0.0252 \Rightarrow \psi_3(L) = H$$ $$\delta_3(L) = 0.0252 \times 0.6 = 0.01512$$

Step 4: Termination

$$z_3^* = \arg\max{\delta_3(H), \delta_3(L)} = \arg\max{0.00588, 0.01512} = L$$ $$P^* = 0.01512$$

Step 5: Backtracking

$$z_3^* = L$$ $$z_2^* = \psi_3(L) = H$$ $$z_1^* = \psi_2(H) = H$$

Optimal path: $\mathbf{z}^* = (H, H, L)$

Viterbi Trellis Summary
Time	Observation	δ(H)	δ(L)	ψ(H)	ψ(L)	Best arriving at H	Best arriving at L
t=1	0	0.3000	0.0400
t=2	1	0.0840	0.0270	H	H	H→H	H→L
t=3	2	0.0059	0.0151	H	H	H→H	H→L

Interpretation

The optimal path is (H, H, L) with probability 0.01512. Notice that even though we end in state L, the best path to get there came from state H at time 2—and the best path to H at time 2 came from H at time 1. The observation $x_3=2$ (which L prefers) pulled the final state to L despite the path through H-states.

Viterbi vs. Posterior Decoding

There are actually two different ways to decode an HMM, and they can give different answers!

Maximum A Posteriori (MAP) Sequence — Viterbi

$$\mathbf{z}^* = \arg\max_{\mathbf{z}} P(\mathbf{z} | \mathbf{x})$$

Find the single most probable complete sequence of states.

Posterior (Marginal) Decoding

$$z_t^* = \arg\max_{z_t} P(z_t | \mathbf{x}) \quad \text{for each } t$$

At each time step, independently choose the most probable state given all observations.

When Do They Differ?

They can differ when the globally optimal sequence doesn't use the locally optimal states at each position.

Example: Consider a 3-state HMM where:

Path A→B→C has probability 0.20
Path A→B→D has probability 0.15
Path A→C→D has probability 0.10
All other paths have lower probability

Viterbi: Returns A→B→C (probability 0.20)

Posterior decoding at position 3:

$P(z_3=C | \mathbf{x}) \propto 0.20$ (only path A→B→C ends in C)
$P(z_3=D | \mathbf{x}) \propto 0.25$ (paths A→B→D + A→C→D end in D)

So posterior decoding might choose D at position 3, giving a sequence A→B→D—different from Viterbi!

Use Viterbi When

•You need a coherent, valid sequence
•Transition constraints matter (e.g., grammar rules)
•You want the single most likely explanation
•Examples: Speech recognition (phoneme sequence), sequence labeling with hard constraints

Use Posterior Decoding When

•You care about per-position accuracy
•Transitions are soft preferences, not hard constraints
•You want to minimize expected number of errors
•Examples: Gene annotation (minimize per-base errors), any task measured by per-element accuracy

Computing Posterior Decoding

Posterior decoding requires the marginal $P(z_t = i | \mathbf{x})$ at each time step. This is computed using both Forward and Backward passes:

$$P(z_t = i | \mathbf{x}) = \frac{\alpha_t(i) \cdot \beta_t(i)}{P(\mathbf{x})}$$

where $\alpha_t$ comes from the Forward Algorithm and $\beta_t$ from the Backward Algorithm (covered in the Baum-Welch page).

Applications and Practical Considerations

Real-World Applications of Viterbi Decoding

1. Speech Recognition

Given acoustic features, find the most likely sequence of phonemes/words:

States represent phonemes or sub-phoneme units
Observations are acoustic feature vectors (MFCCs)
Viterbi finds the best transcription

2. Part-of-Speech Tagging

Given a sentence, assign grammatical tags to each word:

States are POS tags (noun, verb, adjective, ...)
Observations are words
Viterbi respects syntactic constraints (e.g., "the" is usually followed by a noun)

3. Gene Finding in Bioinformatics

Identify coding regions in DNA sequences:

States represent genomic features (exon, intron, intergenic)
Observations are nucleotides (A, C, G, T)
Viterbi segments the genome into functional regions

4. Named Entity Recognition

Identify and classify named entities in text:

States encode entity types and positions (B-PER, I-PER, O, ...)
BIO/IOB tagging scheme constrains valid transitions
Viterbi ensures valid entity boundaries

viterbi_pos_tagging.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import numpy as np
 
def pos_tagging_example():
    """
    Simple POS tagging example with Viterbi.
    
    This is a toy example for illustration.
    Real POS taggers use much larger vocabularies and tagsets.
    """
    # Tags (states)
    tags = ['DET', 'NOUN', 'VERB', 'ADJ']  # Determiner, Noun, Verb, Adjective
    tag2idx = {t: i for i, t in enumerate(tags)}
    
    # Vocabulary (observations)
    words = ['the', 'dog', 'runs', 'big', 'cat', 'sleeps']
    word2idx = {w: i for i, w in enumerate(words)}
    
    # Transition probabilities (simplified)
    # DET usually followed by NOUN or ADJ
    # NOUN usually followed by VERB
    # VERB usually followed by DET or ADJ
    A = np.array([
        [0.01, 0.50, 0.09, 0.40],  # From DET
        [0.05, 0.10, 0.80, 0.05],  # From NOUN
        [0.30, 0.30, 0.10, 0.30],  # From VERB
        [0.05, 0.80, 0.10, 0.05],  # From ADJ
    ])
    
    # Emission probabilities (simplified)
    # Each row: P(word | tag) for words [the, dog, runs, big, cat, sleeps]
    B = np.array([
        [0.80, 0.01, 0.01, 0.01, 0.01, 0.01],  # DET: 'the' is likely
        [0.01, 0.40, 0.01, 0.01, 0.40, 0.01],  # NOUN: 'dog', 'cat'
        [0.01, 0.01, 0.45, 0.01, 0.01, 0.45],  # VERB: 'runs', 'sleeps'
        [0.01, 0.01, 0.01, 0.90, 0.01, 0.01],  # ADJ: 'big'
    ])
    
    # Normalize rows (in case they don't sum to 1)
    B = B / B.sum(axis=1, keepdims=True)
    
    # Initial probabilities (sentences usually start with DET)
    pi = np.array([0.70, 0.15, 0.10, 0.05])
    
    # Test sentences
    test_sentences = [
        ['the', 'dog', 'runs'],
        ['the', 'big', 'cat', 'sleeps'],
        ['dog', 'runs'],
    ]
    
    print("POS Tagging with Viterbi Algorithm")
    print("=" * 50)
    
    for sentence in test_sentences:
        # Convert words to indices
        obs = np.array([word2idx[w] for w in sentence])
        
        # Run Viterbi
        best_path, log_prob, _ = viterbi_algorithm(A, B, pi, obs)
        predicted_tags = [tags[i] for i in best_path]
        
        print(f"\nSentence: {' '.join(sentence)}")
        print(f"Tags:     {' '.join(predicted_tags)}")
        print(f"Log prob: {log_prob:.3f}")
 
 
pos_tagging_example()

Practical Considerations

1. Handling Unknown Observations

In practice, you may encounter observations not seen during training:

Use smoothing (add small probability mass to all emissions)
Create an "unknown" or "OOV" (out-of-vocabulary) observation
Back off to character-level models

2. Beam Search

For very large state spaces, maintain only the top-K paths:

Significant speedup with minimal accuracy loss
Standard in speech recognition and machine translation

3. Online/Streaming Viterbi

For real-time applications:

Can emit partial results before seeing complete sequence
Use look-ahead to improve decisions
Trade latency for accuracy

Summary: The Viterbi Algorithm

The Viterbi Algorithm solves the decoding problem—finding the most probable hidden state sequence—using dynamic programming.

Key Takeaways

•The decoding problem asks for the most likely state sequence given observations: $\mathbf{z}^* = \arg\max P(\mathbf{z} | \mathbf{x})$.
•The Viterbi variable $\delta_t(i)$ stores the probability of the best path ending at state $i$ at time $t$.
•Recursion mirrors Forward Algorithm: Replace sum with max. $\delta_t(j) = \max_i[\delta_{t-1}(i) \cdot A_{ij}] \cdot B_j(x_t)$
•Backpointers enable path recovery: Store $\psi_t(j) = \arg\max$ to trace back the optimal sequence.
•Complexity is $O(TN^2)$: Same as Forward, polynomial in sequence length and number of states.
•Log-space computation prevents underflow: Max operation works directly in log space (no log-sum-exp needed).
•Viterbi ≠ Posterior decoding: Viterbi finds the best sequence; posterior decoding finds the best state at each position. They can differ!

What's Coming Next

Next Page: The Baum-Welch Algorithm (EM for HMMs)

The Forward and Viterbi algorithms assume we know the HMM parameters. But in practice, we must learn them from data. The Baum-Welch algorithm uses Expectation-Maximization to iteratively improve parameter estimates, even when the hidden states are never observed.

This requires both Forward and Backward passes to compute expected state occupancies and transition counts—the sufficient statistics for updating $\mathbf{A}$, $\mathbf{B}$, and $\boldsymbol{\pi}$.

Page Complete

You now understand the Viterbi Algorithm—how it finds the most probable state sequence through dynamic programming, the crucial role of backpointers in reconstructing the optimal path, and when to use Viterbi vs. posterior decoding. Combined with the Forward Algorithm, you have the tools for HMM evaluation and decoding.

3 / 5

Loading learning content...

Machine LearningGraphical Models

Hidden Markov Models

LevelAdvanced

Duration120 mins

TopicGraphical Models

3 / 5

Viterbi Algorithm

Finding the Hidden Story Behind the Observations

When we observe a sequence of outputs from a Hidden Markov Model, a natural question arises: What is the most likely sequence of hidden states that generated these observations?

This is the decoding problem, and it arises constantly in practice:

Speech Recognition: Given an acoustic signal, what sequence of phonemes (hidden states) was most likely spoken?
Part-of-Speech Tagging: Given a sentence, what is the most likely sequence of grammatical tags (noun, verb, adjective)?
Gene Finding: Given a DNA sequence, which regions are most likely to be coding sequences?
Financial Modeling: Given market observations, what is the most likely sequence of market regimes (bull, bear, neutral)?

What You Will Learn

By the end of this page, you will:

The Decoding Problem

Formal Problem Statement

The decoding problem (also called inference or segmentation) is:

Given:

HMM parameters $\boldsymbol{\theta} = (\mathbf{A}, \mathbf{B}, \boldsymbol{\pi})$
Observation sequence $\mathbf{x} = (x_1, x_2, \ldots, x_T)$

Find: $$\mathbf{z}^* = \arg\max_{\mathbf{z}} P(\mathbf{z} | \mathbf{x}, \boldsymbol{\theta})$$

The sequence of hidden states that is most probable given the observations.

Equivalent Formulation

Using Bayes' rule: $$P(\mathbf{z} | \mathbf{x}) = \frac{P(\mathbf{x}, \mathbf{z})}{P(\mathbf{x})}$$

Since $P(\mathbf{x})$ is constant with respect to $\mathbf{z}$, maximizing the posterior is equivalent to maximizing the joint:

$$\mathbf{z}^* = \arg\max_{\mathbf{z}} P(\mathbf{z} | \mathbf{x}) = \arg\max_{\mathbf{z}} P(\mathbf{x}, \mathbf{z})$$

This is convenient because we already know how to compute $P(\mathbf{x}, \mathbf{z})$ from the HMM factorization.

The Exponential Search Space

Key Insight: Optimal Substructure

The Viterbi Algorithm exploits a crucial property: the best path to any state at time $t$ must include the best path to some state at time $t-1$.

Formally, if the globally optimal sequence passes through state $j$ at time $t$, then:

The portion of this sequence up to time $t-1$ must be the best path to whichever state $i$ preceded $j$
Otherwise, we could replace that prefix with a better one and improve the overall sequence

The Viterbi Variable

Definition

The Viterbi variable $\delta_t(i)$ is defined as the probability of the most likely path ending in state $i$ at time $t$, having generated observations $x_1, \ldots, x_t$:

$$\delta_t(i) = \max_{z_1, \ldots, z_{t-1}} P(z_1, \ldots, z_{t-1}, z_t = i, x_1, \ldots, x_t | \boldsymbol{\theta})$$

Comparison with Forward Variable

Variable	Definition	Operation	Semantics
$\alpha_t(i)$	$P(x_{1:t}, z_t=i)$	Sum over all paths to $i$	Total probability mass
$\delta_t(i)$	$\max_{z_{1:t-1}} P(x_{1:t}, z_{1:t-1}, z_t=i)$	Max over all paths to $i$	Best single path

The key difference: Forward sums (for marginalization), Viterbi maximizes (for optimization).

Backpointer Variable

To reconstruct the optimal path, we need to remember which previous state led to the maximum. Define the backpointer:

$$\psi_t(j) = \arg\max_{i} \left[ \delta_{t-1}(i) \cdot A_{ij} \right]$$

This stores the predecessor state on the best path ending at state $j$ at time $t$.

Converting Mermaid diagram...

The Viterbi Algorithm: Step-by-Step Derivation

Step 1: Initialization ($t = 1$)

At the first time step, there's no path history—just the initial state probability and first emission:

$$\delta_1(i) = \pi_i \cdot B_i(x_1)$$ $$\psi_1(i) = 0 \quad \text{(no predecessor at } t=1 \text{)}$$

This is identical to the Forward Algorithm initialization.

Step 2: Recursion ($t = 2, \ldots, T$)

For $t > 1$, we derive the recursive relationship:

The corresponding backpointer: $$\psi_t(j) = \arg\max_{i} \left[ \delta_{t-1}(i) \cdot A_{ij} \right]$$

Forward vs. Viterbi Recursion

Notice the parallel structure:

Forward: $\alpha_t(j) = \left[ \sum_{i} \alpha_{t-1}(i) \cdot A_{ij} \right] \cdot B_j(x_t)$

Viterbi: $\delta_t(j) = \left[ \max_{i} \delta_{t-1}(i) \cdot A_{ij} \right] \cdot B_j(x_t)$

The only difference is sum ↔ max. This is an instance of the sum-product ↔ max-product correspondence in graphical model inference!

Step 3: Termination

After processing all $T$ observations, find the best final state:

$$z_T^* = \arg\max_{i} \delta_T(i)$$

The probability of the best path: $$P^* = \max_{i} \delta_T(i)$$

Step 4: Backtracking

Recover the complete optimal sequence by tracing backpointers:

$$z_{t-1}^* = \psi_t(z_t^*) \quad \text{for } t = T, T-1, \ldots, 2$$

This gives us $\mathbf{z}^* = (z_1^, z_2^, \ldots, z_T^*)$.

Complete Algorithm Summary

Viterbi Algorithm

Initialize: For $i = 1, \ldots, N$: $$\delta_1(i) = \pi_i \cdot B_i(x_1), \quad \psi_1(i) = 0$$
Recurse: For $t = 2, \ldots, T$ and $j = 1, \ldots, N$: $$\delta_t(j) = \max_{i} \left[ \delta_{t-1}(i) \cdot A_{ij} \right] \cdot B_j(x_t)$$ $$\psi_t(j) = \arg\max_{i} \left[ \delta_{t-1}(i) \cdot A_{ij} \right]$$
Terminate: $$P^* = \max_{i} \delta_T(i), \quad z_T^* = \arg\max_{i} \delta_T(i)$$
Backtrack: For $t = T-1, \ldots, 1$: $$z_t^* = \psi_{t+1}(z_{t+1}^*)$$

Complexity Analysis

Time: $O(TN^2)$ — same as Forward
Space: $O(TN)$ for backpointers (must store full trellis)

Implementation

Let's implement the Viterbi Algorithm with proper numerical stability.

Log-Space Implementation

For the same reasons as the Forward Algorithm, we work in log space to prevent underflow. The recursion becomes:

$$\log \delta_t(j) = \max_{i} \left[ \log \delta_{t-1}(i) + \log A_{ij} \right] + \log B_j(x_t)$$

Note that unlike the Forward Algorithm, we don't need log-sum-exp—the max operation works directly in log space!

viterbi_algorithm.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
import numpy as np
from typing import Tuple
 
def viterbi_algorithm(
    A: np.ndarray,    # (N, N) transition matrix
    B: np.ndarray,    # (N, M) emission matrix
    pi: np.ndarray,   # (N,) initial distribution
    observations: np.ndarray  # (T,) observation indices
) -> Tuple[np.ndarray, float, np.ndarray]:
    """
    Find the most likely state sequence using Viterbi algorithm.
    
    Works in log-probability space for numerical stability.
    
    Parameters
    ----------
    A : np.ndarray of shape (N, N)
        Transition matrix, A[i,j] = P(z_t=j | z_{t-1}=i)
    B : np.ndarray of shape (N, M)
        Emission matrix, B[i,k] = P(x_t=k | z_t=i)
    pi : np.ndarray of shape (N,)
        Initial state distribution
    observations : np.ndarray of shape (T,)
        Sequence of observation indices
    
    Returns
    -------
    best_path : np.ndarray of shape (T,)
        Most likely state sequence
    log_prob : float
        Log probability of the best path
    delta : np.ndarray of shape (T, N)
        Viterbi variables (log probabilities)
    """
    T = len(observations)
    N = len(pi)
    
    # Convert to log space
    log_A = np.log(A + 1e-300)  # Add small constant to avoid log(0)
    log_B = np.log(B + 1e-300)
    log_pi = np.log(pi + 1e-300)
    
    # Initialize
    log_delta = np.zeros((T, N))
    psi = np.zeros((T, N), dtype=int)  # Backpointers
    
    # Step 1: Initialization
    log_delta[0] = log_pi + log_B[:, observations[0]]
    psi[0] = 0  # No predecessor at t=0
    
    # Step 2: Recursion
    for t in range(1, T):
        for j in range(N):
            # Compute max over all previous states
            candidates = log_delta[t-1] + log_A[:, j]
            psi[t, j] = np.argmax(candidates)
            log_delta[t, j] = candidates[psi[t, j]] + log_B[j, observations[t]]
    
    # Step 3: Termination
    last_state = np.argmax(log_delta[-1])
    log_prob = log_delta[-1, last_state]
    
    # Step 4: Backtracking
    best_path = np.zeros(T, dtype=int)
    best_path[-1] = last_state
    for t in range(T-2, -1, -1):
        best_path[t] = psi[t+1, best_path[t+1]]
    
    return best_path, log_prob, log_delta
 
 
def viterbi_algorithm_vectorized(
    A: np.ndarray,
    B: np.ndarray,
    pi: np.ndarray,
    observations: np.ndarray
) -> Tuple[np.ndarray, float, np.ndarray]:
    """
    Vectorized Viterbi implementation.
    
    Uses numpy broadcasting for better performance.
    """
    T = len(observations)
    N = len(pi)
    
    log_A = np.log(A + 1e-300)
    log_B = np.log(B + 1e-300)
    log_pi = np.log(pi + 1e-300)
    
    log_delta = np.zeros((T, N))
    psi = np.zeros((T, N), dtype=int)
    
    # Initialization
    log_delta[0] = log_pi + log_B[:, observations[0]]
    
    # Recursion (vectorized over target states)
    for t in range(1, T):
        # (N, 1) + (N, N) = (N, N) matrix of candidates
        # candidates[i, j] = log_delta[t-1, i] + log_A[i, j]
        candidates = log_delta[t-1, :, np.newaxis] + log_A
        
        # Max and argmax over source states (axis 0)
        psi[t] = np.argmax(candidates, axis=0)
        log_delta[t] = np.max(candidates, axis=0) + log_B[:, observations[t]]
    
    # Termination and backtracking
    best_path = np.zeros(T, dtype=int)
    best_path[-1] = np.argmax(log_delta[-1])
    log_prob = log_delta[-1, best_path[-1]]
    
    for t in range(T-2, -1, -1):
        best_path[t] = psi[t+1, best_path[t+1]]
    
    return best_path, log_prob, log_delta
 
 
# Example usage
def demo_viterbi():
    """Demonstrate Viterbi on weather HMM."""
    
    # States: 0=Sunny, 1=Cloudy, 2=Rainy
    # Observations: 0=Walk, 1=Shop, 2=Clean
    
    A = np.array([
        [0.7, 0.2, 0.1],
        [0.3, 0.4, 0.3],
        [0.2, 0.3, 0.5],
    ])
    
    B = np.array([
        [0.7, 0.2, 0.1],  # Sunny: prefer Walk
        [0.3, 0.4, 0.3],  # Cloudy: mixed
        [0.1, 0.2, 0.7],  # Rainy: prefer Clean
    ])
    
    pi = np.array([0.6, 0.3, 0.1])
    
    # Observation: Walk, Shop, Clean, Clean
    observations = np.array([0, 1, 2, 2])
    
    best_path, log_prob, delta = viterbi_algorithm_vectorized(A, B, pi, observations)
    
    state_names = ["Sunny", "Cloudy", "Rainy"]
    obs_names = ["Walk", "Shop", "Clean", "Clean"]
    
    print("Viterbi Decoding Results:")
    print("-" * 60)
    print(f"Observations: {obs_names}")
    print(f"Best path:    {[state_names[s] for s in best_path]}")
    print(f"Log prob:     {log_prob:.4f}")
    print(f"Probability:  {np.exp(log_prob):.8f}")
    
    print("\nViterbi Variables (log δ):")
    for t in range(len(observations)):
        print(f"  t={t+1}: ", end="")
        for i, s in enumerate(state_names):
            print(f"δ({s})={delta[t,i]:.3f}  ", end="")
        print()
 
 
demo_viterbi()

Complete Worked Example

Let's trace through Viterbi on the same example as the Forward Algorithm to see the differences.

Problem Setup

HMM Parameters:

States: $\mathcal{S} = {H, L}$
Observations: $\mathcal{O} = {0, 1, 2}$

Observation sequence: $\mathbf{x} = (0, 1, 2)$

Step 1: Initialization ($t=1$, $x_1=0$)

$$\delta_1(H) = \pi_H \cdot B_H(0) = 0.6 \times 0.5 = 0.30$$ $$\delta_1(L) = \pi_L \cdot B_L(0) = 0.4 \times 0.1 = 0.04$$ $$\psi_1(H) = \psi_1(L) = 0 \quad \text{(no predecessor)}$$

Step 2: Recursion ($t=2$, $x_2=1$)

Step 3: Recursion ($t=3$, $x_3=2$)

Step 4: Termination

$$z_3^* = \arg\max{\delta_3(H), \delta_3(L)} = \arg\max{0.00588, 0.01512} = L$$ $$P^* = 0.01512$$

Step 5: Backtracking

$$z_3^* = L$$ $$z_2^* = \psi_3(L) = H$$ $$z_1^* = \psi_2(H) = H$$

Optimal path: $\mathbf{z}^* = (H, H, L)$

Viterbi Trellis Summary
Time	Observation	δ(H)	δ(L)	ψ(H)	ψ(L)	Best arriving at H	Best arriving at L
t=1	0	0.3000	0.0400
t=2	1	0.0840	0.0270	H	H	H→H	H→L
t=3	2	0.0059	0.0151	H	H	H→H	H→L

Interpretation

Viterbi vs. Posterior Decoding

There are actually two different ways to decode an HMM, and they can give different answers!

Maximum A Posteriori (MAP) Sequence — Viterbi

$$\mathbf{z}^* = \arg\max_{\mathbf{z}} P(\mathbf{z} | \mathbf{x})$$

Find the single most probable complete sequence of states.

Posterior (Marginal) Decoding

$$z_t^* = \arg\max_{z_t} P(z_t | \mathbf{x}) \quad \text{for each } t$$

At each time step, independently choose the most probable state given all observations.

When Do They Differ?

They can differ when the globally optimal sequence doesn't use the locally optimal states at each position.

Example: Consider a 3-state HMM where:

Path A→B→C has probability 0.20
Path A→B→D has probability 0.15
Path A→C→D has probability 0.10
All other paths have lower probability

Viterbi: Returns A→B→C (probability 0.20)

Posterior decoding at position 3:

$P(z_3=C | \mathbf{x}) \propto 0.20$ (only path A→B→C ends in C)
$P(z_3=D | \mathbf{x}) \propto 0.25$ (paths A→B→D + A→C→D end in D)

So posterior decoding might choose D at position 3, giving a sequence A→B→D—different from Viterbi!

Use Viterbi When

•You need a coherent, valid sequence
•Transition constraints matter (e.g., grammar rules)
•You want the single most likely explanation
•Examples: Speech recognition (phoneme sequence), sequence labeling with hard constraints

Use Posterior Decoding When

•You care about per-position accuracy
•Transitions are soft preferences, not hard constraints
•You want to minimize expected number of errors
•Examples: Gene annotation (minimize per-base errors), any task measured by per-element accuracy

Computing Posterior Decoding

Posterior decoding requires the marginal $P(z_t = i | \mathbf{x})$ at each time step. This is computed using both Forward and Backward passes:

$$P(z_t = i | \mathbf{x}) = \frac{\alpha_t(i) \cdot \beta_t(i)}{P(\mathbf{x})}$$

where $\alpha_t$ comes from the Forward Algorithm and $\beta_t$ from the Backward Algorithm (covered in the Baum-Welch page).

Applications and Practical Considerations

Real-World Applications of Viterbi Decoding

1. Speech Recognition

Given acoustic features, find the most likely sequence of phonemes/words:

States represent phonemes or sub-phoneme units
Observations are acoustic feature vectors (MFCCs)
Viterbi finds the best transcription

2. Part-of-Speech Tagging

Given a sentence, assign grammatical tags to each word:

States are POS tags (noun, verb, adjective, ...)
Observations are words
Viterbi respects syntactic constraints (e.g., "the" is usually followed by a noun)

3. Gene Finding in Bioinformatics

Identify coding regions in DNA sequences:

States represent genomic features (exon, intron, intergenic)
Observations are nucleotides (A, C, G, T)
Viterbi segments the genome into functional regions

4. Named Entity Recognition

Identify and classify named entities in text:

States encode entity types and positions (B-PER, I-PER, O, ...)
BIO/IOB tagging scheme constrains valid transitions
Viterbi ensures valid entity boundaries

viterbi_pos_tagging.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import numpy as np
 
def pos_tagging_example():
    """
    Simple POS tagging example with Viterbi.
    
    This is a toy example for illustration.
    Real POS taggers use much larger vocabularies and tagsets.
    """
    # Tags (states)
    tags = ['DET', 'NOUN', 'VERB', 'ADJ']  # Determiner, Noun, Verb, Adjective
    tag2idx = {t: i for i, t in enumerate(tags)}
    
    # Vocabulary (observations)
    words = ['the', 'dog', 'runs', 'big', 'cat', 'sleeps']
    word2idx = {w: i for i, w in enumerate(words)}
    
    # Transition probabilities (simplified)
    # DET usually followed by NOUN or ADJ
    # NOUN usually followed by VERB
    # VERB usually followed by DET or ADJ
    A = np.array([
        [0.01, 0.50, 0.09, 0.40],  # From DET
        [0.05, 0.10, 0.80, 0.05],  # From NOUN
        [0.30, 0.30, 0.10, 0.30],  # From VERB
        [0.05, 0.80, 0.10, 0.05],  # From ADJ
    ])
    
    # Emission probabilities (simplified)
    # Each row: P(word | tag) for words [the, dog, runs, big, cat, sleeps]
    B = np.array([
        [0.80, 0.01, 0.01, 0.01, 0.01, 0.01],  # DET: 'the' is likely
        [0.01, 0.40, 0.01, 0.01, 0.40, 0.01],  # NOUN: 'dog', 'cat'
        [0.01, 0.01, 0.45, 0.01, 0.01, 0.45],  # VERB: 'runs', 'sleeps'
        [0.01, 0.01, 0.01, 0.90, 0.01, 0.01],  # ADJ: 'big'
    ])
    
    # Normalize rows (in case they don't sum to 1)
    B = B / B.sum(axis=1, keepdims=True)
    
    # Initial probabilities (sentences usually start with DET)
    pi = np.array([0.70, 0.15, 0.10, 0.05])
    
    # Test sentences
    test_sentences = [
        ['the', 'dog', 'runs'],
        ['the', 'big', 'cat', 'sleeps'],
        ['dog', 'runs'],
    ]
    
    print("POS Tagging with Viterbi Algorithm")
    print("=" * 50)
    
    for sentence in test_sentences:
        # Convert words to indices
        obs = np.array([word2idx[w] for w in sentence])
        
        # Run Viterbi
        best_path, log_prob, _ = viterbi_algorithm(A, B, pi, obs)
        predicted_tags = [tags[i] for i in best_path]
        
        print(f"\nSentence: {' '.join(sentence)}")
        print(f"Tags:     {' '.join(predicted_tags)}")
        print(f"Log prob: {log_prob:.3f}")
 
 
pos_tagging_example()

Practical Considerations

1. Handling Unknown Observations

In practice, you may encounter observations not seen during training:

Use smoothing (add small probability mass to all emissions)
Create an "unknown" or "OOV" (out-of-vocabulary) observation
Back off to character-level models

2. Beam Search

For very large state spaces, maintain only the top-K paths:

Significant speedup with minimal accuracy loss
Standard in speech recognition and machine translation

3. Online/Streaming Viterbi

For real-time applications:

Can emit partial results before seeing complete sequence
Use look-ahead to improve decisions
Trade latency for accuracy

Summary: The Viterbi Algorithm

The Viterbi Algorithm solves the decoding problem—finding the most probable hidden state sequence—using dynamic programming.

Key Takeaways

•The decoding problem asks for the most likely state sequence given observations: $\mathbf{z}^* = \arg\max P(\mathbf{z} | \mathbf{x})$.
•The Viterbi variable $\delta_t(i)$ stores the probability of the best path ending at state $i$ at time $t$.
•Recursion mirrors Forward Algorithm: Replace sum with max. $\delta_t(j) = \max_i[\delta_{t-1}(i) \cdot A_{ij}] \cdot B_j(x_t)$
•Backpointers enable path recovery: Store $\psi_t(j) = \arg\max$ to trace back the optimal sequence.
•Complexity is $O(TN^2)$: Same as Forward, polynomial in sequence length and number of states.
•Log-space computation prevents underflow: Max operation works directly in log space (no log-sum-exp needed).
•Viterbi ≠ Posterior decoding: Viterbi finds the best sequence; posterior decoding finds the best state at each position. They can differ!

What's Coming Next

Next Page: The Baum-Welch Algorithm (EM for HMMs)

This requires both Forward and Backward passes to compute expected state occupancies and transition counts—the sufficient statistics for updating $\mathbf{A}$, $\mathbf{B}$, and $\boldsymbol{\pi}$.

Page Complete

3 / 5