Loading learning content...
When we observe a sequence of outputs from a Hidden Markov Model, a natural question arises: What is the most likely sequence of hidden states that generated these observations?
This is the decoding problem, and it arises constantly in practice:
The Viterbi Algorithm, named after Andrew Viterbi who developed it in 1967 for decoding convolutional codes, provides an elegant $O(TN^2)$ solution using dynamic programming. Like the Forward Algorithm, it exploits the Markov structure—but instead of summing over paths, it maximizes.
By the end of this page, you will:
• Formally define the decoding problem and distinguish it from the evaluation problem • Understand the Viterbi variable and its recursive computation • Master backtracking to recover the optimal state sequence • Implement the Viterbi Algorithm in log-space for numerical stability • Understand when Viterbi decoding differs from posterior decoding
The decoding problem (also called inference or segmentation) is:
Given:
Find: $$\mathbf{z}^* = \arg\max_{\mathbf{z}} P(\mathbf{z} | \mathbf{x}, \boldsymbol{\theta})$$
The sequence of hidden states that is most probable given the observations.
Using Bayes' rule: $$P(\mathbf{z} | \mathbf{x}) = \frac{P(\mathbf{x}, \mathbf{z})}{P(\mathbf{x})}$$
Since $P(\mathbf{x})$ is constant with respect to $\mathbf{z}$, maximizing the posterior is equivalent to maximizing the joint:
$$\mathbf{z}^* = \arg\max_{\mathbf{z}} P(\mathbf{z} | \mathbf{x}) = \arg\max_{\mathbf{z}} P(\mathbf{x}, \mathbf{z})$$
This is convenient because we already know how to compute $P(\mathbf{x}, \mathbf{z})$ from the HMM factorization.
As with the evaluation problem, brute force enumeration of all $N^T$ state sequences is intractable. For $N=100$ states and $T=50$ time steps, we'd need to evaluate $100^{50}$ sequences—more than the number of atoms in the universe!
The Viterbi Algorithm exploits a crucial property: the best path to any state at time $t$ must include the best path to some state at time $t-1$.
Formally, if the globally optimal sequence passes through state $j$ at time $t$, then:
This is the optimal substructure property that makes dynamic programming applicable. We don't need to track all $N^t$ paths to time $t$—we only need to track the $N$ best paths, one ending at each state.
The Viterbi variable $\delta_t(i)$ is defined as the probability of the most likely path ending in state $i$ at time $t$, having generated observations $x_1, \ldots, x_t$:
$$\delta_t(i) = \max_{z_1, \ldots, z_{t-1}} P(z_1, \ldots, z_{t-1}, z_t = i, x_1, \ldots, x_t | \boldsymbol{\theta})$$
| Variable | Definition | Operation | Semantics |
|---|---|---|---|
| $\alpha_t(i)$ | $P(x_{1:t}, z_t=i)$ | Sum over all paths to $i$ | Total probability mass |
| $\delta_t(i)$ | $\max_{z_{1:t-1}} P(x_{1:t}, z_{1:t-1}, z_t=i)$ | Max over all paths to $i$ | Best single path |
The key difference: Forward sums (for marginalization), Viterbi maximizes (for optimization).
To reconstruct the optimal path, we need to remember which previous state led to the maximum. Define the backpointer:
$$\psi_t(j) = \arg\max_{i} \left[ \delta_{t-1}(i) \cdot A_{ij} \right]$$
This stores the predecessor state on the best path ending at state $j$ at time $t$.
At the first time step, there's no path history—just the initial state probability and first emission:
$$\delta_1(i) = \pi_i \cdot B_i(x_1)$$ $$\psi_1(i) = 0 \quad \text{(no predecessor at } t=1 \text{)}$$
This is identical to the Forward Algorithm initialization.
For $t > 1$, we derive the recursive relationship:
$$\begin{align} \delta_t(j) &= \max_{z_1, \ldots, z_{t-1}} P(z_1, \ldots, z_{t-1}, z_t = j, x_1, \ldots, x_t) \ &= \max_{i} \left[ \max_{z_1, \ldots, z_{t-2}} P(z_1, \ldots, z_{t-2}, z_{t-1} = i, x_1, \ldots, x_{t-1}) \cdot P(z_t = j | z_{t-1} = i) \cdot P(x_t | z_t = j) \right] \ &= \max_{i} \left[ \delta_{t-1}(i) \cdot A_{ij} \right] \cdot B_j(x_t) \end{align}$$
The corresponding backpointer: $$\psi_t(j) = \arg\max_{i} \left[ \delta_{t-1}(i) \cdot A_{ij} \right]$$
Notice the parallel structure:
Forward: $\alpha_t(j) = \left[ \sum_{i} \alpha_{t-1}(i) \cdot A_{ij} \right] \cdot B_j(x_t)$
Viterbi: $\delta_t(j) = \left[ \max_{i} \delta_{t-1}(i) \cdot A_{ij} \right] \cdot B_j(x_t)$
The only difference is sum ↔ max. This is an instance of the sum-product ↔ max-product correspondence in graphical model inference!
After processing all $T$ observations, find the best final state:
$$z_T^* = \arg\max_{i} \delta_T(i)$$
The probability of the best path: $$P^* = \max_{i} \delta_T(i)$$
Recover the complete optimal sequence by tracing backpointers:
$$z_{t-1}^* = \psi_t(z_t^*) \quad \text{for } t = T, T-1, \ldots, 2$$
This gives us $\mathbf{z}^* = (z_1^, z_2^, \ldots, z_T^*)$.
Viterbi Algorithm
Initialize: For $i = 1, \ldots, N$: $$\delta_1(i) = \pi_i \cdot B_i(x_1), \quad \psi_1(i) = 0$$
Recurse: For $t = 2, \ldots, T$ and $j = 1, \ldots, N$: $$\delta_t(j) = \max_{i} \left[ \delta_{t-1}(i) \cdot A_{ij} \right] \cdot B_j(x_t)$$ $$\psi_t(j) = \arg\max_{i} \left[ \delta_{t-1}(i) \cdot A_{ij} \right]$$
Terminate: $$P^* = \max_{i} \delta_T(i), \quad z_T^* = \arg\max_{i} \delta_T(i)$$
Backtrack: For $t = T-1, \ldots, 1$: $$z_t^* = \psi_{t+1}(z_{t+1}^*)$$
Let's implement the Viterbi Algorithm with proper numerical stability.
For the same reasons as the Forward Algorithm, we work in log space to prevent underflow. The recursion becomes:
$$\log \delta_t(j) = \max_{i} \left[ \log \delta_{t-1}(i) + \log A_{ij} \right] + \log B_j(x_t)$$
Note that unlike the Forward Algorithm, we don't need log-sum-exp—the max operation works directly in log space!
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161
import numpy as npfrom typing import Tuple def viterbi_algorithm( A: np.ndarray, # (N, N) transition matrix B: np.ndarray, # (N, M) emission matrix pi: np.ndarray, # (N,) initial distribution observations: np.ndarray # (T,) observation indices) -> Tuple[np.ndarray, float, np.ndarray]: """ Find the most likely state sequence using Viterbi algorithm. Works in log-probability space for numerical stability. Parameters ---------- A : np.ndarray of shape (N, N) Transition matrix, A[i,j] = P(z_t=j | z_{t-1}=i) B : np.ndarray of shape (N, M) Emission matrix, B[i,k] = P(x_t=k | z_t=i) pi : np.ndarray of shape (N,) Initial state distribution observations : np.ndarray of shape (T,) Sequence of observation indices Returns ------- best_path : np.ndarray of shape (T,) Most likely state sequence log_prob : float Log probability of the best path delta : np.ndarray of shape (T, N) Viterbi variables (log probabilities) """ T = len(observations) N = len(pi) # Convert to log space log_A = np.log(A + 1e-300) # Add small constant to avoid log(0) log_B = np.log(B + 1e-300) log_pi = np.log(pi + 1e-300) # Initialize log_delta = np.zeros((T, N)) psi = np.zeros((T, N), dtype=int) # Backpointers # Step 1: Initialization log_delta[0] = log_pi + log_B[:, observations[0]] psi[0] = 0 # No predecessor at t=0 # Step 2: Recursion for t in range(1, T): for j in range(N): # Compute max over all previous states candidates = log_delta[t-1] + log_A[:, j] psi[t, j] = np.argmax(candidates) log_delta[t, j] = candidates[psi[t, j]] + log_B[j, observations[t]] # Step 3: Termination last_state = np.argmax(log_delta[-1]) log_prob = log_delta[-1, last_state] # Step 4: Backtracking best_path = np.zeros(T, dtype=int) best_path[-1] = last_state for t in range(T-2, -1, -1): best_path[t] = psi[t+1, best_path[t+1]] return best_path, log_prob, log_delta def viterbi_algorithm_vectorized( A: np.ndarray, B: np.ndarray, pi: np.ndarray, observations: np.ndarray) -> Tuple[np.ndarray, float, np.ndarray]: """ Vectorized Viterbi implementation. Uses numpy broadcasting for better performance. """ T = len(observations) N = len(pi) log_A = np.log(A + 1e-300) log_B = np.log(B + 1e-300) log_pi = np.log(pi + 1e-300) log_delta = np.zeros((T, N)) psi = np.zeros((T, N), dtype=int) # Initialization log_delta[0] = log_pi + log_B[:, observations[0]] # Recursion (vectorized over target states) for t in range(1, T): # (N, 1) + (N, N) = (N, N) matrix of candidates # candidates[i, j] = log_delta[t-1, i] + log_A[i, j] candidates = log_delta[t-1, :, np.newaxis] + log_A # Max and argmax over source states (axis 0) psi[t] = np.argmax(candidates, axis=0) log_delta[t] = np.max(candidates, axis=0) + log_B[:, observations[t]] # Termination and backtracking best_path = np.zeros(T, dtype=int) best_path[-1] = np.argmax(log_delta[-1]) log_prob = log_delta[-1, best_path[-1]] for t in range(T-2, -1, -1): best_path[t] = psi[t+1, best_path[t+1]] return best_path, log_prob, log_delta # Example usagedef demo_viterbi(): """Demonstrate Viterbi on weather HMM.""" # States: 0=Sunny, 1=Cloudy, 2=Rainy # Observations: 0=Walk, 1=Shop, 2=Clean A = np.array([ [0.7, 0.2, 0.1], [0.3, 0.4, 0.3], [0.2, 0.3, 0.5], ]) B = np.array([ [0.7, 0.2, 0.1], # Sunny: prefer Walk [0.3, 0.4, 0.3], # Cloudy: mixed [0.1, 0.2, 0.7], # Rainy: prefer Clean ]) pi = np.array([0.6, 0.3, 0.1]) # Observation: Walk, Shop, Clean, Clean observations = np.array([0, 1, 2, 2]) best_path, log_prob, delta = viterbi_algorithm_vectorized(A, B, pi, observations) state_names = ["Sunny", "Cloudy", "Rainy"] obs_names = ["Walk", "Shop", "Clean", "Clean"] print("Viterbi Decoding Results:") print("-" * 60) print(f"Observations: {obs_names}") print(f"Best path: {[state_names[s] for s in best_path]}") print(f"Log prob: {log_prob:.4f}") print(f"Probability: {np.exp(log_prob):.8f}") print("\nViterbi Variables (log δ):") for t in range(len(observations)): print(f" t={t+1}: ", end="") for i, s in enumerate(state_names): print(f"δ({s})={delta[t,i]:.3f} ", end="") print() demo_viterbi()Let's trace through Viterbi on the same example as the Forward Algorithm to see the differences.
HMM Parameters:
$$\boldsymbol{\pi} = \begin{pmatrix} 0.6 \ 0.4 \end{pmatrix}, \quad \mathbf{A} = \begin{pmatrix} 0.7 & 0.3 \ 0.4 & 0.6 \end{pmatrix}, \quad \mathbf{B} = \begin{pmatrix} 0.5 & 0.4 & 0.1 \ 0.1 & 0.3 & 0.6 \end{pmatrix}$$
Observation sequence: $\mathbf{x} = (0, 1, 2)$
$$\delta_1(H) = \pi_H \cdot B_H(0) = 0.6 \times 0.5 = 0.30$$ $$\delta_1(L) = \pi_L \cdot B_L(0) = 0.4 \times 0.1 = 0.04$$ $$\psi_1(H) = \psi_1(L) = 0 \quad \text{(no predecessor)}$$
For state H: $$\delta_{t-1}(H) \cdot A_{HH} = 0.30 \times 0.7 = 0.210$$ $$\delta_{t-1}(L) \cdot A_{LH} = 0.04 \times 0.4 = 0.016$$ $$\max = 0.210 \Rightarrow \psi_2(H) = H$$ $$\delta_2(H) = 0.210 \times B_H(1) = 0.210 \times 0.4 = 0.084$$
For state L: $$\delta_{t-1}(H) \cdot A_{HL} = 0.30 \times 0.3 = 0.090$$ $$\delta_{t-1}(L) \cdot A_{LL} = 0.04 \times 0.6 = 0.024$$ $$\max = 0.090 \Rightarrow \psi_2(L) = H$$ $$\delta_2(L) = 0.090 \times B_L(1) = 0.090 \times 0.3 = 0.027$$
For state H: $$\delta_2(H) \cdot A_{HH} = 0.084 \times 0.7 = 0.0588$$ $$\delta_2(L) \cdot A_{LH} = 0.027 \times 0.4 = 0.0108$$ $$\max = 0.0588 \Rightarrow \psi_3(H) = H$$ $$\delta_3(H) = 0.0588 \times 0.1 = 0.00588$$
For state L: $$\delta_2(H) \cdot A_{HL} = 0.084 \times 0.3 = 0.0252$$ $$\delta_2(L) \cdot A_{LL} = 0.027 \times 0.6 = 0.0162$$ $$\max = 0.0252 \Rightarrow \psi_3(L) = H$$ $$\delta_3(L) = 0.0252 \times 0.6 = 0.01512$$
$$z_3^* = \arg\max{\delta_3(H), \delta_3(L)} = \arg\max{0.00588, 0.01512} = L$$ $$P^* = 0.01512$$
$$z_3^* = L$$ $$z_2^* = \psi_3(L) = H$$ $$z_1^* = \psi_2(H) = H$$
Optimal path: $\mathbf{z}^* = (H, H, L)$
| Time | Observation | δ(H) | δ(L) | ψ(H) | ψ(L) | Best arriving at H | Best arriving at L |
|---|---|---|---|---|---|---|---|
| t=1 | 0 | 0.3000 | 0.0400 | ||||
| t=2 | 1 | 0.0840 | 0.0270 | H | H | H→H | H→L |
| t=3 | 2 | 0.0059 | 0.0151 | H | H | H→H | H→L |
The optimal path is (H, H, L) with probability 0.01512. Notice that even though we end in state L, the best path to get there came from state H at time 2—and the best path to H at time 2 came from H at time 1. The observation $x_3=2$ (which L prefers) pulled the final state to L despite the path through H-states.
There are actually two different ways to decode an HMM, and they can give different answers!
$$\mathbf{z}^* = \arg\max_{\mathbf{z}} P(\mathbf{z} | \mathbf{x})$$
Find the single most probable complete sequence of states.
$$z_t^* = \arg\max_{z_t} P(z_t | \mathbf{x}) \quad \text{for each } t$$
At each time step, independently choose the most probable state given all observations.
They can differ when the globally optimal sequence doesn't use the locally optimal states at each position.
Example: Consider a 3-state HMM where:
Viterbi: Returns A→B→C (probability 0.20)
Posterior decoding at position 3:
So posterior decoding might choose D at position 3, giving a sequence A→B→D—different from Viterbi!
Posterior decoding requires the marginal $P(z_t = i | \mathbf{x})$ at each time step. This is computed using both Forward and Backward passes:
$$P(z_t = i | \mathbf{x}) = \frac{\alpha_t(i) \cdot \beta_t(i)}{P(\mathbf{x})}$$
where $\alpha_t$ comes from the Forward Algorithm and $\beta_t$ from the Backward Algorithm (covered in the Baum-Welch page).
1. Speech Recognition
Given acoustic features, find the most likely sequence of phonemes/words:
2. Part-of-Speech Tagging
Given a sentence, assign grammatical tags to each word:
3. Gene Finding in Bioinformatics
Identify coding regions in DNA sequences:
4. Named Entity Recognition
Identify and classify named entities in text:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
import numpy as np def pos_tagging_example(): """ Simple POS tagging example with Viterbi. This is a toy example for illustration. Real POS taggers use much larger vocabularies and tagsets. """ # Tags (states) tags = ['DET', 'NOUN', 'VERB', 'ADJ'] # Determiner, Noun, Verb, Adjective tag2idx = {t: i for i, t in enumerate(tags)} # Vocabulary (observations) words = ['the', 'dog', 'runs', 'big', 'cat', 'sleeps'] word2idx = {w: i for i, w in enumerate(words)} # Transition probabilities (simplified) # DET usually followed by NOUN or ADJ # NOUN usually followed by VERB # VERB usually followed by DET or ADJ A = np.array([ [0.01, 0.50, 0.09, 0.40], # From DET [0.05, 0.10, 0.80, 0.05], # From NOUN [0.30, 0.30, 0.10, 0.30], # From VERB [0.05, 0.80, 0.10, 0.05], # From ADJ ]) # Emission probabilities (simplified) # Each row: P(word | tag) for words [the, dog, runs, big, cat, sleeps] B = np.array([ [0.80, 0.01, 0.01, 0.01, 0.01, 0.01], # DET: 'the' is likely [0.01, 0.40, 0.01, 0.01, 0.40, 0.01], # NOUN: 'dog', 'cat' [0.01, 0.01, 0.45, 0.01, 0.01, 0.45], # VERB: 'runs', 'sleeps' [0.01, 0.01, 0.01, 0.90, 0.01, 0.01], # ADJ: 'big' ]) # Normalize rows (in case they don't sum to 1) B = B / B.sum(axis=1, keepdims=True) # Initial probabilities (sentences usually start with DET) pi = np.array([0.70, 0.15, 0.10, 0.05]) # Test sentences test_sentences = [ ['the', 'dog', 'runs'], ['the', 'big', 'cat', 'sleeps'], ['dog', 'runs'], ] print("POS Tagging with Viterbi Algorithm") print("=" * 50) for sentence in test_sentences: # Convert words to indices obs = np.array([word2idx[w] for w in sentence]) # Run Viterbi best_path, log_prob, _ = viterbi_algorithm(A, B, pi, obs) predicted_tags = [tags[i] for i in best_path] print(f"\nSentence: {' '.join(sentence)}") print(f"Tags: {' '.join(predicted_tags)}") print(f"Log prob: {log_prob:.3f}") pos_tagging_example()1. Handling Unknown Observations
In practice, you may encounter observations not seen during training:
2. Beam Search
For very large state spaces, maintain only the top-K paths:
3. Online/Streaming Viterbi
For real-time applications:
The Viterbi Algorithm solves the decoding problem—finding the most probable hidden state sequence—using dynamic programming.
Next Page: The Baum-Welch Algorithm (EM for HMMs)
The Forward and Viterbi algorithms assume we know the HMM parameters. But in practice, we must learn them from data. The Baum-Welch algorithm uses Expectation-Maximization to iteratively improve parameter estimates, even when the hidden states are never observed.
This requires both Forward and Backward passes to compute expected state occupancies and transition counts—the sufficient statistics for updating $\mathbf{A}$, $\mathbf{B}$, and $\boldsymbol{\pi}$.
You now understand the Viterbi Algorithm—how it finds the most probable state sequence through dynamic programming, the crucial role of backpointers in reconstructing the optimal path, and when to use Viterbi vs. posterior decoding. Combined with the Forward Algorithm, you have the tools for HMM evaluation and decoding.