Machine LearningBeyond Gaussian Mixtures

Beyond Gaussian Mixtures

LevelAdvanced

Duration180 mins

TopicBeyond Gaussian Mixtures

3 / 5

Hidden Markov Models

From Independent Mixtures to Sequential Dependencies

Standard mixture models treat observations as independently and identically distributed (i.i.d.)—each data point is generated by first selecting a component, then sampling from that component's distribution. This assumption is fundamental to tractable inference but profoundly limiting for sequential data.

Consider speech recognition: an audio signal consists of a sequence of acoustic frames corresponding to phonemes. Crucially, phonemes don't appear randomly—the phoneme at time $t$ strongly depends on previous phonemes. A 'k' sound is likely followed by a vowel, not by silence. This temporal structure carries essential information that i.i.d. models discard.

Hidden Markov Models (HMMs) extend mixture models by introducing Markov dependencies between the latent component indicators. Instead of independent component selection, the sequence of latent states $z_1, z_2, \ldots, z_T$ forms a Markov chain: the current state depends only on the previous state. This seemingly simple modification enables HMMs to capture rich temporal structure while remaining computationally tractable.

HMMs are foundational in machine learning, having driven breakthroughs in speech recognition, computational biology, handwriting recognition, and financial modeling. Understanding HMMs provides the conceptual foundation for more advanced sequential models like conditional random fields and recurrent neural networks.

What You Will Learn

By the end of this page, you will understand: (1) The HMM graphical model and its factorization; (2) The three fundamental HMM problems—evaluation, decoding, and learning; (3) The forward-backward algorithm for computing likelihoods and posteriors; (4) The Viterbi algorithm for finding optimal state sequences; (5) The Baum-Welch algorithm (EM for HMMs); (6) Connections to reinforcement learning and modern sequence models.

Hidden Markov Model Definition

An HMM jointly models a sequence of observations $\mathbf{x} = (x_1, x_2, \ldots, x_T)$ and a corresponding sequence of hidden states $\mathbf{z} = (z_1, z_2, \ldots, z_T)$.

Model Components

1. State space: Discrete states $z_t \in {1, 2, \ldots, K}$

2. Initial state distribution $\boldsymbol{\pi}$: $$\pi_k = p(z_1 = k)$$ Vector of length $K$ summing to 1.

3. Transition matrix $\mathbf{A}$: $$A_{jk} = p(z_t = k | z_{t-1} = j)$$ $K \times K$ matrix where each row sums to 1. Entry $A_{jk}$ is the probability of transitioning from state $j$ to state $k$.

4. Emission distributions ${p(x | z = k)}_{k=1}^K$: The probability of observing $x$ given state $k$. Common choices:

Discrete: Categorical distribution over observation symbols
Continuous: Gaussian $\mathcal{N}(x | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$
Mixture: GMM per state (HMM-GMM)

Generative Process

An HMM generates a sequence as follows:

Sample initial state: $z_1 \sim \text{Categorical}(\boldsymbol{\pi})$
Sample first observation: $x_1 \sim p(x | z_1)$
For $t = 2, \ldots, T$:
- Sample next state: $z_t \sim \text{Categorical}(\mathbf{A}{z{t-1}, \cdot})$
- Sample observation: $x_t \sim p(x | z_t)$

Factorization

The joint probability factorizes as:

$$p(\mathbf{x}, \mathbf{z}) = p(z_1) \prod_{t=2}^T p(z_t | z_{t-1}) \prod_{t=1}^T p(x_t | z_t)$$

$$= \pi_{z_1} \prod_{t=2}^T A_{z_{t-1}, z_t} \prod_{t=1}^T p(x_t | z_t)$$

Key Conditional Independence Properties

Markov property on states: $z_t \perp z_{1:t-2} | z_{t-1}$ (future is independent of past given present)
Conditional independence of observations: $x_t \perp x_{-t}, z_{-t} | z_t$ (observation depends only on its corresponding state)
D-separation: Given the hidden states, all observations are independent. Given observations, states are generally not independent—there's "explaining away."

HMM vs. Standard Mixture

A standard mixture model is an HMM with T=1. The transition matrix is irrelevant for single observations. Alternatively, an HMM with A = π·1ᵀ (all rows identical to π) gives i.i.d. component selection—recovering the standard mixture.

The Three Fundamental HMM Problems

Working with HMMs requires solving three fundamental computational problems:

Problem 1: Evaluation (Likelihood)

Question: Given model parameters $\boldsymbol{\theta} = (\boldsymbol{\pi}, \mathbf{A}, {\text{emission params}})$ and observation sequence $\mathbf{x} = (x_1, \ldots, x_T)$, compute $p(\mathbf{x} | \boldsymbol{\theta})$.

Challenge: Naive computation requires summing over all $K^T$ possible state sequences: $$p(\mathbf{x}) = \sum_{z_1, \ldots, z_T} p(\mathbf{x}, \mathbf{z})$$

For $T = 100$ and $K = 50$, this is $50^{100} \approx 10^{170}$ terms—utterly intractable.

Solution: The forward algorithm computes this in $O(TK^2)$ time using dynamic programming.

Applications: Model selection, anomaly detection, speech recognition confidence.

Problem 2: Decoding (Most Likely State Sequence)

Question: Given $\boldsymbol{\theta}$ and $\mathbf{x}$, find the most probable state sequence: $$\mathbf{z}^* = \arg\max_{\mathbf{z}} p(\mathbf{z} | \mathbf{x}, \boldsymbol{\theta})$$

Challenge: Same combinatorial explosion—$K^T$ candidate sequences.

Solution: The Viterbi algorithm finds the global optimum in $O(TK^2)$ using dynamic programming.

Applications: Speech-to-text transcription, part-of-speech tagging, gene finding.

Problem 3: Learning (Parameter Estimation)

Question: Given observation sequences (potentially multiple), find parameters that maximize likelihood: $$\boldsymbol{\theta}^* = \arg\max_{\boldsymbol{\theta}} p(\mathbf{x} | \boldsymbol{\theta})$$

Challenge: Hidden states make the likelihood non-convex. Direct optimization is difficult.

Solution: The Baum-Welch algorithm (EM for HMMs) iteratively improves parameters.

Applications: Training speech recognizers, building biological sequence models.

Summary of HMM Algorithms
Problem	Algorithm	Time Complexity	Space Complexity	Output
Evaluation	Forward	O(TK²)	O(TK)	p(x₁:T)
Decoding	Viterbi	O(TK²)	O(TK)	argmax z₁:T
Learning	Baum-Welch	O(TK² · iterations)	O(TK)	θ*

The Forward-Backward Algorithm

The forward-backward algorithm efficiently computes the likelihood $p(\mathbf{x})$ and the posterior marginals $p(z_t | \mathbf{x})$. It consists of two passes: a forward pass that accumulates evidence from the past, and a backward pass that accumulates evidence from the future.

Forward Variables

Define the forward variable $\alpha_t(k)$ as the joint probability of observing the first $t$ observations and ending in state $k$:

$$\alpha_t(k) = p(x_1, \ldots, x_t, z_t = k)$$

Base case ($t = 1$): $$\alpha_1(k) = \pi_k \cdot p(x_1 | z_1 = k)$$

Recursion ($t > 1$): $$\alpha_t(k) = p(x_t | z_t = k) \sum_{j=1}^K \alpha_{t-1}(j) \cdot A_{jk}$$

The sum accumulates all ways to reach state $k$ at time $t$: for each previous state $j$, multiply the probability of being in $j$ at $t-1$ by the transition probability to $k$.

Termination: The total likelihood is: $$p(\mathbf{x}) = \sum_{k=1}^K \alpha_T(k)$$

Backward Variables

Define the backward variable $\beta_t(k)$ as the probability of observing the remaining sequence given that we're in state $k$ at time $t$:

$$\beta_t(k) = p(x_{t+1}, \ldots, x_T | z_t = k)$$

Base case ($t = T$): $$\beta_T(k) = 1$$

Recursion ($t < T$): $$\beta_t(k) = \sum_{j=1}^K A_{kj} \cdot p(x_{t+1} | z_{t+1} = j) \cdot \beta_{t+1}(j)$$

Posterior Marginals

Combining forward and backward variables yields the posterior:

$$\gamma_t(k) = p(z_t = k | \mathbf{x}) = \frac{\alpha_t(k) \cdot \beta_t(k)}{p(\mathbf{x})} = \frac{\alpha_t(k) \cdot \beta_t(k)}{\sum_j \alpha_t(j) \cdot \beta_t(j)}$$

Pairwise posterior (needed for learning): $$\xi_t(j, k) = p(z_t = j, z_{t+1} = k | \mathbf{x}) = \frac{\alpha_t(j) \cdot A_{jk} \cdot p(x_{t+1} | k) \cdot \beta_{t+1}(k)}{p(\mathbf{x})}$$

forward_backward.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import numpy as np
 
def forward_backward(observations, pi, A, emission_probs):
    """
    Forward-backward algorithm for HMMs.
    
    Args:
        observations: Sequence of observation indices (T,)
        pi: Initial distribution (K,)
        A: Transition matrix (K, K)
        emission_probs: Emission probabilities (K, V) for V observations
        
    Returns:
        alpha: Forward variables (T, K)
        beta: Backward variables (T, K)
        gamma: State posteriors (T, K)
        xi: Transition posteriors (T-1, K, K)
        log_likelihood: log p(observations)
    """
    T = len(observations)
    K = len(pi)
    
    # Get emission probabilities for observed sequence
    B = emission_probs[:, observations].T  # (T, K)
    
    # Forward pass (log-space for numerical stability)
    log_alpha = np.zeros((T, K))
    log_alpha[0] = np.log(pi + 1e-10) + np.log(B[0] + 1e-10)
    
    for t in range(1, T):
        for k in range(K):
            log_alpha[t, k] = (np.log(B[t, k] + 1e-10) + 
                              np.logaddexp.reduce(log_alpha[t-1] + np.log(A[:, k] + 1e-10)))
    
    # Log-likelihood
    log_likelihood = np.logaddexp.reduce(log_alpha[T-1])
    
    # Backward pass
    log_beta = np.zeros((T, K))
    log_beta[T-1] = 0  # log(1) = 0
    
    for t in range(T-2, -1, -1):
        for k in range(K):
            log_beta[t, k] = np.logaddexp.reduce(
                np.log(A[k] + 1e-10) + np.log(B[t+1] + 1e-10) + log_beta[t+1]
            )
    
    # State posteriors
    log_gamma = log_alpha + log_beta - log_likelihood
    gamma = np.exp(log_gamma)
    
    # Transition posteriors
    xi = np.zeros((T-1, K, K))
    for t in range(T-1):
        for j in range(K):
            for k in range(K):
                xi[t, j, k] = np.exp(
                    log_alpha[t, j] + np.log(A[j, k] + 1e-10) + 
                    np.log(B[t+1, k] + 1e-10) + log_beta[t+1, k] - log_likelihood
                )
    
    alpha = np.exp(log_alpha)
    beta = np.exp(log_beta)
    
    return alpha, beta, gamma, xi, log_likelihood

Numerical Stability

α and β values become exponentially small for long sequences due to repeated multiplication of probabilities < 1. Always work in log-space, using log-sum-exp for additions. Alternatively, use scaling factors that normalize α at each step (maintaining running log of scale factors).

The Viterbi Algorithm

The Viterbi algorithm finds the most likely state sequence—the global maximum a posteriori (MAP) path—using dynamic programming. It exploits the same Markov structure as forward-backward but uses max instead of sum.

Key Insight

The optimal principle: the best path to state $k$ at time $t$ must pass through some state $j$ at time $t-1$, and that prefix must itself be optimal for reaching $j$.

Viterbi Variables

Define $\delta_t(k)$ as the probability of the best path ending in state $k$ at time $t$:

$$\delta_t(k) = \max_{z_1, \ldots, z_{t-1}} p(z_1, \ldots, z_{t-1}, z_t = k, x_1, \ldots, x_t)$$

Base case: $$\delta_1(k) = \pi_k \cdot p(x_1 | k)$$

Recursion: $$\delta_t(k) = p(x_t | k) \cdot \max_j \left[ \delta_{t-1}(j) \cdot A_{jk} \right]$$

Backpointers: Store the $\arg\max$ at each step: $$\psi_t(k) = \arg\max_j \left[ \delta_{t-1}(j) \cdot A_{jk} \right]$$

Backtracking

After completing the forward pass:

Best final state: $z_T^* = \arg\max_k \delta_T(k)$
Best path probability: $p^* = \delta_T(z_T^*)$
Backtrack: For $t = T-1, \ldots, 1$: $z_t^* = \psi_{t+1}(z_{t+1}^*)$

Comparison to Forward

Aspect	Forward Algorithm	Viterbi Algorithm
Operation	Sum over previous states	Max over previous states
Output	Total probability p(x)	Best path probability
Backtracking	Not needed	Needed to recover path
Semantics	Marginalizes states	Finds MAP states

When to Use Each

Forward/Posterior: When you need uncertainty—"what's the probability of state $k$ at time $t$?"

Viterbi/MAP: When you need a single best explanation—"what sequence of states best explains the observations?"

viterbi.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
 
def viterbi(observations, pi, A, emission_probs):
    """
    Viterbi algorithm for finding the most likely state sequence.
    
    Args:
        observations: Sequence of observation indices (T,)
        pi: Initial distribution (K,)
        A: Transition matrix (K, K)
        emission_probs: Emission probabilities (K, V)
        
    Returns:
        best_path: Most likely state sequence (T,)
        best_prob: Probability of best path
    """
    T = len(observations)
    K = len(pi)
    
    # Emission probabilities for observed sequence
    B = emission_probs[:, observations].T  # (T, K)
    
    # Log-space for numerical stability
    log_pi = np.log(pi + 1e-10)
    log_A = np.log(A + 1e-10)
    log_B = np.log(B + 1e-10)
    
    # Viterbi variables
    log_delta = np.zeros((T, K))
    psi = np.zeros((T, K), dtype=int)
    
    # Base case
    log_delta[0] = log_pi + log_B[0]
    
    # Recursion
    for t in range(1, T):
        for k in range(K):
            # Find best previous state
            scores = log_delta[t-1] + log_A[:, k]
            psi[t, k] = np.argmax(scores)
            log_delta[t, k] = log_B[t, k] + scores[psi[t, k]]
    
    # Backtrack
    best_path = np.zeros(T, dtype=int)
    best_path[T-1] = np.argmax(log_delta[T-1])
    best_log_prob = log_delta[T-1, best_path[T-1]]
    
    for t in range(T-2, -1, -1):
        best_path[t] = psi[t+1, best_path[t+1]]
    
    return best_path, np.exp(best_log_prob)
 
# Example usage
np.random.seed(42)
K, V, T = 3, 4, 10
 
pi = np.array([0.6, 0.3, 0.1])
A = np.array([
    [0.7, 0.2, 0.1],
    [0.1, 0.6, 0.3],
    [0.3, 0.3, 0.4]
])
emission_probs = np.random.dirichlet(np.ones(V), size=K)
 
observations = np.random.choice(V, size=T)
path, prob = viterbi(observations, pi, A, emission_probs)
print(f"Best path: {path}")
print(f"Path probability: {prob:.6f}")

The Baum-Welch Algorithm (EM for HMMs)

The Baum-Welch algorithm is EM specialized to HMMs. It iteratively updates parameters using expected sufficient statistics computed via forward-backward.

E-Step

Run forward-backward to compute:

State posteriors: $\gamma_t(k) = p(z_t = k | \mathbf{x})$
Transition posteriors: $\xi_t(j, k) = p(z_t = j, z_{t+1} = k | \mathbf{x})$

M-Step

Initial distribution: $$\hat{\pi}_k = \gamma_1(k)$$

The expected count of starting in state $k$.

Transition matrix: $$\hat{A}{jk} = \frac{\sum{t=1}^{T-1} \xi_t(j, k)}{\sum_{t=1}^{T-1} \gamma_t(j)}$$

Numerator: expected count of $j \to k$ transitions. Denominator: expected count of being in state $j$ (when a transition is possible).

Emission parameters (discrete case): $$\hat{B}{kv} = p(x = v | z = k) = \frac{\sum{t: x_t = v} \gamma_t(k)}{\sum_{t=1}^T \gamma_t(k)}$$

Expected count of emitting $v$ from state $k$, divided by expected time in state $k$.

Gaussian Emissions

For Gaussian emission $p(x | z = k) = \mathcal{N}(x | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$:

$$\hat{\boldsymbol{\mu}}k = \frac{\sum{t=1}^T \gamma_t(k) \mathbf{x}t}{\sum{t=1}^T \gamma_t(k)}$$

$$\hat{\boldsymbol{\Sigma}}k = \frac{\sum{t=1}^T \gamma_t(k) (\mathbf{x}_t - \hat{\boldsymbol{\mu}}_k)(\mathbf{x}_t - \hat{\boldsymbol{\mu}}k)^T}{\sum{t=1}^T \gamma_t(k)}$$

Weighted mean and covariance, weighted by the posterior probability of being in state $k$ at each time.

Multiple Observation Sequences

Given multiple independent sequences ${\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(M)}}$:

Run forward-backward on each sequence separately
Aggregate sufficient statistics across sequences
M-step uses pooled statistics

For example: $$\hat{A}{jk} = \frac{\sum{m=1}^M \sum_{t=1}^{T_m-1} \xi_t^{(m)}(j, k)}{\sum_{m=1}^M \sum_{t=1}^{T_m-1} \gamma_t^{(m)}(j)}$$

Convergence and Local Optima

Like all EM algorithms, Baum-Welch guarantees monotonic likelihood improvement and convergence to a local maximum—but not necessarily the global maximum. Run from multiple random initializations and select the solution with highest likelihood. Domain-specific initialization (e.g., from uniform segmentation) often helps.

HMM Variations and Extensions

The basic HMM can be extended in many directions to handle different application requirements.

Left-to-Right (Bakis) HMM

For sequential processes (speech, handwriting), states represent progression through a pattern. Constrain transitions to only allow forward movement:

$$A_{jk} = 0 \text{ for } k < j$$

The model can only stay in the current state or advance. This is standard in speech recognition phoneme models.

Continuous Density HMM

Replace discrete emissions with continuous distributions:

Single Gaussian: $p(x|k) = \mathcal{N}(x | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$
GMM per state: $p(x|k) = \sum_m w_{km} \mathcal{N}(x | \boldsymbol{\mu}{km}, \boldsymbol{\Sigma}{km})$

GMM-HMMs were the dominant approach in speech recognition before deep learning.

Semi-Continuous (Tied-Mixture) HMM

Share Gaussian components across states, with state-specific mixture weights: $$p(x|k) = \sum_m w_{km} \mathcal{N}(x | \boldsymbol{\mu}_m, \boldsymbol{\Sigma}_m)$$

Reduces parameters while allowing state-specific distributions.

HMM Variants Comparison
Variant	Emissions	Transitions	Use Case
Discrete HMM	Categorical	Full	Discrete sequences (DNA, text)
Gaussian HMM	Single Gaussian	Full	Simple continuous data
GMM-HMM	Gaussian mixture	Full	Complex continuous (speech)
Left-Right	Any	Constrained (≥)	Sequential processes
Factorial HMM	Conditional on all	Independent chains	Multiple sources
Input-Output HMM	Depends on input	May depend on input	Conditional sequences

Factorial HMM

Multiple independent Markov chains $\mathbf{z}^{(1)}, \ldots, \mathbf{z}^{(F)}$, with observations depending on all chains:

$$p(x_t | z_t^{(1)}, \ldots, z_t^{(F)})$$

This models multiple independent "causes" contributing to observations (e.g., different speakers in audio).

Input-Output HMM (IOHMM)

Condition transitions and/or emissions on external inputs: $$p(z_t | z_{t-1}, \mathbf{u}_t)$$ $$p(x_t | z_t, \mathbf{u}_t)$$

This bridges HMMs with supervised learning and connects to recurrent neural networks.

Applications of Hidden Markov Models

HMMs have driven major advances across diverse application domains.

Speech Recognition (Historical Foundation)

The application that defined HMM research for decades:

States: Phonemes or sub-phonemic units
Observations: Acoustic features (MFCCs)
Structure: Left-to-right within phonemes, phoneme concatenation via grammar

GMM-HMMs achieved commercially viable speech recognition, later enhanced by deep neural networks replacing GMMs.

Computational Biology

Gene finding: Identify protein-coding regions in DNA

States: Coding, non-coding, intron, exon boundaries
Observations: Nucleotide sequences
Key models: GENSCAN, Glimmer

Protein structure: Profile HMMs for sequence alignment

States: Match, insert, delete for each profile position
Applications: HMMER, Pfam database

Part-of-Speech Tagging

Assign grammatical categories to words:

States: Noun, verb, adjective, etc.
Observations: Words or word features
Transitions: Capture grammatical constraints (articles precede nouns)

HMM Strengths

•Principled probabilistic: Full uncertainty quantification
•Efficient inference: Polynomial algorithms despite exponential state space
•Interpretable: States often have semantic meaning
•Flexible emissions: Any probabilistic observation model
•Generative: Can synthesize new sequences

HMM Limitations

•Limited memory: First-order Markov (only previous state)
•Discrete states: Continuous state spaces require extensions
•Observation independence: Strong conditional independence
•Local optima: EM doesn't guarantee global optimum
•Fixed structure: Number of states chosen a priori

HMMs vs. Modern Sequence Models

RNNs and Transformers have largely superseded HMMs for complex sequence modeling (speech, language). However, HMMs remain valuable for: (1) Small data regimes where generative models excel; (2) Problems requiring interpretable state sequences; (3) Explicit uncertainty over hidden states; (4) Structured domains with known state semantics (biology).

Summary and What's Next

This page has provided a comprehensive treatment of Hidden Markov Models, from fundamental concepts through practical algorithms.

Key Takeaways

•HMMs extend mixtures to sequences by introducing Markov dependencies between latent states.
•Three core problems: Evaluation (forward), Decoding (Viterbi), and Learning (Baum-Welch).
•Forward-backward algorithm computes likelihoods and posteriors in O(TK²) time.
•Viterbi algorithm finds the MAP state sequence using dynamic programming.
•Baum-Welch is EM for HMMs, using expected sufficient statistics from forward-backward.
•Many variants exist: left-to-right, continuous density, factorial, input-output.

What's Next: Infinite Mixtures (Dirichlet Process Mixture Models)

Both standard mixture models and HMMs require specifying the number of components $K$ in advance. But often we don't know how many components exist—we want the model complexity to grow with the data. The next page introduces Dirichlet Process Mixture Models (DPMMs), which extend mixture models with a nonparametric Bayesian prior that allows for a potentially infinite number of components, automatically inferring appropriate complexity from data.

Page Complete

You now understand Hidden Markov Models as mixture models extended to sequential data. The dynamic programming algorithms—forward-backward, Viterbi, and Baum-Welch—make inference and learning tractable despite exponential state spaces. Next, we'll explore how to let the number of mixture components be determined by the data itself.

3 / 5

Loading learning content...

Machine LearningBeyond Gaussian Mixtures

Beyond Gaussian Mixtures

LevelAdvanced

Duration180 mins

TopicBeyond Gaussian Mixtures

3 / 5

Hidden Markov Models

From Independent Mixtures to Sequential Dependencies

What You Will Learn

Hidden Markov Model Definition

An HMM jointly models a sequence of observations $\mathbf{x} = (x_1, x_2, \ldots, x_T)$ and a corresponding sequence of hidden states $\mathbf{z} = (z_1, z_2, \ldots, z_T)$.

Model Components

1. State space: Discrete states $z_t \in {1, 2, \ldots, K}$

2. Initial state distribution $\boldsymbol{\pi}$: $$\pi_k = p(z_1 = k)$$ Vector of length $K$ summing to 1.

4. Emission distributions ${p(x | z = k)}_{k=1}^K$: The probability of observing $x$ given state $k$. Common choices:

Discrete: Categorical distribution over observation symbols
Continuous: Gaussian $\mathcal{N}(x | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$
Mixture: GMM per state (HMM-GMM)

Generative Process

An HMM generates a sequence as follows:

Sample initial state: $z_1 \sim \text{Categorical}(\boldsymbol{\pi})$
Sample first observation: $x_1 \sim p(x | z_1)$
For $t = 2, \ldots, T$:
- Sample next state: $z_t \sim \text{Categorical}(\mathbf{A}{z{t-1}, \cdot})$
- Sample observation: $x_t \sim p(x | z_t)$

Factorization

The joint probability factorizes as:

$$p(\mathbf{x}, \mathbf{z}) = p(z_1) \prod_{t=2}^T p(z_t | z_{t-1}) \prod_{t=1}^T p(x_t | z_t)$$

$$= \pi_{z_1} \prod_{t=2}^T A_{z_{t-1}, z_t} \prod_{t=1}^T p(x_t | z_t)$$

Key Conditional Independence Properties

Markov property on states: $z_t \perp z_{1:t-2} | z_{t-1}$ (future is independent of past given present)
Conditional independence of observations: $x_t \perp x_{-t}, z_{-t} | z_t$ (observation depends only on its corresponding state)
D-separation: Given the hidden states, all observations are independent. Given observations, states are generally not independent—there's "explaining away."

HMM vs. Standard Mixture

The Three Fundamental HMM Problems

Working with HMMs requires solving three fundamental computational problems:

Problem 1: Evaluation (Likelihood)

Challenge: Naive computation requires summing over all $K^T$ possible state sequences: $$p(\mathbf{x}) = \sum_{z_1, \ldots, z_T} p(\mathbf{x}, \mathbf{z})$$

For $T = 100$ and $K = 50$, this is $50^{100} \approx 10^{170}$ terms—utterly intractable.

Solution: The forward algorithm computes this in $O(TK^2)$ time using dynamic programming.

Applications: Model selection, anomaly detection, speech recognition confidence.

Problem 2: Decoding (Most Likely State Sequence)

Question: Given $\boldsymbol{\theta}$ and $\mathbf{x}$, find the most probable state sequence: $$\mathbf{z}^* = \arg\max_{\mathbf{z}} p(\mathbf{z} | \mathbf{x}, \boldsymbol{\theta})$$

Challenge: Same combinatorial explosion—$K^T$ candidate sequences.

Solution: The Viterbi algorithm finds the global optimum in $O(TK^2)$ using dynamic programming.

Applications: Speech-to-text transcription, part-of-speech tagging, gene finding.

Problem 3: Learning (Parameter Estimation)

Challenge: Hidden states make the likelihood non-convex. Direct optimization is difficult.

Solution: The Baum-Welch algorithm (EM for HMMs) iteratively improves parameters.

Applications: Training speech recognizers, building biological sequence models.

Summary of HMM Algorithms
Problem	Algorithm	Time Complexity	Space Complexity	Output
Evaluation	Forward	O(TK²)	O(TK)	p(x₁:T)
Decoding	Viterbi	O(TK²)	O(TK)	argmax z₁:T
Learning	Baum-Welch	O(TK² · iterations)	O(TK)	θ*

The Forward-Backward Algorithm

Forward Variables

Define the forward variable $\alpha_t(k)$ as the joint probability of observing the first $t$ observations and ending in state $k$:

$$\alpha_t(k) = p(x_1, \ldots, x_t, z_t = k)$$

Base case ($t = 1$): $$\alpha_1(k) = \pi_k \cdot p(x_1 | z_1 = k)$$

Recursion ($t > 1$): $$\alpha_t(k) = p(x_t | z_t = k) \sum_{j=1}^K \alpha_{t-1}(j) \cdot A_{jk}$$

The sum accumulates all ways to reach state $k$ at time $t$: for each previous state $j$, multiply the probability of being in $j$ at $t-1$ by the transition probability to $k$.

Termination: The total likelihood is: $$p(\mathbf{x}) = \sum_{k=1}^K \alpha_T(k)$$

Backward Variables

Define the backward variable $\beta_t(k)$ as the probability of observing the remaining sequence given that we're in state $k$ at time $t$:

$$\beta_t(k) = p(x_{t+1}, \ldots, x_T | z_t = k)$$

Base case ($t = T$): $$\beta_T(k) = 1$$

Recursion ($t < T$): $$\beta_t(k) = \sum_{j=1}^K A_{kj} \cdot p(x_{t+1} | z_{t+1} = j) \cdot \beta_{t+1}(j)$$

Posterior Marginals

Combining forward and backward variables yields the posterior:

$$\gamma_t(k) = p(z_t = k | \mathbf{x}) = \frac{\alpha_t(k) \cdot \beta_t(k)}{p(\mathbf{x})} = \frac{\alpha_t(k) \cdot \beta_t(k)}{\sum_j \alpha_t(j) \cdot \beta_t(j)}$$

Pairwise posterior (needed for learning): $$\xi_t(j, k) = p(z_t = j, z_{t+1} = k | \mathbf{x}) = \frac{\alpha_t(j) \cdot A_{jk} \cdot p(x_{t+1} | k) \cdot \beta_{t+1}(k)}{p(\mathbf{x})}$$

forward_backward.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import numpy as np
 
def forward_backward(observations, pi, A, emission_probs):
    """
    Forward-backward algorithm for HMMs.
    
    Args:
        observations: Sequence of observation indices (T,)
        pi: Initial distribution (K,)
        A: Transition matrix (K, K)
        emission_probs: Emission probabilities (K, V) for V observations
        
    Returns:
        alpha: Forward variables (T, K)
        beta: Backward variables (T, K)
        gamma: State posteriors (T, K)
        xi: Transition posteriors (T-1, K, K)
        log_likelihood: log p(observations)
    """
    T = len(observations)
    K = len(pi)
    
    # Get emission probabilities for observed sequence
    B = emission_probs[:, observations].T  # (T, K)
    
    # Forward pass (log-space for numerical stability)
    log_alpha = np.zeros((T, K))
    log_alpha[0] = np.log(pi + 1e-10) + np.log(B[0] + 1e-10)
    
    for t in range(1, T):
        for k in range(K):
            log_alpha[t, k] = (np.log(B[t, k] + 1e-10) + 
                              np.logaddexp.reduce(log_alpha[t-1] + np.log(A[:, k] + 1e-10)))
    
    # Log-likelihood
    log_likelihood = np.logaddexp.reduce(log_alpha[T-1])
    
    # Backward pass
    log_beta = np.zeros((T, K))
    log_beta[T-1] = 0  # log(1) = 0
    
    for t in range(T-2, -1, -1):
        for k in range(K):
            log_beta[t, k] = np.logaddexp.reduce(
                np.log(A[k] + 1e-10) + np.log(B[t+1] + 1e-10) + log_beta[t+1]
            )
    
    # State posteriors
    log_gamma = log_alpha + log_beta - log_likelihood
    gamma = np.exp(log_gamma)
    
    # Transition posteriors
    xi = np.zeros((T-1, K, K))
    for t in range(T-1):
        for j in range(K):
            for k in range(K):
                xi[t, j, k] = np.exp(
                    log_alpha[t, j] + np.log(A[j, k] + 1e-10) + 
                    np.log(B[t+1, k] + 1e-10) + log_beta[t+1, k] - log_likelihood
                )
    
    alpha = np.exp(log_alpha)
    beta = np.exp(log_beta)
    
    return alpha, beta, gamma, xi, log_likelihood

Numerical Stability

The Viterbi Algorithm

Key Insight

The optimal principle: the best path to state $k$ at time $t$ must pass through some state $j$ at time $t-1$, and that prefix must itself be optimal for reaching $j$.

Viterbi Variables

Define $\delta_t(k)$ as the probability of the best path ending in state $k$ at time $t$:

$$\delta_t(k) = \max_{z_1, \ldots, z_{t-1}} p(z_1, \ldots, z_{t-1}, z_t = k, x_1, \ldots, x_t)$$

Base case: $$\delta_1(k) = \pi_k \cdot p(x_1 | k)$$

Recursion: $$\delta_t(k) = p(x_t | k) \cdot \max_j \left[ \delta_{t-1}(j) \cdot A_{jk} \right]$$

Backpointers: Store the $\arg\max$ at each step: $$\psi_t(k) = \arg\max_j \left[ \delta_{t-1}(j) \cdot A_{jk} \right]$$

Backtracking

After completing the forward pass:

Best final state: $z_T^* = \arg\max_k \delta_T(k)$
Best path probability: $p^* = \delta_T(z_T^*)$
Backtrack: For $t = T-1, \ldots, 1$: $z_t^* = \psi_{t+1}(z_{t+1}^*)$

Comparison to Forward

Aspect	Forward Algorithm	Viterbi Algorithm
Operation	Sum over previous states	Max over previous states
Output	Total probability p(x)	Best path probability
Backtracking	Not needed	Needed to recover path
Semantics	Marginalizes states	Finds MAP states

When to Use Each

Forward/Posterior: When you need uncertainty—"what's the probability of state $k$ at time $t$?"

Viterbi/MAP: When you need a single best explanation—"what sequence of states best explains the observations?"

viterbi.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
 
def viterbi(observations, pi, A, emission_probs):
    """
    Viterbi algorithm for finding the most likely state sequence.
    
    Args:
        observations: Sequence of observation indices (T,)
        pi: Initial distribution (K,)
        A: Transition matrix (K, K)
        emission_probs: Emission probabilities (K, V)
        
    Returns:
        best_path: Most likely state sequence (T,)
        best_prob: Probability of best path
    """
    T = len(observations)
    K = len(pi)
    
    # Emission probabilities for observed sequence
    B = emission_probs[:, observations].T  # (T, K)
    
    # Log-space for numerical stability
    log_pi = np.log(pi + 1e-10)
    log_A = np.log(A + 1e-10)
    log_B = np.log(B + 1e-10)
    
    # Viterbi variables
    log_delta = np.zeros((T, K))
    psi = np.zeros((T, K), dtype=int)
    
    # Base case
    log_delta[0] = log_pi + log_B[0]
    
    # Recursion
    for t in range(1, T):
        for k in range(K):
            # Find best previous state
            scores = log_delta[t-1] + log_A[:, k]
            psi[t, k] = np.argmax(scores)
            log_delta[t, k] = log_B[t, k] + scores[psi[t, k]]
    
    # Backtrack
    best_path = np.zeros(T, dtype=int)
    best_path[T-1] = np.argmax(log_delta[T-1])
    best_log_prob = log_delta[T-1, best_path[T-1]]
    
    for t in range(T-2, -1, -1):
        best_path[t] = psi[t+1, best_path[t+1]]
    
    return best_path, np.exp(best_log_prob)
 
# Example usage
np.random.seed(42)
K, V, T = 3, 4, 10
 
pi = np.array([0.6, 0.3, 0.1])
A = np.array([
    [0.7, 0.2, 0.1],
    [0.1, 0.6, 0.3],
    [0.3, 0.3, 0.4]
])
emission_probs = np.random.dirichlet(np.ones(V), size=K)
 
observations = np.random.choice(V, size=T)
path, prob = viterbi(observations, pi, A, emission_probs)
print(f"Best path: {path}")
print(f"Path probability: {prob:.6f}")

The Baum-Welch Algorithm (EM for HMMs)

The Baum-Welch algorithm is EM specialized to HMMs. It iteratively updates parameters using expected sufficient statistics computed via forward-backward.

E-Step

Run forward-backward to compute:

State posteriors: $\gamma_t(k) = p(z_t = k | \mathbf{x})$
Transition posteriors: $\xi_t(j, k) = p(z_t = j, z_{t+1} = k | \mathbf{x})$

M-Step

Initial distribution: $$\hat{\pi}_k = \gamma_1(k)$$

The expected count of starting in state $k$.

Transition matrix: $$\hat{A}{jk} = \frac{\sum{t=1}^{T-1} \xi_t(j, k)}{\sum_{t=1}^{T-1} \gamma_t(j)}$$

Numerator: expected count of $j \to k$ transitions. Denominator: expected count of being in state $j$ (when a transition is possible).

Emission parameters (discrete case): $$\hat{B}{kv} = p(x = v | z = k) = \frac{\sum{t: x_t = v} \gamma_t(k)}{\sum_{t=1}^T \gamma_t(k)}$$

Expected count of emitting $v$ from state $k$, divided by expected time in state $k$.

Gaussian Emissions

For Gaussian emission $p(x | z = k) = \mathcal{N}(x | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$:

$$\hat{\boldsymbol{\mu}}k = \frac{\sum{t=1}^T \gamma_t(k) \mathbf{x}t}{\sum{t=1}^T \gamma_t(k)}$$

$$\hat{\boldsymbol{\Sigma}}k = \frac{\sum{t=1}^T \gamma_t(k) (\mathbf{x}_t - \hat{\boldsymbol{\mu}}_k)(\mathbf{x}_t - \hat{\boldsymbol{\mu}}k)^T}{\sum{t=1}^T \gamma_t(k)}$$

Weighted mean and covariance, weighted by the posterior probability of being in state $k$ at each time.

Multiple Observation Sequences

Given multiple independent sequences ${\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(M)}}$:

Run forward-backward on each sequence separately
Aggregate sufficient statistics across sequences
M-step uses pooled statistics

For example: $$\hat{A}{jk} = \frac{\sum{m=1}^M \sum_{t=1}^{T_m-1} \xi_t^{(m)}(j, k)}{\sum_{m=1}^M \sum_{t=1}^{T_m-1} \gamma_t^{(m)}(j)}$$

Convergence and Local Optima

HMM Variations and Extensions

The basic HMM can be extended in many directions to handle different application requirements.

Left-to-Right (Bakis) HMM

For sequential processes (speech, handwriting), states represent progression through a pattern. Constrain transitions to only allow forward movement:

$$A_{jk} = 0 \text{ for } k < j$$

The model can only stay in the current state or advance. This is standard in speech recognition phoneme models.

Continuous Density HMM

Replace discrete emissions with continuous distributions:

Single Gaussian: $p(x|k) = \mathcal{N}(x | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$
GMM per state: $p(x|k) = \sum_m w_{km} \mathcal{N}(x | \boldsymbol{\mu}{km}, \boldsymbol{\Sigma}{km})$

GMM-HMMs were the dominant approach in speech recognition before deep learning.

Semi-Continuous (Tied-Mixture) HMM

Share Gaussian components across states, with state-specific mixture weights: $$p(x|k) = \sum_m w_{km} \mathcal{N}(x | \boldsymbol{\mu}_m, \boldsymbol{\Sigma}_m)$$

Reduces parameters while allowing state-specific distributions.

HMM Variants Comparison
Variant	Emissions	Transitions	Use Case
Discrete HMM	Categorical	Full	Discrete sequences (DNA, text)
Gaussian HMM	Single Gaussian	Full	Simple continuous data
GMM-HMM	Gaussian mixture	Full	Complex continuous (speech)
Left-Right	Any	Constrained (≥)	Sequential processes
Factorial HMM	Conditional on all	Independent chains	Multiple sources
Input-Output HMM	Depends on input	May depend on input	Conditional sequences

Factorial HMM

Multiple independent Markov chains $\mathbf{z}^{(1)}, \ldots, \mathbf{z}^{(F)}$, with observations depending on all chains:

$$p(x_t | z_t^{(1)}, \ldots, z_t^{(F)})$$

This models multiple independent "causes" contributing to observations (e.g., different speakers in audio).

Input-Output HMM (IOHMM)

Condition transitions and/or emissions on external inputs: $$p(z_t | z_{t-1}, \mathbf{u}_t)$$ $$p(x_t | z_t, \mathbf{u}_t)$$

This bridges HMMs with supervised learning and connects to recurrent neural networks.

Applications of Hidden Markov Models

HMMs have driven major advances across diverse application domains.

Speech Recognition (Historical Foundation)

The application that defined HMM research for decades:

States: Phonemes or sub-phonemic units
Observations: Acoustic features (MFCCs)
Structure: Left-to-right within phonemes, phoneme concatenation via grammar

GMM-HMMs achieved commercially viable speech recognition, later enhanced by deep neural networks replacing GMMs.

Computational Biology

Gene finding: Identify protein-coding regions in DNA

States: Coding, non-coding, intron, exon boundaries
Observations: Nucleotide sequences
Key models: GENSCAN, Glimmer

Protein structure: Profile HMMs for sequence alignment

States: Match, insert, delete for each profile position
Applications: HMMER, Pfam database

Part-of-Speech Tagging

Assign grammatical categories to words:

States: Noun, verb, adjective, etc.
Observations: Words or word features
Transitions: Capture grammatical constraints (articles precede nouns)

HMM Strengths

•Principled probabilistic: Full uncertainty quantification
•Efficient inference: Polynomial algorithms despite exponential state space
•Interpretable: States often have semantic meaning
•Flexible emissions: Any probabilistic observation model
•Generative: Can synthesize new sequences

HMM Limitations

•Limited memory: First-order Markov (only previous state)
•Discrete states: Continuous state spaces require extensions
•Observation independence: Strong conditional independence
•Local optima: EM doesn't guarantee global optimum
•Fixed structure: Number of states chosen a priori

HMMs vs. Modern Sequence Models

Summary and What's Next

This page has provided a comprehensive treatment of Hidden Markov Models, from fundamental concepts through practical algorithms.

Key Takeaways

•HMMs extend mixtures to sequences by introducing Markov dependencies between latent states.
•Three core problems: Evaluation (forward), Decoding (Viterbi), and Learning (Baum-Welch).
•Forward-backward algorithm computes likelihoods and posteriors in O(TK²) time.
•Viterbi algorithm finds the MAP state sequence using dynamic programming.
•Baum-Welch is EM for HMMs, using expected sufficient statistics from forward-backward.
•Many variants exist: left-to-right, continuous density, factorial, input-output.

What's Next: Infinite Mixtures (Dirichlet Process Mixture Models)

Page Complete

3 / 5