Machine LearningGraphical Models

Hidden Markov Models

LevelAdvanced

Duration120 mins

TopicGraphical Models

4 / 5

Baum-Welch (EM) Algorithm

Learning When the Truth is Hidden

So far, we've assumed the HMM parameters $(\mathbf{A}, \mathbf{B}, \boldsymbol{\pi})$ are known. But in practice, we must learn them from data. This presents a fundamental challenge: the latent states are never observed!

If we could observe the hidden state sequence alongside the observations, parameter estimation would be trivial—just count transitions and emissions. But we only see the observations. We need to somehow "infer" what the hidden states probably were, and use that to estimate parameters.

This is a classic chicken-and-egg problem:

If we knew the parameters, we could infer the hidden states
If we knew the hidden states, we could estimate the parameters

The Baum-Welch algorithm (Baum et al., 1970) elegantly solves this via Expectation-Maximization (EM): iteratively refine our beliefs about hidden states (E-step) and update parameters accordingly (M-step). Each iteration is guaranteed to improve the likelihood, converging to a local optimum.

What You Will Learn

By the end of this page, you will:

• Understand the learning problem and why direct optimization is hard • Master the Backward Algorithm and its role in computing posteriors • Derive the E-step: computing expected sufficient statistics • Derive the M-step: updating parameters to maximize expected log-likelihood • Implement the complete Baum-Welch algorithm • Understand convergence properties, initialization strategies, and practical considerations

The Learning Problem

Formal Problem Statement

Given:

Observation sequences $\mathcal{D} = {\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(K)}}$
Number of hidden states $N$ (specified by the modeler)

Find: $$\boldsymbol{\theta}^* = \arg\max_{\boldsymbol{\theta}} P(\mathcal{D} | \boldsymbol{\theta}) = \arg\max_{\boldsymbol{\theta}} \prod_{k=1}^{K} P(\mathbf{x}^{(k)} | \boldsymbol{\theta})$$

We seek the Maximum Likelihood Estimate (MLE) of the parameters.

Why Direct Optimization is Hard

The log-likelihood involves marginalizing over hidden states:

$$\log P(\mathbf{x} | \boldsymbol{\theta}) = \log \sum_{\mathbf{z}} P(\mathbf{x}, \mathbf{z} | \boldsymbol{\theta})$$

The sum inside the log prevents simple closed-form solutions:

The derivative doesn't simplify nicely
The likelihood surface is non-convex with multiple local optima
Direct gradient methods require computing intractable sums

Complete Data vs. Incomplete Data

Complete data: If we observed both $\mathbf{x}$ and $\mathbf{z}$, the MLE would be trivial:

$\hat{A}{ij} = \frac{\text{count}(z{t-1}=i, z_t=j)}{\text{count}(z_{t-1}=i)}$ — Just count transitions!

$\hat{B}_{jk} = \frac{\text{count}(z_t=j, x_t=k)}{\text{count}(z_t=j)}$ — Just count emissions!

Incomplete data: With $\mathbf{z}$ hidden, we can't count directly. Baum-Welch replaces counts with expected counts under the current parameter estimates.

The EM Strategy

Expectation-Maximization provides a principled approach:

E-step (Expectation): Using current parameters $\boldsymbol{\theta}^{\text{old}}$, compute the posterior distribution $P(\mathbf{z} | \mathbf{x}, \boldsymbol{\theta}^{\text{old}})$ over hidden states. Extract expected "counts" of transitions and emissions.

M-step (Maximization): Update parameters to maximize expected complete-data log-likelihood: $$\boldsymbol{\theta}^{\text{new}} = \arg\max_{\boldsymbol{\theta}} \mathbb{E}_{\mathbf{z} | \mathbf{x}, \boldsymbol{\theta}^{\text{old}}} [\log P(\mathbf{x}, \mathbf{z} | \boldsymbol{\theta})]$$

Repeat until convergence.

The key insight: even though we don't observe $\mathbf{z}$, we can compute a soft assignment—the probability of each state at each time step—and use this "expected" information as if it were (fractional) counts.

The Backward Algorithm

To compute posterior probabilities over hidden states, we need the Backward Algorithm—the complement to the Forward Algorithm.

Definition: Backward Variable

The backward variable $\beta_t(i)$ is the probability of observing the future observations $x_{t+1}, \ldots, x_T$ given that we're in state $i$ at time $t$:

$$\beta_t(i) = P(x_{t+1}, x_{t+2}, \ldots, x_T | z_t = i, \boldsymbol{\theta})$$

Initialization ($t = T$)

At the final time step, there are no future observations: $$\beta_T(i) = 1 \quad \text{for all } i$$

Recursion ($t = T-1, \ldots, 1$)

Working backwards:

$$\begin{align} \beta_t(i) &= P(x_{t+1}, \ldots, x_T | z_t = i) \ &= \sum_{j=1}^{N} P(x_{t+1}, \ldots, x_T, z_{t+1} = j | z_t = i) \ &= \sum_{j=1}^{N} P(z_{t+1} = j | z_t = i) \cdot P(x_{t+1} | z_{t+1} = j) \cdot P(x_{t+2}, \ldots, x_T | z_{t+1} = j) \ &= \sum_{j=1}^{N} A_{ij} \cdot B_j(x_{t+1}) \cdot \beta_{t+1}(j) \end{align}$$

Forward-Backward Symmetry

Compare the structures:

Forward: $\alpha_t(j) = \sum_{i} \alpha_{t-1}(i) \cdot A_{ij} \cdot B_j(x_t)$

Backward: $\beta_t(i) = \sum_{j} A_{ij} \cdot B_j(x_{t+1}) \cdot \beta_{t+1}(j)$

Both propagate information through the chain—Forward from past to present, Backward from future to present. Together, they capture all information about each hidden state.

backward_algorithm.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
import numpy as np
from scipy.special import logsumexp
 
def backward_algorithm_log(
    log_A: np.ndarray,    # (N, N) log transition matrix
    log_B: np.ndarray,    # (N, M) log emission matrix
    observations: np.ndarray  # (T,) observation indices
) -> np.ndarray:
    """
    Compute backward variables in log-probability space.
    
    Parameters
    ----------
    log_A : np.ndarray of shape (N, N)
        Log transition matrix
    log_B : np.ndarray of shape (N, M)
        Log emission matrix
    observations : np.ndarray of shape (T,)
        Observation sequence
    
    Returns
    -------
    log_beta : np.ndarray of shape (T, N)
        Log backward variables
    """
    T = len(observations)
    N = log_A.shape[0]
    
    log_beta = np.zeros((T, N))
    
    # Initialization: β_T(i) = 1  =>  log β_T(i) = 0
    log_beta[-1] = 0.0
    
    # Recursion: t = T-1 down to 0
    for t in range(T-2, -1, -1):
        for i in range(N):
            # log β_t(i) = log Σ_j A_{ij} * B_j(x_{t+1}) * β_{t+1}(j)
            log_beta[t, i] = logsumexp(
                log_A[i, :] + log_B[:, observations[t+1]] + log_beta[t+1]
            )
    
    return log_beta
 
 
def backward_algorithm_log_vectorized(
    log_A: np.ndarray,
    log_B: np.ndarray,
    observations: np.ndarray
) -> np.ndarray:
    """Vectorized log-space Backward Algorithm."""
    T = len(observations)
    N = log_A.shape[0]
    
    log_beta = np.zeros((T, N))
    log_beta[-1] = 0.0
    
    for t in range(T-2, -1, -1):
        # For each source state i, sum over target states j
        # log_A[i, j] + log_B[j, obs] + log_beta[t+1, j]
        # Shape: (N, N) + (N,) + (N,) = (N, N), then logsumexp over axis 1
        log_beta[t] = logsumexp(
            log_A + log_B[:, observations[t+1]] + log_beta[t+1],
            axis=1
        )
    
    return log_beta
 
 
def verify_forward_backward(log_alpha, log_beta, log_pi, log_B, observations):
    """
    Verify that forward and backward give consistent likelihood.
    
    At any time t: P(x) = Σ_i α_t(i) * β_t(i)
    This should be the same for all t!
    """
    T = len(observations)
    log_probs = np.zeros(T)
    
    for t in range(T):
        log_probs[t] = logsumexp(log_alpha[t] + log_beta[t])
    
    print("Log P(x) computed at each time step:")
    for t in range(T):
        print(f"  t={t}: {log_probs[t]:.6f}")
    
    print(f"\nAll equal? {np.allclose(log_probs, log_probs[0])}")

Computing Posterior Probabilities

With both forward and backward variables, we can compute posterior probabilities that form the foundation of the E-step.

Gamma: State Occupancy Probability

Define $\gamma_t(i)$ as the posterior probability of being in state $i$ at time $t$:

$$\gamma_t(i) = P(z_t = i | \mathbf{x}, \boldsymbol{\theta})$$

Using Bayes' rule and the forward-backward variables:

$$\gamma_t(i) = \frac{P(z_t = i, \mathbf{x})}{P(\mathbf{x})} = \frac{\alpha_t(i) \cdot \beta_t(i)}{\sum_{j} \alpha_t(j) \cdot \beta_t(j)}$$

Interpretation: $\alpha_t(i)$ captures "evidence from the past" (observations $x_1, \ldots, x_t$ and being in state $i$). $\beta_t(i)$ captures "evidence from the future" (observations $x_{t+1}, \ldots, x_T$ given state $i$). The product gives the joint probability of all observations and being in state $i$ at time $t$.

Xi: Transition Probability

Define $\xi_t(i, j)$ as the posterior probability of transitioning from state $i$ to state $j$ at time $t$:

$$\xi_t(i, j) = P(z_t = i, z_{t+1} = j | \mathbf{x}, \boldsymbol{\theta})$$

Derivation of $\xi_t(i, j)$

$$\begin{align} \xi_t(i, j) &= \frac{P(z_t = i, z_{t+1} = j, \mathbf{x})}{P(\mathbf{x})} \ &= \frac{\alpha_t(i) \cdot A_{ij} \cdot B_j(x_{t+1}) \cdot \beta_{t+1}(j)}{P(\mathbf{x})} \end{align}$$

where $P(\mathbf{x}) = \sum_{i} \alpha_T(i)$.

Decomposition:

$\alpha_t(i)$: Probability of observations $x_1, \ldots, x_t$ and being in state $i$
$A_{ij}$: Transition probability from $i$ to $j$
$B_j(x_{t+1})$: Emission probability of $x_{t+1}$ from state $j$
$\beta_{t+1}(j)$: Probability of future observations $x_{t+2}, \ldots, x_T$ given state $j$

Relationship Between $\gamma$ and $\xi$

Marginalizing $\xi$ over the target state gives $\gamma$:

$$\gamma_t(i) = \sum_{j=1}^{N} \xi_t(i, j)$$

This makes sense: the probability of being in state $i$ at time $t$ equals the sum of probabilities of transitioning from $i$ to any state $j$.

Summary of Posterior Quantities
Quantity	Definition	Meaning	Formula
$\gamma_t(i)$	$P(z_t=i \| \mathbf{x})$	State occupancy at time $t$	$\frac{\alpha_t(i) \beta_t(i)}{P(\mathbf{x})}$
$\xi_t(i,j)$	$P(z_t=i, z_{t+1}=j \| \mathbf{x})$	Transition $i \to j$ at time $t$	$\frac{\alpha_t(i) A_{ij} B_j(x_{t+1}) \beta_{t+1}(j)}{P(\mathbf{x})}$

The E-Step: Expected Sufficient Statistics

In the E-step, we compute the expected sufficient statistics—the expected counts that we would compute if we observed the hidden states, where expectations are taken under the posterior $P(\mathbf{z} | \mathbf{x}, \boldsymbol{\theta}^{\text{old}})$.

Expected Initial State Counts

$$\mathbb{E}[\text{start in state } i] = \gamma_1(i)$$

The expected number of times the sequence starts in state $i$.

Expected Transition Counts

$$\mathbb{E}[\text{transitions } i \to j] = \sum_{t=1}^{T-1} \xi_t(i, j)$$

Sum over all time steps where a transition could occur.

$$\mathbb{E}[\text{transitions from } i] = \sum_{t=1}^{T-1} \gamma_t(i)$$

Expected total transitions out of state $i$.

Expected Emission Counts

$$\mathbb{E}[\text{emit } k \text{ from state } j] = \sum_{t: x_t = k} \gamma_t(j)$$

Sum $\gamma_t(j)$ over all time steps where observation $k$ was seen.

$$\mathbb{E}[\text{emissions from state } j] = \sum_{t=1}^{T} \gamma_t(j)$$

Expected total emissions from state $j$.

Fractional Counts

Unlike hard counts from observed data, these are soft counts—fractional values between 0 and 1. For example, $\gamma_t(i) = 0.7$ means "at time $t$, we're 70% confident the system was in state $i$."

These fractional counts accumulate across time steps and, when normalized, give proper probability estimates.

e_step.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import numpy as np
from scipy.special import logsumexp
 
def compute_posteriors(
    log_A: np.ndarray,
    log_B: np.ndarray,
    log_pi: np.ndarray,
    observations: np.ndarray
) -> tuple:
    """
    Compute γ (gamma) and ξ (xi) posterior probabilities.
    
    Returns
    -------
    gamma : np.ndarray of shape (T, N)
        State occupancy probabilities γ_t(i) = P(z_t=i | x)
    xi : np.ndarray of shape (T-1, N, N)
        Transition probabilities ξ_t(i,j) = P(z_t=i, z_{t+1}=j | x)
    log_likelihood : float
        Log P(observations | model)
    """
    T = len(observations)
    N = len(log_pi)
    
    # Forward pass
    log_alpha = np.zeros((T, N))
    log_alpha[0] = log_pi + log_B[:, observations[0]]
    for t in range(1, T):
        log_alpha[t] = logsumexp(
            log_alpha[t-1, :, np.newaxis] + log_A, axis=0
        ) + log_B[:, observations[t]]
    
    # Backward pass
    log_beta = np.zeros((T, N))
    log_beta[-1] = 0.0
    for t in range(T-2, -1, -1):
        log_beta[t] = logsumexp(
            log_A + log_B[:, observations[t+1]] + log_beta[t+1],
            axis=1
        )
    
    # Log-likelihood: P(x) = Σ_i α_T(i)
    log_likelihood = logsumexp(log_alpha[-1])
    
    # Gamma: γ_t(i) = P(z_t = i | x)
    # In log space: log γ = log α + log β - log P(x)
    log_gamma = log_alpha + log_beta - log_likelihood
    gamma = np.exp(log_gamma)
    
    # Xi: ξ_t(i,j) = P(z_t=i, z_{t+1}=j | x)
    # Only defined for t = 0 to T-2
    xi = np.zeros((T-1, N, N))
    for t in range(T-1):
        # log ξ_t(i,j) = log α_t(i) + log A_{ij} + log B_j(x_{t+1}) 
        #               + log β_{t+1}(j) - log P(x)
        log_xi_t = (log_alpha[t, :, np.newaxis] + 
                   log_A + 
                   log_B[:, observations[t+1]] + 
                   log_beta[t+1, :] - 
                   log_likelihood)
        xi[t] = np.exp(log_xi_t)
    
    return gamma, xi, log_likelihood
 
 
def accumulate_statistics(
    gamma: np.ndarray,
    xi: np.ndarray,
    observations: np.ndarray,
    n_observations: int
) -> tuple:
    """
    Accumulate expected sufficient statistics from posteriors.
    
    Returns
    -------
    gamma_1 : np.ndarray of shape (N,)
        Expected initial state counts (just γ_1)
    transition_counts : np.ndarray of shape (N, N)
        Expected transition counts
    emission_counts : np.ndarray of shape (N, M)
        Expected emission counts per state and observation
    """
    T, N = gamma.shape
    
    # Initial state: γ_1(i)
    gamma_1 = gamma[0]
    
    # Transition counts: Σ_t ξ_t(i,j)
    transition_counts = xi.sum(axis=0)  # Sum over time
    
    # Emission counts: Σ_{t: x_t=k} γ_t(j)
    emission_counts = np.zeros((N, n_observations))
    for t in range(T):
        emission_counts[:, observations[t]] += gamma[t]
    
    return gamma_1, transition_counts, emission_counts

The M-Step: Parameter Updates

The M-step updates parameters to maximize the expected complete-data log-likelihood (the Q-function):

$$Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{\text{old}}) = \mathbb{E}_{\mathbf{z} | \mathbf{x}, \boldsymbol{\theta}^{\text{old}}} [\log P(\mathbf{x}, \mathbf{z} | \boldsymbol{\theta})]$$

Update for Initial Distribution $\hat{\boldsymbol{\pi}}$

$$\hat{\pi}_i = \gamma_1(i)$$

The new initial probability of state $i$ is the posterior probability of starting in state $i$. For multiple sequences:

$$\hat{\pi}i = \frac{1}{K} \sum{k=1}^{K} \gamma_1^{(k)}(i)$$

Update for Transition Matrix $\hat{\mathbf{A}}$

$$\hat{A}{ij} = \frac{\text{Expected transitions } i \to j}{\text{Expected transitions from } i} = \frac{\sum{t=1}^{T-1} \xi_t(i, j)}{\sum_{t=1}^{T-1} \gamma_t(i)}$$

For multiple sequences:

$$\hat{A}{ij} = \frac{\sum{k=1}^{K} \sum_{t=1}^{T_k-1} \xi_t^{(k)}(i, j)}{\sum_{k=1}^{K} \sum_{t=1}^{T_k-1} \gamma_t^{(k)}(i)}$$

Update for Emission Distribution $\hat{\mathbf{B}}$

Discrete case: $$\hat{B}{jk} = \frac{\text{Expected emissions of } k \text{ from state } j}{\text{Expected emissions from state } j} = \frac{\sum{t: x_t = k} \gamma_t(j)}{\sum_{t=1}^{T} \gamma_t(j)}$$

Gaussian case: If emissions are Gaussian $\mathcal{N}(\boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)$:

$$\hat{\boldsymbol{\mu}}j = \frac{\sum{t=1}^{T} \gamma_t(j) \cdot x_t}{\sum_{t=1}^{T} \gamma_t(j)}$$

$$\hat{\boldsymbol{\Sigma}}j = \frac{\sum{t=1}^{T} \gamma_t(j) \cdot (x_t - \hat{\boldsymbol{\mu}}_j)(x_t - \hat{\boldsymbol{\mu}}j)^\top}{\sum{t=1}^{T} \gamma_t(j)}$$

These are weighted maximum likelihood estimates where $\gamma_t(j)$ serves as the weight for observation $x_t$ being attributed to state $j$.

The Essence of EM for HMMs

The M-step formulas are exactly the Maximum Likelihood estimates you would compute if you observed hard counts, but with hard counts replaced by expected (soft) counts from the E-step.

This is why Baum-Welch converges to a local maximum: each iteration increases the expected complete-data log-likelihood, which provably increases the incomplete-data log-likelihood.

m_step.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
import numpy as np
 
def m_step(
    gamma_1: np.ndarray,
    transition_counts: np.ndarray,
    emission_counts: np.ndarray,
    gamma_sum: np.ndarray,
    gamma_sum_except_last: np.ndarray
) -> tuple:
    """
    Perform M-step: update HMM parameters from expected statistics.
    
    Parameters
    ----------
    gamma_1 : np.ndarray of shape (N,)
        Expected initial state distribution
    transition_counts : np.ndarray of shape (N, N)
        Expected transition counts
    emission_counts : np.ndarray of shape (N, M)
        Expected emission counts
    gamma_sum : np.ndarray of shape (N,)
        Σ_t γ_t(i) for each state (total expected time in each state)
    gamma_sum_except_last : np.ndarray of shape (N,)
        Σ_{t=1}^{T-1} γ_t(i) for each state (for transition denominator)
    
    Returns
    -------
    pi : np.ndarray of shape (N,)
        Updated initial distribution
    A : np.ndarray of shape (N, N)
        Updated transition matrix
    B : np.ndarray of shape (N, M)
        Updated emission matrix
    """
    # Initial distribution
    pi = gamma_1 / gamma_1.sum()  # Normalize (should already sum to 1)
    
    # Transition matrix
    # A[i, j] = (expected i→j transitions) / (expected transitions from i)
    # Add small constant to avoid division by zero
    A = transition_counts / (gamma_sum_except_last[:, np.newaxis] + 1e-300)
    
    # Normalize rows to ensure proper probability distribution
    A = A / A.sum(axis=1, keepdims=True)
    
    # Emission matrix
    # B[j, k] = (expected emissions of k from j) / (expected emissions from j)
    B = emission_counts / (gamma_sum[:, np.newaxis] + 1e-300)
    B = B / B.sum(axis=1, keepdims=True)
    
    return pi, A, B
 
 
def m_step_gaussian(
    gamma: np.ndarray,        # (T, N)
    observations: np.ndarray  # (T, D) continuous observations
) -> tuple:
    """
    M-step for Gaussian HMM.
    
    Returns updated means and covariances for each state.
    """
    T, N = gamma.shape
    D = observations.shape[1]
    
    # Normalize gamma so each row sums to 1 (for interpretability)
    gamma_sum = gamma.sum(axis=0)  # (N,)
    
    # Update means: μ_j = Σ_t γ_t(j) * x_t / Σ_t γ_t(j)
    mu = np.zeros((N, D))
    for j in range(N):
        mu[j] = (gamma[:, j, np.newaxis] * observations).sum(axis=0) / gamma_sum[j]
    
    # Update covariances
    sigma = np.zeros((N, D, D))
    for j in range(N):
        diff = observations - mu[j]  # (T, D)
        # Weighted outer product sum
        for t in range(T):
            sigma[j] += gamma[t, j] * np.outer(diff[t], diff[t])
        sigma[j] /= gamma_sum[j]
    
    return mu, sigma

The Complete Baum-Welch Algorithm

Putting it all together, here is the complete Baum-Welch algorithm:

Algorithm: Baum-Welch (EM for HMMs)

Input: Observation sequences ${\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(K)}}$, number of states $N$, number of observations $M$

Output: Estimated parameters $\hat{\boldsymbol{\theta}} = (\hat{\mathbf{A}}, \hat{\mathbf{B}}, \hat{\boldsymbol{\pi}})$

Algorithm:

Initialize parameters $(\mathbf{A}^{(0)}, \mathbf{B}^{(0)}, \boldsymbol{\pi}^{(0)})$
Repeat until convergence:

E-step: For each sequence $k$:
- Compute forward variables $\alpha^{(k)}$ using current parameters
- Compute backward variables $\beta^{(k)}$
- Compute $\gamma^{(k)}$ and $\xi^{(k)}$ posteriors
- Accumulate expected sufficient statistics
M-step: Update parameters:
- $\hat{\pi}_i = \frac{\sum_k \gamma_1^{(k)}(i)}{K}$
- $\hat{A}_{ij} = \frac{\sum_k \sum_t \xi_t^{(k)}(i,j)}{\sum_k \sum_t \gamma_t^{(k)}(i)}$
- $\hat{B}{jk} = \frac{\sum_k \sum{t: x_t^{(k)}=k} \gamma_t^{(k)}(j)}{\sum_k \sum_t \gamma_t^{(k)}(j)}$
Return final parameters

baum_welch.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
import numpy as np
from scipy.special import logsumexp
from typing import List, Optional
 
class HMM:
    """Complete HMM with Baum-Welch training."""
    
    def __init__(self, n_states: int, n_observations: int):
        self.n_states = n_states
        self.n_observations = n_observations
        self.A = None  # Transition matrix
        self.B = None  # Emission matrix
        self.pi = None  # Initial distribution
    
    def _init_parameters(self, random_state: Optional[int] = None):
        """Initialize parameters with small random perturbations."""
        if random_state is not None:
            np.random.seed(random_state)
        
        N, M = self.n_states, self.n_observations
        
        # Random initialization with normalization
        self.A = np.random.rand(N, N) + 0.1
        self.A /= self.A.sum(axis=1, keepdims=True)
        
        self.B = np.random.rand(N, M) + 0.1
        self.B /= self.B.sum(axis=1, keepdims=True)
        
        self.pi = np.random.rand(N) + 0.1
        self.pi /= self.pi.sum()
    
    def _forward(self, observations: np.ndarray) -> tuple:
        """Forward pass in log space."""
        T = len(observations)
        N = self.n_states
        
        log_A = np.log(self.A + 1e-300)
        log_B = np.log(self.B + 1e-300)
        log_pi = np.log(self.pi + 1e-300)
        
        log_alpha = np.zeros((T, N))
        log_alpha[0] = log_pi + log_B[:, observations[0]]
        
        for t in range(1, T):
            log_alpha[t] = logsumexp(
                log_alpha[t-1, :, np.newaxis] + log_A, axis=0
            ) + log_B[:, observations[t]]
        
        return log_alpha, logsumexp(log_alpha[-1])
    
    def _backward(self, observations: np.ndarray) -> np.ndarray:
        """Backward pass in log space."""
        T = len(observations)
        N = self.n_states
        
        log_A = np.log(self.A + 1e-300)
        log_B = np.log(self.B + 1e-300)
        
        log_beta = np.zeros((T, N))
        log_beta[-1] = 0.0
        
        for t in range(T-2, -1, -1):
            log_beta[t] = logsumexp(
                log_A + log_B[:, observations[t+1]] + log_beta[t+1],
                axis=1
            )
        
        return log_beta
    
    def _e_step(self, observations: np.ndarray) -> tuple:
        """E-step: compute posteriors and sufficient statistics."""
        T = len(observations)
        N = self.n_states
        
        log_alpha, log_likelihood = self._forward(observations)
        log_beta = self._backward(observations)
        
        log_A = np.log(self.A + 1e-300)
        log_B = np.log(self.B + 1e-300)
        
        # Gamma
        log_gamma = log_alpha + log_beta - log_likelihood
        gamma = np.exp(log_gamma)
        
        # Xi
        xi = np.zeros((T-1, N, N))
        for t in range(T-1):
            log_xi_t = (log_alpha[t, :, np.newaxis] + 
                       log_A + 
                       log_B[:, observations[t+1]] + 
                       log_beta[t+1, :] - 
                       log_likelihood)
            xi[t] = np.exp(log_xi_t)
        
        return gamma, xi, log_likelihood
    
    def fit(
        self,
        sequences: List[np.ndarray],
        max_iters: int = 100,
        tol: float = 1e-6,
        random_state: Optional[int] = None,
        verbose: bool = True
    ) -> List[float]:
        """
        Train HMM using Baum-Welch algorithm.
        
        Parameters
        ----------
        sequences : List of observation sequences
        max_iters : Maximum EM iterations
        tol : Convergence tolerance (relative change in log-likelihood)
        random_state : Random seed for initialization
        verbose : Print progress
        
        Returns
        -------
        log_likelihoods : List of log-likelihoods at each iteration
        """
        self._init_parameters(random_state)
        
        K = len(sequences)
        N = self.n_states
        M = self.n_observations
        
        log_likelihoods = []
        prev_ll = float('-inf')
        
        for iteration in range(max_iters):
            # Accumulate sufficient statistics
            total_gamma_1 = np.zeros(N)
            total_trans_num = np.zeros((N, N))
            total_trans_denom = np.zeros(N)
            total_emit_num = np.zeros((N, M))
            total_emit_denom = np.zeros(N)
            total_ll = 0.0
            
            # E-step: process each sequence
            for seq in sequences:
                gamma, xi, ll = self._e_step(seq)
                total_ll += ll
                
                # Accumulate statistics
                total_gamma_1 += gamma[0]
                total_trans_num += xi.sum(axis=0)
                total_trans_denom += gamma[:-1].sum(axis=0)
                
                for t, obs in enumerate(seq):
                    total_emit_num[:, obs] += gamma[t]
                total_emit_denom += gamma.sum(axis=0)
            
            log_likelihoods.append(total_ll)
            
            # Check convergence
            rel_change = (total_ll - prev_ll) / (abs(prev_ll) + 1e-10)
            if verbose and (iteration % 10 == 0 or iteration == max_iters - 1):
                print(f"Iter {iteration}: log-likelihood = {total_ll:.4f}")
            
            if rel_change < tol and iteration > 0:
                if verbose:
                    print(f"Converged at iteration {iteration}")
                break
            
            prev_ll = total_ll
            
            # M-step: update parameters
            self.pi = total_gamma_1 / K
            self.A = total_trans_num / (total_trans_denom[:, np.newaxis] + 1e-300)
            self.A /= self.A.sum(axis=1, keepdims=True)  # Renormalize
            
            self.B = total_emit_num / (total_emit_denom[:, np.newaxis] + 1e-300)
            self.B /= self.B.sum(axis=1, keepdims=True)  # Renormalize
        
        return log_likelihoods
 
 
# Example usage
def demo_baum_welch():
    """Demonstrate Baum-Welch on synthetic data."""
    
    # True HMM (for generating data)
    true_A = np.array([[0.7, 0.3], [0.4, 0.6]])
    true_B = np.array([[0.9, 0.1], [0.2, 0.8]])
    true_pi = np.array([0.6, 0.4])
    
    # Generate synthetic sequences
    def sample_hmm(A, B, pi, T):
        N, M = B.shape
        states = np.zeros(T, dtype=int)
        obs = np.zeros(T, dtype=int)
        
        states[0] = np.random.choice(N, p=pi)
        obs[0] = np.random.choice(M, p=B[states[0]])
        
        for t in range(1, T):
            states[t] = np.random.choice(N, p=A[states[t-1]])
            obs[t] = np.random.choice(M, p=B[states[t]])
        
        return obs
    
    np.random.seed(42)
    sequences = [sample_hmm(true_A, true_B, true_pi, T=50) for _ in range(20)]
    
    # Train HMM
    hmm = HMM(n_states=2, n_observations=2)
    log_likelihoods = hmm.fit(sequences, max_iters=50, 
                               random_state=123, verbose=True)
    
    print("\n" + "="*50)
    print("True vs Learned Parameters:")
    print("\nTrue A:")
    print(true_A)
    print("Learned A:")
    print(hmm.A.round(3))
    
    print("\nTrue B:")
    print(true_B)
    print("Learned B:")
    print(hmm.B.round(3))
 
 
demo_baum_welch()

Practical Considerations

Initialization Strategies

Baum-Welch finds a local maximum, not necessarily the global maximum. Initialization matters!

Random Initialization:

Simple but may converge to poor local optima
Run multiple times with different seeds, keep best result

K-means Initialization (for continuous emissions):

Cluster observations with K-means
Initialize Gaussian parameters from cluster statistics
Often gives better starting points

Uniform Initialization:

Set all parameters to uniform distributions
Rarely optimal but provides a baseline

Domain-Informed Initialization:

Use prior knowledge about the problem structure
E.g., speech recognition might initialize phoneme models with phonetic knowledge

Common Issues

•Local optima: EM only guarantees local convergence. Multiple restarts are essential.
•Degenerate solutions: States can "collapse" (all probability in one state). Add pseudocounts.
•Slow convergence: EM can be slow near optima. Consider acceleration techniques.
•Overfitting: Too many states fit training data perfectly but generalize poorly.

Best Practices

•Multiple restarts: Try 10+ random initializations, keep best log-likelihood.
•Add smoothing: Dirichlet prior (add pseudocounts) prevents zero probabilities.
•Monitor convergence: Plot log-likelihood curve; check it's increasing.
•Model selection: Use BIC/AIC to choose number of states.

Regularization via Priors

Adding pseudocounts is equivalent to placing Dirichlet priors on parameters. Instead of MLE, we get Maximum A Posteriori (MAP) estimates:

$\hat{A}_{ij} = \frac{\sum_t \xi_t(i,j) + \alpha - 1}{\sum_t \gamma_t(i) + N(\alpha - 1)}$

With $\alpha > 1$ (symmetric Dirichlet prior), this prevents any transition probability from being exactly 0, improving robustness.

Complexity Analysis

Per iteration:

Forward pass: $O(TN^2)$ per sequence
Backward pass: $O(TN^2)$ per sequence
Computing $\xi$: $O(TN^2)$ per sequence
M-step: $O(KT(N^2 + NM))$

Total per iteration: $O(KTN^2)$

For large datasets or models, Baum-Welch can be slow. Alternatives:

Online EM: Update parameters incrementally after each sequence
Stochastic EM: Use random subsets of sequences
Variational Bayes: Alternative to EM with better regularization
Gradient descent: Directly optimize log-likelihood with gradients

Summary: Baum-Welch Algorithm

The Baum-Welch algorithm solves the learning problem—estimating HMM parameters from unlabeled observation sequences—using Expectation-Maximization.

Key Takeaways

•The learning problem is hard because hidden states are unobserved—we can't directly count transitions and emissions.
•EM resolves the chicken-and-egg problem: Alternate between inferring states (E-step) and updating parameters (M-step).
•The Backward Algorithm computes $\beta_t(i) = P(x_{t+1:T} | z_t=i)$, complementing the Forward pass.
•Posterior quantities: $\gamma_t(i)$ (state occupancy) and $\xi_t(i,j)$ (transition probability) are computed from $\alpha$ and $\beta$.
•E-step computes expected sufficient statistics: Expected transition counts and emission counts using soft assignments.
•M-step updates parameters: Same formulas as MLE, but with expected counts instead of hard counts.
•Convergence is guaranteed to a local optimum. Multiple restarts with different initializations improve results.
•Practical considerations: Use log-space, add smoothing, monitor convergence, use model selection for number of states.

What's Coming Next

Next Page: Applications

With the complete HMM toolkit—structure, Forward Algorithm, Viterbi Algorithm, and Baum-Welch—we're ready to explore real-world applications. We'll see HMMs in action for speech recognition, part-of-speech tagging, gene finding, and other domains, understanding how the choice of state representation, observation model, and training procedures impact performance.

Page Complete

You now have a deep understanding of the Baum-Welch algorithm—the theoretical foundation in EM, the Forward-Backward Algorithm for computing posteriors, the parameter update formulas, and practical considerations for training. Combined with the Forward and Viterbi algorithms, you have the complete toolkit for building, training, and deploying Hidden Markov Models.

4 / 5

Loading learning content...

Machine LearningGraphical Models

Hidden Markov Models

LevelAdvanced

Duration120 mins

TopicGraphical Models

4 / 5

Baum-Welch (EM) Algorithm

Learning When the Truth is Hidden

This is a classic chicken-and-egg problem:

If we knew the parameters, we could infer the hidden states
If we knew the hidden states, we could estimate the parameters

What You Will Learn

By the end of this page, you will:

The Learning Problem

Formal Problem Statement

Given:

Observation sequences $\mathcal{D} = {\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(K)}}$
Number of hidden states $N$ (specified by the modeler)

Find: $$\boldsymbol{\theta}^* = \arg\max_{\boldsymbol{\theta}} P(\mathcal{D} | \boldsymbol{\theta}) = \arg\max_{\boldsymbol{\theta}} \prod_{k=1}^{K} P(\mathbf{x}^{(k)} | \boldsymbol{\theta})$$

We seek the Maximum Likelihood Estimate (MLE) of the parameters.

Why Direct Optimization is Hard

The log-likelihood involves marginalizing over hidden states:

$$\log P(\mathbf{x} | \boldsymbol{\theta}) = \log \sum_{\mathbf{z}} P(\mathbf{x}, \mathbf{z} | \boldsymbol{\theta})$$

The sum inside the log prevents simple closed-form solutions:

The derivative doesn't simplify nicely
The likelihood surface is non-convex with multiple local optima
Direct gradient methods require computing intractable sums

Complete Data vs. Incomplete Data

Complete data: If we observed both $\mathbf{x}$ and $\mathbf{z}$, the MLE would be trivial:

$\hat{A}{ij} = \frac{\text{count}(z{t-1}=i, z_t=j)}{\text{count}(z_{t-1}=i)}$ — Just count transitions!

$\hat{B}_{jk} = \frac{\text{count}(z_t=j, x_t=k)}{\text{count}(z_t=j)}$ — Just count emissions!

Incomplete data: With $\mathbf{z}$ hidden, we can't count directly. Baum-Welch replaces counts with expected counts under the current parameter estimates.

The EM Strategy

Expectation-Maximization provides a principled approach:

Repeat until convergence.

The Backward Algorithm

To compute posterior probabilities over hidden states, we need the Backward Algorithm—the complement to the Forward Algorithm.

Definition: Backward Variable

The backward variable $\beta_t(i)$ is the probability of observing the future observations $x_{t+1}, \ldots, x_T$ given that we're in state $i$ at time $t$:

$$\beta_t(i) = P(x_{t+1}, x_{t+2}, \ldots, x_T | z_t = i, \boldsymbol{\theta})$$

Initialization ($t = T$)

At the final time step, there are no future observations: $$\beta_T(i) = 1 \quad \text{for all } i$$

Recursion ($t = T-1, \ldots, 1$)

Working backwards:

Forward-Backward Symmetry

Compare the structures:

Forward: $\alpha_t(j) = \sum_{i} \alpha_{t-1}(i) \cdot A_{ij} \cdot B_j(x_t)$

Backward: $\beta_t(i) = \sum_{j} A_{ij} \cdot B_j(x_{t+1}) \cdot \beta_{t+1}(j)$

Both propagate information through the chain—Forward from past to present, Backward from future to present. Together, they capture all information about each hidden state.

backward_algorithm.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
import numpy as np
from scipy.special import logsumexp
 
def backward_algorithm_log(
    log_A: np.ndarray,    # (N, N) log transition matrix
    log_B: np.ndarray,    # (N, M) log emission matrix
    observations: np.ndarray  # (T,) observation indices
) -> np.ndarray:
    """
    Compute backward variables in log-probability space.
    
    Parameters
    ----------
    log_A : np.ndarray of shape (N, N)
        Log transition matrix
    log_B : np.ndarray of shape (N, M)
        Log emission matrix
    observations : np.ndarray of shape (T,)
        Observation sequence
    
    Returns
    -------
    log_beta : np.ndarray of shape (T, N)
        Log backward variables
    """
    T = len(observations)
    N = log_A.shape[0]
    
    log_beta = np.zeros((T, N))
    
    # Initialization: β_T(i) = 1  =>  log β_T(i) = 0
    log_beta[-1] = 0.0
    
    # Recursion: t = T-1 down to 0
    for t in range(T-2, -1, -1):
        for i in range(N):
            # log β_t(i) = log Σ_j A_{ij} * B_j(x_{t+1}) * β_{t+1}(j)
            log_beta[t, i] = logsumexp(
                log_A[i, :] + log_B[:, observations[t+1]] + log_beta[t+1]
            )
    
    return log_beta
 
 
def backward_algorithm_log_vectorized(
    log_A: np.ndarray,
    log_B: np.ndarray,
    observations: np.ndarray
) -> np.ndarray:
    """Vectorized log-space Backward Algorithm."""
    T = len(observations)
    N = log_A.shape[0]
    
    log_beta = np.zeros((T, N))
    log_beta[-1] = 0.0
    
    for t in range(T-2, -1, -1):
        # For each source state i, sum over target states j
        # log_A[i, j] + log_B[j, obs] + log_beta[t+1, j]
        # Shape: (N, N) + (N,) + (N,) = (N, N), then logsumexp over axis 1
        log_beta[t] = logsumexp(
            log_A + log_B[:, observations[t+1]] + log_beta[t+1],
            axis=1
        )
    
    return log_beta
 
 
def verify_forward_backward(log_alpha, log_beta, log_pi, log_B, observations):
    """
    Verify that forward and backward give consistent likelihood.
    
    At any time t: P(x) = Σ_i α_t(i) * β_t(i)
    This should be the same for all t!
    """
    T = len(observations)
    log_probs = np.zeros(T)
    
    for t in range(T):
        log_probs[t] = logsumexp(log_alpha[t] + log_beta[t])
    
    print("Log P(x) computed at each time step:")
    for t in range(T):
        print(f"  t={t}: {log_probs[t]:.6f}")
    
    print(f"\nAll equal? {np.allclose(log_probs, log_probs[0])}")

Computing Posterior Probabilities

With both forward and backward variables, we can compute posterior probabilities that form the foundation of the E-step.

Gamma: State Occupancy Probability

Define $\gamma_t(i)$ as the posterior probability of being in state $i$ at time $t$:

$$\gamma_t(i) = P(z_t = i | \mathbf{x}, \boldsymbol{\theta})$$

Using Bayes' rule and the forward-backward variables:

$$\gamma_t(i) = \frac{P(z_t = i, \mathbf{x})}{P(\mathbf{x})} = \frac{\alpha_t(i) \cdot \beta_t(i)}{\sum_{j} \alpha_t(j) \cdot \beta_t(j)}$$

Xi: Transition Probability

Define $\xi_t(i, j)$ as the posterior probability of transitioning from state $i$ to state $j$ at time $t$:

$$\xi_t(i, j) = P(z_t = i, z_{t+1} = j | \mathbf{x}, \boldsymbol{\theta})$$

Derivation of $\xi_t(i, j)$

$$\begin{align} \xi_t(i, j) &= \frac{P(z_t = i, z_{t+1} = j, \mathbf{x})}{P(\mathbf{x})} \ &= \frac{\alpha_t(i) \cdot A_{ij} \cdot B_j(x_{t+1}) \cdot \beta_{t+1}(j)}{P(\mathbf{x})} \end{align}$$

where $P(\mathbf{x}) = \sum_{i} \alpha_T(i)$.

Decomposition:

$\alpha_t(i)$: Probability of observations $x_1, \ldots, x_t$ and being in state $i$
$A_{ij}$: Transition probability from $i$ to $j$
$B_j(x_{t+1})$: Emission probability of $x_{t+1}$ from state $j$
$\beta_{t+1}(j)$: Probability of future observations $x_{t+2}, \ldots, x_T$ given state $j$

Relationship Between $\gamma$ and $\xi$

Marginalizing $\xi$ over the target state gives $\gamma$:

$$\gamma_t(i) = \sum_{j=1}^{N} \xi_t(i, j)$$

This makes sense: the probability of being in state $i$ at time $t$ equals the sum of probabilities of transitioning from $i$ to any state $j$.

Summary of Posterior Quantities
Quantity	Definition	Meaning	Formula
$\gamma_t(i)$	$P(z_t=i \| \mathbf{x})$	State occupancy at time $t$	$\frac{\alpha_t(i) \beta_t(i)}{P(\mathbf{x})}$
$\xi_t(i,j)$	$P(z_t=i, z_{t+1}=j \| \mathbf{x})$	Transition $i \to j$ at time $t$	$\frac{\alpha_t(i) A_{ij} B_j(x_{t+1}) \beta_{t+1}(j)}{P(\mathbf{x})}$

The E-Step: Expected Sufficient Statistics

Expected Initial State Counts

$$\mathbb{E}[\text{start in state } i] = \gamma_1(i)$$

The expected number of times the sequence starts in state $i$.

Expected Transition Counts

$$\mathbb{E}[\text{transitions } i \to j] = \sum_{t=1}^{T-1} \xi_t(i, j)$$

Sum over all time steps where a transition could occur.

$$\mathbb{E}[\text{transitions from } i] = \sum_{t=1}^{T-1} \gamma_t(i)$$

Expected total transitions out of state $i$.

Expected Emission Counts

$$\mathbb{E}[\text{emit } k \text{ from state } j] = \sum_{t: x_t = k} \gamma_t(j)$$

Sum $\gamma_t(j)$ over all time steps where observation $k$ was seen.

$$\mathbb{E}[\text{emissions from state } j] = \sum_{t=1}^{T} \gamma_t(j)$$

Expected total emissions from state $j$.

Fractional Counts

These fractional counts accumulate across time steps and, when normalized, give proper probability estimates.

e_step.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import numpy as np
from scipy.special import logsumexp
 
def compute_posteriors(
    log_A: np.ndarray,
    log_B: np.ndarray,
    log_pi: np.ndarray,
    observations: np.ndarray
) -> tuple:
    """
    Compute γ (gamma) and ξ (xi) posterior probabilities.
    
    Returns
    -------
    gamma : np.ndarray of shape (T, N)
        State occupancy probabilities γ_t(i) = P(z_t=i | x)
    xi : np.ndarray of shape (T-1, N, N)
        Transition probabilities ξ_t(i,j) = P(z_t=i, z_{t+1}=j | x)
    log_likelihood : float
        Log P(observations | model)
    """
    T = len(observations)
    N = len(log_pi)
    
    # Forward pass
    log_alpha = np.zeros((T, N))
    log_alpha[0] = log_pi + log_B[:, observations[0]]
    for t in range(1, T):
        log_alpha[t] = logsumexp(
            log_alpha[t-1, :, np.newaxis] + log_A, axis=0
        ) + log_B[:, observations[t]]
    
    # Backward pass
    log_beta = np.zeros((T, N))
    log_beta[-1] = 0.0
    for t in range(T-2, -1, -1):
        log_beta[t] = logsumexp(
            log_A + log_B[:, observations[t+1]] + log_beta[t+1],
            axis=1
        )
    
    # Log-likelihood: P(x) = Σ_i α_T(i)
    log_likelihood = logsumexp(log_alpha[-1])
    
    # Gamma: γ_t(i) = P(z_t = i | x)
    # In log space: log γ = log α + log β - log P(x)
    log_gamma = log_alpha + log_beta - log_likelihood
    gamma = np.exp(log_gamma)
    
    # Xi: ξ_t(i,j) = P(z_t=i, z_{t+1}=j | x)
    # Only defined for t = 0 to T-2
    xi = np.zeros((T-1, N, N))
    for t in range(T-1):
        # log ξ_t(i,j) = log α_t(i) + log A_{ij} + log B_j(x_{t+1}) 
        #               + log β_{t+1}(j) - log P(x)
        log_xi_t = (log_alpha[t, :, np.newaxis] + 
                   log_A + 
                   log_B[:, observations[t+1]] + 
                   log_beta[t+1, :] - 
                   log_likelihood)
        xi[t] = np.exp(log_xi_t)
    
    return gamma, xi, log_likelihood
 
 
def accumulate_statistics(
    gamma: np.ndarray,
    xi: np.ndarray,
    observations: np.ndarray,
    n_observations: int
) -> tuple:
    """
    Accumulate expected sufficient statistics from posteriors.
    
    Returns
    -------
    gamma_1 : np.ndarray of shape (N,)
        Expected initial state counts (just γ_1)
    transition_counts : np.ndarray of shape (N, N)
        Expected transition counts
    emission_counts : np.ndarray of shape (N, M)
        Expected emission counts per state and observation
    """
    T, N = gamma.shape
    
    # Initial state: γ_1(i)
    gamma_1 = gamma[0]
    
    # Transition counts: Σ_t ξ_t(i,j)
    transition_counts = xi.sum(axis=0)  # Sum over time
    
    # Emission counts: Σ_{t: x_t=k} γ_t(j)
    emission_counts = np.zeros((N, n_observations))
    for t in range(T):
        emission_counts[:, observations[t]] += gamma[t]
    
    return gamma_1, transition_counts, emission_counts

The M-Step: Parameter Updates

The M-step updates parameters to maximize the expected complete-data log-likelihood (the Q-function):

$$Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{\text{old}}) = \mathbb{E}_{\mathbf{z} | \mathbf{x}, \boldsymbol{\theta}^{\text{old}}} [\log P(\mathbf{x}, \mathbf{z} | \boldsymbol{\theta})]$$

Update for Initial Distribution $\hat{\boldsymbol{\pi}}$

$$\hat{\pi}_i = \gamma_1(i)$$

The new initial probability of state $i$ is the posterior probability of starting in state $i$. For multiple sequences:

$$\hat{\pi}i = \frac{1}{K} \sum{k=1}^{K} \gamma_1^{(k)}(i)$$

Update for Transition Matrix $\hat{\mathbf{A}}$

$$\hat{A}{ij} = \frac{\text{Expected transitions } i \to j}{\text{Expected transitions from } i} = \frac{\sum{t=1}^{T-1} \xi_t(i, j)}{\sum_{t=1}^{T-1} \gamma_t(i)}$$

For multiple sequences:

$$\hat{A}{ij} = \frac{\sum{k=1}^{K} \sum_{t=1}^{T_k-1} \xi_t^{(k)}(i, j)}{\sum_{k=1}^{K} \sum_{t=1}^{T_k-1} \gamma_t^{(k)}(i)}$$

Update for Emission Distribution $\hat{\mathbf{B}}$

Gaussian case: If emissions are Gaussian $\mathcal{N}(\boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)$:

$$\hat{\boldsymbol{\mu}}j = \frac{\sum{t=1}^{T} \gamma_t(j) \cdot x_t}{\sum_{t=1}^{T} \gamma_t(j)}$$

$$\hat{\boldsymbol{\Sigma}}j = \frac{\sum{t=1}^{T} \gamma_t(j) \cdot (x_t - \hat{\boldsymbol{\mu}}_j)(x_t - \hat{\boldsymbol{\mu}}j)^\top}{\sum{t=1}^{T} \gamma_t(j)}$$

These are weighted maximum likelihood estimates where $\gamma_t(j)$ serves as the weight for observation $x_t$ being attributed to state $j$.

The Essence of EM for HMMs

The M-step formulas are exactly the Maximum Likelihood estimates you would compute if you observed hard counts, but with hard counts replaced by expected (soft) counts from the E-step.

This is why Baum-Welch converges to a local maximum: each iteration increases the expected complete-data log-likelihood, which provably increases the incomplete-data log-likelihood.

m_step.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
import numpy as np
 
def m_step(
    gamma_1: np.ndarray,
    transition_counts: np.ndarray,
    emission_counts: np.ndarray,
    gamma_sum: np.ndarray,
    gamma_sum_except_last: np.ndarray
) -> tuple:
    """
    Perform M-step: update HMM parameters from expected statistics.
    
    Parameters
    ----------
    gamma_1 : np.ndarray of shape (N,)
        Expected initial state distribution
    transition_counts : np.ndarray of shape (N, N)
        Expected transition counts
    emission_counts : np.ndarray of shape (N, M)
        Expected emission counts
    gamma_sum : np.ndarray of shape (N,)
        Σ_t γ_t(i) for each state (total expected time in each state)
    gamma_sum_except_last : np.ndarray of shape (N,)
        Σ_{t=1}^{T-1} γ_t(i) for each state (for transition denominator)
    
    Returns
    -------
    pi : np.ndarray of shape (N,)
        Updated initial distribution
    A : np.ndarray of shape (N, N)
        Updated transition matrix
    B : np.ndarray of shape (N, M)
        Updated emission matrix
    """
    # Initial distribution
    pi = gamma_1 / gamma_1.sum()  # Normalize (should already sum to 1)
    
    # Transition matrix
    # A[i, j] = (expected i→j transitions) / (expected transitions from i)
    # Add small constant to avoid division by zero
    A = transition_counts / (gamma_sum_except_last[:, np.newaxis] + 1e-300)
    
    # Normalize rows to ensure proper probability distribution
    A = A / A.sum(axis=1, keepdims=True)
    
    # Emission matrix
    # B[j, k] = (expected emissions of k from j) / (expected emissions from j)
    B = emission_counts / (gamma_sum[:, np.newaxis] + 1e-300)
    B = B / B.sum(axis=1, keepdims=True)
    
    return pi, A, B
 
 
def m_step_gaussian(
    gamma: np.ndarray,        # (T, N)
    observations: np.ndarray  # (T, D) continuous observations
) -> tuple:
    """
    M-step for Gaussian HMM.
    
    Returns updated means and covariances for each state.
    """
    T, N = gamma.shape
    D = observations.shape[1]
    
    # Normalize gamma so each row sums to 1 (for interpretability)
    gamma_sum = gamma.sum(axis=0)  # (N,)
    
    # Update means: μ_j = Σ_t γ_t(j) * x_t / Σ_t γ_t(j)
    mu = np.zeros((N, D))
    for j in range(N):
        mu[j] = (gamma[:, j, np.newaxis] * observations).sum(axis=0) / gamma_sum[j]
    
    # Update covariances
    sigma = np.zeros((N, D, D))
    for j in range(N):
        diff = observations - mu[j]  # (T, D)
        # Weighted outer product sum
        for t in range(T):
            sigma[j] += gamma[t, j] * np.outer(diff[t], diff[t])
        sigma[j] /= gamma_sum[j]
    
    return mu, sigma

The Complete Baum-Welch Algorithm

Putting it all together, here is the complete Baum-Welch algorithm:

Algorithm: Baum-Welch (EM for HMMs)

Input: Observation sequences ${\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(K)}}$, number of states $N$, number of observations $M$

Output: Estimated parameters $\hat{\boldsymbol{\theta}} = (\hat{\mathbf{A}}, \hat{\mathbf{B}}, \hat{\boldsymbol{\pi}})$

Algorithm:

Initialize parameters $(\mathbf{A}^{(0)}, \mathbf{B}^{(0)}, \boldsymbol{\pi}^{(0)})$
Repeat until convergence:

E-step: For each sequence $k$:
- Compute forward variables $\alpha^{(k)}$ using current parameters
- Compute backward variables $\beta^{(k)}$
- Compute $\gamma^{(k)}$ and $\xi^{(k)}$ posteriors
- Accumulate expected sufficient statistics
M-step: Update parameters:
- $\hat{\pi}_i = \frac{\sum_k \gamma_1^{(k)}(i)}{K}$
- $\hat{A}_{ij} = \frac{\sum_k \sum_t \xi_t^{(k)}(i,j)}{\sum_k \sum_t \gamma_t^{(k)}(i)}$
- $\hat{B}{jk} = \frac{\sum_k \sum{t: x_t^{(k)}=k} \gamma_t^{(k)}(j)}{\sum_k \sum_t \gamma_t^{(k)}(j)}$
Return final parameters

baum_welch.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
import numpy as np
from scipy.special import logsumexp
from typing import List, Optional
 
class HMM:
    """Complete HMM with Baum-Welch training."""
    
    def __init__(self, n_states: int, n_observations: int):
        self.n_states = n_states
        self.n_observations = n_observations
        self.A = None  # Transition matrix
        self.B = None  # Emission matrix
        self.pi = None  # Initial distribution
    
    def _init_parameters(self, random_state: Optional[int] = None):
        """Initialize parameters with small random perturbations."""
        if random_state is not None:
            np.random.seed(random_state)
        
        N, M = self.n_states, self.n_observations
        
        # Random initialization with normalization
        self.A = np.random.rand(N, N) + 0.1
        self.A /= self.A.sum(axis=1, keepdims=True)
        
        self.B = np.random.rand(N, M) + 0.1
        self.B /= self.B.sum(axis=1, keepdims=True)
        
        self.pi = np.random.rand(N) + 0.1
        self.pi /= self.pi.sum()
    
    def _forward(self, observations: np.ndarray) -> tuple:
        """Forward pass in log space."""
        T = len(observations)
        N = self.n_states
        
        log_A = np.log(self.A + 1e-300)
        log_B = np.log(self.B + 1e-300)
        log_pi = np.log(self.pi + 1e-300)
        
        log_alpha = np.zeros((T, N))
        log_alpha[0] = log_pi + log_B[:, observations[0]]
        
        for t in range(1, T):
            log_alpha[t] = logsumexp(
                log_alpha[t-1, :, np.newaxis] + log_A, axis=0
            ) + log_B[:, observations[t]]
        
        return log_alpha, logsumexp(log_alpha[-1])
    
    def _backward(self, observations: np.ndarray) -> np.ndarray:
        """Backward pass in log space."""
        T = len(observations)
        N = self.n_states
        
        log_A = np.log(self.A + 1e-300)
        log_B = np.log(self.B + 1e-300)
        
        log_beta = np.zeros((T, N))
        log_beta[-1] = 0.0
        
        for t in range(T-2, -1, -1):
            log_beta[t] = logsumexp(
                log_A + log_B[:, observations[t+1]] + log_beta[t+1],
                axis=1
            )
        
        return log_beta
    
    def _e_step(self, observations: np.ndarray) -> tuple:
        """E-step: compute posteriors and sufficient statistics."""
        T = len(observations)
        N = self.n_states
        
        log_alpha, log_likelihood = self._forward(observations)
        log_beta = self._backward(observations)
        
        log_A = np.log(self.A + 1e-300)
        log_B = np.log(self.B + 1e-300)
        
        # Gamma
        log_gamma = log_alpha + log_beta - log_likelihood
        gamma = np.exp(log_gamma)
        
        # Xi
        xi = np.zeros((T-1, N, N))
        for t in range(T-1):
            log_xi_t = (log_alpha[t, :, np.newaxis] + 
                       log_A + 
                       log_B[:, observations[t+1]] + 
                       log_beta[t+1, :] - 
                       log_likelihood)
            xi[t] = np.exp(log_xi_t)
        
        return gamma, xi, log_likelihood
    
    def fit(
        self,
        sequences: List[np.ndarray],
        max_iters: int = 100,
        tol: float = 1e-6,
        random_state: Optional[int] = None,
        verbose: bool = True
    ) -> List[float]:
        """
        Train HMM using Baum-Welch algorithm.
        
        Parameters
        ----------
        sequences : List of observation sequences
        max_iters : Maximum EM iterations
        tol : Convergence tolerance (relative change in log-likelihood)
        random_state : Random seed for initialization
        verbose : Print progress
        
        Returns
        -------
        log_likelihoods : List of log-likelihoods at each iteration
        """
        self._init_parameters(random_state)
        
        K = len(sequences)
        N = self.n_states
        M = self.n_observations
        
        log_likelihoods = []
        prev_ll = float('-inf')
        
        for iteration in range(max_iters):
            # Accumulate sufficient statistics
            total_gamma_1 = np.zeros(N)
            total_trans_num = np.zeros((N, N))
            total_trans_denom = np.zeros(N)
            total_emit_num = np.zeros((N, M))
            total_emit_denom = np.zeros(N)
            total_ll = 0.0
            
            # E-step: process each sequence
            for seq in sequences:
                gamma, xi, ll = self._e_step(seq)
                total_ll += ll
                
                # Accumulate statistics
                total_gamma_1 += gamma[0]
                total_trans_num += xi.sum(axis=0)
                total_trans_denom += gamma[:-1].sum(axis=0)
                
                for t, obs in enumerate(seq):
                    total_emit_num[:, obs] += gamma[t]
                total_emit_denom += gamma.sum(axis=0)
            
            log_likelihoods.append(total_ll)
            
            # Check convergence
            rel_change = (total_ll - prev_ll) / (abs(prev_ll) + 1e-10)
            if verbose and (iteration % 10 == 0 or iteration == max_iters - 1):
                print(f"Iter {iteration}: log-likelihood = {total_ll:.4f}")
            
            if rel_change < tol and iteration > 0:
                if verbose:
                    print(f"Converged at iteration {iteration}")
                break
            
            prev_ll = total_ll
            
            # M-step: update parameters
            self.pi = total_gamma_1 / K
            self.A = total_trans_num / (total_trans_denom[:, np.newaxis] + 1e-300)
            self.A /= self.A.sum(axis=1, keepdims=True)  # Renormalize
            
            self.B = total_emit_num / (total_emit_denom[:, np.newaxis] + 1e-300)
            self.B /= self.B.sum(axis=1, keepdims=True)  # Renormalize
        
        return log_likelihoods
 
 
# Example usage
def demo_baum_welch():
    """Demonstrate Baum-Welch on synthetic data."""
    
    # True HMM (for generating data)
    true_A = np.array([[0.7, 0.3], [0.4, 0.6]])
    true_B = np.array([[0.9, 0.1], [0.2, 0.8]])
    true_pi = np.array([0.6, 0.4])
    
    # Generate synthetic sequences
    def sample_hmm(A, B, pi, T):
        N, M = B.shape
        states = np.zeros(T, dtype=int)
        obs = np.zeros(T, dtype=int)
        
        states[0] = np.random.choice(N, p=pi)
        obs[0] = np.random.choice(M, p=B[states[0]])
        
        for t in range(1, T):
            states[t] = np.random.choice(N, p=A[states[t-1]])
            obs[t] = np.random.choice(M, p=B[states[t]])
        
        return obs
    
    np.random.seed(42)
    sequences = [sample_hmm(true_A, true_B, true_pi, T=50) for _ in range(20)]
    
    # Train HMM
    hmm = HMM(n_states=2, n_observations=2)
    log_likelihoods = hmm.fit(sequences, max_iters=50, 
                               random_state=123, verbose=True)
    
    print("\n" + "="*50)
    print("True vs Learned Parameters:")
    print("\nTrue A:")
    print(true_A)
    print("Learned A:")
    print(hmm.A.round(3))
    
    print("\nTrue B:")
    print(true_B)
    print("Learned B:")
    print(hmm.B.round(3))
 
 
demo_baum_welch()

Practical Considerations

Initialization Strategies

Baum-Welch finds a local maximum, not necessarily the global maximum. Initialization matters!

Random Initialization:

Simple but may converge to poor local optima
Run multiple times with different seeds, keep best result

K-means Initialization (for continuous emissions):

Cluster observations with K-means
Initialize Gaussian parameters from cluster statistics
Often gives better starting points

Uniform Initialization:

Set all parameters to uniform distributions
Rarely optimal but provides a baseline

Domain-Informed Initialization:

Use prior knowledge about the problem structure
E.g., speech recognition might initialize phoneme models with phonetic knowledge

Common Issues

•Local optima: EM only guarantees local convergence. Multiple restarts are essential.
•Degenerate solutions: States can "collapse" (all probability in one state). Add pseudocounts.
•Slow convergence: EM can be slow near optima. Consider acceleration techniques.
•Overfitting: Too many states fit training data perfectly but generalize poorly.

Best Practices

•Multiple restarts: Try 10+ random initializations, keep best log-likelihood.
•Add smoothing: Dirichlet prior (add pseudocounts) prevents zero probabilities.
•Monitor convergence: Plot log-likelihood curve; check it's increasing.
•Model selection: Use BIC/AIC to choose number of states.

Regularization via Priors

Adding pseudocounts is equivalent to placing Dirichlet priors on parameters. Instead of MLE, we get Maximum A Posteriori (MAP) estimates:

$\hat{A}_{ij} = \frac{\sum_t \xi_t(i,j) + \alpha - 1}{\sum_t \gamma_t(i) + N(\alpha - 1)}$

With $\alpha > 1$ (symmetric Dirichlet prior), this prevents any transition probability from being exactly 0, improving robustness.

Complexity Analysis

Per iteration:

Forward pass: $O(TN^2)$ per sequence
Backward pass: $O(TN^2)$ per sequence
Computing $\xi$: $O(TN^2)$ per sequence
M-step: $O(KT(N^2 + NM))$

Total per iteration: $O(KTN^2)$

For large datasets or models, Baum-Welch can be slow. Alternatives:

Online EM: Update parameters incrementally after each sequence
Stochastic EM: Use random subsets of sequences
Variational Bayes: Alternative to EM with better regularization
Gradient descent: Directly optimize log-likelihood with gradients

Summary: Baum-Welch Algorithm

The Baum-Welch algorithm solves the learning problem—estimating HMM parameters from unlabeled observation sequences—using Expectation-Maximization.

Key Takeaways

•The learning problem is hard because hidden states are unobserved—we can't directly count transitions and emissions.
•EM resolves the chicken-and-egg problem: Alternate between inferring states (E-step) and updating parameters (M-step).
•The Backward Algorithm computes $\beta_t(i) = P(x_{t+1:T} | z_t=i)$, complementing the Forward pass.
•Posterior quantities: $\gamma_t(i)$ (state occupancy) and $\xi_t(i,j)$ (transition probability) are computed from $\alpha$ and $\beta$.
•E-step computes expected sufficient statistics: Expected transition counts and emission counts using soft assignments.
•M-step updates parameters: Same formulas as MLE, but with expected counts instead of hard counts.
•Convergence is guaranteed to a local optimum. Multiple restarts with different initializations improve results.
•Practical considerations: Use log-space, add smoothing, monitor convergence, use model selection for number of states.

What's Coming Next

Next Page: Applications

Page Complete

4 / 5