Loading learning content...
So far, we've assumed the HMM parameters $(\mathbf{A}, \mathbf{B}, \boldsymbol{\pi})$ are known. But in practice, we must learn them from data. This presents a fundamental challenge: the latent states are never observed!
If we could observe the hidden state sequence alongside the observations, parameter estimation would be trivial—just count transitions and emissions. But we only see the observations. We need to somehow "infer" what the hidden states probably were, and use that to estimate parameters.
This is a classic chicken-and-egg problem:
The Baum-Welch algorithm (Baum et al., 1970) elegantly solves this via Expectation-Maximization (EM): iteratively refine our beliefs about hidden states (E-step) and update parameters accordingly (M-step). Each iteration is guaranteed to improve the likelihood, converging to a local optimum.
By the end of this page, you will:
• Understand the learning problem and why direct optimization is hard • Master the Backward Algorithm and its role in computing posteriors • Derive the E-step: computing expected sufficient statistics • Derive the M-step: updating parameters to maximize expected log-likelihood • Implement the complete Baum-Welch algorithm • Understand convergence properties, initialization strategies, and practical considerations
Given:
Find: $$\boldsymbol{\theta}^* = \arg\max_{\boldsymbol{\theta}} P(\mathcal{D} | \boldsymbol{\theta}) = \arg\max_{\boldsymbol{\theta}} \prod_{k=1}^{K} P(\mathbf{x}^{(k)} | \boldsymbol{\theta})$$
We seek the Maximum Likelihood Estimate (MLE) of the parameters.
The log-likelihood involves marginalizing over hidden states:
$$\log P(\mathbf{x} | \boldsymbol{\theta}) = \log \sum_{\mathbf{z}} P(\mathbf{x}, \mathbf{z} | \boldsymbol{\theta})$$
The sum inside the log prevents simple closed-form solutions:
Complete data: If we observed both $\mathbf{x}$ and $\mathbf{z}$, the MLE would be trivial:
$\hat{A}{ij} = \frac{\text{count}(z{t-1}=i, z_t=j)}{\text{count}(z_{t-1}=i)}$ — Just count transitions!
$\hat{B}_{jk} = \frac{\text{count}(z_t=j, x_t=k)}{\text{count}(z_t=j)}$ — Just count emissions!
Incomplete data: With $\mathbf{z}$ hidden, we can't count directly. Baum-Welch replaces counts with expected counts under the current parameter estimates.
Expectation-Maximization provides a principled approach:
E-step (Expectation): Using current parameters $\boldsymbol{\theta}^{\text{old}}$, compute the posterior distribution $P(\mathbf{z} | \mathbf{x}, \boldsymbol{\theta}^{\text{old}})$ over hidden states. Extract expected "counts" of transitions and emissions.
M-step (Maximization): Update parameters to maximize expected complete-data log-likelihood: $$\boldsymbol{\theta}^{\text{new}} = \arg\max_{\boldsymbol{\theta}} \mathbb{E}_{\mathbf{z} | \mathbf{x}, \boldsymbol{\theta}^{\text{old}}} [\log P(\mathbf{x}, \mathbf{z} | \boldsymbol{\theta})]$$
Repeat until convergence.
The key insight: even though we don't observe $\mathbf{z}$, we can compute a soft assignment—the probability of each state at each time step—and use this "expected" information as if it were (fractional) counts.
To compute posterior probabilities over hidden states, we need the Backward Algorithm—the complement to the Forward Algorithm.
The backward variable $\beta_t(i)$ is the probability of observing the future observations $x_{t+1}, \ldots, x_T$ given that we're in state $i$ at time $t$:
$$\beta_t(i) = P(x_{t+1}, x_{t+2}, \ldots, x_T | z_t = i, \boldsymbol{\theta})$$
At the final time step, there are no future observations: $$\beta_T(i) = 1 \quad \text{for all } i$$
Working backwards:
$$\begin{align} \beta_t(i) &= P(x_{t+1}, \ldots, x_T | z_t = i) \ &= \sum_{j=1}^{N} P(x_{t+1}, \ldots, x_T, z_{t+1} = j | z_t = i) \ &= \sum_{j=1}^{N} P(z_{t+1} = j | z_t = i) \cdot P(x_{t+1} | z_{t+1} = j) \cdot P(x_{t+2}, \ldots, x_T | z_{t+1} = j) \ &= \sum_{j=1}^{N} A_{ij} \cdot B_j(x_{t+1}) \cdot \beta_{t+1}(j) \end{align}$$
Compare the structures:
Forward: $\alpha_t(j) = \sum_{i} \alpha_{t-1}(i) \cdot A_{ij} \cdot B_j(x_t)$
Backward: $\beta_t(i) = \sum_{j} A_{ij} \cdot B_j(x_{t+1}) \cdot \beta_{t+1}(j)$
Both propagate information through the chain—Forward from past to present, Backward from future to present. Together, they capture all information about each hidden state.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586
import numpy as npfrom scipy.special import logsumexp def backward_algorithm_log( log_A: np.ndarray, # (N, N) log transition matrix log_B: np.ndarray, # (N, M) log emission matrix observations: np.ndarray # (T,) observation indices) -> np.ndarray: """ Compute backward variables in log-probability space. Parameters ---------- log_A : np.ndarray of shape (N, N) Log transition matrix log_B : np.ndarray of shape (N, M) Log emission matrix observations : np.ndarray of shape (T,) Observation sequence Returns ------- log_beta : np.ndarray of shape (T, N) Log backward variables """ T = len(observations) N = log_A.shape[0] log_beta = np.zeros((T, N)) # Initialization: β_T(i) = 1 => log β_T(i) = 0 log_beta[-1] = 0.0 # Recursion: t = T-1 down to 0 for t in range(T-2, -1, -1): for i in range(N): # log β_t(i) = log Σ_j A_{ij} * B_j(x_{t+1}) * β_{t+1}(j) log_beta[t, i] = logsumexp( log_A[i, :] + log_B[:, observations[t+1]] + log_beta[t+1] ) return log_beta def backward_algorithm_log_vectorized( log_A: np.ndarray, log_B: np.ndarray, observations: np.ndarray) -> np.ndarray: """Vectorized log-space Backward Algorithm.""" T = len(observations) N = log_A.shape[0] log_beta = np.zeros((T, N)) log_beta[-1] = 0.0 for t in range(T-2, -1, -1): # For each source state i, sum over target states j # log_A[i, j] + log_B[j, obs] + log_beta[t+1, j] # Shape: (N, N) + (N,) + (N,) = (N, N), then logsumexp over axis 1 log_beta[t] = logsumexp( log_A + log_B[:, observations[t+1]] + log_beta[t+1], axis=1 ) return log_beta def verify_forward_backward(log_alpha, log_beta, log_pi, log_B, observations): """ Verify that forward and backward give consistent likelihood. At any time t: P(x) = Σ_i α_t(i) * β_t(i) This should be the same for all t! """ T = len(observations) log_probs = np.zeros(T) for t in range(T): log_probs[t] = logsumexp(log_alpha[t] + log_beta[t]) print("Log P(x) computed at each time step:") for t in range(T): print(f" t={t}: {log_probs[t]:.6f}") print(f"\nAll equal? {np.allclose(log_probs, log_probs[0])}")With both forward and backward variables, we can compute posterior probabilities that form the foundation of the E-step.
Define $\gamma_t(i)$ as the posterior probability of being in state $i$ at time $t$:
$$\gamma_t(i) = P(z_t = i | \mathbf{x}, \boldsymbol{\theta})$$
Using Bayes' rule and the forward-backward variables:
$$\gamma_t(i) = \frac{P(z_t = i, \mathbf{x})}{P(\mathbf{x})} = \frac{\alpha_t(i) \cdot \beta_t(i)}{\sum_{j} \alpha_t(j) \cdot \beta_t(j)}$$
Interpretation: $\alpha_t(i)$ captures "evidence from the past" (observations $x_1, \ldots, x_t$ and being in state $i$). $\beta_t(i)$ captures "evidence from the future" (observations $x_{t+1}, \ldots, x_T$ given state $i$). The product gives the joint probability of all observations and being in state $i$ at time $t$.
Define $\xi_t(i, j)$ as the posterior probability of transitioning from state $i$ to state $j$ at time $t$:
$$\xi_t(i, j) = P(z_t = i, z_{t+1} = j | \mathbf{x}, \boldsymbol{\theta})$$
$$\begin{align} \xi_t(i, j) &= \frac{P(z_t = i, z_{t+1} = j, \mathbf{x})}{P(\mathbf{x})} \ &= \frac{\alpha_t(i) \cdot A_{ij} \cdot B_j(x_{t+1}) \cdot \beta_{t+1}(j)}{P(\mathbf{x})} \end{align}$$
where $P(\mathbf{x}) = \sum_{i} \alpha_T(i)$.
Decomposition:
Marginalizing $\xi$ over the target state gives $\gamma$:
$$\gamma_t(i) = \sum_{j=1}^{N} \xi_t(i, j)$$
This makes sense: the probability of being in state $i$ at time $t$ equals the sum of probabilities of transitioning from $i$ to any state $j$.
| Quantity | Definition | Meaning | Formula |
|---|---|---|---|
| $\gamma_t(i)$ | $P(z_t=i | \mathbf{x})$ | State occupancy at time $t$ | $\frac{\alpha_t(i) \beta_t(i)}{P(\mathbf{x})}$ |
| $\xi_t(i,j)$ | $P(z_t=i, z_{t+1}=j | \mathbf{x})$ | Transition $i \to j$ at time $t$ | $\frac{\alpha_t(i) A_{ij} B_j(x_{t+1}) \beta_{t+1}(j)}{P(\mathbf{x})}$ |
In the E-step, we compute the expected sufficient statistics—the expected counts that we would compute if we observed the hidden states, where expectations are taken under the posterior $P(\mathbf{z} | \mathbf{x}, \boldsymbol{\theta}^{\text{old}})$.
$$\mathbb{E}[\text{start in state } i] = \gamma_1(i)$$
The expected number of times the sequence starts in state $i$.
$$\mathbb{E}[\text{transitions } i \to j] = \sum_{t=1}^{T-1} \xi_t(i, j)$$
Sum over all time steps where a transition could occur.
$$\mathbb{E}[\text{transitions from } i] = \sum_{t=1}^{T-1} \gamma_t(i)$$
Expected total transitions out of state $i$.
$$\mathbb{E}[\text{emit } k \text{ from state } j] = \sum_{t: x_t = k} \gamma_t(j)$$
Sum $\gamma_t(j)$ over all time steps where observation $k$ was seen.
$$\mathbb{E}[\text{emissions from state } j] = \sum_{t=1}^{T} \gamma_t(j)$$
Expected total emissions from state $j$.
Unlike hard counts from observed data, these are soft counts—fractional values between 0 and 1. For example, $\gamma_t(i) = 0.7$ means "at time $t$, we're 70% confident the system was in state $i$."
These fractional counts accumulate across time steps and, when normalized, give proper probability estimates.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697
import numpy as npfrom scipy.special import logsumexp def compute_posteriors( log_A: np.ndarray, log_B: np.ndarray, log_pi: np.ndarray, observations: np.ndarray) -> tuple: """ Compute γ (gamma) and ξ (xi) posterior probabilities. Returns ------- gamma : np.ndarray of shape (T, N) State occupancy probabilities γ_t(i) = P(z_t=i | x) xi : np.ndarray of shape (T-1, N, N) Transition probabilities ξ_t(i,j) = P(z_t=i, z_{t+1}=j | x) log_likelihood : float Log P(observations | model) """ T = len(observations) N = len(log_pi) # Forward pass log_alpha = np.zeros((T, N)) log_alpha[0] = log_pi + log_B[:, observations[0]] for t in range(1, T): log_alpha[t] = logsumexp( log_alpha[t-1, :, np.newaxis] + log_A, axis=0 ) + log_B[:, observations[t]] # Backward pass log_beta = np.zeros((T, N)) log_beta[-1] = 0.0 for t in range(T-2, -1, -1): log_beta[t] = logsumexp( log_A + log_B[:, observations[t+1]] + log_beta[t+1], axis=1 ) # Log-likelihood: P(x) = Σ_i α_T(i) log_likelihood = logsumexp(log_alpha[-1]) # Gamma: γ_t(i) = P(z_t = i | x) # In log space: log γ = log α + log β - log P(x) log_gamma = log_alpha + log_beta - log_likelihood gamma = np.exp(log_gamma) # Xi: ξ_t(i,j) = P(z_t=i, z_{t+1}=j | x) # Only defined for t = 0 to T-2 xi = np.zeros((T-1, N, N)) for t in range(T-1): # log ξ_t(i,j) = log α_t(i) + log A_{ij} + log B_j(x_{t+1}) # + log β_{t+1}(j) - log P(x) log_xi_t = (log_alpha[t, :, np.newaxis] + log_A + log_B[:, observations[t+1]] + log_beta[t+1, :] - log_likelihood) xi[t] = np.exp(log_xi_t) return gamma, xi, log_likelihood def accumulate_statistics( gamma: np.ndarray, xi: np.ndarray, observations: np.ndarray, n_observations: int) -> tuple: """ Accumulate expected sufficient statistics from posteriors. Returns ------- gamma_1 : np.ndarray of shape (N,) Expected initial state counts (just γ_1) transition_counts : np.ndarray of shape (N, N) Expected transition counts emission_counts : np.ndarray of shape (N, M) Expected emission counts per state and observation """ T, N = gamma.shape # Initial state: γ_1(i) gamma_1 = gamma[0] # Transition counts: Σ_t ξ_t(i,j) transition_counts = xi.sum(axis=0) # Sum over time # Emission counts: Σ_{t: x_t=k} γ_t(j) emission_counts = np.zeros((N, n_observations)) for t in range(T): emission_counts[:, observations[t]] += gamma[t] return gamma_1, transition_counts, emission_countsThe M-step updates parameters to maximize the expected complete-data log-likelihood (the Q-function):
$$Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{\text{old}}) = \mathbb{E}_{\mathbf{z} | \mathbf{x}, \boldsymbol{\theta}^{\text{old}}} [\log P(\mathbf{x}, \mathbf{z} | \boldsymbol{\theta})]$$
$$\hat{\pi}_i = \gamma_1(i)$$
The new initial probability of state $i$ is the posterior probability of starting in state $i$. For multiple sequences:
$$\hat{\pi}i = \frac{1}{K} \sum{k=1}^{K} \gamma_1^{(k)}(i)$$
$$\hat{A}{ij} = \frac{\text{Expected transitions } i \to j}{\text{Expected transitions from } i} = \frac{\sum{t=1}^{T-1} \xi_t(i, j)}{\sum_{t=1}^{T-1} \gamma_t(i)}$$
For multiple sequences:
$$\hat{A}{ij} = \frac{\sum{k=1}^{K} \sum_{t=1}^{T_k-1} \xi_t^{(k)}(i, j)}{\sum_{k=1}^{K} \sum_{t=1}^{T_k-1} \gamma_t^{(k)}(i)}$$
Discrete case: $$\hat{B}{jk} = \frac{\text{Expected emissions of } k \text{ from state } j}{\text{Expected emissions from state } j} = \frac{\sum{t: x_t = k} \gamma_t(j)}{\sum_{t=1}^{T} \gamma_t(j)}$$
Gaussian case: If emissions are Gaussian $\mathcal{N}(\boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)$:
$$\hat{\boldsymbol{\mu}}j = \frac{\sum{t=1}^{T} \gamma_t(j) \cdot x_t}{\sum_{t=1}^{T} \gamma_t(j)}$$
$$\hat{\boldsymbol{\Sigma}}j = \frac{\sum{t=1}^{T} \gamma_t(j) \cdot (x_t - \hat{\boldsymbol{\mu}}_j)(x_t - \hat{\boldsymbol{\mu}}j)^\top}{\sum{t=1}^{T} \gamma_t(j)}$$
These are weighted maximum likelihood estimates where $\gamma_t(j)$ serves as the weight for observation $x_t$ being attributed to state $j$.
The M-step formulas are exactly the Maximum Likelihood estimates you would compute if you observed hard counts, but with hard counts replaced by expected (soft) counts from the E-step.
This is why Baum-Welch converges to a local maximum: each iteration increases the expected complete-data log-likelihood, which provably increases the incomplete-data log-likelihood.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283
import numpy as np def m_step( gamma_1: np.ndarray, transition_counts: np.ndarray, emission_counts: np.ndarray, gamma_sum: np.ndarray, gamma_sum_except_last: np.ndarray) -> tuple: """ Perform M-step: update HMM parameters from expected statistics. Parameters ---------- gamma_1 : np.ndarray of shape (N,) Expected initial state distribution transition_counts : np.ndarray of shape (N, N) Expected transition counts emission_counts : np.ndarray of shape (N, M) Expected emission counts gamma_sum : np.ndarray of shape (N,) Σ_t γ_t(i) for each state (total expected time in each state) gamma_sum_except_last : np.ndarray of shape (N,) Σ_{t=1}^{T-1} γ_t(i) for each state (for transition denominator) Returns ------- pi : np.ndarray of shape (N,) Updated initial distribution A : np.ndarray of shape (N, N) Updated transition matrix B : np.ndarray of shape (N, M) Updated emission matrix """ # Initial distribution pi = gamma_1 / gamma_1.sum() # Normalize (should already sum to 1) # Transition matrix # A[i, j] = (expected i→j transitions) / (expected transitions from i) # Add small constant to avoid division by zero A = transition_counts / (gamma_sum_except_last[:, np.newaxis] + 1e-300) # Normalize rows to ensure proper probability distribution A = A / A.sum(axis=1, keepdims=True) # Emission matrix # B[j, k] = (expected emissions of k from j) / (expected emissions from j) B = emission_counts / (gamma_sum[:, np.newaxis] + 1e-300) B = B / B.sum(axis=1, keepdims=True) return pi, A, B def m_step_gaussian( gamma: np.ndarray, # (T, N) observations: np.ndarray # (T, D) continuous observations) -> tuple: """ M-step for Gaussian HMM. Returns updated means and covariances for each state. """ T, N = gamma.shape D = observations.shape[1] # Normalize gamma so each row sums to 1 (for interpretability) gamma_sum = gamma.sum(axis=0) # (N,) # Update means: μ_j = Σ_t γ_t(j) * x_t / Σ_t γ_t(j) mu = np.zeros((N, D)) for j in range(N): mu[j] = (gamma[:, j, np.newaxis] * observations).sum(axis=0) / gamma_sum[j] # Update covariances sigma = np.zeros((N, D, D)) for j in range(N): diff = observations - mu[j] # (T, D) # Weighted outer product sum for t in range(T): sigma[j] += gamma[t, j] * np.outer(diff[t], diff[t]) sigma[j] /= gamma_sum[j] return mu, sigmaPutting it all together, here is the complete Baum-Welch algorithm:
Input: Observation sequences ${\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(K)}}$, number of states $N$, number of observations $M$
Output: Estimated parameters $\hat{\boldsymbol{\theta}} = (\hat{\mathbf{A}}, \hat{\mathbf{B}}, \hat{\boldsymbol{\pi}})$
Algorithm:
Initialize parameters $(\mathbf{A}^{(0)}, \mathbf{B}^{(0)}, \boldsymbol{\pi}^{(0)})$
Repeat until convergence:
E-step: For each sequence $k$:
M-step: Update parameters:
Return final parameters
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222
import numpy as npfrom scipy.special import logsumexpfrom typing import List, Optional class HMM: """Complete HMM with Baum-Welch training.""" def __init__(self, n_states: int, n_observations: int): self.n_states = n_states self.n_observations = n_observations self.A = None # Transition matrix self.B = None # Emission matrix self.pi = None # Initial distribution def _init_parameters(self, random_state: Optional[int] = None): """Initialize parameters with small random perturbations.""" if random_state is not None: np.random.seed(random_state) N, M = self.n_states, self.n_observations # Random initialization with normalization self.A = np.random.rand(N, N) + 0.1 self.A /= self.A.sum(axis=1, keepdims=True) self.B = np.random.rand(N, M) + 0.1 self.B /= self.B.sum(axis=1, keepdims=True) self.pi = np.random.rand(N) + 0.1 self.pi /= self.pi.sum() def _forward(self, observations: np.ndarray) -> tuple: """Forward pass in log space.""" T = len(observations) N = self.n_states log_A = np.log(self.A + 1e-300) log_B = np.log(self.B + 1e-300) log_pi = np.log(self.pi + 1e-300) log_alpha = np.zeros((T, N)) log_alpha[0] = log_pi + log_B[:, observations[0]] for t in range(1, T): log_alpha[t] = logsumexp( log_alpha[t-1, :, np.newaxis] + log_A, axis=0 ) + log_B[:, observations[t]] return log_alpha, logsumexp(log_alpha[-1]) def _backward(self, observations: np.ndarray) -> np.ndarray: """Backward pass in log space.""" T = len(observations) N = self.n_states log_A = np.log(self.A + 1e-300) log_B = np.log(self.B + 1e-300) log_beta = np.zeros((T, N)) log_beta[-1] = 0.0 for t in range(T-2, -1, -1): log_beta[t] = logsumexp( log_A + log_B[:, observations[t+1]] + log_beta[t+1], axis=1 ) return log_beta def _e_step(self, observations: np.ndarray) -> tuple: """E-step: compute posteriors and sufficient statistics.""" T = len(observations) N = self.n_states log_alpha, log_likelihood = self._forward(observations) log_beta = self._backward(observations) log_A = np.log(self.A + 1e-300) log_B = np.log(self.B + 1e-300) # Gamma log_gamma = log_alpha + log_beta - log_likelihood gamma = np.exp(log_gamma) # Xi xi = np.zeros((T-1, N, N)) for t in range(T-1): log_xi_t = (log_alpha[t, :, np.newaxis] + log_A + log_B[:, observations[t+1]] + log_beta[t+1, :] - log_likelihood) xi[t] = np.exp(log_xi_t) return gamma, xi, log_likelihood def fit( self, sequences: List[np.ndarray], max_iters: int = 100, tol: float = 1e-6, random_state: Optional[int] = None, verbose: bool = True ) -> List[float]: """ Train HMM using Baum-Welch algorithm. Parameters ---------- sequences : List of observation sequences max_iters : Maximum EM iterations tol : Convergence tolerance (relative change in log-likelihood) random_state : Random seed for initialization verbose : Print progress Returns ------- log_likelihoods : List of log-likelihoods at each iteration """ self._init_parameters(random_state) K = len(sequences) N = self.n_states M = self.n_observations log_likelihoods = [] prev_ll = float('-inf') for iteration in range(max_iters): # Accumulate sufficient statistics total_gamma_1 = np.zeros(N) total_trans_num = np.zeros((N, N)) total_trans_denom = np.zeros(N) total_emit_num = np.zeros((N, M)) total_emit_denom = np.zeros(N) total_ll = 0.0 # E-step: process each sequence for seq in sequences: gamma, xi, ll = self._e_step(seq) total_ll += ll # Accumulate statistics total_gamma_1 += gamma[0] total_trans_num += xi.sum(axis=0) total_trans_denom += gamma[:-1].sum(axis=0) for t, obs in enumerate(seq): total_emit_num[:, obs] += gamma[t] total_emit_denom += gamma.sum(axis=0) log_likelihoods.append(total_ll) # Check convergence rel_change = (total_ll - prev_ll) / (abs(prev_ll) + 1e-10) if verbose and (iteration % 10 == 0 or iteration == max_iters - 1): print(f"Iter {iteration}: log-likelihood = {total_ll:.4f}") if rel_change < tol and iteration > 0: if verbose: print(f"Converged at iteration {iteration}") break prev_ll = total_ll # M-step: update parameters self.pi = total_gamma_1 / K self.A = total_trans_num / (total_trans_denom[:, np.newaxis] + 1e-300) self.A /= self.A.sum(axis=1, keepdims=True) # Renormalize self.B = total_emit_num / (total_emit_denom[:, np.newaxis] + 1e-300) self.B /= self.B.sum(axis=1, keepdims=True) # Renormalize return log_likelihoods # Example usagedef demo_baum_welch(): """Demonstrate Baum-Welch on synthetic data.""" # True HMM (for generating data) true_A = np.array([[0.7, 0.3], [0.4, 0.6]]) true_B = np.array([[0.9, 0.1], [0.2, 0.8]]) true_pi = np.array([0.6, 0.4]) # Generate synthetic sequences def sample_hmm(A, B, pi, T): N, M = B.shape states = np.zeros(T, dtype=int) obs = np.zeros(T, dtype=int) states[0] = np.random.choice(N, p=pi) obs[0] = np.random.choice(M, p=B[states[0]]) for t in range(1, T): states[t] = np.random.choice(N, p=A[states[t-1]]) obs[t] = np.random.choice(M, p=B[states[t]]) return obs np.random.seed(42) sequences = [sample_hmm(true_A, true_B, true_pi, T=50) for _ in range(20)] # Train HMM hmm = HMM(n_states=2, n_observations=2) log_likelihoods = hmm.fit(sequences, max_iters=50, random_state=123, verbose=True) print("\n" + "="*50) print("True vs Learned Parameters:") print("\nTrue A:") print(true_A) print("Learned A:") print(hmm.A.round(3)) print("\nTrue B:") print(true_B) print("Learned B:") print(hmm.B.round(3)) demo_baum_welch()Baum-Welch finds a local maximum, not necessarily the global maximum. Initialization matters!
Random Initialization:
K-means Initialization (for continuous emissions):
Uniform Initialization:
Domain-Informed Initialization:
Adding pseudocounts is equivalent to placing Dirichlet priors on parameters. Instead of MLE, we get Maximum A Posteriori (MAP) estimates:
$\hat{A}_{ij} = \frac{\sum_t \xi_t(i,j) + \alpha - 1}{\sum_t \gamma_t(i) + N(\alpha - 1)}$
With $\alpha > 1$ (symmetric Dirichlet prior), this prevents any transition probability from being exactly 0, improving robustness.
Per iteration:
Total per iteration: $O(KTN^2)$
For large datasets or models, Baum-Welch can be slow. Alternatives:
The Baum-Welch algorithm solves the learning problem—estimating HMM parameters from unlabeled observation sequences—using Expectation-Maximization.
Next Page: Applications
With the complete HMM toolkit—structure, Forward Algorithm, Viterbi Algorithm, and Baum-Welch—we're ready to explore real-world applications. We'll see HMMs in action for speech recognition, part-of-speech tagging, gene finding, and other domains, understanding how the choice of state representation, observation model, and training procedures impact performance.
You now have a deep understanding of the Baum-Welch algorithm—the theoretical foundation in EM, the Forward-Backward Algorithm for computing posteriors, the parameter update formulas, and practical considerations for training. Combined with the Forward and Viterbi algorithms, you have the complete toolkit for building, training, and deploying Hidden Markov Models.