Loading learning content...
Imagine listening to a conversation through a noisy phone line. You hear garbled sounds, but your brain effortlessly reconstructs the words being spoken. Or consider a doctor interpreting symptoms—observable manifestations of underlying diseases that cannot be directly observed. In both cases, we observe emissions from a system whose true internal state remains hidden from us.
Hidden Markov Models (HMMs) provide a principled probabilistic framework for reasoning about such scenarios: systems that evolve through a sequence of unobservable states, where each state produces observable outputs according to some emission distribution. HMMs elegantly combine the simplicity of Markov chains with the power of latent variable models, enabling us to:
This page establishes the foundational structure of HMMs, laying the groundwork for the inference and learning algorithms that follow.
By the end of this page, you will understand the complete mathematical specification of Hidden Markov Models, including state spaces, transition dynamics, emission distributions, and the key independence assumptions that make HMMs both powerful and tractable. You will be able to formally define an HMM and understand how it relates to observable Markov chains and general latent variable models.
To understand Hidden Markov Models, we must first appreciate the limitations of standard Markov chains and the motivation for introducing hidden states.
A Markov chain models a sequence of random variables $X_1, X_2, \ldots, X_T$ where each variable depends only on its immediate predecessor:
$$P(X_t | X_1, X_2, \ldots, X_{t-1}) = P(X_t | X_{t-1})$$
This Markov property (also called memorylessness) dramatically simplifies modeling by asserting that all relevant information for predicting the future is captured in the current state. The joint probability of a sequence factorizes as:
$$P(X_1, X_2, \ldots, X_T) = P(X_1) \prod_{t=2}^{T} P(X_t | X_{t-1})$$
Standard Markov chains assume that states are directly observable—we see the full sequence $X_1, \ldots, X_T$ and can count transitions to estimate parameters.
In many real-world scenarios, the underlying state sequence is not directly observable. We only observe noisy, indirect measurements or symptoms of the true state. A patient's underlying health condition (states) is hidden; we only observe symptoms and test results (observations). A speaker's intended phonemes (states) are hidden; we only observe acoustic signals (observations).
Consider modeling daily weather patterns from a person's activity log. We might observe whether they carried an umbrella, but not the actual weather. The weather (sunny, cloudy, rainy) forms a Markov chain, but we only see umbrella usage—a noisy indicator of the true weather.
This motivates the hidden state formulation:
The hidden states form a Markov chain, but we only have access to the observations. This is the essence of a Hidden Markov Model.
| Aspect | Observable Markov Chain | Hidden Markov Model |
|---|---|---|
| States | Directly observed | Hidden (latent) |
| Observations | States themselves | Emissions from states |
| Parameters to estimate | Transition probabilities only | Transitions + emissions + initial |
| Learning approach | Direct counting | EM algorithm (Baum-Welch) |
| Inference | Trivial (states known) | Forward-backward, Viterbi |
| Modeling power | Limited to observable dynamics | Can model hidden generative processes |
A Hidden Markov Model is a doubly stochastic process: a Markov chain over hidden states, coupled with a stochastic emission process at each state. Formally, an HMM is specified by the following components:
1. State Space $\mathcal{S} = {s_1, s_2, \ldots, s_N}$
The finite set of $N$ possible hidden states. At each time step $t$, the system occupies exactly one state $z_t \in \mathcal{S}$.
2. Observation Space $\mathcal{O} = {o_1, o_2, \ldots, o_M}$ (Discrete Case)
The set of $M$ possible observations. At each time step, the system emits one observation $x_t \in \mathcal{O}$ based on the current hidden state. For continuous observations, $\mathcal{O} = \mathbb{R}^d$ for some dimension $d$.
3. State Transition Matrix $\mathbf{A} \in \mathbb{R}^{N \times N}$
$$A_{ij} = P(z_t = s_j | z_{t-1} = s_i)$$
The probability of transitioning from state $s_i$ to state $s_j$. Each row sums to 1: $\sum_{j=1}^{N} A_{ij} = 1$ for all $i$.
4. Emission Distribution $\mathbf{B}$
For discrete observations: $\mathbf{B} \in \mathbb{R}^{N \times M}$ $$B_{jk} = P(x_t = o_k | z_t = s_j)$$
The probability of observing symbol $o_k$ when in state $s_j$. Each row sums to 1: $\sum_{k=1}^{M} B_{jk} = 1$.
For continuous observations: $B_j(x)$ is a probability density function (commonly Gaussian) $$B_j(x) = \mathcal{N}(x | \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)$$
5. Initial State Distribution $\boldsymbol{\pi} \in \mathbb{R}^N$
$$\pi_i = P(z_1 = s_i)$$
The probability of starting in state $s_i$. Must satisfy: $\sum_{i=1}^{N} \pi_i = 1$.
We denote the complete HMM parameter set as $\boldsymbol{\theta} = (\mathbf{A}, \mathbf{B}, \boldsymbol{\pi})$ or sometimes $\lambda = (\mathbf{A}, \mathbf{B}, \boldsymbol{\pi})$. When writing $P(\cdot | \lambda)$, we mean the probability under the model with these specific parameters. The state and observation spaces $(\mathcal{S}, \mathcal{O})$ are typically fixed by the problem structure.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114
import numpy as npfrom dataclasses import dataclassfrom typing import Optional @dataclassclass DiscreteHMM: """ Hidden Markov Model with discrete observations. Parameters ---------- n_states : int Number of hidden states (N) n_observations : int Number of possible observations (M) A : np.ndarray of shape (N, N) State transition matrix where A[i,j] = P(z_t=j | z_{t-1}=i) B : np.ndarray of shape (N, M) Emission matrix where B[i,k] = P(x_t=k | z_t=i) pi : np.ndarray of shape (N,) Initial state distribution where pi[i] = P(z_1=i) state_names : Optional list of state names for interpretability observation_names : Optional list of observation names """ n_states: int n_observations: int A: np.ndarray # Transition matrix (N x N) B: np.ndarray # Emission matrix (N x M) pi: np.ndarray # Initial distribution (N,) state_names: Optional[list] = None observation_names: Optional[list] = None def __post_init__(self): """Validate HMM parameters.""" # Check dimensions assert self.A.shape == (self.n_states, self.n_states), \ f"A must be ({self.n_states}, {self.n_states})" assert self.B.shape == (self.n_states, self.n_observations), \ f"B must be ({self.n_states}, {self.n_observations})" assert self.pi.shape == (self.n_states,), \ f"pi must have length {self.n_states}" # Check probability constraints (rows sum to 1) assert np.allclose(self.A.sum(axis=1), 1.0), \ "Each row of A must sum to 1" assert np.allclose(self.B.sum(axis=1), 1.0), \ "Each row of B must sum to 1" assert np.isclose(self.pi.sum(), 1.0), \ "pi must sum to 1" # Check non-negativity assert np.all(self.A >= 0), "A must be non-negative" assert np.all(self.B >= 0), "B must be non-negative" assert np.all(self.pi >= 0), "pi must be non-negative" def sample(self, T: int, seed: Optional[int] = None) -> tuple: """ Generate a sample sequence from the HMM. Returns (observations, hidden_states) of length T. """ if seed is not None: np.random.seed(seed) states = np.zeros(T, dtype=int) observations = np.zeros(T, dtype=int) # Sample initial state states[0] = np.random.choice(self.n_states, p=self.pi) observations[0] = np.random.choice( self.n_observations, p=self.B[states[0]] ) # Sample subsequent states and observations for t in range(1, T): states[t] = np.random.choice( self.n_states, p=self.A[states[t-1]] ) observations[t] = np.random.choice( self.n_observations, p=self.B[states[t]] ) return observations, states # Example: Weather HMM# States: 0=Sunny, 1=Cloudy, 2=Rainy# Observations: 0=Walk, 1=Shop, 2=Clean weather_hmm = DiscreteHMM( n_states=3, n_observations=3, # Transition probabilities (weather dynamics) A=np.array([ [0.7, 0.2, 0.1], # From Sunny [0.3, 0.4, 0.3], # From Cloudy [0.2, 0.3, 0.5], # From Rainy ]), # Emission probabilities (activity given weather) B=np.array([ [0.7, 0.2, 0.1], # Sunny: prefer Walk [0.3, 0.4, 0.3], # Cloudy: mixed [0.1, 0.2, 0.7], # Rainy: prefer Clean ]), # Initial distribution (summer start) pi=np.array([0.6, 0.3, 0.1]), state_names=["Sunny", "Cloudy", "Rainy"], observation_names=["Walk", "Shop", "Clean"]) # Generate a sample sequenceobs, states = weather_hmm.sample(T=10, seed=42)print("Hidden states:", [weather_hmm.state_names[s] for s in states])print("Observations: ", [weather_hmm.observation_names[o] for o in obs])HMMs have a natural representation as a directed graphical model (Bayesian network), which makes the conditional independence structure explicit and provides a visual language for the model.
The graphical model for an HMM consists of two parallel chains:
$$\begin{array}{ccccccc} Z_1 & \to & Z_2 & \to & Z_3 & \to & \cdots & \to & Z_T \ \downarrow & & \downarrow & & \downarrow & & & & \downarrow \ X_1 & & X_2 & & X_3 & & \cdots & & X_T \end{array}$$
The arrows encode the generative process:
The graphical structure encodes powerful conditional independence properties that are crucial for efficient inference:
1. First-Order Markov Property for States $$P(Z_t | Z_1, \ldots, Z_{t-1}) = P(Z_t | Z_{t-1})$$
Given the previous state, the current state is independent of all earlier states.
2. Observation Independence Given States $$P(X_t | Z_1, \ldots, Z_T, X_1, \ldots, X_{t-1}, X_{t+1}, \ldots, X_T) = P(X_t | Z_t)$$
Given the current hidden state, the observation is independent of all other states and observations.
3. D-Separation Properties
Using d-separation:
These conditional independence properties are not just mathematical curiosities—they are the foundation that makes HMM inference computationally tractable. Without them, computing the probability of an observation sequence would require summing over an exponential number of state sequences. The Markov structure allows us to use dynamic programming (Forward algorithm, Viterbi) to perform exact inference in O(TN²) time instead of O(N^T).
The graphical structure directly determines how the joint probability over all variables factorizes. Understanding this factorization is essential for deriving all HMM algorithms.
For an HMM with hidden state sequence $\mathbf{z} = (z_1, \ldots, z_T)$ and observation sequence $\mathbf{x} = (x_1, \ldots, x_T)$, the joint probability factorizes as:
$$P(\mathbf{x}, \mathbf{z} | \boldsymbol{\theta}) = P(z_1) \left[\prod_{t=2}^{T} P(z_t | z_{t-1})\right] \left[\prod_{t=1}^{T} P(x_t | z_t)\right]$$
Using our parameter notation:
$$P(\mathbf{x}, \mathbf{z} | \boldsymbol{\theta}) = \pi_{z_1} \left[\prod_{t=2}^{T} A_{z_{t-1}, z_t}\right] \left[\prod_{t=1}^{T} B_{z_t}(x_t)\right]$$
| Factor | Meaning | Source |
|---|---|---|
| $\pi_{z_1}$ | Probability of starting in state $z_1$ | Initial distribution |
| $\prod_{t=2}^{T} A_{z_{t-1}, z_t}$ | Probability of the state sequence transitions | Transition matrix |
| $\prod_{t=1}^{T} B_{z_t}(x_t)$ | Probability of observations given states | Emission distribution |
Consider a simple HMM with:
With parameters:
For the sequence $\mathbf{z} = (H, H, L)$ and $\mathbf{x} = (1, 0, 1)$:
$$\begin{align} P(\mathbf{x}, \mathbf{z}) &= P(z_1=H) \cdot P(z_2=H|z_1=H) \cdot P(z_3=L|z_2=H) \ &\quad \times P(x_1=1|z_1=H) \cdot P(x_2=0|z_2=H) \cdot P(x_3=1|z_3=L) \ &= 0.6 \times 0.7 \times 0.3 \times 0.8 \times 0.2 \times 0.2 \ &= 0.004032 \end{align}$$
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687
import numpy as np def compute_joint_probability(hmm, observations, states): """ Compute P(observations, states | hmm) for a specific state sequence. This is the direct computation from the factorization formula. Note: This is for educational purposes. In practice, we never enumerate all state sequences - we use dynamic programming. Parameters ---------- hmm : DiscreteHMM The HMM model observations : array-like of length T The observation sequence (indices) states : array-like of length T The hidden state sequence (indices) Returns ------- float : The joint probability P(x, z | θ) """ T = len(observations) assert len(states) == T, "Observations and states must have same length" # Initialize with initial state probability prob = hmm.pi[states[0]] # Multiply transition probabilities for t in range(1, T): prob *= hmm.A[states[t-1], states[t]] # Multiply emission probabilities for t in range(T): prob *= hmm.B[states[t], observations[t]] return prob def compute_joint_log_probability(hmm, observations, states): """ Compute log P(observations, states | hmm) for numerical stability. When working with long sequences, the joint probability becomes vanishingly small. Log-space computation prevents underflow. """ T = len(observations) # Log initial probability log_prob = np.log(hmm.pi[states[0]]) # Add log transition probabilities for t in range(1, T): log_prob += np.log(hmm.A[states[t-1], states[t]]) # Add log emission probabilities for t in range(T): log_prob += np.log(hmm.B[states[t], observations[t]]) return log_prob # Example computationexample_hmm = DiscreteHMM( n_states=2, n_observations=2, A=np.array([[0.7, 0.3], [0.4, 0.6]]), B=np.array([[0.2, 0.8], # State 0 (H): prefers observation 1 [0.8, 0.2]]), # State 1 (L): prefers observation 0 pi=np.array([0.6, 0.4]), state_names=["High", "Low"], observation_names=["0", "1"]) observations = [1, 0, 1] # x_1=1, x_2=0, x_3=1states = [0, 0, 1] # z_1=H, z_2=H, z_3=L prob = compute_joint_probability(example_hmm, observations, states)log_prob = compute_joint_log_probability(example_hmm, observations, states) print(f"State sequence: {[example_hmm.state_names[s] for s in states]}")print(f"Observation sequence: {observations}")print(f"Joint probability: {prob:.6f}")print(f"Log joint probability: {log_prob:.4f}")print(f"Verified: exp(log_prob) = {np.exp(log_prob):.6f}")The standard HMM formulation makes a critical assumption that significantly simplifies the model: time homogeneity.
In a time-homogeneous (or stationary-parameter) HMM, the transition and emission probabilities do not depend on time:
$$P(z_t = j | z_{t-1} = i) = A_{ij} \quad \text{for all } t$$ $$P(x_t = k | z_t = j) = B_{jk} \quad \text{for all } t$$
This means we use the same matrices $\mathbf{A}$ and $\mathbf{B}$ at every time step, regardless of where we are in the sequence.
Advantages:
Limitations:
For applications requiring time-varying parameters, non-homogeneous HMMs or time-varying HMMs can be used. These have $\mathbf{A}^{(t)}$ and $\mathbf{B}^{(t)}$ that change with $t$. However, this dramatically increases the number of parameters and typically requires either explicit functional forms for time-dependence or regularization techniques.
For an ergodic (irreducible, aperiodic) HMM, the hidden state Markov chain has a stationary distribution $\boldsymbol{\mu}$ satisfying:
$$\boldsymbol{\mu}^\top \mathbf{A} = \boldsymbol{\mu}^\top$$
This is the left eigenvector of $\mathbf{A}$ corresponding to eigenvalue 1. If we start the chain with this distribution ($\boldsymbol{\pi} = \boldsymbol{\mu}$), the marginal distribution over states remains constant over time.
For the weather example with transition matrix: $$\mathbf{A} = \begin{pmatrix} 0.7 & 0.2 & 0.1 \ 0.3 & 0.4 & 0.3 \ 0.2 & 0.3 & 0.5 \end{pmatrix}$$
The stationary distribution is approximately $\boldsymbol{\mu} \approx (0.449, 0.287, 0.264)$, meaning that in the long run, about 45% of days are sunny, 29% cloudy, and 26% rainy.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384
import numpy as npfrom scipy import linalg def compute_stationary_distribution(A: np.ndarray) -> np.ndarray: """ Compute the stationary distribution of a Markov chain. The stationary distribution μ satisfies μᵀA = μᵀ, i.e., it's the left eigenvector for eigenvalue 1. Parameters ---------- A : np.ndarray of shape (N, N) Transition matrix (row-stochastic) Returns ------- np.ndarray of shape (N,) Stationary distribution """ N = A.shape[0] # Method 1: Solve the eigenvector problem # μᵀA = μᵀ => Aᵀμ = μ (find right eigenvector of Aᵀ) eigenvalues, eigenvectors = linalg.eig(A.T) # Find eigenvector corresponding to eigenvalue 1 idx = np.argmin(np.abs(eigenvalues - 1)) stationary = np.real(eigenvectors[:, idx]) # Normalize to make it a probability distribution stationary = stationary / stationary.sum() return stationary def verify_stationary(A: np.ndarray, mu: np.ndarray) -> bool: """Verify that μ is a stationary distribution for A.""" result = mu @ A return np.allclose(result, mu) # Weather transition matrixA_weather = np.array([ [0.7, 0.2, 0.1], [0.3, 0.4, 0.3], [0.2, 0.3, 0.5]]) mu = compute_stationary_distribution(A_weather)state_names = ["Sunny", "Cloudy", "Rainy"] print("Weather HMM Stationary Distribution:")print("-" * 40)for name, prob in zip(state_names, mu): print(f" {name:8s}: {prob:.3f} ({prob*100:.1f}%)") print(f"\nVerification: μᵀA = μᵀ? {verify_stationary(A_weather, mu)}") # Simulate convergence to stationary distributiondef simulate_convergence(A, pi0, n_steps=50): """Show how initial distribution converges to stationary.""" pi = pi0.copy() history = [pi.copy()] for _ in range(n_steps): pi = pi @ A history.append(pi.copy()) return np.array(history) # Start from different initial distributionspi_sunny = np.array([1.0, 0.0, 0.0]) # Start definitely sunnypi_rainy = np.array([0.0, 0.0, 1.0]) # Start definitely rainy convergence_sunny = simulate_convergence(A_weather, pi_sunny, 20)convergence_rainy = simulate_convergence(A_weather, pi_rainy, 20) print("\nConvergence from different starts (Sunny probability):")print("Step | From Sunny | From Rainy | Stationary")print("-" * 50)for t in [0, 1, 2, 5, 10, 20]: print(f"{t:4d} | {convergence_sunny[t,0]:10.4f} | " f"{convergence_rainy[t,0]:10.4f} | {mu[0]:.4f}")Given the HMM structure, there are three fundamental computational problems that arise in practice. Understanding these problems is essential before diving into the algorithms that solve them.
Given: Parameters $\boldsymbol{\theta} = (\mathbf{A}, \mathbf{B}, \boldsymbol{\pi})$ and observation sequence $\mathbf{x} = (x_1, \ldots, x_T)$
Compute: $P(\mathbf{x} | \boldsymbol{\theta})$ — the probability of the observation sequence
Why it matters:
Challenge: Requires summing over all possible hidden state sequences: $$P(\mathbf{x} | \boldsymbol{\theta}) = \sum_{\mathbf{z}} P(\mathbf{x}, \mathbf{z} | \boldsymbol{\theta})$$
With $N$ states and $T$ time steps, there are $N^T$ possible state sequences. Brute force is intractable.
Solution: Forward algorithm (dynamic programming) — $O(TN^2)$ time
Given: Parameters $\boldsymbol{\theta}$ and observation sequence $\mathbf{x}$
Compute: The most likely hidden state sequence $$\mathbf{z}^* = \arg\max_{\mathbf{z}} P(\mathbf{z} | \mathbf{x}, \boldsymbol{\theta})$$
Why it matters:
Challenge: Again, $N^T$ possible sequences to consider.
Solution: Viterbi algorithm (dynamic programming) — $O(TN^2)$ time
Note: There are actually two versions of this problem:
Given: Observation sequences ${\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(K)}}$ (states are hidden!)
Compute: Optimal parameters $\boldsymbol{\theta}^* = \arg\max_{\boldsymbol{\theta}} P(\mathbf{x}^{(1:K)} | \boldsymbol{\theta})$
Why it matters:
Challenge: No closed-form solution because hidden states are unobserved. The likelihood landscape may be non-convex with multiple local optima.
Solution: Baum-Welch algorithm (Expectation-Maximization) — iteratively improves parameters
Special case: If hidden states are observed during training (supervised learning), parameters can be estimated by direct counting: $$\hat{A}{ij} = \frac{\text{count}(z{t-1}=i, z_t=j)}{\text{count}(z_{t-1}=i)}$$
| Problem | Input | Output | Algorithm | Complexity |
|---|---|---|---|---|
| Evaluation | $\mathbf{x}, \boldsymbol{\theta}$ | $P(\mathbf{x} | \boldsymbol{\theta})$ | Forward | $O(TN^2)$ |
| Decoding | $\mathbf{x}, \boldsymbol{\theta}$ | $\mathbf{z}^* = \arg\max P(\mathbf{z}|\mathbf{x})$ | Viterbi | $O(TN^2)$ |
| Learning | ${\mathbf{x}^{(k)}}$ | $\boldsymbol{\theta}^* = \arg\max P(\mathbf{x})$ | Baum-Welch (EM) | $O(KTN^2)$ per iter |
All three problems share a common theme: replacing exponential-time brute force enumeration with polynomial-time dynamic programming. The key insight is that the Markov property allows us to decompose the global computation into local computations that can be combined efficiently. This is the recurring pattern that makes HMMs practically useful.
The basic HMM framework admits many extensions to handle different data types and structural assumptions.
1. Discrete HMMs (Multinomial Emissions)
2. Gaussian HMMs (Continuous Emissions)
3. Gaussian Mixture Model HMMs (GMM-HMM)
4. Autoregressive HMMs
The choice of HMM variant depends on your data and assumptions:
• Discrete data (words, symbols) → Standard discrete HMM • Continuous univariate/multivariate data → Gaussian HMM or GMM-HMM • Sequential data with fixed progression → Left-right HMM • Variable-length segments → Hidden semi-Markov model • Multiple interacting processes → Factorial or coupled HMMs
Start with the simplest model that captures your data's structure, then add complexity as needed.
We have established the complete mathematical foundation for Hidden Markov Models. Let's consolidate the key concepts:
With the structural foundation in place, we will now develop the algorithms that make HMMs practical:
Next Page: The Forward Algorithm — We'll derive an elegant dynamic programming solution to efficiently compute the probability of an observation sequence, turning an exponential-time problem into a polynomial-time one.
Following Pages:
You now have a rigorous understanding of HMM structure—the mathematical specification, graphical model interpretation, factorization properties, and the three fundamental problems that HMM algorithms solve. This foundation will be essential as we develop the Forward, Viterbi, and Baum-Welch algorithms in the following pages.