Machine LearningGraphical Models

Hidden Markov Models

LevelAdvanced

Duration120 mins

TopicGraphical Models

1 / 5

HMM Structure

The Challenge of Sequential Data with Hidden States

Imagine listening to a conversation through a noisy phone line. You hear garbled sounds, but your brain effortlessly reconstructs the words being spoken. Or consider a doctor interpreting symptoms—observable manifestations of underlying diseases that cannot be directly observed. In both cases, we observe emissions from a system whose true internal state remains hidden from us.

Hidden Markov Models (HMMs) provide a principled probabilistic framework for reasoning about such scenarios: systems that evolve through a sequence of unobservable states, where each state produces observable outputs according to some emission distribution. HMMs elegantly combine the simplicity of Markov chains with the power of latent variable models, enabling us to:

Infer the most likely sequence of hidden states given observations
Compute the probability of observing a particular sequence
Learn the model parameters from unlabeled data

This page establishes the foundational structure of HMMs, laying the groundwork for the inference and learning algorithms that follow.

What You Will Learn

By the end of this page, you will understand the complete mathematical specification of Hidden Markov Models, including state spaces, transition dynamics, emission distributions, and the key independence assumptions that make HMMs both powerful and tractable. You will be able to formally define an HMM and understand how it relates to observable Markov chains and general latent variable models.

From Observable to Hidden States

To understand Hidden Markov Models, we must first appreciate the limitations of standard Markov chains and the motivation for introducing hidden states.

Standard Markov Chains: A Brief Review

A Markov chain models a sequence of random variables $X_1, X_2, \ldots, X_T$ where each variable depends only on its immediate predecessor:

$$P(X_t | X_1, X_2, \ldots, X_{t-1}) = P(X_t | X_{t-1})$$

This Markov property (also called memorylessness) dramatically simplifies modeling by asserting that all relevant information for predicting the future is captured in the current state. The joint probability of a sequence factorizes as:

$$P(X_1, X_2, \ldots, X_T) = P(X_1) \prod_{t=2}^{T} P(X_t | X_{t-1})$$

Standard Markov chains assume that states are directly observable—we see the full sequence $X_1, \ldots, X_T$ and can count transitions to estimate parameters.

The Limitation of Observable Markov Chains

In many real-world scenarios, the underlying state sequence is not directly observable. We only observe noisy, indirect measurements or symptoms of the true state. A patient's underlying health condition (states) is hidden; we only observe symptoms and test results (observations). A speaker's intended phonemes (states) are hidden; we only observe acoustic signals (observations).

The Need for Hidden States

Consider modeling daily weather patterns from a person's activity log. We might observe whether they carried an umbrella, but not the actual weather. The weather (sunny, cloudy, rainy) forms a Markov chain, but we only see umbrella usage—a noisy indicator of the true weather.

This motivates the hidden state formulation:

Hidden (latent) states $Z_1, Z_2, \ldots, Z_T$: The true underlying sequence we cannot directly observe
Observations (emissions) $X_1, X_2, \ldots, X_T$: The visible outputs generated by each hidden state

The hidden states form a Markov chain, but we only have access to the observations. This is the essence of a Hidden Markov Model.

Observable vs Hidden Markov Chains
Aspect	Observable Markov Chain	Hidden Markov Model
States	Directly observed	Hidden (latent)
Observations	States themselves	Emissions from states
Parameters to estimate	Transition probabilities only	Transitions + emissions + initial
Learning approach	Direct counting	EM algorithm (Baum-Welch)
Inference	Trivial (states known)	Forward-backward, Viterbi
Modeling power	Limited to observable dynamics	Can model hidden generative processes

Formal Definition of Hidden Markov Models

A Hidden Markov Model is a doubly stochastic process: a Markov chain over hidden states, coupled with a stochastic emission process at each state. Formally, an HMM is specified by the following components:

The Five-Tuple Specification: $\lambda = (\mathcal{S}, \mathcal{O}, \mathbf{A}, \mathbf{B}, \boldsymbol{\pi})$

1. State Space $\mathcal{S} = {s_1, s_2, \ldots, s_N}$

The finite set of $N$ possible hidden states. At each time step $t$, the system occupies exactly one state $z_t \in \mathcal{S}$.

2. Observation Space $\mathcal{O} = {o_1, o_2, \ldots, o_M}$ (Discrete Case)

The set of $M$ possible observations. At each time step, the system emits one observation $x_t \in \mathcal{O}$ based on the current hidden state. For continuous observations, $\mathcal{O} = \mathbb{R}^d$ for some dimension $d$.

3. State Transition Matrix $\mathbf{A} \in \mathbb{R}^{N \times N}$

$$A_{ij} = P(z_t = s_j | z_{t-1} = s_i)$$

The probability of transitioning from state $s_i$ to state $s_j$. Each row sums to 1: $\sum_{j=1}^{N} A_{ij} = 1$ for all $i$.

4. Emission Distribution $\mathbf{B}$

For discrete observations: $\mathbf{B} \in \mathbb{R}^{N \times M}$ $$B_{jk} = P(x_t = o_k | z_t = s_j)$$

The probability of observing symbol $o_k$ when in state $s_j$. Each row sums to 1: $\sum_{k=1}^{M} B_{jk} = 1$.

For continuous observations: $B_j(x)$ is a probability density function (commonly Gaussian) $$B_j(x) = \mathcal{N}(x | \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)$$

5. Initial State Distribution $\boldsymbol{\pi} \in \mathbb{R}^N$

$$\pi_i = P(z_1 = s_i)$$

The probability of starting in state $s_i$. Must satisfy: $\sum_{i=1}^{N} \pi_i = 1$.

Compact Notation

We denote the complete HMM parameter set as $\boldsymbol{\theta} = (\mathbf{A}, \mathbf{B}, \boldsymbol{\pi})$ or sometimes $\lambda = (\mathbf{A}, \mathbf{B}, \boldsymbol{\pi})$. When writing $P(\cdot | \lambda)$, we mean the probability under the model with these specific parameters. The state and observation spaces $(\mathcal{S}, \mathcal{O})$ are typically fixed by the problem structure.

hmm_parameters.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
import numpy as np
from dataclasses import dataclass
from typing import Optional
 
@dataclass
class DiscreteHMM:
    """
    Hidden Markov Model with discrete observations.
    
    Parameters
    ----------
    n_states : int
        Number of hidden states (N)
    n_observations : int
        Number of possible observations (M)
    A : np.ndarray of shape (N, N)
        State transition matrix where A[i,j] = P(z_t=j | z_{t-1}=i)
    B : np.ndarray of shape (N, M)
        Emission matrix where B[i,k] = P(x_t=k | z_t=i)
    pi : np.ndarray of shape (N,)
        Initial state distribution where pi[i] = P(z_1=i)
    state_names : Optional list of state names for interpretability
    observation_names : Optional list of observation names
    """
    n_states: int
    n_observations: int
    A: np.ndarray  # Transition matrix (N x N)
    B: np.ndarray  # Emission matrix (N x M)
    pi: np.ndarray  # Initial distribution (N,)
    state_names: Optional[list] = None
    observation_names: Optional[list] = None
    
    def __post_init__(self):
        """Validate HMM parameters."""
        # Check dimensions
        assert self.A.shape == (self.n_states, self.n_states), \
            f"A must be ({self.n_states}, {self.n_states})"
        assert self.B.shape == (self.n_states, self.n_observations), \
            f"B must be ({self.n_states}, {self.n_observations})"
        assert self.pi.shape == (self.n_states,), \
            f"pi must have length {self.n_states}"
        
        # Check probability constraints (rows sum to 1)
        assert np.allclose(self.A.sum(axis=1), 1.0), \
            "Each row of A must sum to 1"
        assert np.allclose(self.B.sum(axis=1), 1.0), \
            "Each row of B must sum to 1"
        assert np.isclose(self.pi.sum(), 1.0), \
            "pi must sum to 1"
        
        # Check non-negativity
        assert np.all(self.A >= 0), "A must be non-negative"
        assert np.all(self.B >= 0), "B must be non-negative"
        assert np.all(self.pi >= 0), "pi must be non-negative"
    
    def sample(self, T: int, seed: Optional[int] = None) -> tuple:
        """
        Generate a sample sequence from the HMM.
        
        Returns (observations, hidden_states) of length T.
        """
        if seed is not None:
            np.random.seed(seed)
        
        states = np.zeros(T, dtype=int)
        observations = np.zeros(T, dtype=int)
        
        # Sample initial state
        states[0] = np.random.choice(self.n_states, p=self.pi)
        observations[0] = np.random.choice(
            self.n_observations, p=self.B[states[0]]
        )
        
        # Sample subsequent states and observations
        for t in range(1, T):
            states[t] = np.random.choice(
                self.n_states, p=self.A[states[t-1]]
            )
            observations[t] = np.random.choice(
                self.n_observations, p=self.B[states[t]]
            )
        
        return observations, states
 
 
# Example: Weather HMM
# States: 0=Sunny, 1=Cloudy, 2=Rainy
# Observations: 0=Walk, 1=Shop, 2=Clean
 
weather_hmm = DiscreteHMM(
    n_states=3,
    n_observations=3,
    # Transition probabilities (weather dynamics)
    A=np.array([
        [0.7, 0.2, 0.1],  # From Sunny
        [0.3, 0.4, 0.3],  # From Cloudy
        [0.2, 0.3, 0.5],  # From Rainy
    ]),
    # Emission probabilities (activity given weather)
    B=np.array([
        [0.7, 0.2, 0.1],  # Sunny: prefer Walk
        [0.3, 0.4, 0.3],  # Cloudy: mixed
        [0.1, 0.2, 0.7],  # Rainy: prefer Clean
    ]),
    # Initial distribution (summer start)
    pi=np.array([0.6, 0.3, 0.1]),
    state_names=["Sunny", "Cloudy", "Rainy"],
    observation_names=["Walk", "Shop", "Clean"]
)
 
# Generate a sample sequence
obs, states = weather_hmm.sample(T=10, seed=42)
print("Hidden states:", [weather_hmm.state_names[s] for s in states])
print("Observations: ", [weather_hmm.observation_names[o] for o in obs])

The Graphical Model Representation

HMMs have a natural representation as a directed graphical model (Bayesian network), which makes the conditional independence structure explicit and provides a visual language for the model.

The HMM Graphical Structure

The graphical model for an HMM consists of two parallel chains:

The hidden state chain $Z_1 \to Z_2 \to \ldots \to Z_T$: A first-order Markov chain
The observation nodes $X_1, X_2, \ldots, X_T$: Each $X_t$ depends only on the corresponding $Z_t$

$$\begin{array}{ccccccc} Z_1 & \to & Z_2 & \to & Z_3 & \to & \cdots & \to & Z_T \ \downarrow & & \downarrow & & \downarrow & & & & \downarrow \ X_1 & & X_2 & & X_3 & & \cdots & & X_T \end{array}$$

The arrows encode the generative process:

Each $Z_t$ is generated from $Z_{t-1}$ according to the transition matrix $\mathbf{A}$
Each $X_t$ is generated from $Z_t$ according to the emission distribution $\mathbf{B}$

Converting Mermaid diagram...

Reading Conditional Independence from the Graph

The graphical structure encodes powerful conditional independence properties that are crucial for efficient inference:

1. First-Order Markov Property for States $$P(Z_t | Z_1, \ldots, Z_{t-1}) = P(Z_t | Z_{t-1})$$

Given the previous state, the current state is independent of all earlier states.

2. Observation Independence Given States $$P(X_t | Z_1, \ldots, Z_T, X_1, \ldots, X_{t-1}, X_{t+1}, \ldots, X_T) = P(X_t | Z_t)$$

Given the current hidden state, the observation is independent of all other states and observations.

3. D-Separation Properties

Using d-separation:

$X_s \perp!!!\perp X_t ,|, Z_s, Z_t$ for $s \neq t$: Observations are conditionally independent given their states
$Z_{t+1} \perp!!!\perp Z_{1:t-1} ,|, Z_t$: Future states are independent of distant past given present
$X_{1:t-1} \perp!!!\perp X_{t+1:T} ,|, Z_t$: Past and future observations are independent given the current state

Why These Independencies Matter

These conditional independence properties are not just mathematical curiosities—they are the foundation that makes HMM inference computationally tractable. Without them, computing the probability of an observation sequence would require summing over an exponential number of state sequences. The Markov structure allows us to use dynamic programming (Forward algorithm, Viterbi) to perform exact inference in O(TN²) time instead of O(N^T).

Joint Distribution Factorization

The graphical structure directly determines how the joint probability over all variables factorizes. Understanding this factorization is essential for deriving all HMM algorithms.

The Complete Joint Distribution

For an HMM with hidden state sequence $\mathbf{z} = (z_1, \ldots, z_T)$ and observation sequence $\mathbf{x} = (x_1, \ldots, x_T)$, the joint probability factorizes as:

$$P(\mathbf{x}, \mathbf{z} | \boldsymbol{\theta}) = P(z_1) \left[\prod_{t=2}^{T} P(z_t | z_{t-1})\right] \left[\prod_{t=1}^{T} P(x_t | z_t)\right]$$

Using our parameter notation:

$$P(\mathbf{x}, \mathbf{z} | \boldsymbol{\theta}) = \pi_{z_1} \left[\prod_{t=2}^{T} A_{z_{t-1}, z_t}\right] \left[\prod_{t=1}^{T} B_{z_t}(x_t)\right]$$

Interpretation of Each Factor

Factor	Meaning	Source
$\pi_{z_1}$	Probability of starting in state $z_1$	Initial distribution
$\prod_{t=2}^{T} A_{z_{t-1}, z_t}$	Probability of the state sequence transitions	Transition matrix
$\prod_{t=1}^{T} B_{z_t}(x_t)$	Probability of observations given states	Emission distribution

Example: Computing Joint Probability

Consider a simple HMM with:

2 states: $\mathcal{S} = {H, L}$ (High, Low)
2 observations: $\mathcal{O} = {0, 1}$

With parameters:

$\boldsymbol{\pi} = (0.6, 0.4)$
$\mathbf{A} = \begin{pmatrix} 0.7 & 0.3 \ 0.4 & 0.6 \end{pmatrix}$
$\mathbf{B} = \begin{pmatrix} 0.2 & 0.8 \ 0.8 & 0.2 \end{pmatrix}$

For the sequence $\mathbf{z} = (H, H, L)$ and $\mathbf{x} = (1, 0, 1)$:

$$\begin{align} P(\mathbf{x}, \mathbf{z}) &= P(z_1=H) \cdot P(z_2=H|z_1=H) \cdot P(z_3=L|z_2=H) \ &\quad \times P(x_1=1|z_1=H) \cdot P(x_2=0|z_2=H) \cdot P(x_3=1|z_3=L) \ &= 0.6 \times 0.7 \times 0.3 \times 0.8 \times 0.2 \times 0.2 \ &= 0.004032 \end{align}$$

joint_probability.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import numpy as np
 
def compute_joint_probability(hmm, observations, states):
    """
    Compute P(observations, states | hmm) for a specific state sequence.
    
    This is the direct computation from the factorization formula.
    Note: This is for educational purposes. In practice, we never
    enumerate all state sequences - we use dynamic programming.
    
    Parameters
    ----------
    hmm : DiscreteHMM
        The HMM model
    observations : array-like of length T
        The observation sequence (indices)
    states : array-like of length T
        The hidden state sequence (indices)
    
    Returns
    -------
    float : The joint probability P(x, z | θ)
    """
    T = len(observations)
    assert len(states) == T, "Observations and states must have same length"
    
    # Initialize with initial state probability
    prob = hmm.pi[states[0]]
    
    # Multiply transition probabilities
    for t in range(1, T):
        prob *= hmm.A[states[t-1], states[t]]
    
    # Multiply emission probabilities
    for t in range(T):
        prob *= hmm.B[states[t], observations[t]]
    
    return prob
 
 
def compute_joint_log_probability(hmm, observations, states):
    """
    Compute log P(observations, states | hmm) for numerical stability.
    
    When working with long sequences, the joint probability becomes
    vanishingly small. Log-space computation prevents underflow.
    """
    T = len(observations)
    
    # Log initial probability
    log_prob = np.log(hmm.pi[states[0]])
    
    # Add log transition probabilities
    for t in range(1, T):
        log_prob += np.log(hmm.A[states[t-1], states[t]])
    
    # Add log emission probabilities
    for t in range(T):
        log_prob += np.log(hmm.B[states[t], observations[t]])
    
    return log_prob
 
 
# Example computation
example_hmm = DiscreteHMM(
    n_states=2,
    n_observations=2,
    A=np.array([[0.7, 0.3], 
                [0.4, 0.6]]),
    B=np.array([[0.2, 0.8],  # State 0 (H): prefers observation 1
                [0.8, 0.2]]), # State 1 (L): prefers observation 0
    pi=np.array([0.6, 0.4]),
    state_names=["High", "Low"],
    observation_names=["0", "1"]
)
 
observations = [1, 0, 1]  # x_1=1, x_2=0, x_3=1
states = [0, 0, 1]        # z_1=H, z_2=H, z_3=L
 
prob = compute_joint_probability(example_hmm, observations, states)
log_prob = compute_joint_log_probability(example_hmm, observations, states)
 
print(f"State sequence: {[example_hmm.state_names[s] for s in states]}")
print(f"Observation sequence: {observations}")
print(f"Joint probability: {prob:.6f}")
print(f"Log joint probability: {log_prob:.4f}")
print(f"Verified: exp(log_prob) = {np.exp(log_prob):.6f}")

Time Homogeneity and Stationarity

The standard HMM formulation makes a critical assumption that significantly simplifies the model: time homogeneity.

Time-Homogeneous HMMs

In a time-homogeneous (or stationary-parameter) HMM, the transition and emission probabilities do not depend on time:

$$P(z_t = j | z_{t-1} = i) = A_{ij} \quad \text{for all } t$$ $$P(x_t = k | z_t = j) = B_{jk} \quad \text{for all } t$$

This means we use the same matrices $\mathbf{A}$ and $\mathbf{B}$ at every time step, regardless of where we are in the sequence.

Implications of Time Homogeneity

Advantages:

Dramatic parameter reduction: Instead of $T$ different transition matrices, we learn just one
Better generalization: Parameters are estimated from all time steps, not just one
Sequence length flexibility: The same model works for sequences of any length

Limitations:

Cannot capture time-varying dynamics (e.g., morning vs. evening patterns)
Assumes the process is stationary within the modeled timeframe
May be inappropriate for processes with trends or seasonality

Non-Homogeneous Extensions

For applications requiring time-varying parameters, non-homogeneous HMMs or time-varying HMMs can be used. These have $\mathbf{A}^{(t)}$ and $\mathbf{B}^{(t)}$ that change with $t$. However, this dramatically increases the number of parameters and typically requires either explicit functional forms for time-dependence or regularization techniques.

Stationary Distribution of the Hidden Chain

For an ergodic (irreducible, aperiodic) HMM, the hidden state Markov chain has a stationary distribution $\boldsymbol{\mu}$ satisfying:

$$\boldsymbol{\mu}^\top \mathbf{A} = \boldsymbol{\mu}^\top$$

This is the left eigenvector of $\mathbf{A}$ corresponding to eigenvalue 1. If we start the chain with this distribution ($\boldsymbol{\pi} = \boldsymbol{\mu}$), the marginal distribution over states remains constant over time.

For the weather example with transition matrix: $$\mathbf{A} = \begin{pmatrix} 0.7 & 0.2 & 0.1 \ 0.3 & 0.4 & 0.3 \ 0.2 & 0.3 & 0.5 \end{pmatrix}$$

The stationary distribution is approximately $\boldsymbol{\mu} \approx (0.449, 0.287, 0.264)$, meaning that in the long run, about 45% of days are sunny, 29% cloudy, and 26% rainy.

stationary_distribution.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import numpy as np
from scipy import linalg
 
def compute_stationary_distribution(A: np.ndarray) -> np.ndarray:
    """
    Compute the stationary distribution of a Markov chain.
    
    The stationary distribution μ satisfies μᵀA = μᵀ,
    i.e., it's the left eigenvector for eigenvalue 1.
    
    Parameters
    ----------
    A : np.ndarray of shape (N, N)
        Transition matrix (row-stochastic)
    
    Returns
    -------
    np.ndarray of shape (N,)
        Stationary distribution
    """
    N = A.shape[0]
    
    # Method 1: Solve the eigenvector problem
    # μᵀA = μᵀ  =>  Aᵀμ = μ (find right eigenvector of Aᵀ)
    eigenvalues, eigenvectors = linalg.eig(A.T)
    
    # Find eigenvector corresponding to eigenvalue 1
    idx = np.argmin(np.abs(eigenvalues - 1))
    stationary = np.real(eigenvectors[:, idx])
    
    # Normalize to make it a probability distribution
    stationary = stationary / stationary.sum()
    
    return stationary
 
 
def verify_stationary(A: np.ndarray, mu: np.ndarray) -> bool:
    """Verify that μ is a stationary distribution for A."""
    result = mu @ A
    return np.allclose(result, mu)
 
 
# Weather transition matrix
A_weather = np.array([
    [0.7, 0.2, 0.1],
    [0.3, 0.4, 0.3],
    [0.2, 0.3, 0.5]
])
 
mu = compute_stationary_distribution(A_weather)
state_names = ["Sunny", "Cloudy", "Rainy"]
 
print("Weather HMM Stationary Distribution:")
print("-" * 40)
for name, prob in zip(state_names, mu):
    print(f"  {name:8s}: {prob:.3f} ({prob*100:.1f}%)")
 
print(f"\nVerification: μᵀA = μᵀ? {verify_stationary(A_weather, mu)}")
 
# Simulate convergence to stationary distribution
def simulate_convergence(A, pi0, n_steps=50):
    """Show how initial distribution converges to stationary."""
    pi = pi0.copy()
    history = [pi.copy()]
    
    for _ in range(n_steps):
        pi = pi @ A
        history.append(pi.copy())
    
    return np.array(history)
 
# Start from different initial distributions
pi_sunny = np.array([1.0, 0.0, 0.0])  # Start definitely sunny
pi_rainy = np.array([0.0, 0.0, 1.0])  # Start definitely rainy
 
convergence_sunny = simulate_convergence(A_weather, pi_sunny, 20)
convergence_rainy = simulate_convergence(A_weather, pi_rainy, 20)
 
print("\nConvergence from different starts (Sunny probability):")
print("Step | From Sunny | From Rainy | Stationary")
print("-" * 50)
for t in [0, 1, 2, 5, 10, 20]:
    print(f"{t:4d} | {convergence_sunny[t,0]:10.4f} | "
          f"{convergence_rainy[t,0]:10.4f} | {mu[0]:.4f}")

The Three Fundamental HMM Problems

Given the HMM structure, there are three fundamental computational problems that arise in practice. Understanding these problems is essential before diving into the algorithms that solve them.

Problem 1: Likelihood/Evaluation

Given: Parameters $\boldsymbol{\theta} = (\mathbf{A}, \mathbf{B}, \boldsymbol{\pi})$ and observation sequence $\mathbf{x} = (x_1, \ldots, x_T)$

Compute: $P(\mathbf{x} | \boldsymbol{\theta})$ — the probability of the observation sequence

Why it matters:

Model comparison: Which model better explains the data?
Classification: Which of several HMMs is most likely to have generated this sequence?
Anomaly detection: Is this sequence unlikely under the learned model?

Challenge: Requires summing over all possible hidden state sequences: $$P(\mathbf{x} | \boldsymbol{\theta}) = \sum_{\mathbf{z}} P(\mathbf{x}, \mathbf{z} | \boldsymbol{\theta})$$

With $N$ states and $T$ time steps, there are $N^T$ possible state sequences. Brute force is intractable.

Solution: Forward algorithm (dynamic programming) — $O(TN^2)$ time

Problem 2: Decoding/Inference

Given: Parameters $\boldsymbol{\theta}$ and observation sequence $\mathbf{x}$

Compute: The most likely hidden state sequence $$\mathbf{z}^* = \arg\max_{\mathbf{z}} P(\mathbf{z} | \mathbf{x}, \boldsymbol{\theta})$$

Why it matters:

Speech recognition: What words were spoken?
Part-of-speech tagging: What grammatical role does each word play?
Gene finding: Which regions of DNA are coding regions?

Challenge: Again, $N^T$ possible sequences to consider.

Solution: Viterbi algorithm (dynamic programming) — $O(TN^2)$ time

Note: There are actually two versions of this problem:

MAP sequence: Find the single most probable complete sequence (Viterbi)
Marginal MAP: Find the most probable state at each position independently (posterior decoding using Forward-Backward)

Problem 3: Learning/Parameter Estimation

Given: Observation sequences ${\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(K)}}$ (states are hidden!)

Compute: Optimal parameters $\boldsymbol{\theta}^* = \arg\max_{\boldsymbol{\theta}} P(\mathbf{x}^{(1:K)} | \boldsymbol{\theta})$

Why it matters:

We typically don't know the true HMM parameters
Must learn transition, emission, and initial probabilities from data

Challenge: No closed-form solution because hidden states are unobserved. The likelihood landscape may be non-convex with multiple local optima.

Solution: Baum-Welch algorithm (Expectation-Maximization) — iteratively improves parameters

Special case: If hidden states are observed during training (supervised learning), parameters can be estimated by direct counting: $$\hat{A}{ij} = \frac{\text{count}(z{t-1}=i, z_t=j)}{\text{count}(z_{t-1}=i)}$$

Summary: The Three Fundamental HMM Problems
Problem	Input	Output	Algorithm	Complexity
Evaluation	$\mathbf{x}, \boldsymbol{\theta}$	$P(\mathbf{x} \| \boldsymbol{\theta})$	Forward	$O(TN^2)$
Decoding	$\mathbf{x}, \boldsymbol{\theta}$	$\mathbf{z}^* = \arg\max P(\mathbf{z}\|\mathbf{x})$	Viterbi	$O(TN^2)$
Learning	${\mathbf{x}^{(k)}}$	$\boldsymbol{\theta}^* = \arg\max P(\mathbf{x})$	Baum-Welch (EM)	$O(KTN^2)$ per iter

The Power of Dynamic Programming

All three problems share a common theme: replacing exponential-time brute force enumeration with polynomial-time dynamic programming. The key insight is that the Markov property allows us to decompose the global computation into local computations that can be combined efficiently. This is the recurring pattern that makes HMMs practically useful.

HMM Variants and Extensions

The basic HMM framework admits many extensions to handle different data types and structural assumptions.

Emission Distribution Variants

1. Discrete HMMs (Multinomial Emissions)

Standard formulation with categorical observations
Each state has a discrete distribution over $M$ possible observations
Parameters: $N \times M$ emission matrix

2. Gaussian HMMs (Continuous Emissions)

Each state emits from a Gaussian: $x_t | z_t=j \sim \mathcal{N}(\boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)$
Parameters per state: mean vector $\boldsymbol{\mu}_j$ and covariance matrix $\boldsymbol{\Sigma}_j$
Common for speech and time-series modeling

3. Gaussian Mixture Model HMMs (GMM-HMM)

Each state emits from a mixture of Gaussians
$P(x_t | z_t=j) = \sum_{m=1}^{M_j} w_{jm} \mathcal{N}(x_t | \boldsymbol{\mu}{jm}, \boldsymbol{\Sigma}{jm})$
More flexible: can model multimodal emission distributions
The classic architecture for speech recognition (before deep learning)

4. Autoregressive HMMs

Emission depends on previous observations: $P(x_t | z_t, x_{t-1}, \ldots)$
Captures temporal structure in the observations themselves
Useful for speech and music modeling

Structural Variants

•Left-Right HMMs: States can only transition to same or higher-indexed states. Natural for modeling sequences with a beginning, middle, and end (e.g., phonemes in words).
•Ergodic HMMs: Any state can transition to any other state. Standard assumption for most applications.
•Hierarchical HMMs: States can themselves contain sub-HMMs, creating multi-level temporal structure.
•Factorial HMMs: Multiple parallel hidden chains, each influencing observations. More expressive but inference is harder.

Advanced Extensions

•Input-Output HMMs: Transition/emission depend on external inputs. Combines HMMs with conditional models.
•Hidden Semi-Markov Models (HSMMs): Explicit modeling of state duration. States persist for a duration drawn from a distribution.
•Infinite HMMs: Bayesian nonparametric version where the number of states is inferred from data.
•Deep HMMs: Neural networks parameterize transitions and emissions. Combines deep learning with structured inference.

Choosing the Right HMM Variant

The choice of HMM variant depends on your data and assumptions:

• Discrete data (words, symbols) → Standard discrete HMM • Continuous univariate/multivariate data → Gaussian HMM or GMM-HMM • Sequential data with fixed progression → Left-right HMM • Variable-length segments → Hidden semi-Markov model • Multiple interacting processes → Factorial or coupled HMMs

Start with the simplest model that captures your data's structure, then add complexity as needed.

Summary: HMM Structure

We have established the complete mathematical foundation for Hidden Markov Models. Let's consolidate the key concepts:

Key Takeaways

•HMMs model sequential data with hidden states: We observe emissions but not the underlying state sequence that generates them.
•The model is specified by five components: State space $\mathcal{S}$, observation space $\mathcal{O}$, transition matrix $\mathbf{A}$, emission distribution $\mathbf{B}$, and initial distribution $\boldsymbol{\pi}$.
•The graphical structure encodes conditional independencies: The first-order Markov property and observation independence given states enable efficient inference.
•The joint distribution factorizes cleanly: $P(\mathbf{x}, \mathbf{z}) = \pi_{z_1} \prod_t A_{z_{t-1},z_t} \prod_t B_{z_t}(x_t)$
•Time homogeneity is a standard assumption: The same transition and emission parameters apply at every time step.
•Three fundamental problems define HMM usage: Evaluation (Forward), Decoding (Viterbi), and Learning (Baum-Welch).
•Many extensions exist: Different emission types, structural constraints, and hierarchical models extend the basic framework.

What's Coming Next

With the structural foundation in place, we will now develop the algorithms that make HMMs practical:

Next Page: The Forward Algorithm — We'll derive an elegant dynamic programming solution to efficiently compute the probability of an observation sequence, turning an exponential-time problem into a polynomial-time one.

Following Pages:

Viterbi Algorithm: Finding the most likely hidden state sequence
Baum-Welch Algorithm: Learning HMM parameters from unlabeled data
Applications: Real-world uses in speech, NLP, and bioinformatics

Page Complete

You now have a rigorous understanding of HMM structure—the mathematical specification, graphical model interpretation, factorization properties, and the three fundamental problems that HMM algorithms solve. This foundation will be essential as we develop the Forward, Viterbi, and Baum-Welch algorithms in the following pages.

1 / 5

Loading learning content...

Machine LearningGraphical Models

Hidden Markov Models

LevelAdvanced

Duration120 mins

TopicGraphical Models

1 / 5

HMM Structure

The Challenge of Sequential Data with Hidden States

Infer the most likely sequence of hidden states given observations
Compute the probability of observing a particular sequence
Learn the model parameters from unlabeled data

This page establishes the foundational structure of HMMs, laying the groundwork for the inference and learning algorithms that follow.

What You Will Learn

From Observable to Hidden States

To understand Hidden Markov Models, we must first appreciate the limitations of standard Markov chains and the motivation for introducing hidden states.

Standard Markov Chains: A Brief Review

A Markov chain models a sequence of random variables $X_1, X_2, \ldots, X_T$ where each variable depends only on its immediate predecessor:

$$P(X_t | X_1, X_2, \ldots, X_{t-1}) = P(X_t | X_{t-1})$$

$$P(X_1, X_2, \ldots, X_T) = P(X_1) \prod_{t=2}^{T} P(X_t | X_{t-1})$$

Standard Markov chains assume that states are directly observable—we see the full sequence $X_1, \ldots, X_T$ and can count transitions to estimate parameters.

The Limitation of Observable Markov Chains

The Need for Hidden States

This motivates the hidden state formulation:

Hidden (latent) states $Z_1, Z_2, \ldots, Z_T$: The true underlying sequence we cannot directly observe
Observations (emissions) $X_1, X_2, \ldots, X_T$: The visible outputs generated by each hidden state

The hidden states form a Markov chain, but we only have access to the observations. This is the essence of a Hidden Markov Model.

Observable vs Hidden Markov Chains
Aspect	Observable Markov Chain	Hidden Markov Model
States	Directly observed	Hidden (latent)
Observations	States themselves	Emissions from states
Parameters to estimate	Transition probabilities only	Transitions + emissions + initial
Learning approach	Direct counting	EM algorithm (Baum-Welch)
Inference	Trivial (states known)	Forward-backward, Viterbi
Modeling power	Limited to observable dynamics	Can model hidden generative processes

Formal Definition of Hidden Markov Models

The Five-Tuple Specification: $\lambda = (\mathcal{S}, \mathcal{O}, \mathbf{A}, \mathbf{B}, \boldsymbol{\pi})$

1. State Space $\mathcal{S} = {s_1, s_2, \ldots, s_N}$

The finite set of $N$ possible hidden states. At each time step $t$, the system occupies exactly one state $z_t \in \mathcal{S}$.

2. Observation Space $\mathcal{O} = {o_1, o_2, \ldots, o_M}$ (Discrete Case)

3. State Transition Matrix $\mathbf{A} \in \mathbb{R}^{N \times N}$

$$A_{ij} = P(z_t = s_j | z_{t-1} = s_i)$$

The probability of transitioning from state $s_i$ to state $s_j$. Each row sums to 1: $\sum_{j=1}^{N} A_{ij} = 1$ for all $i$.

4. Emission Distribution $\mathbf{B}$

For discrete observations: $\mathbf{B} \in \mathbb{R}^{N \times M}$ $$B_{jk} = P(x_t = o_k | z_t = s_j)$$

The probability of observing symbol $o_k$ when in state $s_j$. Each row sums to 1: $\sum_{k=1}^{M} B_{jk} = 1$.

For continuous observations: $B_j(x)$ is a probability density function (commonly Gaussian) $$B_j(x) = \mathcal{N}(x | \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)$$

5. Initial State Distribution $\boldsymbol{\pi} \in \mathbb{R}^N$

$$\pi_i = P(z_1 = s_i)$$

The probability of starting in state $s_i$. Must satisfy: $\sum_{i=1}^{N} \pi_i = 1$.

Compact Notation

hmm_parameters.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
import numpy as np
from dataclasses import dataclass
from typing import Optional
 
@dataclass
class DiscreteHMM:
    """
    Hidden Markov Model with discrete observations.
    
    Parameters
    ----------
    n_states : int
        Number of hidden states (N)
    n_observations : int
        Number of possible observations (M)
    A : np.ndarray of shape (N, N)
        State transition matrix where A[i,j] = P(z_t=j | z_{t-1}=i)
    B : np.ndarray of shape (N, M)
        Emission matrix where B[i,k] = P(x_t=k | z_t=i)
    pi : np.ndarray of shape (N,)
        Initial state distribution where pi[i] = P(z_1=i)
    state_names : Optional list of state names for interpretability
    observation_names : Optional list of observation names
    """
    n_states: int
    n_observations: int
    A: np.ndarray  # Transition matrix (N x N)
    B: np.ndarray  # Emission matrix (N x M)
    pi: np.ndarray  # Initial distribution (N,)
    state_names: Optional[list] = None
    observation_names: Optional[list] = None
    
    def __post_init__(self):
        """Validate HMM parameters."""
        # Check dimensions
        assert self.A.shape == (self.n_states, self.n_states), \
            f"A must be ({self.n_states}, {self.n_states})"
        assert self.B.shape == (self.n_states, self.n_observations), \
            f"B must be ({self.n_states}, {self.n_observations})"
        assert self.pi.shape == (self.n_states,), \
            f"pi must have length {self.n_states}"
        
        # Check probability constraints (rows sum to 1)
        assert np.allclose(self.A.sum(axis=1), 1.0), \
            "Each row of A must sum to 1"
        assert np.allclose(self.B.sum(axis=1), 1.0), \
            "Each row of B must sum to 1"
        assert np.isclose(self.pi.sum(), 1.0), \
            "pi must sum to 1"
        
        # Check non-negativity
        assert np.all(self.A >= 0), "A must be non-negative"
        assert np.all(self.B >= 0), "B must be non-negative"
        assert np.all(self.pi >= 0), "pi must be non-negative"
    
    def sample(self, T: int, seed: Optional[int] = None) -> tuple:
        """
        Generate a sample sequence from the HMM.
        
        Returns (observations, hidden_states) of length T.
        """
        if seed is not None:
            np.random.seed(seed)
        
        states = np.zeros(T, dtype=int)
        observations = np.zeros(T, dtype=int)
        
        # Sample initial state
        states[0] = np.random.choice(self.n_states, p=self.pi)
        observations[0] = np.random.choice(
            self.n_observations, p=self.B[states[0]]
        )
        
        # Sample subsequent states and observations
        for t in range(1, T):
            states[t] = np.random.choice(
                self.n_states, p=self.A[states[t-1]]
            )
            observations[t] = np.random.choice(
                self.n_observations, p=self.B[states[t]]
            )
        
        return observations, states
 
 
# Example: Weather HMM
# States: 0=Sunny, 1=Cloudy, 2=Rainy
# Observations: 0=Walk, 1=Shop, 2=Clean
 
weather_hmm = DiscreteHMM(
    n_states=3,
    n_observations=3,
    # Transition probabilities (weather dynamics)
    A=np.array([
        [0.7, 0.2, 0.1],  # From Sunny
        [0.3, 0.4, 0.3],  # From Cloudy
        [0.2, 0.3, 0.5],  # From Rainy
    ]),
    # Emission probabilities (activity given weather)
    B=np.array([
        [0.7, 0.2, 0.1],  # Sunny: prefer Walk
        [0.3, 0.4, 0.3],  # Cloudy: mixed
        [0.1, 0.2, 0.7],  # Rainy: prefer Clean
    ]),
    # Initial distribution (summer start)
    pi=np.array([0.6, 0.3, 0.1]),
    state_names=["Sunny", "Cloudy", "Rainy"],
    observation_names=["Walk", "Shop", "Clean"]
)
 
# Generate a sample sequence
obs, states = weather_hmm.sample(T=10, seed=42)
print("Hidden states:", [weather_hmm.state_names[s] for s in states])
print("Observations: ", [weather_hmm.observation_names[o] for o in obs])

The Graphical Model Representation

HMMs have a natural representation as a directed graphical model (Bayesian network), which makes the conditional independence structure explicit and provides a visual language for the model.

The HMM Graphical Structure

The graphical model for an HMM consists of two parallel chains:

The hidden state chain $Z_1 \to Z_2 \to \ldots \to Z_T$: A first-order Markov chain
The observation nodes $X_1, X_2, \ldots, X_T$: Each $X_t$ depends only on the corresponding $Z_t$

$$\begin{array}{ccccccc} Z_1 & \to & Z_2 & \to & Z_3 & \to & \cdots & \to & Z_T \ \downarrow & & \downarrow & & \downarrow & & & & \downarrow \ X_1 & & X_2 & & X_3 & & \cdots & & X_T \end{array}$$

The arrows encode the generative process:

Each $Z_t$ is generated from $Z_{t-1}$ according to the transition matrix $\mathbf{A}$
Each $X_t$ is generated from $Z_t$ according to the emission distribution $\mathbf{B}$

Converting Mermaid diagram...

Reading Conditional Independence from the Graph

The graphical structure encodes powerful conditional independence properties that are crucial for efficient inference:

1. First-Order Markov Property for States $$P(Z_t | Z_1, \ldots, Z_{t-1}) = P(Z_t | Z_{t-1})$$

Given the previous state, the current state is independent of all earlier states.

2. Observation Independence Given States $$P(X_t | Z_1, \ldots, Z_T, X_1, \ldots, X_{t-1}, X_{t+1}, \ldots, X_T) = P(X_t | Z_t)$$

Given the current hidden state, the observation is independent of all other states and observations.

3. D-Separation Properties

Using d-separation:

$X_s \perp!!!\perp X_t ,|, Z_s, Z_t$ for $s \neq t$: Observations are conditionally independent given their states
$Z_{t+1} \perp!!!\perp Z_{1:t-1} ,|, Z_t$: Future states are independent of distant past given present
$X_{1:t-1} \perp!!!\perp X_{t+1:T} ,|, Z_t$: Past and future observations are independent given the current state

Why These Independencies Matter

Joint Distribution Factorization

The graphical structure directly determines how the joint probability over all variables factorizes. Understanding this factorization is essential for deriving all HMM algorithms.

The Complete Joint Distribution

For an HMM with hidden state sequence $\mathbf{z} = (z_1, \ldots, z_T)$ and observation sequence $\mathbf{x} = (x_1, \ldots, x_T)$, the joint probability factorizes as:

$$P(\mathbf{x}, \mathbf{z} | \boldsymbol{\theta}) = P(z_1) \left[\prod_{t=2}^{T} P(z_t | z_{t-1})\right] \left[\prod_{t=1}^{T} P(x_t | z_t)\right]$$

Using our parameter notation:

$$P(\mathbf{x}, \mathbf{z} | \boldsymbol{\theta}) = \pi_{z_1} \left[\prod_{t=2}^{T} A_{z_{t-1}, z_t}\right] \left[\prod_{t=1}^{T} B_{z_t}(x_t)\right]$$

Interpretation of Each Factor

Factor	Meaning	Source
$\pi_{z_1}$	Probability of starting in state $z_1$	Initial distribution
$\prod_{t=2}^{T} A_{z_{t-1}, z_t}$	Probability of the state sequence transitions	Transition matrix
$\prod_{t=1}^{T} B_{z_t}(x_t)$	Probability of observations given states	Emission distribution

Example: Computing Joint Probability

Consider a simple HMM with:

2 states: $\mathcal{S} = {H, L}$ (High, Low)
2 observations: $\mathcal{O} = {0, 1}$

With parameters:

$\boldsymbol{\pi} = (0.6, 0.4)$
$\mathbf{A} = \begin{pmatrix} 0.7 & 0.3 \ 0.4 & 0.6 \end{pmatrix}$
$\mathbf{B} = \begin{pmatrix} 0.2 & 0.8 \ 0.8 & 0.2 \end{pmatrix}$

For the sequence $\mathbf{z} = (H, H, L)$ and $\mathbf{x} = (1, 0, 1)$:

joint_probability.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import numpy as np
 
def compute_joint_probability(hmm, observations, states):
    """
    Compute P(observations, states | hmm) for a specific state sequence.
    
    This is the direct computation from the factorization formula.
    Note: This is for educational purposes. In practice, we never
    enumerate all state sequences - we use dynamic programming.
    
    Parameters
    ----------
    hmm : DiscreteHMM
        The HMM model
    observations : array-like of length T
        The observation sequence (indices)
    states : array-like of length T
        The hidden state sequence (indices)
    
    Returns
    -------
    float : The joint probability P(x, z | θ)
    """
    T = len(observations)
    assert len(states) == T, "Observations and states must have same length"
    
    # Initialize with initial state probability
    prob = hmm.pi[states[0]]
    
    # Multiply transition probabilities
    for t in range(1, T):
        prob *= hmm.A[states[t-1], states[t]]
    
    # Multiply emission probabilities
    for t in range(T):
        prob *= hmm.B[states[t], observations[t]]
    
    return prob
 
 
def compute_joint_log_probability(hmm, observations, states):
    """
    Compute log P(observations, states | hmm) for numerical stability.
    
    When working with long sequences, the joint probability becomes
    vanishingly small. Log-space computation prevents underflow.
    """
    T = len(observations)
    
    # Log initial probability
    log_prob = np.log(hmm.pi[states[0]])
    
    # Add log transition probabilities
    for t in range(1, T):
        log_prob += np.log(hmm.A[states[t-1], states[t]])
    
    # Add log emission probabilities
    for t in range(T):
        log_prob += np.log(hmm.B[states[t], observations[t]])
    
    return log_prob
 
 
# Example computation
example_hmm = DiscreteHMM(
    n_states=2,
    n_observations=2,
    A=np.array([[0.7, 0.3], 
                [0.4, 0.6]]),
    B=np.array([[0.2, 0.8],  # State 0 (H): prefers observation 1
                [0.8, 0.2]]), # State 1 (L): prefers observation 0
    pi=np.array([0.6, 0.4]),
    state_names=["High", "Low"],
    observation_names=["0", "1"]
)
 
observations = [1, 0, 1]  # x_1=1, x_2=0, x_3=1
states = [0, 0, 1]        # z_1=H, z_2=H, z_3=L
 
prob = compute_joint_probability(example_hmm, observations, states)
log_prob = compute_joint_log_probability(example_hmm, observations, states)
 
print(f"State sequence: {[example_hmm.state_names[s] for s in states]}")
print(f"Observation sequence: {observations}")
print(f"Joint probability: {prob:.6f}")
print(f"Log joint probability: {log_prob:.4f}")
print(f"Verified: exp(log_prob) = {np.exp(log_prob):.6f}")

Time Homogeneity and Stationarity

The standard HMM formulation makes a critical assumption that significantly simplifies the model: time homogeneity.

Time-Homogeneous HMMs

In a time-homogeneous (or stationary-parameter) HMM, the transition and emission probabilities do not depend on time:

$$P(z_t = j | z_{t-1} = i) = A_{ij} \quad \text{for all } t$$ $$P(x_t = k | z_t = j) = B_{jk} \quad \text{for all } t$$

This means we use the same matrices $\mathbf{A}$ and $\mathbf{B}$ at every time step, regardless of where we are in the sequence.

Implications of Time Homogeneity

Advantages:

Dramatic parameter reduction: Instead of $T$ different transition matrices, we learn just one
Better generalization: Parameters are estimated from all time steps, not just one
Sequence length flexibility: The same model works for sequences of any length

Limitations:

Cannot capture time-varying dynamics (e.g., morning vs. evening patterns)
Assumes the process is stationary within the modeled timeframe
May be inappropriate for processes with trends or seasonality

Non-Homogeneous Extensions

Stationary Distribution of the Hidden Chain

For an ergodic (irreducible, aperiodic) HMM, the hidden state Markov chain has a stationary distribution $\boldsymbol{\mu}$ satisfying:

$$\boldsymbol{\mu}^\top \mathbf{A} = \boldsymbol{\mu}^\top$$

For the weather example with transition matrix: $$\mathbf{A} = \begin{pmatrix} 0.7 & 0.2 & 0.1 \ 0.3 & 0.4 & 0.3 \ 0.2 & 0.3 & 0.5 \end{pmatrix}$$

The stationary distribution is approximately $\boldsymbol{\mu} \approx (0.449, 0.287, 0.264)$, meaning that in the long run, about 45% of days are sunny, 29% cloudy, and 26% rainy.

stationary_distribution.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import numpy as np
from scipy import linalg
 
def compute_stationary_distribution(A: np.ndarray) -> np.ndarray:
    """
    Compute the stationary distribution of a Markov chain.
    
    The stationary distribution μ satisfies μᵀA = μᵀ,
    i.e., it's the left eigenvector for eigenvalue 1.
    
    Parameters
    ----------
    A : np.ndarray of shape (N, N)
        Transition matrix (row-stochastic)
    
    Returns
    -------
    np.ndarray of shape (N,)
        Stationary distribution
    """
    N = A.shape[0]
    
    # Method 1: Solve the eigenvector problem
    # μᵀA = μᵀ  =>  Aᵀμ = μ (find right eigenvector of Aᵀ)
    eigenvalues, eigenvectors = linalg.eig(A.T)
    
    # Find eigenvector corresponding to eigenvalue 1
    idx = np.argmin(np.abs(eigenvalues - 1))
    stationary = np.real(eigenvectors[:, idx])
    
    # Normalize to make it a probability distribution
    stationary = stationary / stationary.sum()
    
    return stationary
 
 
def verify_stationary(A: np.ndarray, mu: np.ndarray) -> bool:
    """Verify that μ is a stationary distribution for A."""
    result = mu @ A
    return np.allclose(result, mu)
 
 
# Weather transition matrix
A_weather = np.array([
    [0.7, 0.2, 0.1],
    [0.3, 0.4, 0.3],
    [0.2, 0.3, 0.5]
])
 
mu = compute_stationary_distribution(A_weather)
state_names = ["Sunny", "Cloudy", "Rainy"]
 
print("Weather HMM Stationary Distribution:")
print("-" * 40)
for name, prob in zip(state_names, mu):
    print(f"  {name:8s}: {prob:.3f} ({prob*100:.1f}%)")
 
print(f"\nVerification: μᵀA = μᵀ? {verify_stationary(A_weather, mu)}")
 
# Simulate convergence to stationary distribution
def simulate_convergence(A, pi0, n_steps=50):
    """Show how initial distribution converges to stationary."""
    pi = pi0.copy()
    history = [pi.copy()]
    
    for _ in range(n_steps):
        pi = pi @ A
        history.append(pi.copy())
    
    return np.array(history)
 
# Start from different initial distributions
pi_sunny = np.array([1.0, 0.0, 0.0])  # Start definitely sunny
pi_rainy = np.array([0.0, 0.0, 1.0])  # Start definitely rainy
 
convergence_sunny = simulate_convergence(A_weather, pi_sunny, 20)
convergence_rainy = simulate_convergence(A_weather, pi_rainy, 20)
 
print("\nConvergence from different starts (Sunny probability):")
print("Step | From Sunny | From Rainy | Stationary")
print("-" * 50)
for t in [0, 1, 2, 5, 10, 20]:
    print(f"{t:4d} | {convergence_sunny[t,0]:10.4f} | "
          f"{convergence_rainy[t,0]:10.4f} | {mu[0]:.4f}")

The Three Fundamental HMM Problems

Given the HMM structure, there are three fundamental computational problems that arise in practice. Understanding these problems is essential before diving into the algorithms that solve them.

Problem 1: Likelihood/Evaluation

Given: Parameters $\boldsymbol{\theta} = (\mathbf{A}, \mathbf{B}, \boldsymbol{\pi})$ and observation sequence $\mathbf{x} = (x_1, \ldots, x_T)$

Compute: $P(\mathbf{x} | \boldsymbol{\theta})$ — the probability of the observation sequence

Why it matters:

Model comparison: Which model better explains the data?
Classification: Which of several HMMs is most likely to have generated this sequence?
Anomaly detection: Is this sequence unlikely under the learned model?

Challenge: Requires summing over all possible hidden state sequences: $$P(\mathbf{x} | \boldsymbol{\theta}) = \sum_{\mathbf{z}} P(\mathbf{x}, \mathbf{z} | \boldsymbol{\theta})$$

With $N$ states and $T$ time steps, there are $N^T$ possible state sequences. Brute force is intractable.

Solution: Forward algorithm (dynamic programming) — $O(TN^2)$ time

Problem 2: Decoding/Inference

Given: Parameters $\boldsymbol{\theta}$ and observation sequence $\mathbf{x}$

Compute: The most likely hidden state sequence $$\mathbf{z}^* = \arg\max_{\mathbf{z}} P(\mathbf{z} | \mathbf{x}, \boldsymbol{\theta})$$

Why it matters:

Speech recognition: What words were spoken?
Part-of-speech tagging: What grammatical role does each word play?
Gene finding: Which regions of DNA are coding regions?

Challenge: Again, $N^T$ possible sequences to consider.

Solution: Viterbi algorithm (dynamic programming) — $O(TN^2)$ time

Note: There are actually two versions of this problem:

MAP sequence: Find the single most probable complete sequence (Viterbi)
Marginal MAP: Find the most probable state at each position independently (posterior decoding using Forward-Backward)

Problem 3: Learning/Parameter Estimation

Given: Observation sequences ${\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(K)}}$ (states are hidden!)

Compute: Optimal parameters $\boldsymbol{\theta}^* = \arg\max_{\boldsymbol{\theta}} P(\mathbf{x}^{(1:K)} | \boldsymbol{\theta})$

Why it matters:

We typically don't know the true HMM parameters
Must learn transition, emission, and initial probabilities from data

Challenge: No closed-form solution because hidden states are unobserved. The likelihood landscape may be non-convex with multiple local optima.

Solution: Baum-Welch algorithm (Expectation-Maximization) — iteratively improves parameters

Summary: The Three Fundamental HMM Problems
Problem	Input	Output	Algorithm	Complexity
Evaluation	$\mathbf{x}, \boldsymbol{\theta}$	$P(\mathbf{x} \| \boldsymbol{\theta})$	Forward	$O(TN^2)$
Decoding	$\mathbf{x}, \boldsymbol{\theta}$	$\mathbf{z}^* = \arg\max P(\mathbf{z}\|\mathbf{x})$	Viterbi	$O(TN^2)$
Learning	${\mathbf{x}^{(k)}}$	$\boldsymbol{\theta}^* = \arg\max P(\mathbf{x})$	Baum-Welch (EM)	$O(KTN^2)$ per iter

The Power of Dynamic Programming

HMM Variants and Extensions

The basic HMM framework admits many extensions to handle different data types and structural assumptions.

Emission Distribution Variants

1. Discrete HMMs (Multinomial Emissions)

Standard formulation with categorical observations
Each state has a discrete distribution over $M$ possible observations
Parameters: $N \times M$ emission matrix

2. Gaussian HMMs (Continuous Emissions)

Each state emits from a Gaussian: $x_t | z_t=j \sim \mathcal{N}(\boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)$
Parameters per state: mean vector $\boldsymbol{\mu}_j$ and covariance matrix $\boldsymbol{\Sigma}_j$
Common for speech and time-series modeling

3. Gaussian Mixture Model HMMs (GMM-HMM)

Each state emits from a mixture of Gaussians
$P(x_t | z_t=j) = \sum_{m=1}^{M_j} w_{jm} \mathcal{N}(x_t | \boldsymbol{\mu}{jm}, \boldsymbol{\Sigma}{jm})$
More flexible: can model multimodal emission distributions
The classic architecture for speech recognition (before deep learning)

4. Autoregressive HMMs

Emission depends on previous observations: $P(x_t | z_t, x_{t-1}, \ldots)$
Captures temporal structure in the observations themselves
Useful for speech and music modeling

Structural Variants

•Left-Right HMMs: States can only transition to same or higher-indexed states. Natural for modeling sequences with a beginning, middle, and end (e.g., phonemes in words).
•Ergodic HMMs: Any state can transition to any other state. Standard assumption for most applications.
•Hierarchical HMMs: States can themselves contain sub-HMMs, creating multi-level temporal structure.
•Factorial HMMs: Multiple parallel hidden chains, each influencing observations. More expressive but inference is harder.

Advanced Extensions

•Input-Output HMMs: Transition/emission depend on external inputs. Combines HMMs with conditional models.
•Hidden Semi-Markov Models (HSMMs): Explicit modeling of state duration. States persist for a duration drawn from a distribution.
•Infinite HMMs: Bayesian nonparametric version where the number of states is inferred from data.
•Deep HMMs: Neural networks parameterize transitions and emissions. Combines deep learning with structured inference.

Choosing the Right HMM Variant

The choice of HMM variant depends on your data and assumptions:

Start with the simplest model that captures your data's structure, then add complexity as needed.

Summary: HMM Structure

We have established the complete mathematical foundation for Hidden Markov Models. Let's consolidate the key concepts:

Key Takeaways

•HMMs model sequential data with hidden states: We observe emissions but not the underlying state sequence that generates them.
•The model is specified by five components: State space $\mathcal{S}$, observation space $\mathcal{O}$, transition matrix $\mathbf{A}$, emission distribution $\mathbf{B}$, and initial distribution $\boldsymbol{\pi}$.
•The graphical structure encodes conditional independencies: The first-order Markov property and observation independence given states enable efficient inference.
•The joint distribution factorizes cleanly: $P(\mathbf{x}, \mathbf{z}) = \pi_{z_1} \prod_t A_{z_{t-1},z_t} \prod_t B_{z_t}(x_t)$
•Time homogeneity is a standard assumption: The same transition and emission parameters apply at every time step.
•Three fundamental problems define HMM usage: Evaluation (Forward), Decoding (Viterbi), and Learning (Baum-Welch).
•Many extensions exist: Different emission types, structural constraints, and hierarchical models extend the basic framework.

What's Coming Next

With the structural foundation in place, we will now develop the algorithms that make HMMs practical:

Following Pages:

Viterbi Algorithm: Finding the most likely hidden state sequence
Baum-Welch Algorithm: Learning HMM parameters from unlabeled data
Applications: Real-world uses in speech, NLP, and bioinformatics

Page Complete

1 / 5