Loading content...
Hidden Markov Models aren't just elegant mathematical constructs—they have been the backbone of major technological advances across multiple domains. For decades, HMMs powered the world's best speech recognition systems. They remain fundamental in bioinformatics for gene finding and protein analysis. They enable natural language processing systems for part-of-speech tagging and named entity recognition.
Understanding how HMMs are applied in practice reveals:
This page surveys the most important application domains, providing concrete examples of how the algorithms we've learned translate into working systems.
By the end of this page, you will:
• Understand how HMMs revolutionized speech recognition • See how sequence labeling tasks (POS tagging, NER) use HMMs • Explore gene finding and biological sequence analysis • Understand gesture recognition and handwriting systems • Know when to use HMMs versus modern deep learning alternatives
Given an acoustic signal (audio waveform), transcribe the spoken words. This involves:
HMMs dominated speech recognition from the 1980s until the deep learning revolution around 2012.
States: Phonemes or sub-phoneme units (typically 3 states per phoneme: beginning, middle, end)
Observations: Acoustic feature vectors, typically:
Emissions: Gaussian Mixture Models (GMM-HMMs)
Transitions: Left-to-right (Bakis) topology
Training:
Decoding:
Around 2010-2012, Deep Neural Networks replaced GMMs for emission modeling:
Eventually, end-to-end neural models (CTC, attention-based) replaced HMMs entirely in most modern systems. But HMM principles—sequential modeling, forward-backward inference—still influence these architectures.
Even though pure HMM systems are now rare in production speech recognition, their influence is profound:
• Connectionist Temporal Classification (CTC) is essentially a special HMM • Alignment concepts from HMMs inform attention mechanisms • Hybrid systems still use HMM decoder structures • Understanding HMMs is essential for understanding modern speech systems
Given a sentence, assign a grammatical tag (noun, verb, adjective, ...) to each word.
Example:
Sentence: The dog runs quickly
Tags: DET NOUN VERB ADV
POS tagging is a fundamental NLP task that feeds into parsing, named entity recognition, and other downstream applications.
States: Part-of-speech tags (tagset size ~45 for Penn Treebank)
Observations: Words
Transition Matrix: $P(\text{tag}t | \text{tag}{t-1})$
Emission Distribution: $P(\text{word} | \text{tag})$
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138
import numpy as npfrom collections import defaultdictfrom typing import List, Tuple class POSTaggerHMM: """ HMM-based Part-of-Speech tagger. Demonstrates supervised training from tagged corpus and Viterbi decoding for inference. """ def __init__(self, smoothing: float = 0.1): self.smoothing = smoothing self.tag2idx = {} self.word2idx = {} self.idx2tag = {} # Parameters (populated during training) self.A = None # Transition probabilities self.B = None # Emission probabilities self.pi = None # Initial distribution def train(self, tagged_sentences: List[List[Tuple[str, str]]]): """ Train from labeled data (supervised learning). Parameters ---------- tagged_sentences : List of sentences, where each sentence is a list of (word, tag) tuples """ # Build vocabularies for sentence in tagged_sentences: for word, tag in sentence: if tag not in self.tag2idx: idx = len(self.tag2idx) self.tag2idx[tag] = idx self.idx2tag[idx] = tag if word not in self.word2idx: self.word2idx[word] = len(self.word2idx) # Add UNK token self.word2idx['<UNK>'] = len(self.word2idx) N = len(self.tag2idx) # Number of tags V = len(self.word2idx) # Vocabulary size # Count statistics initial_counts = np.zeros(N) + self.smoothing transition_counts = np.zeros((N, N)) + self.smoothing emission_counts = np.zeros((N, V)) + self.smoothing for sentence in tagged_sentences: prev_tag_idx = None for i, (word, tag) in enumerate(sentence): tag_idx = self.tag2idx[tag] word_idx = self.word2idx[word] if i == 0: initial_counts[tag_idx] += 1 else: transition_counts[prev_tag_idx, tag_idx] += 1 emission_counts[tag_idx, word_idx] += 1 prev_tag_idx = tag_idx # Normalize to get probabilities self.pi = initial_counts / initial_counts.sum() self.A = transition_counts / transition_counts.sum(axis=1, keepdims=True) self.B = emission_counts / emission_counts.sum(axis=1, keepdims=True) def tag(self, words: List[str]) -> List[str]: """ Tag a sentence using Viterbi decoding. """ T = len(words) N = len(self.tag2idx) # Map words to indices (use UNK for unknown words) obs = [self.word2idx.get(w, self.word2idx['<UNK>']) for w in words] # Viterbi in log space log_A = np.log(self.A) log_B = np.log(self.B) log_pi = np.log(self.pi) delta = np.zeros((T, N)) psi = np.zeros((T, N), dtype=int) # Initialization delta[0] = log_pi + log_B[:, obs[0]] # Recursion for t in range(1, T): for j in range(N): candidates = delta[t-1] + log_A[:, j] psi[t, j] = np.argmax(candidates) delta[t, j] = candidates[psi[t, j]] + log_B[j, obs[t]] # Backtracking path = np.zeros(T, dtype=int) path[-1] = np.argmax(delta[-1]) for t in range(T-2, -1, -1): path[t] = psi[t+1, path[t+1]] return [self.idx2tag[i] for i in path] # Example usagedef demo_pos_tagger(): # Simple training data training_data = [ [('the', 'DET'), ('dog', 'NOUN'), ('runs', 'VERB')], [('a', 'DET'), ('cat', 'NOUN'), ('sleeps', 'VERB')], [('the', 'DET'), ('big', 'ADJ'), ('dog', 'NOUN'), ('barks', 'VERB')], [('dogs', 'NOUN'), ('run', 'VERB'), ('fast', 'ADV')], [('the', 'DET'), ('quick', 'ADJ'), ('fox', 'NOUN'), ('jumps', 'VERB')], ] tagger = POSTaggerHMM() tagger.train(training_data) # Test sentences test_sentences = [ ['the', 'cat', 'runs'], ['a', 'big', 'dog', 'barks'], ['cats', 'sleep'], # 'cats' is OOV ] print("POS Tagging Results:") print("-" * 50) for sentence in test_sentences: tags = tagger.tag(sentence) print(f"Words: {' '.join(sentence)}") print(f"Tags: {' '.join(tags)}\n") demo_pos_tagger()A major challenge in HMM POS tagging is out-of-vocabulary (OOV) words—words not seen during training. Solutions include:
• Unknown word token: Map all OOV to <UNK>, trained on rare words • Suffix features: Words ending in "-ly" likely adverbs, "-tion" likely nouns • Word shape: Capitalized words may be proper nouns • Character-level models: Use spelling patterns for emission probability
Modern systems use these features in discriminative models (CRFs) or neural taggers.
Identify and classify named entities in text—persons, organizations, locations, dates, etc.
Example:
Text: Apple Inc. announced that Tim Cook will visit London
Labels: B-ORG I-ORG O O B-PER I-PER O O B-LOC
NER uses a special label scheme to mark entity boundaries:
This converts the entity recognition problem into a sequence labeling problem that HMMs can handle.
States: BIO tags (e.g., B-PER, I-PER, B-ORG, I-ORG, B-LOC, I-LOC, O)
Observations: Words and features (capitalization, word shape)
Transition Constraints:
These constraints can be enforced by setting transition probabilities to zero for invalid transitions.
| From \ To | B-PER | I-PER | B-ORG | I-ORG | O |
|---|---|---|---|---|---|
| B-PER | ✓ | ✓ | ✓ | ✗ | ✓ |
| I-PER | ✓ | ✓ | ✓ | ✗ | ✓ |
| B-ORG | ✓ | ✗ | ✓ | ✓ | ✓ |
| I-ORG | ✓ | ✗ | ✓ | ✓ | ✓ |
| O | ✓ | ✗ | ✓ | ✗ | ✓ |
Raw words are insufficient for NER—we need rich features:
Word-level features:
Context features:
Gazetteers (external knowledge):
For HMMs, these features must be incorporated into the observation model—either by discretizing features or using Gaussian emissions for continuous features.
HMMs are generative models: they model $P(x, z) = P(x|z)P(z)$. This requires specifying how observations are generated from states.
In NER, we want to use arbitrarily complex features that describe the relationship between words and labels. Conditional Random Fields (CRFs) are the discriminative counterpart: they model $P(z|x)$ directly, allowing any features without generative assumptions.
Modern NER systems typically use neural CRFs or transformer-based models (BERT + token classification), but understanding HMMs provides the foundation for these advances.
Given a DNA sequence (string over {A, C, G, T}), identify protein-coding genes:
HMMs excel here because gene structure has clear sequential dependencies: genes have start codons, splice sites, stop codons, and characteristic statistics.
─────────────[===EXON===]────INTRON────[===EXON===]──────────
ATG ... TGA
(start codon) (stop codon)
States:
Observations: Nucleotides (A, C, G, T)
Key insight: Different regions have different nucleotide statistics:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485
import numpy as np def simple_gene_finder_demo(): """ Simplified gene finder HMM demonstration. Real gene finders (GENSCAN, Augustus) use much more complex state spaces with duration modeling. """ # Simplified states states = ['INTERGENIC', 'EXON', 'INTRON'] n_states = len(states) # Observations: nucleotides nucleotides = ['A', 'C', 'G', 'T'] nuc2idx = {n: i for i, n in enumerate(nucleotides)} # Transition probabilities (simplified) # Exons tend to be followed by exons (self-loop) or introns # Intergenic regions are stable or transition to exons (gene start) A = np.array([ [0.95, 0.05, 0.00], # From INTERGENIC [0.01, 0.90, 0.09], # From EXON [0.01, 0.10, 0.89], # From INTRON ]) # Emission probabilities # Coding regions (EXON) have biased nucleotide frequencies # due to codon usage preferences B = np.array([ # A C G T - INTERGENIC (roughly uniform) [0.30, 0.20, 0.20, 0.30], # EXON - biased towards certain nucleotides in codons [0.20, 0.30, 0.35, 0.15], # INTRON - AT-rich [0.35, 0.15, 0.15, 0.35], ]) pi = np.array([0.8, 0.1, 0.1]) # Usually start intergenic # Example DNA sequence dna_sequence = "ATGCGCGCGATCGATCGATCGATCGATCGATATATATATAAAA" obs = [nuc2idx[n] for n in dna_sequence] # Viterbi decoding T = len(obs) delta = np.zeros((T, n_states)) psi = np.zeros((T, n_states), dtype=int) delta[0] = np.log(pi) + np.log(B[:, obs[0]]) for t in range(1, T): for j in range(n_states): candidates = delta[t-1] + np.log(A[:, j]) psi[t, j] = np.argmax(candidates) delta[t, j] = candidates[psi[t, j]] + np.log(B[j, obs[t]]) # Backtrack path = np.zeros(T, dtype=int) path[-1] = np.argmax(delta[-1]) for t in range(T-2, -1, -1): path[t] = psi[t+1, path[t+1]] # Display results print("Gene Finding Demo") print("=" * 60) print(f"DNA: {dna_sequence}") print(f"States: {''.join([states[s][0] for s in path])}") print("\nLegend: I=INTERGENIC, E=EXON, N=INTRON") # Show segment summary print("\nSegment analysis:") current_state = path[0] start = 0 for i in range(1, T+1): if i == T or path[i] != current_state: print(f" Positions {start:3d}-{i-1:3d}: {states[current_state]:12s} " f"({dna_sequence[start:i][:20]}{'...' if i-start > 20 else ''})") if i < T: current_state = path[i] start = i simple_gene_finder_demo()Protein Secondary Structure Prediction:
CpG Island Detection:
Sequence Alignment (Profile HMMs):
Chromatin State Annotation:
Online handwriting (pen trajectory over time):
Offline handwriting (image-based):
Sign language recognition:
Activity recognition:
Across all these applications, the pattern is the same:
The art is in choosing the right state space and observation model for your domain.
With the rise of deep learning, when should you still consider HMMs?
Interpretability:
Data efficiency:
Computational efficiency:
Theoretical guarantees:
| Aspect | HMMs | Deep Learning (RNNs, Transformers) |
|---|---|---|
| Data requirements | Works with small data | Needs large datasets |
| Interpretability | High (inspectable parameters) | Low (black box) |
| Flexibility | Limited by model structure | Can learn complex patterns |
| Long-range dependencies | Limited (Markov assumption) | Better with attention |
| Training complexity | EM, local optima | SGD, hyperparameter tuning |
| Inference | Exact, efficient | Forward pass (no exact posteriors) |
| Uncertainty | Principled probabilistic | Requires additional techniques |
| Compute requirements | CPU sufficient | Often needs GPU |
Often the best solution combines both:
• Neural emissions: Use a neural network for $P(x|z)$, HMM structure for temporal model • Neural CRFs: CRF layer on top of BiLSTM/BERT for sequence labeling • Variational Sequential Models: VAEs with latent HMM structure
Understanding HMMs provides the foundation for these advanced architectures.
Python:
hmmlearn: Sklearn-compatible, Gaussian and GMM emissionspomegranate: Flexible distributions, GPU supportseqlearn: Sequence learning with HMMs and CRFsJava:
Jahmm: Pure Java HMM librarySpecialty:
HMMER: Profile HMMs for protein sequencesHTK: HMM Toolkit for speech recognitionKaldi: Speech recognition toolkit (HMM + neural)Challenge: Standard algorithms are $O(TN^2)$ per sequence
Solutions:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
from hmmlearn import hmmimport numpy as np def hmmlearn_example(): """ Example using the hmmlearn library for Gaussian HMMs. hmmlearn provides efficient implementations of HMM algorithms. """ # Generate synthetic data from a 2-state Gaussian HMM np.random.seed(42) # True parameters true_startprob = np.array([0.6, 0.4]) true_transmat = np.array([[0.7, 0.3], [0.4, 0.6]]) true_means = np.array([[0.0], [3.0]]) true_covars = np.array([[[0.5]], [[1.0]]]) # Create true model and sample true_model = hmm.GaussianHMM(n_components=2, covariance_type="full") true_model.startprob_ = true_startprob true_model.transmat_ = true_transmat true_model.means_ = true_means true_model.covars_ = true_covars # Generate training data X_train, _ = true_model.sample(1000) # Create and train a new model learned_model = hmm.GaussianHMM( n_components=2, covariance_type="full", n_iter=100, random_state=42 ) learned_model.fit(X_train) print("Learned HMM Parameters:") print("=" * 50) print("\nTransition matrix:") print(learned_model.transmat_.round(3)) print("\nMeans:") print(learned_model.means_.round(3)) # Decode a test sequence X_test, true_states = true_model.sample(20) log_prob, predicted_states = learned_model.decode(X_test) print("\nDecoding Example:") print(f"True states: {true_states.flatten()}") print(f"Predicted states: {predicted_states}") print(f"Accuracy: {(true_states.flatten() == predicted_states).mean():.2%}") # Model selection using BIC print("\nModel Selection (BIC):") for n in range(1, 5): model = hmm.GaussianHMM(n_components=n, n_iter=100, random_state=42) model.fit(X_train) bic = -2 * model.score(X_train) + n * np.log(len(X_train)) print(f" {n} states: BIC = {bic:.2f}") hmmlearn_example()Hidden Markov Models have been one of the most successful and widely-applied machine learning techniques, providing the foundation for numerous technologies we use daily.
Congratulations! You have completed the comprehensive study of Hidden Markov Models. You now understand:
This knowledge provides a solid foundation for understanding more advanced sequential models, including Conditional Random Fields, Recurrent Neural Networks, and modern sequence-to-sequence architectures.
You have mastered Hidden Markov Models—from the mathematical foundations through core algorithms to real-world applications. HMMs remain relevant today as building blocks for understanding sequential data, and their principles permeate modern machine learning. Whether you apply HMMs directly or use them as a conceptual framework for more complex models, this knowledge will serve you well throughout your machine learning journey.