Loading learning content...
In October 2018, Google AI released a paper that would fundamentally reshape natural language processing: BERT: Bidirectional Encoder Representations from Transformers. Within months of its release, BERT shattered records on virtually every major NLP benchmark, achieving state-of-the-art results on 11 different tasks simultaneously.
BERT wasn't just an incremental improvement—it represented a paradigm shift. Before BERT, NLP models were typically trained from scratch for each task, requiring task-specific architectures and large labeled datasets. BERT introduced a new methodology: pre-train deeply on massive unlabeled text, then fine-tune briefly on any downstream task. This transfer learning approach democratized high-performance NLP, enabling practitioners to achieve near-state-of-the-art results with modest computational resources and limited labeled data.
By the end of this page, you will understand BERT's architecture in complete detail—from its bidirectional attention mechanism to its pre-training objectives. You'll learn how BERT processes text, why bidirectionality matters, how fine-tuning works, and how subsequent variants like RoBERTa, ALBERT, and DistilBERT improved upon the original. You'll be equipped to apply BERT effectively and understand its limitations.
The Pre-BERT Landscape:
To appreciate BERT's contribution, we must understand what came before. Prior to BERT, the dominant approaches to NLP fell into several categories:
Feature-based approaches: Using pre-trained word embeddings (Word2Vec, GloVe) as features for downstream task-specific models. These embeddings were context-independent—the word "bank" had the same representation whether referring to a river bank or financial institution.
ELMo (Contextualized Embeddings): A breakthrough in 2018, ELMo used bidirectional LSTMs to generate context-dependent embeddings. However, the bidirectionality was shallow—forward and backward LSTMs were trained separately and combined only at the final layer.
Unidirectional Language Models: OpenAI's GPT (2018) used the transformer architecture with unidirectional (left-to-right) attention for language modeling. While powerful, the unidirectional constraint limited the model's ability to understand context from both directions.
BERT's innovation was achieving deep bidirectionality—every layer jointly conditions on both left and right context, creating richer representations than any previous approach.
BERT uses only the encoder stack from the original transformer architecture. Unlike the full encoder-decoder transformer (designed for sequence-to-sequence tasks like translation), BERT's encoder-only design is optimized for understanding and representing text, not generating it.
Why Encoder-Only?
The encoder's self-attention mechanism allows every token to attend to every other token—perfect for tasks requiring understanding of the entire input sequence. The decoder, designed for autoregressive generation, uses masked self-attention that prevents tokens from attending to future positions. BERT deliberately removes this constraint to achieve full bidirectionality.
| Configuration | Layers (L) | Hidden Size (H) | Attention Heads (A) | Parameters |
|---|---|---|---|---|
| BERT-Base | 12 | 768 | 12 | 110M |
| BERT-Large | 24 | 1024 | 16 | 340M |
Architectural Components:
Each layer in BERT consists of:
Multi-Head Self-Attention: Allows each token to gather information from all other tokens in the sequence. With H hidden units and A attention heads, each head operates on H/A dimensional subspaces.
Position-wise Feed-Forward Network: A two-layer fully connected network applied independently to each position:
$$\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$$
The intermediate dimension is typically 4H (3072 for BERT-Base, 4096 for BERT-Large).
$$\text{output} = \text{LayerNorm}(x + \text{SubLayer}(x))$$
Research has shown that different layers in BERT capture different linguistic phenomena. Lower layers tend to capture surface-level features (parts of speech, syntax), middle layers capture syntactic relationships, and upper layers capture more semantic and task-specific information. This hierarchical representation emerges naturally from pre-training.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105
import torchimport torch.nn as nn class BERTEmbedding(nn.Module): """ BERT Embedding layer combines three types of embeddings: 1. Token embeddings (vocabulary) 2. Segment embeddings (sentence A vs B) 3. Position embeddings (absolute positions) """ def __init__(self, vocab_size, hidden_size, max_seq_length, num_segments=2, dropout=0.1): super().__init__() self.token_embedding = nn.Embedding(vocab_size, hidden_size) self.segment_embedding = nn.Embedding(num_segments, hidden_size) self.position_embedding = nn.Embedding(max_seq_length, hidden_size) self.layer_norm = nn.LayerNorm(hidden_size, eps=1e-12) self.dropout = nn.Dropout(dropout) def forward(self, token_ids, segment_ids): seq_length = token_ids.size(1) position_ids = torch.arange(seq_length, device=token_ids.device).unsqueeze(0) # Sum all three embedding types embeddings = ( self.token_embedding(token_ids) + self.segment_embedding(segment_ids) + self.position_embedding(position_ids) ) embeddings = self.layer_norm(embeddings) embeddings = self.dropout(embeddings) return embeddings class BERTAttention(nn.Module): """ Multi-head self-attention with scaling and residual connection. """ def __init__(self, hidden_size, num_heads, dropout=0.1): super().__init__() assert hidden_size % num_heads == 0 self.num_heads = num_heads self.head_dim = hidden_size // num_heads self.scale = self.head_dim ** -0.5 self.query = nn.Linear(hidden_size, hidden_size) self.key = nn.Linear(hidden_size, hidden_size) self.value = nn.Linear(hidden_size, hidden_size) self.output = nn.Linear(hidden_size, hidden_size) self.dropout = nn.Dropout(dropout) def forward(self, x, attention_mask=None): batch_size, seq_length, hidden_size = x.shape # Linear projections and reshape for multi-head Q = self.query(x).view(batch_size, seq_length, self.num_heads, self.head_dim).transpose(1, 2) K = self.key(x).view(batch_size, seq_length, self.num_heads, self.head_dim).transpose(1, 2) V = self.value(x).view(batch_size, seq_length, self.num_heads, self.head_dim).transpose(1, 2) # Scaled dot-product attention attention_scores = torch.matmul(Q, K.transpose(-2, -1)) * self.scale if attention_mask is not None: # attention_mask: [batch_size, 1, 1, seq_length] attention_scores = attention_scores + attention_mask attention_probs = torch.softmax(attention_scores, dim=-1) attention_probs = self.dropout(attention_probs) # Weighted sum and reshape context = torch.matmul(attention_probs, V) context = context.transpose(1, 2).contiguous().view(batch_size, seq_length, hidden_size) return self.output(context) class BERTLayer(nn.Module): """ Single BERT layer: Attention + FFN with residual connections and layer norm. """ def __init__(self, hidden_size, num_heads, intermediate_size, dropout=0.1): super().__init__() self.attention = BERTAttention(hidden_size, num_heads, dropout) self.attention_norm = nn.LayerNorm(hidden_size, eps=1e-12) self.ffn = nn.Sequential( nn.Linear(hidden_size, intermediate_size), nn.GELU(), # BERT uses GELU activation nn.Linear(intermediate_size, hidden_size), nn.Dropout(dropout) ) self.ffn_norm = nn.LayerNorm(hidden_size, eps=1e-12) def forward(self, x, attention_mask=None): # Self-attention with residual attention_output = self.attention(x, attention_mask) x = self.attention_norm(x + attention_output) # FFN with residual ffn_output = self.ffn(x) x = self.ffn_norm(x + ffn_output) return xBERT's input representation is carefully designed to handle various NLP tasks within a unified framework. Understanding this representation is crucial for effective use of BERT.
WordPiece Tokenization:
BERT uses WordPiece tokenization, a subword tokenization algorithm that balances vocabulary size against the ability to handle rare and out-of-vocabulary words. The algorithm:
This approach handles morphological variations naturally: "playing" might become ["play", "##ing"], where "##" indicates a subword continuation.
Subword tokenization solves the vocabulary dilemma: word-level tokenization struggles with rare words ("OOV problem"), while character-level tokenization creates very long sequences and loses word-level semantics. Subword methods like WordPiece, BPE, and SentencePiece achieve an optimal balance, representing common words as single tokens while breaking rare words into meaningful subunits.
Special Tokens:
BERT uses several special tokens with specific purposes:
[CLS]: Classification token, placed at the beginning of every input. Its final hidden state serves as the aggregate sequence representation for classification tasks.
[SEP]: Separator token, used to separate two sentences in pair tasks (e.g., question-answering, natural language inference). Also placed at the end of single-sentence inputs.
[MASK]: Masking token, used during pre-training to replace tokens that the model must predict.
[PAD]: Padding token, used to make all sequences in a batch the same length.
[UNK]: Unknown token, rarely used due to WordPiece's ability to handle most inputs.
| Embedding Type | Purpose | Details |
|---|---|---|
| Token Embedding | Maps vocabulary tokens to vectors | Vocabulary of 30,522 tokens → 768-dim vectors |
| Segment Embedding | Distinguishes sentence A from B | Two learned embeddings for pair tasks |
| Position Embedding | Encodes absolute position | Learned embeddings for positions 0-511 |
Input Construction Example:
For a sentence pair task like Natural Language Inference:
Premise: "A man is playing guitar." Hypothesis: "Someone is making music."
The input sequence becomes:
Tokens: [CLS] a man is play ##ing guitar . [SEP] someone is making music . [SEP]
Segment: 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1
Position: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
The final input embedding is the sum of these three components:
$$E_{input} = E_{token} + E_{segment} + E_{position}$$
This embedding is then processed through BERT's encoder layers.
BERT's position embeddings are learned and fixed at 512 positions, limiting input sequences to 512 tokens. This constraint has significant implications for tasks involving long documents. Various strategies address this: truncation, sliding window approaches, or using long-context variants like Longformer.
BERT's pre-training uses two self-supervised objectives that together enable deep bidirectional understanding: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).
The core innovation enabling BERT's bidirectionality is MLM. The challenge with standard language modeling is that bidirectional conditioning would allow words to "see themselves"—making prediction trivial. BERT solves this elegantly:
The MLM Procedure:
The 80-10-10 masking strategy addresses a subtle problem: during fine-tuning, the model never sees [MASK] tokens. If pre-training exclusively used [MASK], there would be a train-fine-tune mismatch. Random replacement and unchanged tokens help the model learn to predict in all contexts. The 80% [MASK] rate still ensures the model can't simply copy the input.
MLM Loss Function:
The MLM loss is standard cross-entropy over the masked positions:
$$\mathcal{L}{MLM} = -\sum{i \in \mathcal{M}} \log P(x_i | x_{\backslash i})$$
where $\mathcal{M}$ is the set of masked positions and $x_{\backslash i}$ represents the sequence with position $i$ masked.
Interpreting MLM as Denoising:
MLM can be viewed as a denoising autoencoder. The input is corrupted (by masking), and the model learns to reconstruct the original. This forces the model to develop rich internal representations that capture linguistic patterns, syntax, and semantics.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485
import torchimport torch.nn as nnimport torch.nn.functional as Ffrom typing import Tuple def create_mlm_mask( input_ids: torch.Tensor, vocab_size: int, mask_token_id: int, special_token_ids: set, mask_prob: float = 0.15) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]: """ Create MLM masks following BERT's 80-10-10 strategy. Returns: masked_input_ids: Input with masking applied labels: Original tokens at masked positions (-100 elsewhere) attention_mask: 1 for real tokens, 0 for padding """ labels = input_ids.clone() # Probability matrix for masking probability_matrix = torch.full(input_ids.shape, mask_prob) # Don't mask special tokens special_tokens_mask = torch.tensor([ [1 if token_id in special_token_ids else 0 for token_id in seq] for seq in input_ids.tolist() ], dtype=torch.bool) probability_matrix.masked_fill_(special_tokens_mask, value=0.0) # Determine which tokens to mask masked_indices = torch.bernoulli(probability_matrix).bool() # Labels: -100 for non-masked (ignored in loss), original id for masked labels[~masked_indices] = -100 # 80% of masked: replace with [MASK] indices_replaced = torch.bernoulli(torch.full(input_ids.shape, 0.8)).bool() & masked_indices input_ids[indices_replaced] = mask_token_id # 10% of masked: replace with random token indices_random = ( torch.bernoulli(torch.full(input_ids.shape, 0.5)).bool() & masked_indices & ~indices_replaced ) random_words = torch.randint(vocab_size, input_ids.shape, dtype=torch.long) input_ids[indices_random] = random_words[indices_random] # Remaining 10%: keep original (already done since we cloned) return input_ids, labels class MLMHead(nn.Module): """ Masked Language Modeling head. Predicts original tokens from final hidden states. """ def __init__(self, hidden_size: int, vocab_size: int): super().__init__() self.dense = nn.Linear(hidden_size, hidden_size) self.activation = nn.GELU() self.layer_norm = nn.LayerNorm(hidden_size, eps=1e-12) self.decoder = nn.Linear(hidden_size, vocab_size) def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: x = self.dense(hidden_states) x = self.activation(x) x = self.layer_norm(x) logits = self.decoder(x) return logits def compute_mlm_loss(logits: torch.Tensor, labels: torch.Tensor) -> torch.Tensor: """ Compute MLM loss with -100 label masking. """ return F.cross_entropy( logits.view(-1, logits.size(-1)), labels.view(-1), ignore_index=-100 )The second pre-training objective teaches BERT to understand relationships between sentences—crucial for tasks like question-answering and natural language inference.
The NSP Procedure:
NSP Loss Function:
$$\mathcal{L}_{NSP} = -[y \log P(\text{IsNext}) + (1-y) \log P(\text{NotNext})]$$
where $y = 1$ if B follows A, else $y = 0$.
Later research (particularly RoBERTa) showed that NSP might not be as beneficial as originally thought. The binary classification task may be too easy—the model can often distinguish sentences by topic alone without learning true discourse coherence. RoBERTa removes NSP entirely with no performance loss, and many subsequent models have followed suit.
Total Pre-training Objective:
BERT's final pre-training loss combines both objectives:
$$\mathcal{L}{total} = \mathcal{L}{MLM} + \mathcal{L}_{NSP}$$
Pre-training Data and Compute:
BERT was pre-trained on:
Total: ~3.3 billion words
Training took:
BERT's most significant practical innovation is its fine-tuning paradigm: a pre-trained model can be adapted to virtually any NLP task with minimal architectural modification and relatively little task-specific training data.
The Fine-tuning Procedure:
Why Fine-tuning Works:
Pre-training imbues BERT with rich linguistic knowledge—syntax, semantics, world knowledge, and reasoning patterns. Fine-tuning then adapts these general capabilities to the specific patterns of the target task. This is vastly more data-efficient than training from scratch.
| Task Type | Input Format | Output Layer | Uses |
|---|---|---|---|
| Single Sentence Classification | [CLS] sentence [SEP] | Linear([CLS]) → k classes | Sentiment, topic classification |
| Sentence Pair Classification | [CLS] sent_A [SEP] sent_B [SEP] | Linear([CLS]) → k classes | NLI, paraphrase detection |
| Token Classification | [CLS] tokens... [SEP] | Linear(each token) → k classes | NER, POS tagging |
| Extractive QA | [CLS] question [SEP] context [SEP] | Linear(each token) → start/end | SQuAD-style QA |
Fine-tuning Hyperparameters:
BERT's authors provide recommended hyperparameters that work well across most tasks:
These conservative settings prevent overwriting the valuable pre-trained representations while allowing adaptation to the target task.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115
import torchimport torch.nn as nnfrom transformers import BertModel, BertTokenizerfrom torch.optim import AdamWfrom torch.optim.lr_scheduler import LinearLR class BERTForSequenceClassification(nn.Module): """ BERT fine-tuned for sequence classification tasks. Uses [CLS] token representation for classification. """ def __init__(self, model_name: str, num_labels: int, dropout: float = 0.1): super().__init__() self.bert = BertModel.from_pretrained(model_name) self.dropout = nn.Dropout(dropout) self.classifier = nn.Linear(self.bert.config.hidden_size, num_labels) def forward(self, input_ids, attention_mask, token_type_ids=None): outputs = self.bert( input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids ) # Use [CLS] token representation (first token) cls_output = outputs.last_hidden_state[:, 0, :] cls_output = self.dropout(cls_output) logits = self.classifier(cls_output) return logits class BERTForTokenClassification(nn.Module): """ BERT fine-tuned for token classification (NER, POS tagging). Classifies each token independently. """ def __init__(self, model_name: str, num_labels: int, dropout: float = 0.1): super().__init__() self.bert = BertModel.from_pretrained(model_name) self.dropout = nn.Dropout(dropout) self.classifier = nn.Linear(self.bert.config.hidden_size, num_labels) def forward(self, input_ids, attention_mask, token_type_ids=None): outputs = self.bert( input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids ) # Classify each token sequence_output = outputs.last_hidden_state sequence_output = self.dropout(sequence_output) logits = self.classifier(sequence_output) return logits class BERTForQuestionAnswering(nn.Module): """ BERT fine-tuned for extractive question answering. Predicts start and end positions of answer span. """ def __init__(self, model_name: str): super().__init__() self.bert = BertModel.from_pretrained(model_name) self.qa_outputs = nn.Linear(self.bert.config.hidden_size, 2) # start, end def forward(self, input_ids, attention_mask, token_type_ids=None): outputs = self.bert( input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids ) sequence_output = outputs.last_hidden_state logits = self.qa_outputs(sequence_output) start_logits, end_logits = logits.split(1, dim=-1) start_logits = start_logits.squeeze(-1) end_logits = end_logits.squeeze(-1) return start_logits, end_logits def get_optimizer_and_scheduler(model, lr=2e-5, warmup_ratio=0.1, total_steps=1000): """ Standard BERT fine-tuning optimizer setup. Uses AdamW with weight decay and linear warmup. """ # Separate weight decay for different parameter types no_decay = ['bias', 'LayerNorm.weight'] optimizer_grouped_parameters = [ { 'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01 }, { 'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0 } ] optimizer = AdamW(optimizer_grouped_parameters, lr=lr) warmup_steps = int(total_steps * warmup_ratio) scheduler = LinearLR( optimizer, start_factor=0.01, total_iters=warmup_steps ) return optimizer, schedulerFine-tuning BERT-Base on a classification task typically requires only 2-4 hours on a single GPU, even for datasets with 100K+ examples. This represents a dramatic reduction from the weeks of training required for task-specific models before BERT.
Following BERT's release, researchers identified numerous ways to improve upon the original design. Three particularly influential variants emerged, each addressing different aspects of BERT's limitations.
Facebook's RoBERTa (2019) demonstrated that BERT was significantly undertrained and that careful optimization of pre-training could yield substantial improvements.
Key Changes from BERT:
RoBERTa's success demonstrated that architecture innovations aren't the only path to improvement—careful engineering and sufficient training compute matter enormously. RoBERTa uses the exact same architecture as BERT-Large but consistently outperforms it through better training.
Google's ALBERT (2019) addressed BERT's massive parameter count with parameter efficiency techniques, enabling larger models with fewer parameters.
Key Innovations:
| Model | Parameters | Layers | Hidden Size | Embedding Size |
|---|---|---|---|---|
| BERT-Large | 340M | 24 | 1024 | 1024 |
| ALBERT-Base | 12M | 12 | 768 | 128 |
| ALBERT-Large | 18M | 24 | 1024 | 128 |
| ALBERT-xxlarge | 235M | 12 | 4096 | 128 |
Hugging Face's DistilBERT (2019) used knowledge distillation to create a smaller, faster model that retains most of BERT's capabilities.
Knowledge Distillation Process:
Use RoBERTa when you need maximum accuracy and have compute budget. Use ALBERT when memory is constrained but training time is flexible. Use DistilBERT for production deployment where inference speed matters and you can accept slight accuracy trade-offs.
A rich research literature has emerged analyzing what linguistic knowledge BERT acquires during pre-training. Understanding these learned capabilities helps practitioners leverage BERT effectively.
Syntax and Grammar:
BERT develops strong syntactic representations. Probing experiments show:
Semantics and World Knowledge:
BERT encodes significant factual knowledge:
The Layer Hierarchy:
Different BERT layers capture different linguistic phenomena:
| Layer Range | Information Captured | Probing Evidence |
|---|---|---|
| Layers 1-4 | Surface features, word identity | High performance on word identity probes |
| Layers 5-8 | Syntax, phrase structure | Best for dependency parsing, POS tagging |
| Layers 9-12 | Semantics, task-specific | Best for NER, semantic role labeling |
Attention Head Analysis:
Researchers have identified specific attention heads with interpretable behaviors:
However, not all heads are equally useful. Many heads appear redundant, which is why pruning approaches like head pruning work without significant performance loss.
BERT isn't perfect. It struggles with: (1) Numerical reasoning and arithmetic, (2) Factual knowledge verification (can confidently produce incorrect facts), (3) Multi-hop reasoning requiring multiple inference steps, (4) Tasks requiring external knowledge not in the training data. Understanding these limitations is crucial for appropriate application.
The Embedding Space:
BERT's embeddings exhibit interesting geometric properties:
Anisotropy: BERT embeddings occupy a narrow cone in the vector space rather than being uniformly distributed. This can affect similarity calculations.
Context sensitivity: The same word in different contexts produces dramatically different vectors. "Bank" in "river bank" vs "bank account" produces distinct representations.
Subspace structure: Semantic categories often form linearly separable subspaces, enabling simple classifiers on frozen BERT features.
BERT has found applications across the entire spectrum of NLP tasks. Here we examine common usage patterns and best practices for practitioners.
Text Classification:
For sentiment analysis, intent detection, or topic classification:
[CLS] token representation1234567891011121314151617181920212223242526272829303132333435
from transformers import BertTokenizer, BertForSequenceClassificationimport torch # Load pre-trained model and tokenizertokenizer = BertTokenizer.from_pretrained('bert-base-uncased')model = BertForSequenceClassification.from_pretrained( 'bert-base-uncased', num_labels=2 # Binary classification) def classify_sentiment(text: str) -> dict: """Classify sentiment of input text.""" # Tokenize input inputs = tokenizer( text, return_tensors='pt', truncation=True, max_length=512, padding=True ) # Get predictions with torch.no_grad(): outputs = model(**inputs) probs = torch.softmax(outputs.logits, dim=-1) return { 'negative': probs[0][0].item(), 'positive': probs[0][1].item(), 'prediction': 'positive' if probs[0][1] > 0.5 else 'negative' } # Example usageresult = classify_sentiment("This movie was absolutely fantastic!")print(f"Sentiment: {result['prediction']} (confidence: {max(result['positive'], result['negative']):.2%})")Named Entity Recognition (NER):
For token-level classification tasks:
Question Answering:
For extractive QA (finding answer spans in context):
[CLS] question [SEP] context [SEP]Semantic Similarity and Retrieval:
BERT can be used for similarity, but with caveats:
[CLS] embedding similarity often underperformsYou now have a comprehensive understanding of BERT—from its bidirectional architecture and pre-training objectives to fine-tuning methodology and practical applications. BERT established the pre-train/fine-tune paradigm that dominates modern NLP. Next, we'll explore GPT, which takes a fundamentally different approach: unidirectional, autoregressive language modeling for generation.