Machine LearningAttention & Transformers

Transformer Variants

LevelAdvanced

Duration180 mins

TopicAttention & Transformers

1 / 5

BERT: Bidirectional Encoder Representations from Transformers

The BERT Revolution

In October 2018, Google AI released a paper that would fundamentally reshape natural language processing: BERT: Bidirectional Encoder Representations from Transformers. Within months of its release, BERT shattered records on virtually every major NLP benchmark, achieving state-of-the-art results on 11 different tasks simultaneously.

BERT wasn't just an incremental improvement—it represented a paradigm shift. Before BERT, NLP models were typically trained from scratch for each task, requiring task-specific architectures and large labeled datasets. BERT introduced a new methodology: pre-train deeply on massive unlabeled text, then fine-tune briefly on any downstream task. This transfer learning approach democratized high-performance NLP, enabling practitioners to achieve near-state-of-the-art results with modest computational resources and limited labeled data.

What You Will Learn

By the end of this page, you will understand BERT's architecture in complete detail—from its bidirectional attention mechanism to its pre-training objectives. You'll learn how BERT processes text, why bidirectionality matters, how fine-tuning works, and how subsequent variants like RoBERTa, ALBERT, and DistilBERT improved upon the original. You'll be equipped to apply BERT effectively and understand its limitations.

The Pre-BERT Landscape:

To appreciate BERT's contribution, we must understand what came before. Prior to BERT, the dominant approaches to NLP fell into several categories:

Feature-based approaches: Using pre-trained word embeddings (Word2Vec, GloVe) as features for downstream task-specific models. These embeddings were context-independent—the word "bank" had the same representation whether referring to a river bank or financial institution.
ELMo (Contextualized Embeddings): A breakthrough in 2018, ELMo used bidirectional LSTMs to generate context-dependent embeddings. However, the bidirectionality was shallow—forward and backward LSTMs were trained separately and combined only at the final layer.
Unidirectional Language Models: OpenAI's GPT (2018) used the transformer architecture with unidirectional (left-to-right) attention for language modeling. While powerful, the unidirectional constraint limited the model's ability to understand context from both directions.

BERT's innovation was achieving deep bidirectionality—every layer jointly conditions on both left and right context, creating richer representations than any previous approach.

BERT Architecture: The Encoder-Only Transformer

BERT uses only the encoder stack from the original transformer architecture. Unlike the full encoder-decoder transformer (designed for sequence-to-sequence tasks like translation), BERT's encoder-only design is optimized for understanding and representing text, not generating it.

Why Encoder-Only?

The encoder's self-attention mechanism allows every token to attend to every other token—perfect for tasks requiring understanding of the entire input sequence. The decoder, designed for autoregressive generation, uses masked self-attention that prevents tokens from attending to future positions. BERT deliberately removes this constraint to achieve full bidirectionality.

BERT Model Configurations
Configuration	Layers (L)	Hidden Size (H)	Attention Heads (A)	Parameters
BERT-Base	12	768	12	110M
BERT-Large	24	1024	16	340M

Architectural Components:

Each layer in BERT consists of:

Multi-Head Self-Attention: Allows each token to gather information from all other tokens in the sequence. With H hidden units and A attention heads, each head operates on H/A dimensional subspaces.
Position-wise Feed-Forward Network: A two-layer fully connected network applied independently to each position:

$$\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$$

The intermediate dimension is typically 4H (3072 for BERT-Base, 4096 for BERT-Large).

Layer Normalization: Applied after each sub-layer with residual connections:

$$\text{output} = \text{LayerNorm}(x + \text{SubLayer}(x))$$

Residual Connections: Skip connections around both the attention and feed-forward sub-layers, enabling gradient flow through deep networks.

The Power of Depth in BERT

Research has shown that different layers in BERT capture different linguistic phenomena. Lower layers tend to capture surface-level features (parts of speech, syntax), middle layers capture syntactic relationships, and upper layers capture more semantic and task-specific information. This hierarchical representation emerges naturally from pre-training.

bert_architecture.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
import torch
import torch.nn as nn
 
class BERTEmbedding(nn.Module):
    """
    BERT Embedding layer combines three types of embeddings:
    1. Token embeddings (vocabulary)
    2. Segment embeddings (sentence A vs B)
    3. Position embeddings (absolute positions)
    """
    def __init__(self, vocab_size, hidden_size, max_seq_length, num_segments=2, dropout=0.1):
        super().__init__()
        self.token_embedding = nn.Embedding(vocab_size, hidden_size)
        self.segment_embedding = nn.Embedding(num_segments, hidden_size)
        self.position_embedding = nn.Embedding(max_seq_length, hidden_size)
        self.layer_norm = nn.LayerNorm(hidden_size, eps=1e-12)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, token_ids, segment_ids):
        seq_length = token_ids.size(1)
        position_ids = torch.arange(seq_length, device=token_ids.device).unsqueeze(0)
        
        # Sum all three embedding types
        embeddings = (
            self.token_embedding(token_ids) +
            self.segment_embedding(segment_ids) +
            self.position_embedding(position_ids)
        )
        
        embeddings = self.layer_norm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings
 
 
class BERTAttention(nn.Module):
    """
    Multi-head self-attention with scaling and residual connection.
    """
    def __init__(self, hidden_size, num_heads, dropout=0.1):
        super().__init__()
        assert hidden_size % num_heads == 0
        
        self.num_heads = num_heads
        self.head_dim = hidden_size // num_heads
        self.scale = self.head_dim ** -0.5
        
        self.query = nn.Linear(hidden_size, hidden_size)
        self.key = nn.Linear(hidden_size, hidden_size)
        self.value = nn.Linear(hidden_size, hidden_size)
        self.output = nn.Linear(hidden_size, hidden_size)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, attention_mask=None):
        batch_size, seq_length, hidden_size = x.shape
        
        # Linear projections and reshape for multi-head
        Q = self.query(x).view(batch_size, seq_length, self.num_heads, self.head_dim).transpose(1, 2)
        K = self.key(x).view(batch_size, seq_length, self.num_heads, self.head_dim).transpose(1, 2)
        V = self.value(x).view(batch_size, seq_length, self.num_heads, self.head_dim).transpose(1, 2)
        
        # Scaled dot-product attention
        attention_scores = torch.matmul(Q, K.transpose(-2, -1)) * self.scale
        
        if attention_mask is not None:
            # attention_mask: [batch_size, 1, 1, seq_length]
            attention_scores = attention_scores + attention_mask
            
        attention_probs = torch.softmax(attention_scores, dim=-1)
        attention_probs = self.dropout(attention_probs)
        
        # Weighted sum and reshape
        context = torch.matmul(attention_probs, V)
        context = context.transpose(1, 2).contiguous().view(batch_size, seq_length, hidden_size)
        
        return self.output(context)
 
 
class BERTLayer(nn.Module):
    """
    Single BERT layer: Attention + FFN with residual connections and layer norm.
    """
    def __init__(self, hidden_size, num_heads, intermediate_size, dropout=0.1):
        super().__init__()
        self.attention = BERTAttention(hidden_size, num_heads, dropout)
        self.attention_norm = nn.LayerNorm(hidden_size, eps=1e-12)
        
        self.ffn = nn.Sequential(
            nn.Linear(hidden_size, intermediate_size),
            nn.GELU(),  # BERT uses GELU activation
            nn.Linear(intermediate_size, hidden_size),
            nn.Dropout(dropout)
        )
        self.ffn_norm = nn.LayerNorm(hidden_size, eps=1e-12)
        
    def forward(self, x, attention_mask=None):
        # Self-attention with residual
        attention_output = self.attention(x, attention_mask)
        x = self.attention_norm(x + attention_output)
        
        # FFN with residual
        ffn_output = self.ffn(x)
        x = self.ffn_norm(x + ffn_output)
        
        return x

Input Representation and Tokenization

BERT's input representation is carefully designed to handle various NLP tasks within a unified framework. Understanding this representation is crucial for effective use of BERT.

WordPiece Tokenization:

BERT uses WordPiece tokenization, a subword tokenization algorithm that balances vocabulary size against the ability to handle rare and out-of-vocabulary words. The algorithm:

Starts with a character vocabulary
Iteratively merges the most frequent adjacent character pairs
Continues until reaching a target vocabulary size (30,522 for BERT-Base)

This approach handles morphological variations naturally: "playing" might become ["play", "##ing"], where "##" indicates a subword continuation.

Why Subword Tokenization?

Subword tokenization solves the vocabulary dilemma: word-level tokenization struggles with rare words ("OOV problem"), while character-level tokenization creates very long sequences and loses word-level semantics. Subword methods like WordPiece, BPE, and SentencePiece achieve an optimal balance, representing common words as single tokens while breaking rare words into meaningful subunits.

Special Tokens:

BERT uses several special tokens with specific purposes:

[CLS]: Classification token, placed at the beginning of every input. Its final hidden state serves as the aggregate sequence representation for classification tasks.
[SEP]: Separator token, used to separate two sentences in pair tasks (e.g., question-answering, natural language inference). Also placed at the end of single-sentence inputs.
[MASK]: Masking token, used during pre-training to replace tokens that the model must predict.
[PAD]: Padding token, used to make all sequences in a batch the same length.
[UNK]: Unknown token, rarely used due to WordPiece's ability to handle most inputs.

Three Embedding Types in BERT
Embedding Type	Purpose	Details
Token Embedding	Maps vocabulary tokens to vectors	Vocabulary of 30,522 tokens → 768-dim vectors
Segment Embedding	Distinguishes sentence A from B	Two learned embeddings for pair tasks
Position Embedding	Encodes absolute position	Learned embeddings for positions 0-511

Input Construction Example:

For a sentence pair task like Natural Language Inference:

Premise: "A man is playing guitar." Hypothesis: "Someone is making music."

The input sequence becomes:

Tokens:     [CLS] a man is play ##ing guitar . [SEP] someone is making music . [SEP]
Segment:       0  0   0  0    0      0      0  0    0        1  1      1     1  1    1
Position:      0  1   2  3    4      5      6  7    8        9 10     11    12 13   14

The final input embedding is the sum of these three components:

$$E_{input} = E_{token} + E_{segment} + E_{position}$$

This embedding is then processed through BERT's encoder layers.

Maximum Sequence Length

BERT's position embeddings are learned and fixed at 512 positions, limiting input sequences to 512 tokens. This constraint has significant implications for tasks involving long documents. Various strategies address this: truncation, sliding window approaches, or using long-context variants like Longformer.

Pre-training Objectives: MLM and NSP

BERT's pre-training uses two self-supervised objectives that together enable deep bidirectional understanding: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).

Masked Language Modeling (MLM)

The core innovation enabling BERT's bidirectionality is MLM. The challenge with standard language modeling is that bidirectional conditioning would allow words to "see themselves"—making prediction trivial. BERT solves this elegantly:

The MLM Procedure:

Randomly select 15% of input tokens for masking
For each selected token:
- 80% of the time: Replace with [MASK] token
- 10% of the time: Replace with a random token
- 10% of the time: Keep the original token unchanged
Predict the original tokens from the corrupted input

Why the 80-10-10 Split?

The 80-10-10 masking strategy addresses a subtle problem: during fine-tuning, the model never sees [MASK] tokens. If pre-training exclusively used [MASK], there would be a train-fine-tune mismatch. Random replacement and unchanged tokens help the model learn to predict in all contexts. The 80% [MASK] rate still ensures the model can't simply copy the input.

MLM Loss Function:

The MLM loss is standard cross-entropy over the masked positions:

$$\mathcal{L}{MLM} = -\sum{i \in \mathcal{M}} \log P(x_i | x_{\backslash i})$$

where $\mathcal{M}$ is the set of masked positions and $x_{\backslash i}$ represents the sequence with position $i$ masked.

Interpreting MLM as Denoising:

MLM can be viewed as a denoising autoencoder. The input is corrupted (by masking), and the model learns to reconstruct the original. This forces the model to develop rich internal representations that capture linguistic patterns, syntax, and semantics.

mlm_pretraining.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Tuple
 
def create_mlm_mask(
    input_ids: torch.Tensor, 
    vocab_size: int,
    mask_token_id: int,
    special_token_ids: set,
    mask_prob: float = 0.15
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    """
    Create MLM masks following BERT's 80-10-10 strategy.
    
    Returns:
        masked_input_ids: Input with masking applied
        labels: Original tokens at masked positions (-100 elsewhere)
        attention_mask: 1 for real tokens, 0 for padding
    """
    labels = input_ids.clone()
    
    # Probability matrix for masking
    probability_matrix = torch.full(input_ids.shape, mask_prob)
    
    # Don't mask special tokens
    special_tokens_mask = torch.tensor([
        [1 if token_id in special_token_ids else 0 for token_id in seq]
        for seq in input_ids.tolist()
    ], dtype=torch.bool)
    probability_matrix.masked_fill_(special_tokens_mask, value=0.0)
    
    # Determine which tokens to mask
    masked_indices = torch.bernoulli(probability_matrix).bool()
    
    # Labels: -100 for non-masked (ignored in loss), original id for masked
    labels[~masked_indices] = -100
    
    # 80% of masked: replace with [MASK]
    indices_replaced = torch.bernoulli(torch.full(input_ids.shape, 0.8)).bool() & masked_indices
    input_ids[indices_replaced] = mask_token_id
    
    # 10% of masked: replace with random token
    indices_random = (
        torch.bernoulli(torch.full(input_ids.shape, 0.5)).bool() 
        & masked_indices 
        & ~indices_replaced
    )
    random_words = torch.randint(vocab_size, input_ids.shape, dtype=torch.long)
    input_ids[indices_random] = random_words[indices_random]
    
    # Remaining 10%: keep original (already done since we cloned)
    
    return input_ids, labels
 
 
class MLMHead(nn.Module):
    """
    Masked Language Modeling head.
    Predicts original tokens from final hidden states.
    """
    def __init__(self, hidden_size: int, vocab_size: int):
        super().__init__()
        self.dense = nn.Linear(hidden_size, hidden_size)
        self.activation = nn.GELU()
        self.layer_norm = nn.LayerNorm(hidden_size, eps=1e-12)
        self.decoder = nn.Linear(hidden_size, vocab_size)
        
    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        x = self.dense(hidden_states)
        x = self.activation(x)
        x = self.layer_norm(x)
        logits = self.decoder(x)
        return logits
 
 
def compute_mlm_loss(logits: torch.Tensor, labels: torch.Tensor) -> torch.Tensor:
    """
    Compute MLM loss with -100 label masking.
    """
    return F.cross_entropy(
        logits.view(-1, logits.size(-1)), 
        labels.view(-1),
        ignore_index=-100
    )

Next Sentence Prediction (NSP)

The second pre-training objective teaches BERT to understand relationships between sentences—crucial for tasks like question-answering and natural language inference.

The NSP Procedure:

Sample sentence pairs (A, B) from the corpus
50% of the time: B is the actual next sentence after A (label: IsNext)
50% of the time: B is a random sentence (label: NotNext)
Predict whether B follows A using the [CLS] token representation

NSP Loss Function:

$$\mathcal{L}_{NSP} = -[y \log P(\text{IsNext}) + (1-y) \log P(\text{NotNext})]$$

where $y = 1$ if B follows A, else $y = 0$.

The Controversy Around NSP

Later research (particularly RoBERTa) showed that NSP might not be as beneficial as originally thought. The binary classification task may be too easy—the model can often distinguish sentences by topic alone without learning true discourse coherence. RoBERTa removes NSP entirely with no performance loss, and many subsequent models have followed suit.

Total Pre-training Objective:

BERT's final pre-training loss combines both objectives:

$$\mathcal{L}{total} = \mathcal{L}{MLM} + \mathcal{L}_{NSP}$$

Pre-training Data and Compute:

BERT was pre-trained on:

BooksCorpus: 800M words from unpublished books
English Wikipedia: 2,500M words (text only, excluding lists and tables)

Total: ~3.3 billion words

Training took:

BERT-Base: 4 days on 16 TPU chips
BERT-Large: 4 days on 64 TPU chips

Fine-tuning: Adapting BERT to Downstream Tasks

BERT's most significant practical innovation is its fine-tuning paradigm: a pre-trained model can be adapted to virtually any NLP task with minimal architectural modification and relatively little task-specific training data.

The Fine-tuning Procedure:

Initialize with pre-trained BERT weights
Add a task-specific output layer (typically 1-2 layers)
Train the entire model end-to-end on the supervised task
Use a small learning rate to avoid catastrophic forgetting

Why Fine-tuning Works:

Pre-training imbues BERT with rich linguistic knowledge—syntax, semantics, world knowledge, and reasoning patterns. Fine-tuning then adapts these general capabilities to the specific patterns of the target task. This is vastly more data-efficient than training from scratch.

Task-Specific Architectures for BERT Fine-tuning
Task Type	Input Format	Output Layer	Uses
Single Sentence Classification	[CLS] sentence [SEP]	Linear([CLS]) → k classes	Sentiment, topic classification
Sentence Pair Classification	[CLS] sent_A [SEP] sent_B [SEP]	Linear([CLS]) → k classes	NLI, paraphrase detection
Token Classification	[CLS] tokens... [SEP]	Linear(each token) → k classes	NER, POS tagging
Extractive QA	[CLS] question [SEP] context [SEP]	Linear(each token) → start/end	SQuAD-style QA

Fine-tuning Hyperparameters:

BERT's authors provide recommended hyperparameters that work well across most tasks:

Learning rate: 2e-5, 3e-5, or 5e-5 (much lower than pre-training)
Batch size: 16 or 32
Epochs: 2-4 (fine-tuning converges quickly)
Warmup: Linear warmup over 10% of training steps
Weight decay: 0.01

These conservative settings prevent overwriting the valuable pre-trained representations while allowing adaptation to the target task.

bert_finetuning.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
import torch
import torch.nn as nn
from transformers import BertModel, BertTokenizer
from torch.optim import AdamW
from torch.optim.lr_scheduler import LinearLR
 
class BERTForSequenceClassification(nn.Module):
    """
    BERT fine-tuned for sequence classification tasks.
    Uses [CLS] token representation for classification.
    """
    def __init__(self, model_name: str, num_labels: int, dropout: float = 0.1):
        super().__init__()
        self.bert = BertModel.from_pretrained(model_name)
        self.dropout = nn.Dropout(dropout)
        self.classifier = nn.Linear(self.bert.config.hidden_size, num_labels)
        
    def forward(self, input_ids, attention_mask, token_type_ids=None):
        outputs = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids
        )
        
        # Use [CLS] token representation (first token)
        cls_output = outputs.last_hidden_state[:, 0, :]
        cls_output = self.dropout(cls_output)
        logits = self.classifier(cls_output)
        
        return logits
 
 
class BERTForTokenClassification(nn.Module):
    """
    BERT fine-tuned for token classification (NER, POS tagging).
    Classifies each token independently.
    """
    def __init__(self, model_name: str, num_labels: int, dropout: float = 0.1):
        super().__init__()
        self.bert = BertModel.from_pretrained(model_name)
        self.dropout = nn.Dropout(dropout)
        self.classifier = nn.Linear(self.bert.config.hidden_size, num_labels)
        
    def forward(self, input_ids, attention_mask, token_type_ids=None):
        outputs = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids
        )
        
        # Classify each token
        sequence_output = outputs.last_hidden_state
        sequence_output = self.dropout(sequence_output)
        logits = self.classifier(sequence_output)
        
        return logits
 
 
class BERTForQuestionAnswering(nn.Module):
    """
    BERT fine-tuned for extractive question answering.
    Predicts start and end positions of answer span.
    """
    def __init__(self, model_name: str):
        super().__init__()
        self.bert = BertModel.from_pretrained(model_name)
        self.qa_outputs = nn.Linear(self.bert.config.hidden_size, 2)  # start, end
        
    def forward(self, input_ids, attention_mask, token_type_ids=None):
        outputs = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids
        )
        
        sequence_output = outputs.last_hidden_state
        logits = self.qa_outputs(sequence_output)
        
        start_logits, end_logits = logits.split(1, dim=-1)
        start_logits = start_logits.squeeze(-1)
        end_logits = end_logits.squeeze(-1)
        
        return start_logits, end_logits
 
 
def get_optimizer_and_scheduler(model, lr=2e-5, warmup_ratio=0.1, total_steps=1000):
    """
    Standard BERT fine-tuning optimizer setup.
    Uses AdamW with weight decay and linear warmup.
    """
    # Separate weight decay for different parameter types
    no_decay = ['bias', 'LayerNorm.weight']
    optimizer_grouped_parameters = [
        {
            'params': [p for n, p in model.named_parameters() 
                      if not any(nd in n for nd in no_decay)],
            'weight_decay': 0.01
        },
        {
            'params': [p for n, p in model.named_parameters() 
                      if any(nd in n for nd in no_decay)],
            'weight_decay': 0.0
        }
    ]
    
    optimizer = AdamW(optimizer_grouped_parameters, lr=lr)
    
    warmup_steps = int(total_steps * warmup_ratio)
    scheduler = LinearLR(
        optimizer, 
        start_factor=0.01, 
        total_iters=warmup_steps
    )
    
    return optimizer, scheduler

Fine-tuning Efficiency

Fine-tuning BERT-Base on a classification task typically requires only 2-4 hours on a single GPU, even for datasets with 100K+ examples. This represents a dramatic reduction from the weeks of training required for task-specific models before BERT.

BERT Variants: RoBERTa, ALBERT, DistilBERT

Following BERT's release, researchers identified numerous ways to improve upon the original design. Three particularly influential variants emerged, each addressing different aspects of BERT's limitations.

RoBERTa: A Robustly Optimized BERT Approach

Facebook's RoBERTa (2019) demonstrated that BERT was significantly undertrained and that careful optimization of pre-training could yield substantial improvements.

Key Changes from BERT:

•Removed NSP objective — Training without Next Sentence Prediction performed better or equally well on downstream tasks
•Dynamic masking — Instead of masking once during data preparation, masks are generated dynamically each time a sequence is fed to the model
•Larger batches — Training with batch sizes up to 8K tokens improved perplexity
•More data — Trained on 160GB of text (vs BERT's 16GB), including CC-News, OpenWebText, and Stories datasets
•Longer training — 100K-300K more training steps than BERT
•Byte-Pair Encoding — Used BPE tokenization with 50K vocabulary instead of WordPiece

The RoBERTa Lesson

RoBERTa's success demonstrated that architecture innovations aren't the only path to improvement—careful engineering and sufficient training compute matter enormously. RoBERTa uses the exact same architecture as BERT-Large but consistently outperforms it through better training.

ALBERT: A Lite BERT

Google's ALBERT (2019) addressed BERT's massive parameter count with parameter efficiency techniques, enabling larger models with fewer parameters.

Key Innovations:

•Factorized embedding parameterization — Decomposed the embedding matrix: vocabulary → small dimension E → hidden dimension H. Reduces embedding parameters from V×H to V×E + E×H
•Cross-layer parameter sharing — All layers share the same parameters. Dramatically reduces model size (ALBERT-xxlarge has 235M params vs BERT-Large's 340M despite more layers)
•Sentence Order Prediction (SOP) — Replaces NSP with a harder task: predicting whether two sentences are in correct or swapped order
•Deeper but narrower — ALBERT-xxlarge uses 12 layers with 4096 hidden size but only 235M parameters

ALBERT Configuration Comparison
Model	Parameters	Layers	Hidden Size	Embedding Size
BERT-Large	340M	24	1024	1024
ALBERT-Base	12M	12	768	128
ALBERT-Large	18M	24	1024	128
ALBERT-xxlarge	235M	12	4096	128

DistilBERT: Distilled BERT

Hugging Face's DistilBERT (2019) used knowledge distillation to create a smaller, faster model that retains most of BERT's capabilities.

Knowledge Distillation Process:

Teacher model: Full BERT-Base
Student model: 6-layer BERT (half of BERT-Base)
Training objective: Combination of:
- Distillation loss: KL divergence between teacher and student output distributions
- MLM loss: Standard masked language modeling
- Cosine embedding loss: Align hidden states between teacher and student

DistilBERT Efficiency Gains

•60% smaller — 66M parameters vs 110M
•60% faster — Both training and inference
•97% of BERT's performance — On GLUE benchmark
•Production-ready — Practical for deployment on resource-constrained environments

Choosing the Right Variant

Use RoBERTa when you need maximum accuracy and have compute budget. Use ALBERT when memory is constrained but training time is flexible. Use DistilBERT for production deployment where inference speed matters and you can accept slight accuracy trade-offs.

Understanding What BERT Learns

A rich research literature has emerged analyzing what linguistic knowledge BERT acquires during pre-training. Understanding these learned capabilities helps practitioners leverage BERT effectively.

Syntax and Grammar:

BERT develops strong syntactic representations. Probing experiments show:

Attention heads that track subject-verb agreement across long distances
Layers that encode dependency parse trees
Representations that predict part-of-speech tags with high accuracy

Semantics and World Knowledge:

BERT encodes significant factual knowledge:

Named entities and their types
Semantic role information
Common-sense relationships
Some factual knowledge (e.g., "Paris is the capital of France")

The Layer Hierarchy:

Different BERT layers capture different linguistic phenomena:

Linguistic Information Across BERT Layers
Layer Range	Information Captured	Probing Evidence
Layers 1-4	Surface features, word identity	High performance on word identity probes
Layers 5-8	Syntax, phrase structure	Best for dependency parsing, POS tagging
Layers 9-12	Semantics, task-specific	Best for NER, semantic role labeling

Attention Head Analysis:

Researchers have identified specific attention heads with interpretable behaviors:

Direct object heads: Attend from verbs to their direct objects
Determiner-noun heads: Attend from determiners to modified nouns
Coreference heads: Track pronoun antecedents across sentences
Positional heads: Attend to previous/next token regardless of content

However, not all heads are equally useful. Many heads appear redundant, which is why pruning approaches like head pruning work without significant performance loss.

BERT's Limitations

BERT isn't perfect. It struggles with: (1) Numerical reasoning and arithmetic, (2) Factual knowledge verification (can confidently produce incorrect facts), (3) Multi-hop reasoning requiring multiple inference steps, (4) Tasks requiring external knowledge not in the training data. Understanding these limitations is crucial for appropriate application.

The Embedding Space:

BERT's embeddings exhibit interesting geometric properties:

Anisotropy: BERT embeddings occupy a narrow cone in the vector space rather than being uniformly distributed. This can affect similarity calculations.
Context sensitivity: The same word in different contexts produces dramatically different vectors. "Bank" in "river bank" vs "bank account" produces distinct representations.
Subspace structure: Semantic categories often form linearly separable subspaces, enabling simple classifiers on frozen BERT features.

Practical Applications and Usage Patterns

BERT has found applications across the entire spectrum of NLP tasks. Here we examine common usage patterns and best practices for practitioners.

Text Classification:

For sentiment analysis, intent detection, or topic classification:

Use [CLS] token representation
Single linear layer classifier is often sufficient
Consider domain-adaptive pre-training if your domain differs significantly from BERT's training data

bert_sentiment.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
from transformers import BertTokenizer, BertForSequenceClassification
import torch
 
# Load pre-trained model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=2  # Binary classification
)
 
def classify_sentiment(text: str) -> dict:
    """Classify sentiment of input text."""
    # Tokenize input
    inputs = tokenizer(
        text,
        return_tensors='pt',
        truncation=True,
        max_length=512,
        padding=True
    )
    
    # Get predictions
    with torch.no_grad():
        outputs = model(**inputs)
        probs = torch.softmax(outputs.logits, dim=-1)
    
    return {
        'negative': probs[0][0].item(),
        'positive': probs[0][1].item(),
        'prediction': 'positive' if probs[0][1] > 0.5 else 'negative'
    }
 
# Example usage
result = classify_sentiment("This movie was absolutely fantastic!")
print(f"Sentiment: {result['prediction']} (confidence: {max(result['positive'], result['negative']):.2%})")

Named Entity Recognition (NER):

For token-level classification tasks:

Each token gets its own classification
Handle subword tokenization carefully: typically use first subword or special handling for B-I-O tags
Conditional Random Fields (CRF) layer can improve consistency of predictions

Question Answering:

For extractive QA (finding answer spans in context):

Format: [CLS] question [SEP] context [SEP]
Model predicts start and end positions of answer span
Post-process to ensure valid spans (end ≥ start, reasonable length)

Semantic Similarity and Retrieval:

BERT can be used for similarity, but with caveats:

Direct [CLS] embedding similarity often underperforms
Sentence-BERT (SBERT) fine-tunes with contrastive objective for better sentence embeddings
For retrieval, consider bi-encoder architectures for efficiency

BERT Best Practices

•Start with pre-trained model — Never train BERT from scratch unless you have massive compute and data
•Use appropriate learning rate — 2e-5 to 5e-5 for fine-tuning; too high destroys pre-trained knowledge
•Don't overtrain — 2-4 epochs is usually sufficient; watch for overfitting
•Handle long sequences — Truncate, chunk, or use sliding window for texts > 512 tokens
•Consider domain adaptation — Further pre-train on domain text before fine-tuning for specialized domains
•Evaluate properly — Use held-out test set; cross-validation for small datasets

Page Complete

You now have a comprehensive understanding of BERT—from its bidirectional architecture and pre-training objectives to fine-tuning methodology and practical applications. BERT established the pre-train/fine-tune paradigm that dominates modern NLP. Next, we'll explore GPT, which takes a fundamentally different approach: unidirectional, autoregressive language modeling for generation.

1 / 5

Loading learning content...

Machine LearningAttention & Transformers

Transformer Variants

LevelAdvanced

Duration180 mins

TopicAttention & Transformers

1 / 5

BERT: Bidirectional Encoder Representations from Transformers

The BERT Revolution

What You Will Learn

The Pre-BERT Landscape:

To appreciate BERT's contribution, we must understand what came before. Prior to BERT, the dominant approaches to NLP fell into several categories:

Feature-based approaches: Using pre-trained word embeddings (Word2Vec, GloVe) as features for downstream task-specific models. These embeddings were context-independent—the word "bank" had the same representation whether referring to a river bank or financial institution.
ELMo (Contextualized Embeddings): A breakthrough in 2018, ELMo used bidirectional LSTMs to generate context-dependent embeddings. However, the bidirectionality was shallow—forward and backward LSTMs were trained separately and combined only at the final layer.
Unidirectional Language Models: OpenAI's GPT (2018) used the transformer architecture with unidirectional (left-to-right) attention for language modeling. While powerful, the unidirectional constraint limited the model's ability to understand context from both directions.

BERT's innovation was achieving deep bidirectionality—every layer jointly conditions on both left and right context, creating richer representations than any previous approach.

BERT Architecture: The Encoder-Only Transformer

Why Encoder-Only?

BERT Model Configurations
Configuration	Layers (L)	Hidden Size (H)	Attention Heads (A)	Parameters
BERT-Base	12	768	12	110M
BERT-Large	24	1024	16	340M

Architectural Components:

Each layer in BERT consists of:

Multi-Head Self-Attention: Allows each token to gather information from all other tokens in the sequence. With H hidden units and A attention heads, each head operates on H/A dimensional subspaces.
Position-wise Feed-Forward Network: A two-layer fully connected network applied independently to each position:

$$\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$$

The intermediate dimension is typically 4H (3072 for BERT-Base, 4096 for BERT-Large).

Layer Normalization: Applied after each sub-layer with residual connections:

$$\text{output} = \text{LayerNorm}(x + \text{SubLayer}(x))$$

Residual Connections: Skip connections around both the attention and feed-forward sub-layers, enabling gradient flow through deep networks.

The Power of Depth in BERT

bert_architecture.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
import torch
import torch.nn as nn
 
class BERTEmbedding(nn.Module):
    """
    BERT Embedding layer combines three types of embeddings:
    1. Token embeddings (vocabulary)
    2. Segment embeddings (sentence A vs B)
    3. Position embeddings (absolute positions)
    """
    def __init__(self, vocab_size, hidden_size, max_seq_length, num_segments=2, dropout=0.1):
        super().__init__()
        self.token_embedding = nn.Embedding(vocab_size, hidden_size)
        self.segment_embedding = nn.Embedding(num_segments, hidden_size)
        self.position_embedding = nn.Embedding(max_seq_length, hidden_size)
        self.layer_norm = nn.LayerNorm(hidden_size, eps=1e-12)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, token_ids, segment_ids):
        seq_length = token_ids.size(1)
        position_ids = torch.arange(seq_length, device=token_ids.device).unsqueeze(0)
        
        # Sum all three embedding types
        embeddings = (
            self.token_embedding(token_ids) +
            self.segment_embedding(segment_ids) +
            self.position_embedding(position_ids)
        )
        
        embeddings = self.layer_norm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings
 
 
class BERTAttention(nn.Module):
    """
    Multi-head self-attention with scaling and residual connection.
    """
    def __init__(self, hidden_size, num_heads, dropout=0.1):
        super().__init__()
        assert hidden_size % num_heads == 0
        
        self.num_heads = num_heads
        self.head_dim = hidden_size // num_heads
        self.scale = self.head_dim ** -0.5
        
        self.query = nn.Linear(hidden_size, hidden_size)
        self.key = nn.Linear(hidden_size, hidden_size)
        self.value = nn.Linear(hidden_size, hidden_size)
        self.output = nn.Linear(hidden_size, hidden_size)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, attention_mask=None):
        batch_size, seq_length, hidden_size = x.shape
        
        # Linear projections and reshape for multi-head
        Q = self.query(x).view(batch_size, seq_length, self.num_heads, self.head_dim).transpose(1, 2)
        K = self.key(x).view(batch_size, seq_length, self.num_heads, self.head_dim).transpose(1, 2)
        V = self.value(x).view(batch_size, seq_length, self.num_heads, self.head_dim).transpose(1, 2)
        
        # Scaled dot-product attention
        attention_scores = torch.matmul(Q, K.transpose(-2, -1)) * self.scale
        
        if attention_mask is not None:
            # attention_mask: [batch_size, 1, 1, seq_length]
            attention_scores = attention_scores + attention_mask
            
        attention_probs = torch.softmax(attention_scores, dim=-1)
        attention_probs = self.dropout(attention_probs)
        
        # Weighted sum and reshape
        context = torch.matmul(attention_probs, V)
        context = context.transpose(1, 2).contiguous().view(batch_size, seq_length, hidden_size)
        
        return self.output(context)
 
 
class BERTLayer(nn.Module):
    """
    Single BERT layer: Attention + FFN with residual connections and layer norm.
    """
    def __init__(self, hidden_size, num_heads, intermediate_size, dropout=0.1):
        super().__init__()
        self.attention = BERTAttention(hidden_size, num_heads, dropout)
        self.attention_norm = nn.LayerNorm(hidden_size, eps=1e-12)
        
        self.ffn = nn.Sequential(
            nn.Linear(hidden_size, intermediate_size),
            nn.GELU(),  # BERT uses GELU activation
            nn.Linear(intermediate_size, hidden_size),
            nn.Dropout(dropout)
        )
        self.ffn_norm = nn.LayerNorm(hidden_size, eps=1e-12)
        
    def forward(self, x, attention_mask=None):
        # Self-attention with residual
        attention_output = self.attention(x, attention_mask)
        x = self.attention_norm(x + attention_output)
        
        # FFN with residual
        ffn_output = self.ffn(x)
        x = self.ffn_norm(x + ffn_output)
        
        return x

Input Representation and Tokenization

BERT's input representation is carefully designed to handle various NLP tasks within a unified framework. Understanding this representation is crucial for effective use of BERT.

WordPiece Tokenization:

BERT uses WordPiece tokenization, a subword tokenization algorithm that balances vocabulary size against the ability to handle rare and out-of-vocabulary words. The algorithm:

Starts with a character vocabulary
Iteratively merges the most frequent adjacent character pairs
Continues until reaching a target vocabulary size (30,522 for BERT-Base)

This approach handles morphological variations naturally: "playing" might become ["play", "##ing"], where "##" indicates a subword continuation.

Why Subword Tokenization?

Special Tokens:

BERT uses several special tokens with specific purposes:

[CLS]: Classification token, placed at the beginning of every input. Its final hidden state serves as the aggregate sequence representation for classification tasks.
[SEP]: Separator token, used to separate two sentences in pair tasks (e.g., question-answering, natural language inference). Also placed at the end of single-sentence inputs.
[MASK]: Masking token, used during pre-training to replace tokens that the model must predict.
[PAD]: Padding token, used to make all sequences in a batch the same length.
[UNK]: Unknown token, rarely used due to WordPiece's ability to handle most inputs.

Three Embedding Types in BERT
Embedding Type	Purpose	Details
Token Embedding	Maps vocabulary tokens to vectors	Vocabulary of 30,522 tokens → 768-dim vectors
Segment Embedding	Distinguishes sentence A from B	Two learned embeddings for pair tasks
Position Embedding	Encodes absolute position	Learned embeddings for positions 0-511

Input Construction Example:

For a sentence pair task like Natural Language Inference:

Premise: "A man is playing guitar." Hypothesis: "Someone is making music."

The input sequence becomes:

Tokens:     [CLS] a man is play ##ing guitar . [SEP] someone is making music . [SEP]
Segment:       0  0   0  0    0      0      0  0    0        1  1      1     1  1    1
Position:      0  1   2  3    4      5      6  7    8        9 10     11    12 13   14

The final input embedding is the sum of these three components:

$$E_{input} = E_{token} + E_{segment} + E_{position}$$

This embedding is then processed through BERT's encoder layers.

Maximum Sequence Length

Pre-training Objectives: MLM and NSP

BERT's pre-training uses two self-supervised objectives that together enable deep bidirectional understanding: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).

Masked Language Modeling (MLM)

The MLM Procedure:

Randomly select 15% of input tokens for masking
For each selected token:
- 80% of the time: Replace with [MASK] token
- 10% of the time: Replace with a random token
- 10% of the time: Keep the original token unchanged
Predict the original tokens from the corrupted input

Why the 80-10-10 Split?

MLM Loss Function:

The MLM loss is standard cross-entropy over the masked positions:

$$\mathcal{L}{MLM} = -\sum{i \in \mathcal{M}} \log P(x_i | x_{\backslash i})$$

where $\mathcal{M}$ is the set of masked positions and $x_{\backslash i}$ represents the sequence with position $i$ masked.

Interpreting MLM as Denoising:

mlm_pretraining.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Tuple
 
def create_mlm_mask(
    input_ids: torch.Tensor, 
    vocab_size: int,
    mask_token_id: int,
    special_token_ids: set,
    mask_prob: float = 0.15
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    """
    Create MLM masks following BERT's 80-10-10 strategy.
    
    Returns:
        masked_input_ids: Input with masking applied
        labels: Original tokens at masked positions (-100 elsewhere)
        attention_mask: 1 for real tokens, 0 for padding
    """
    labels = input_ids.clone()
    
    # Probability matrix for masking
    probability_matrix = torch.full(input_ids.shape, mask_prob)
    
    # Don't mask special tokens
    special_tokens_mask = torch.tensor([
        [1 if token_id in special_token_ids else 0 for token_id in seq]
        for seq in input_ids.tolist()
    ], dtype=torch.bool)
    probability_matrix.masked_fill_(special_tokens_mask, value=0.0)
    
    # Determine which tokens to mask
    masked_indices = torch.bernoulli(probability_matrix).bool()
    
    # Labels: -100 for non-masked (ignored in loss), original id for masked
    labels[~masked_indices] = -100
    
    # 80% of masked: replace with [MASK]
    indices_replaced = torch.bernoulli(torch.full(input_ids.shape, 0.8)).bool() & masked_indices
    input_ids[indices_replaced] = mask_token_id
    
    # 10% of masked: replace with random token
    indices_random = (
        torch.bernoulli(torch.full(input_ids.shape, 0.5)).bool() 
        & masked_indices 
        & ~indices_replaced
    )
    random_words = torch.randint(vocab_size, input_ids.shape, dtype=torch.long)
    input_ids[indices_random] = random_words[indices_random]
    
    # Remaining 10%: keep original (already done since we cloned)
    
    return input_ids, labels
 
 
class MLMHead(nn.Module):
    """
    Masked Language Modeling head.
    Predicts original tokens from final hidden states.
    """
    def __init__(self, hidden_size: int, vocab_size: int):
        super().__init__()
        self.dense = nn.Linear(hidden_size, hidden_size)
        self.activation = nn.GELU()
        self.layer_norm = nn.LayerNorm(hidden_size, eps=1e-12)
        self.decoder = nn.Linear(hidden_size, vocab_size)
        
    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        x = self.dense(hidden_states)
        x = self.activation(x)
        x = self.layer_norm(x)
        logits = self.decoder(x)
        return logits
 
 
def compute_mlm_loss(logits: torch.Tensor, labels: torch.Tensor) -> torch.Tensor:
    """
    Compute MLM loss with -100 label masking.
    """
    return F.cross_entropy(
        logits.view(-1, logits.size(-1)), 
        labels.view(-1),
        ignore_index=-100
    )

Next Sentence Prediction (NSP)

The second pre-training objective teaches BERT to understand relationships between sentences—crucial for tasks like question-answering and natural language inference.

The NSP Procedure:

Sample sentence pairs (A, B) from the corpus
50% of the time: B is the actual next sentence after A (label: IsNext)
50% of the time: B is a random sentence (label: NotNext)
Predict whether B follows A using the [CLS] token representation

NSP Loss Function:

$$\mathcal{L}_{NSP} = -[y \log P(\text{IsNext}) + (1-y) \log P(\text{NotNext})]$$

where $y = 1$ if B follows A, else $y = 0$.

The Controversy Around NSP

Total Pre-training Objective:

BERT's final pre-training loss combines both objectives:

$$\mathcal{L}{total} = \mathcal{L}{MLM} + \mathcal{L}_{NSP}$$

Pre-training Data and Compute:

BERT was pre-trained on:

BooksCorpus: 800M words from unpublished books
English Wikipedia: 2,500M words (text only, excluding lists and tables)

Total: ~3.3 billion words

Training took:

BERT-Base: 4 days on 16 TPU chips
BERT-Large: 4 days on 64 TPU chips

Fine-tuning: Adapting BERT to Downstream Tasks

The Fine-tuning Procedure:

Initialize with pre-trained BERT weights
Add a task-specific output layer (typically 1-2 layers)
Train the entire model end-to-end on the supervised task
Use a small learning rate to avoid catastrophic forgetting

Why Fine-tuning Works:

Task-Specific Architectures for BERT Fine-tuning
Task Type	Input Format	Output Layer	Uses
Single Sentence Classification	[CLS] sentence [SEP]	Linear([CLS]) → k classes	Sentiment, topic classification
Sentence Pair Classification	[CLS] sent_A [SEP] sent_B [SEP]	Linear([CLS]) → k classes	NLI, paraphrase detection
Token Classification	[CLS] tokens... [SEP]	Linear(each token) → k classes	NER, POS tagging
Extractive QA	[CLS] question [SEP] context [SEP]	Linear(each token) → start/end	SQuAD-style QA

Fine-tuning Hyperparameters:

BERT's authors provide recommended hyperparameters that work well across most tasks:

Learning rate: 2e-5, 3e-5, or 5e-5 (much lower than pre-training)
Batch size: 16 or 32
Epochs: 2-4 (fine-tuning converges quickly)
Warmup: Linear warmup over 10% of training steps
Weight decay: 0.01

These conservative settings prevent overwriting the valuable pre-trained representations while allowing adaptation to the target task.

bert_finetuning.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
import torch
import torch.nn as nn
from transformers import BertModel, BertTokenizer
from torch.optim import AdamW
from torch.optim.lr_scheduler import LinearLR
 
class BERTForSequenceClassification(nn.Module):
    """
    BERT fine-tuned for sequence classification tasks.
    Uses [CLS] token representation for classification.
    """
    def __init__(self, model_name: str, num_labels: int, dropout: float = 0.1):
        super().__init__()
        self.bert = BertModel.from_pretrained(model_name)
        self.dropout = nn.Dropout(dropout)
        self.classifier = nn.Linear(self.bert.config.hidden_size, num_labels)
        
    def forward(self, input_ids, attention_mask, token_type_ids=None):
        outputs = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids
        )
        
        # Use [CLS] token representation (first token)
        cls_output = outputs.last_hidden_state[:, 0, :]
        cls_output = self.dropout(cls_output)
        logits = self.classifier(cls_output)
        
        return logits
 
 
class BERTForTokenClassification(nn.Module):
    """
    BERT fine-tuned for token classification (NER, POS tagging).
    Classifies each token independently.
    """
    def __init__(self, model_name: str, num_labels: int, dropout: float = 0.1):
        super().__init__()
        self.bert = BertModel.from_pretrained(model_name)
        self.dropout = nn.Dropout(dropout)
        self.classifier = nn.Linear(self.bert.config.hidden_size, num_labels)
        
    def forward(self, input_ids, attention_mask, token_type_ids=None):
        outputs = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids
        )
        
        # Classify each token
        sequence_output = outputs.last_hidden_state
        sequence_output = self.dropout(sequence_output)
        logits = self.classifier(sequence_output)
        
        return logits
 
 
class BERTForQuestionAnswering(nn.Module):
    """
    BERT fine-tuned for extractive question answering.
    Predicts start and end positions of answer span.
    """
    def __init__(self, model_name: str):
        super().__init__()
        self.bert = BertModel.from_pretrained(model_name)
        self.qa_outputs = nn.Linear(self.bert.config.hidden_size, 2)  # start, end
        
    def forward(self, input_ids, attention_mask, token_type_ids=None):
        outputs = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids
        )
        
        sequence_output = outputs.last_hidden_state
        logits = self.qa_outputs(sequence_output)
        
        start_logits, end_logits = logits.split(1, dim=-1)
        start_logits = start_logits.squeeze(-1)
        end_logits = end_logits.squeeze(-1)
        
        return start_logits, end_logits
 
 
def get_optimizer_and_scheduler(model, lr=2e-5, warmup_ratio=0.1, total_steps=1000):
    """
    Standard BERT fine-tuning optimizer setup.
    Uses AdamW with weight decay and linear warmup.
    """
    # Separate weight decay for different parameter types
    no_decay = ['bias', 'LayerNorm.weight']
    optimizer_grouped_parameters = [
        {
            'params': [p for n, p in model.named_parameters() 
                      if not any(nd in n for nd in no_decay)],
            'weight_decay': 0.01
        },
        {
            'params': [p for n, p in model.named_parameters() 
                      if any(nd in n for nd in no_decay)],
            'weight_decay': 0.0
        }
    ]
    
    optimizer = AdamW(optimizer_grouped_parameters, lr=lr)
    
    warmup_steps = int(total_steps * warmup_ratio)
    scheduler = LinearLR(
        optimizer, 
        start_factor=0.01, 
        total_iters=warmup_steps
    )
    
    return optimizer, scheduler

Fine-tuning Efficiency

BERT Variants: RoBERTa, ALBERT, DistilBERT

RoBERTa: A Robustly Optimized BERT Approach

Facebook's RoBERTa (2019) demonstrated that BERT was significantly undertrained and that careful optimization of pre-training could yield substantial improvements.

Key Changes from BERT:

•Removed NSP objective — Training without Next Sentence Prediction performed better or equally well on downstream tasks
•Dynamic masking — Instead of masking once during data preparation, masks are generated dynamically each time a sequence is fed to the model
•Larger batches — Training with batch sizes up to 8K tokens improved perplexity
•More data — Trained on 160GB of text (vs BERT's 16GB), including CC-News, OpenWebText, and Stories datasets
•Longer training — 100K-300K more training steps than BERT
•Byte-Pair Encoding — Used BPE tokenization with 50K vocabulary instead of WordPiece

The RoBERTa Lesson

ALBERT: A Lite BERT

Google's ALBERT (2019) addressed BERT's massive parameter count with parameter efficiency techniques, enabling larger models with fewer parameters.

Key Innovations:

•Factorized embedding parameterization — Decomposed the embedding matrix: vocabulary → small dimension E → hidden dimension H. Reduces embedding parameters from V×H to V×E + E×H
•Cross-layer parameter sharing — All layers share the same parameters. Dramatically reduces model size (ALBERT-xxlarge has 235M params vs BERT-Large's 340M despite more layers)
•Sentence Order Prediction (SOP) — Replaces NSP with a harder task: predicting whether two sentences are in correct or swapped order
•Deeper but narrower — ALBERT-xxlarge uses 12 layers with 4096 hidden size but only 235M parameters

ALBERT Configuration Comparison
Model	Parameters	Layers	Hidden Size	Embedding Size
BERT-Large	340M	24	1024	1024
ALBERT-Base	12M	12	768	128
ALBERT-Large	18M	24	1024	128
ALBERT-xxlarge	235M	12	4096	128

DistilBERT: Distilled BERT

Hugging Face's DistilBERT (2019) used knowledge distillation to create a smaller, faster model that retains most of BERT's capabilities.

Knowledge Distillation Process:

Teacher model: Full BERT-Base
Student model: 6-layer BERT (half of BERT-Base)
Training objective: Combination of:
- Distillation loss: KL divergence between teacher and student output distributions
- MLM loss: Standard masked language modeling
- Cosine embedding loss: Align hidden states between teacher and student

DistilBERT Efficiency Gains

•60% smaller — 66M parameters vs 110M
•60% faster — Both training and inference
•97% of BERT's performance — On GLUE benchmark
•Production-ready — Practical for deployment on resource-constrained environments

Choosing the Right Variant

Understanding What BERT Learns

A rich research literature has emerged analyzing what linguistic knowledge BERT acquires during pre-training. Understanding these learned capabilities helps practitioners leverage BERT effectively.

Syntax and Grammar:

BERT develops strong syntactic representations. Probing experiments show:

Attention heads that track subject-verb agreement across long distances
Layers that encode dependency parse trees
Representations that predict part-of-speech tags with high accuracy

Semantics and World Knowledge:

BERT encodes significant factual knowledge:

Named entities and their types
Semantic role information
Common-sense relationships
Some factual knowledge (e.g., "Paris is the capital of France")

The Layer Hierarchy:

Different BERT layers capture different linguistic phenomena:

Linguistic Information Across BERT Layers
Layer Range	Information Captured	Probing Evidence
Layers 1-4	Surface features, word identity	High performance on word identity probes
Layers 5-8	Syntax, phrase structure	Best for dependency parsing, POS tagging
Layers 9-12	Semantics, task-specific	Best for NER, semantic role labeling

Attention Head Analysis:

Researchers have identified specific attention heads with interpretable behaviors:

Direct object heads: Attend from verbs to their direct objects
Determiner-noun heads: Attend from determiners to modified nouns
Coreference heads: Track pronoun antecedents across sentences
Positional heads: Attend to previous/next token regardless of content

However, not all heads are equally useful. Many heads appear redundant, which is why pruning approaches like head pruning work without significant performance loss.

BERT's Limitations

The Embedding Space:

BERT's embeddings exhibit interesting geometric properties:

Anisotropy: BERT embeddings occupy a narrow cone in the vector space rather than being uniformly distributed. This can affect similarity calculations.
Context sensitivity: The same word in different contexts produces dramatically different vectors. "Bank" in "river bank" vs "bank account" produces distinct representations.
Subspace structure: Semantic categories often form linearly separable subspaces, enabling simple classifiers on frozen BERT features.

Practical Applications and Usage Patterns

BERT has found applications across the entire spectrum of NLP tasks. Here we examine common usage patterns and best practices for practitioners.

Text Classification:

For sentiment analysis, intent detection, or topic classification:

Use [CLS] token representation
Single linear layer classifier is often sufficient
Consider domain-adaptive pre-training if your domain differs significantly from BERT's training data

bert_sentiment.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
from transformers import BertTokenizer, BertForSequenceClassification
import torch
 
# Load pre-trained model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=2  # Binary classification
)
 
def classify_sentiment(text: str) -> dict:
    """Classify sentiment of input text."""
    # Tokenize input
    inputs = tokenizer(
        text,
        return_tensors='pt',
        truncation=True,
        max_length=512,
        padding=True
    )
    
    # Get predictions
    with torch.no_grad():
        outputs = model(**inputs)
        probs = torch.softmax(outputs.logits, dim=-1)
    
    return {
        'negative': probs[0][0].item(),
        'positive': probs[0][1].item(),
        'prediction': 'positive' if probs[0][1] > 0.5 else 'negative'
    }
 
# Example usage
result = classify_sentiment("This movie was absolutely fantastic!")
print(f"Sentiment: {result['prediction']} (confidence: {max(result['positive'], result['negative']):.2%})")

Named Entity Recognition (NER):

For token-level classification tasks:

Each token gets its own classification
Handle subword tokenization carefully: typically use first subword or special handling for B-I-O tags
Conditional Random Fields (CRF) layer can improve consistency of predictions

Question Answering:

For extractive QA (finding answer spans in context):

Format: [CLS] question [SEP] context [SEP]
Model predicts start and end positions of answer span
Post-process to ensure valid spans (end ≥ start, reasonable length)

Semantic Similarity and Retrieval:

BERT can be used for similarity, but with caveats:

Direct [CLS] embedding similarity often underperforms
Sentence-BERT (SBERT) fine-tunes with contrastive objective for better sentence embeddings
For retrieval, consider bi-encoder architectures for efficiency

BERT Best Practices

•Start with pre-trained model — Never train BERT from scratch unless you have massive compute and data
•Use appropriate learning rate — 2e-5 to 5e-5 for fine-tuning; too high destroys pre-trained knowledge
•Don't overtrain — 2-4 epochs is usually sufficient; watch for overfitting
•Handle long sequences — Truncate, chunk, or use sliding window for texts > 512 tokens
•Consider domain adaptation — Further pre-train on domain text before fine-tuning for specialized domains
•Evaluate properly — Use held-out test set; cross-validation for small datasets

Page Complete

1 / 5