Advanced Rnn Topics - Learning Module

Loading content...

0/245

Bidirectional RNNs

The Limitation of Looking Only Forward

Standard recurrent neural networks process sequences in a single direction—typically left-to-right for text or past-to-present for time series. At each timestep $t$, the hidden state $\mathbf{h}_t$ encodes information from all previous inputs $\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_t$. This creates a fundamental limitation: the network cannot access future context when making predictions.

Consider the ambiguous sentence: "The bank is on the left." Does "bank" refer to a financial institution or a riverbank? A human reader resolves this ambiguity by reading the entire sentence—including words that come after "bank." A standard RNN, processing left-to-right, must predict the meaning of "bank" before seeing "left."

Bidirectional RNNs (BiRNNs) solve this problem elegantly: they process the sequence in both directions simultaneously, combining forward and backward context to produce richer representations at each position.

What You Will Learn

By the end of this page, you will understand the mathematical formulation of bidirectional RNNs, their gradient flow properties, when and how to apply them effectively, and the critical distinction between bidirectional training and autoregressive generation.

The Unidirectional Limitation

To appreciate bidirectional processing, we must first rigorously understand what information unidirectional RNNs capture and what they miss.

Unidirectional Hidden States

In a standard forward RNN, the hidden state at timestep $t$ is computed as:

$$\mathbf{h}t = f(\mathbf{W}{hh}\mathbf{h}{t-1} + \mathbf{W}{xh}\mathbf{x}_t + \mathbf{b}_h)$$

The crucial observation is that $\mathbf{h}t$ depends only on $\mathbf{h}{t-1}$, which in turn depends on $\mathbf{h}_{t-2}$, and so on. By induction, $\mathbf{h}_t$ encodes information from the subsequence $\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_t$.

What $\mathbf{h}_t$ cannot encode: Any information from $\mathbf{x}{t+1}, \mathbf{x}{t+2}, \ldots, \mathbf{x}_T$.

The Fundamental Trade-off

Unidirectional RNNs can generate sequences autoregressively (one token at a time, conditioned on previously generated tokens). Bidirectional RNNs cannot—they require the full sequence upfront. This is not a flaw but a design choice suited to different tasks.

Tasks Where Future Context Matters

Many sequence labeling tasks require context from both directions:

Task	Why Future Context Helps
Named Entity Recognition	"Apple announced..." vs "Apple orchards..."
Part-of-Speech Tagging	"Time flies like an arrow" vs "Fruit flies like a banana"
Speech Recognition	Phoneme disambiguation from surrounding context
Sentiment Analysis	"Not good" vs "Not good, but great"
Machine Translation	Word alignment depends on full source sentence
Slot Filling	"Book a flight to [DEST]" pattern recognition

In each case, making optimal predictions at position $t$ requires access to both past ($t' < t$) and future ($t' > t$) context.

Bidirectional Architecture

The bidirectional RNN architecture consists of two independent recurrent networks:

Forward RNN ($\overrightarrow{\text{RNN}}$): Processes the sequence left-to-right, producing hidden states $\overrightarrow{\mathbf{h}}_1, \overrightarrow{\mathbf{h}}_2, \ldots, \overrightarrow{\mathbf{h}}_T$
Backward RNN ($\overleftarrow{\text{RNN}}$): Processes the sequence right-to-left, producing hidden states $\overleftarrow{\mathbf{h}}T, \overleftarrow{\mathbf{h}}{T-1}, \ldots, \overleftarrow{\mathbf{h}}_1$

At each timestep $t$, the bidirectional representation combines both:

$$\mathbf{h}_t^{\text{bi}} = [\overrightarrow{\mathbf{h}}_t; \overleftarrow{\mathbf{h}}_t]$$

where $[\cdot; \cdot]$ denotes concatenation along the feature dimension.

Converting Mermaid diagram...

Key Architectural Properties

Parameter Independence: The forward and backward RNNs have completely separate parameters. They do not share weights.
Doubled Dimensionality: If each directional hidden state has dimension $d$, the bidirectional hidden state has dimension $2d$.
Full Context at Every Position: Each $\mathbf{h}_t^{\text{bi}}$ encodes information from the entire sequence—both preceding and following context.
Parallel Processing: While there's sequential dependency within each direction, the forward and backward passes can be computed in parallel on modern hardware.

Mathematical Formulation

Let us formally define the bidirectional RNN for a sequence $\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_T$ where each $\mathbf{x}t \in \mathbb{R}^{d{\text{in}}}$.

Forward Pass (Left-to-Right)

$$\overrightarrow{\mathbf{h}}t = f\left(\overrightarrow{\mathbf{W}}{hh} \overrightarrow{\mathbf{h}}{t-1} + \overrightarrow{\mathbf{W}}{xh} \mathbf{x}_t + \overrightarrow{\mathbf{b}}_h\right)$$

with initial state $\overrightarrow{\mathbf{h}}_0 = \mathbf{0}$ (or learned initialization).

Backward Pass (Right-to-Left)

$$\overleftarrow{\mathbf{h}}t = f\left(\overleftarrow{\mathbf{W}}{hh} \overleftarrow{\mathbf{h}}{t+1} + \overleftarrow{\mathbf{W}}{xh} \mathbf{x}_t + \overleftarrow{\mathbf{b}}_h\right)$$

with initial state $\overleftarrow{\mathbf{h}}_{T+1} = \mathbf{0}$ (or learned initialization).

Parameter Count

Component	Parameters
$\overrightarrow{\mathbf{W}}_{xh}$	$d \times d_{\text{in}}$
$\overrightarrow{\mathbf{W}}_{hh}$	$d \times d$
$\overrightarrow{\mathbf{b}}_h$	$d$
$\overleftarrow{\mathbf{W}}_{xh}$	$d \times d_{\text{in}}$
$\overleftarrow{\mathbf{W}}_{hh}$	$d \times d$
$\overleftarrow{\mathbf{b}}_h$	$d$
Total	$2(d^2 + d \cdot d_{\text{in}} + d)$

Parameter Scaling

A bidirectional RNN has exactly twice the parameters of an equivalent unidirectional RNN. The forward and backward networks together learn complementary representations—one capturing left context, one capturing right context.

Combining Directional Hidden States

The most common combination strategy is concatenation:

$$\mathbf{h}_t^{\text{bi}} = [\overrightarrow{\mathbf{h}}_t; \overleftarrow{\mathbf{h}}_t] \in \mathbb{R}^{2d}$$

Other combination strategies include:

Method	Formula	Resulting Dimension
Concatenation	$[\overrightarrow{\mathbf{h}}_t; \overleftarrow{\mathbf{h}}_t]$	$2d$
Summation	$\overrightarrow{\mathbf{h}}_t + \overleftarrow{\mathbf{h}}_t$	$d$
Average	$\frac{1}{2}(\overrightarrow{\mathbf{h}}_t + \overleftarrow{\mathbf{h}}_t)$	$d$
Element-wise Max	$\max(\overrightarrow{\mathbf{h}}_t, \overleftarrow{\mathbf{h}}_t)$	$d$
Gated Combination	$\mathbf{g} \odot \overrightarrow{\mathbf{h}}_t + (1-\mathbf{g}) \odot \overleftarrow{\mathbf{h}}_t$	$d$

Concatenation is preferred in most cases because it preserves all information from both directions without information loss. Summation and averaging lose information but reduce dimensionality.

Gradient Flow in Bidirectional Networks

Bidirectional RNNs exhibit interesting gradient flow properties during backpropagation through time (BPTT).

Forward RNN Gradient Flow

Gradients flow backward through time (from $t=T$ to $t=1$):

$$\frac{\partial \mathcal{L}}{\partial \overrightarrow{\mathbf{h}}t} = \frac{\partial \mathcal{L}}{\partial \overrightarrow{\mathbf{h}}{t+1}} \cdot \overrightarrow{\mathbf{W}}_{hh}^\top \cdot f'(\cdot)$$

Backward RNN Gradient Flow

Gradients flow forward through time (from $t=1$ to $t=T$):

$$\frac{\partial \mathcal{L}}{\partial \overleftarrow{\mathbf{h}}t} = \frac{\partial \mathcal{L}}{\partial \overleftarrow{\mathbf{h}}{t-1}} \cdot \overleftarrow{\mathbf{W}}_{hh}^\top \cdot f'(\cdot)$$

Key Insight: The two gradient paths are completely independent. Each direction faces the same vanishing/exploding gradient issues as standard RNNs. LSTM or GRU cells are commonly used within each direction to mitigate these problems.

Bidirectional LSTM/GRU

In practice, bidirectional networks almost always use LSTM or GRU cells rather than vanilla RNN cells. This creates 'BiLSTM' or 'BiGRU' architectures that combine bidirectional context with stable gradient flow through gating mechanisms.

Computational Considerations

Aspect	Implication
Forward Pass	Two independent passes (parallelizable on GPU)
Backward Pass	Two independent gradient computations
Memory	Must store hidden states for both directions across all timesteps
Latency	Cannot begin producing output until full sequence is available
Throughput	With parallelization, only ~1.1-1.3× slower than unidirectional

Gradient Paths to Parameters

For any loss $\mathcal{L}$ that depends on the bidirectional representation $\mathbf{h}_t^{\text{bi}}$:

$$\frac{\partial \mathcal{L}}{\partial \overrightarrow{\mathbf{W}}{hh}} = \sum{t=1}^{T} \frac{\partial \mathcal{L}}{\partial \overrightarrow{\mathbf{h}}_t} \cdot \frac{\partial \overrightarrow{\mathbf{h}}t}{\partial \overrightarrow{\mathbf{W}}{hh}}$$

$$\frac{\partial \mathcal{L}}{\partial \overleftarrow{\mathbf{W}}{hh}} = \sum{t=1}^{T} \frac{\partial \mathcal{L}}{\partial \overleftarrow{\mathbf{h}}_t} \cdot \frac{\partial \overleftarrow{\mathbf{h}}t}{\partial \overleftarrow{\mathbf{W}}{hh}}$$

The forward and backward parameter gradients are computed independently and accumulated separately.

Output Strategies for Different Tasks

The way bidirectional representations are used depends on the downstream task. Different tasks require different output strategies.

Sequence Labeling (Per-Position Output)

For tasks like NER, POS tagging, or slot filling, we need an output at every position:

$$\mathbf{y}_t = \text{softmax}(\mathbf{W}_y \mathbf{h}_t^{\text{bi}} + \mathbf{b}_y)$$

where $\mathbf{W}_y \in \mathbb{R}^{|V| \times 2d}$ projects the bidirectional hidden state to the output vocabulary.

sequence_labeling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import torch
import torch.nn as nn
 
class BiRNNSequenceLabeler(nn.Module):
    """Bidirectional RNN for sequence labeling tasks."""
    
    def __init__(
        self,
        vocab_size: int,
        embed_dim: int,
        hidden_dim: int,
        num_labels: int,
        num_layers: int = 1,
        dropout: float = 0.1,
        cell_type: str = "lstm"
    ):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        
        # Choose cell type
        rnn_class = nn.LSTM if cell_type == "lstm" else nn.GRU
        
        # Bidirectional RNN
        self.rnn = rnn_class(
            input_size=embed_dim,
            hidden_size=hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            bidirectional=True,  # Key: enables bidirectionality
            dropout=dropout if num_layers > 1 else 0
        )
        
        # Output projection: 2*hidden_dim due to bidirectionality
        self.classifier = nn.Linear(2 * hidden_dim, num_labels)
        self.dropout = nn.Dropout(dropout)
        
    def forward(
        self,
        input_ids: torch.Tensor,
        lengths: torch.Tensor = None
    ) -> torch.Tensor:
        """
        Args:
            input_ids: [batch, seq_len] - token indices
            lengths: [batch] - actual sequence lengths (for packing)
            
        Returns:
            logits: [batch, seq_len, num_labels] - per-position predictions
        """
        # Embed input tokens
        x = self.embedding(input_ids)  # [batch, seq_len, embed_dim]
        x = self.dropout(x)
        
        # Pack sequences if lengths provided (for efficiency)
        if lengths is not None:
            x = nn.utils.rnn.pack_padded_sequence(
                x, lengths.cpu(), batch_first=True, enforce_sorted=False
            )
        
        # Bidirectional RNN: h_bi has shape [batch, seq_len, 2*hidden_dim]
        h_bi, _ = self.rnn(x)
        
        # Unpack if packed
        if lengths is not None:
            h_bi, _ = nn.utils.rnn.pad_packed_sequence(h_bi, batch_first=True)
        
        h_bi = self.dropout(h_bi)
        
        # Per-position classification
        logits = self.classifier(h_bi)  # [batch, seq_len, num_labels]
        
        return logits

Sequence Classification (Single Output)

For tasks like sentiment analysis or document classification, we need a single representation for the entire sequence:

$$\mathbf{h}_{\text{final}} = f(\overrightarrow{\mathbf{h}}_T, \overleftarrow{\mathbf{h}}_1)$$

Common choices for $f$:

Method	Formula	Captures
Last states only	$[\overrightarrow{\mathbf{h}}_T; \overleftarrow{\mathbf{h}}_1]$	Sequence endpoints with full context
Max pooling	$\max_{t}(\mathbf{h}_t^{\text{bi}})$	Strongest activations across sequence
Mean pooling	$\frac{1}{T}\sum_{t=1}^{T}\mathbf{h}_t^{\text{bi}}$	Average representation
Attention pooling	$\sum_{t}\alpha_t \mathbf{h}_t^{\text{bi}}$	Learned weighted combination

Why Last States Work Differently

In unidirectional RNNs, the final hidden state captures the entire sequence. In bidirectional RNNs, the forward final state (h→ₜ) sees the entire sequence up to position T, while the backward initial state (←h₁) sees the entire sequence from position 1. Together, they summarize the whole sequence.

When NOT to Use Bidirectional RNNs

Bidirectional processing is powerful but not universally applicable. Understanding when to avoid BiRNNs is as important as knowing when to use them.

Autoregressive Generation

For tasks requiring sequential generation—where the model produces one token at a time, conditioned on previously generated tokens—bidirectional processing is fundamentally incompatible.

Critical Constraint

You CANNOT use bidirectional RNNs for autoregressive text generation. During generation, future tokens don't exist yet—there's nothing for the backward RNN to process. This is why GPT-style language models use unidirectional (causal) attention.

Tasks Requiring Unidirectional Processing

Task	Why Bidirectional Fails
Language Modeling	Must predict next word from previous words only
Text Generation	Future tokens are unknown during decoding
Real-time Streaming	Cannot wait for future input
Causal Prediction	Future information causes data leakage
Time Series Forecasting	Future values are the prediction target

Streaming and Real-Time Applications

BiRNNs require the complete sequence before producing any output. For applications requiring low-latency, streaming output, this is unacceptable:

Real-time speech recognition
Live captioning
Interactive dialogue systems
Financial trading signals

For these cases, use unidirectional models or clever windowing strategies.

When TO Use Bidirectional RNNs

•Sequence Labeling — NER, POS tagging, slot filling, chunking
•Sequence Classification — Sentiment analysis, intent detection, spam classification
•Encoder in Seq2Seq — The source sequence is fully available before decoding begins
•Feature Extraction — Generating contextual embeddings (like in ELMo)
•Question Answering — Both question and context are known before answering
•Machine Reading — Document is fully available for comprehension tasks

Practical Implementation Details

Implementing bidirectional RNNs correctly requires attention to several practical details that significantly impact training and performance.

Handling Variable-Length Sequences

In batched training with sequences of different lengths, we must handle padding carefully to avoid the backward RNN starting from padding tokens.

birnn_with_masking.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
import torch
import torch.nn as nn
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
 
class MaskedBiRNN(nn.Module):
    """
    Bidirectional RNN with proper handling of variable-length sequences.
    Uses packing/unpacking to ensure the backward RNN doesn't see padding.
    """
    
    def __init__(self, input_dim: int, hidden_dim: int, num_layers: int = 2):
        super().__init__()
        self.hidden_dim = hidden_dim
        
        self.rnn = nn.LSTM(
            input_size=input_dim,
            hidden_size=hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            bidirectional=True,
            dropout=0.2
        )
    
    def forward(
        self,
        x: torch.Tensor,
        lengths: torch.Tensor
    ) -> tuple[torch.Tensor, torch.Tensor]:
        """
        Args:
            x: [batch, max_seq_len, input_dim] - padded sequences
            lengths: [batch] - actual lengths of each sequence
            
        Returns:
            output: [batch, max_seq_len, 2*hidden_dim] - bidirectional hidden states
            final: [batch, 2*hidden_dim] - final hidden state for classification
        """
        batch_size = x.size(0)
        
        # Sort by length (required for pack_padded_sequence)
        sorted_lengths, sort_idx = lengths.sort(descending=True)
        sorted_x = x[sort_idx]
        
        # Pack the sequences - CRUCIAL for correct backward RNN behavior
        packed = pack_padded_sequence(
            sorted_x, 
            sorted_lengths.cpu(), 
            batch_first=True
        )
        
        # Run bidirectional RNN
        packed_output, (h_n, c_n) = self.rnn(packed)
        
        # Unpack back to padded form
        output, _ = pad_packed_sequence(packed_output, batch_first=True)
        
        # Unsort to restore original order
        _, unsort_idx = sort_idx.sort()
        output = output[unsort_idx]
        
        # Get final hidden states
        # h_n shape: [num_layers*num_directions, batch, hidden_dim]
        # Reshape to separate layers and directions
        h_n = h_n.view(self.rnn.num_layers, 2, batch_size, self.hidden_dim)
        
        # Take last layer, both directions: [2, batch, hidden_dim]
        h_last = h_n[-1]  # Last layer
        
        # Concatenate forward and backward final states
        # Forward final: h_last[0] - last timestep of forward pass
        # Backward final: h_last[1] - first timestep of backward pass
        final = torch.cat([h_last[0], h_last[1]], dim=-1)  # [batch, 2*hidden_dim]
        final = final[unsort_idx]  # Restore original order
        
        return output, final
    
    def extract_final_states(
        self,
        h_bi: torch.Tensor,
        lengths: torch.Tensor
    ) -> torch.Tensor:
        """
        Alternative: Extract final states from padded output directly.
        Useful when you need per-sequence final states without packing.
        
        Args:
            h_bi: [batch, max_seq_len, 2*hidden_dim]
            lengths: [batch]
            
        Returns:
            final: [batch, 2*hidden_dim]
        """
        batch_size = h_bi.size(0)
        device = h_bi.device
        
        # Split into forward and backward
        forward = h_bi[:, :, :self.hidden_dim]   # [batch, seq_len, hidden_dim]
        backward = h_bi[:, :, self.hidden_dim:]  # [batch, seq_len, hidden_dim]
        
        # Forward final: at position (length - 1) for each sequence
        forward_final = forward[
            torch.arange(batch_size, device=device), 
            lengths - 1
        ]  # [batch, hidden_dim]
        
        # Backward final: always at position 0 (starting point of backward pass)
        backward_final = backward[:, 0]  # [batch, hidden_dim]
        
        return torch.cat([forward_final, backward_final], dim=-1)

Common Bug: Ignoring Packing

If you don't use pack_padded_sequence, the backward RNN will start from padding tokens and propagate garbage information through the sequence. This silently degrades performance. Always pack variable-length sequences.

Multi-Layer Bidirectional Networks

For complex tasks, stacking multiple bidirectional layers creates deep bidirectional networks with hierarchical representations.

Architecture Convention

When stacking bidirectional layers, each layer receives the concatenated output from the previous layer:

$$\mathbf{h}_t^{(l)} = [\overrightarrow{\mathbf{h}}_t^{(l)}; \overleftarrow{\mathbf{h}}_t^{(l)}]$$

$$\overrightarrow{\mathbf{h}}_t^{(l+1)} = \text{Forward-RNN}^{(l+1)}(\mathbf{h}_t^{(l)})$$

Dimensionality Growth

Layer	Input Dimension	Output Dimension
Layer 1	$d_{\text{embed}}$	$2d$
Layer 2	$2d$	$2d$
Layer 3	$2d$	$2d$
...	$2d$	$2d$

Modern frameworks like PyTorch handle this automatically when setting bidirectional=True with num_layers > 1.

Converting Mermaid diagram...

Residual Connections in Deep BiRNNs

For deep bidirectional networks (3+ layers), residual connections significantly improve training:

$$\mathbf{h}_t^{(l)} = \mathbf{h}_t^{(l-1)} + \text{BiRNN}^{(l)}(\mathbf{h}_t^{(l-1)})$$

This requires matching dimensions—typically achieved by using the same hidden size at each layer or adding projection layers.

Depth vs Width Trade-off

Empirically, 2-3 bidirectional layers often outperform a single large layer with the same parameter count. Depth enables hierarchical feature learning where lower layers capture local patterns and higher layers capture more abstract, global patterns.

BiRNNs vs Transformers: A Modern Perspective

With the rise of Transformer architectures, it's natural to ask: Are bidirectional RNNs still relevant?

The answer is nuanced. Transformers—particularly self-attention—can capture bidirectional context in a single operation:

$$\text{Attention}(\mathbf{x}t) = \sum{i=1}^{T} \alpha_{ti} \mathbf{v}_i$$

Every position can directly attend to every other position, eliminating the need for separate forward and backward passes.

Comparative Analysis

Bidirectional RNNs vs Transformers
Aspect	BiRNN	Transformer
Context Access	Full bidirectional via two passes	Full bidirectional in single attention
Computational Complexity	O(T) sequential in each direction	O(T²) parallelizable attention
Parallelization	Limited (sequential dependency)	Highly parallelizable (all positions at once)
Long-Range Dependencies	Degrades over distance despite LSTM/GRU	Direct attention at any distance
Training Speed	Slower (sequential)	Faster on GPUs (parallel)
Memory Efficiency	More efficient for short sequences	Quadratic memory in sequence length
Inductive Bias	Strong sequential/temporal bias	Minimal positional bias (learned embeddings)
Sub-100 Tokens Performance	Competitive or superior	Similar or overkill

When BiRNNs Still Excel

Resource-Constrained Environments: BiRNNs have lower memory footprint for moderate sequence lengths
Strong Sequential Inductive Bias: When data is inherently sequential (speech, music, certain time series), BiRNNs' built-in sequential processing can be beneficial
Small Data Regimes: BiRNNs may generalize better with limited training data due to their constrained architecture
Latency-Sensitive Applications: For inference on CPU or edge devices, BiRNNs can be faster
Hybrid Architectures: Many modern systems use BiRNN encoders with Transformer decoders or attention layers over BiRNN outputs

Historical Significance

Bidirectional RNNs (particularly BiLSTMs) dominated NLP from 2015-2018, powering state-of-the-art systems including ELMo, CoNLL NER winners, and production machine translation. Understanding BiRNNs remains essential both for historical understanding and for the many systems still in production that use them.

Summary: Bidirectional RNNs

We have thoroughly explored bidirectional recurrent neural networks—their motivation, architecture, mathematical formulation, and practical considerations. Let's consolidate the key takeaways:

Key Takeaways

•Bidirectional processing combines forward and backward context — Each position receives information from the entire sequence, not just preceding elements
•Two independent RNNs with separate parameters — Forward and backward networks learn complementary representations
•Concatenation is the default combination — Producing hidden states of dimension 2d preserves all directional information
•Packing is essential for variable-length sequences — Without proper packing, backward RNN processes padding tokens incorrectly
•Not suitable for autoregressive generation — BiRNNs require complete input sequences; cannot generate tokens sequentially
•Ideal for sequence labeling and classification — Tasks where the full input is available before making predictions
•Foundation for encoder architectures — BiRNNs commonly serve as encoders in seq2seq models

What's Next:

Having mastered bidirectional processing, we'll next explore Deep RNNs—stacking multiple recurrent layers to create hierarchical representations that capture increasingly abstract patterns in sequential data.

Page Complete

You now understand how bidirectional RNNs overcome the limitation of unidirectional processing by combining forward and backward context. This architecture enables powerful sequence understanding for tasks where the complete input is available, forming the foundation for many state-of-the-art NLP systems.

Bidirectional RNNs

The Limitation of Looking Only Forward

What You Will Learn

The Unidirectional Limitation

To appreciate bidirectional processing, we must first rigorously understand what information unidirectional RNNs capture and what they miss.

Unidirectional Hidden States

In a standard forward RNN, the hidden state at timestep $t$ is computed as:

$$\mathbf{h}t = f(\mathbf{W}{hh}\mathbf{h}{t-1} + \mathbf{W}{xh}\mathbf{x}_t + \mathbf{b}_h)$$

What $\mathbf{h}_t$ cannot encode: Any information from $\mathbf{x}{t+1}, \mathbf{x}{t+2}, \ldots, \mathbf{x}_T$.

The Fundamental Trade-off

Tasks Where Future Context Matters

Many sequence labeling tasks require context from both directions:

Task	Why Future Context Helps
Named Entity Recognition	"Apple announced..." vs "Apple orchards..."
Part-of-Speech Tagging	"Time flies like an arrow" vs "Fruit flies like a banana"
Speech Recognition	Phoneme disambiguation from surrounding context
Sentiment Analysis	"Not good" vs "Not good, but great"
Machine Translation	Word alignment depends on full source sentence
Slot Filling	"Book a flight to [DEST]" pattern recognition

In each case, making optimal predictions at position $t$ requires access to both past ($t' < t$) and future ($t' > t$) context.

Bidirectional Architecture

The bidirectional RNN architecture consists of two independent recurrent networks:

Forward RNN ($\overrightarrow{\text{RNN}}$): Processes the sequence left-to-right, producing hidden states $\overrightarrow{\mathbf{h}}_1, \overrightarrow{\mathbf{h}}_2, \ldots, \overrightarrow{\mathbf{h}}_T$
Backward RNN ($\overleftarrow{\text{RNN}}$): Processes the sequence right-to-left, producing hidden states $\overleftarrow{\mathbf{h}}T, \overleftarrow{\mathbf{h}}{T-1}, \ldots, \overleftarrow{\mathbf{h}}_1$

At each timestep $t$, the bidirectional representation combines both:

$$\mathbf{h}_t^{\text{bi}} = [\overrightarrow{\mathbf{h}}_t; \overleftarrow{\mathbf{h}}_t]$$

where $[\cdot; \cdot]$ denotes concatenation along the feature dimension.

Converting Mermaid diagram...

Key Architectural Properties

Parameter Independence: The forward and backward RNNs have completely separate parameters. They do not share weights.
Doubled Dimensionality: If each directional hidden state has dimension $d$, the bidirectional hidden state has dimension $2d$.
Full Context at Every Position: Each $\mathbf{h}_t^{\text{bi}}$ encodes information from the entire sequence—both preceding and following context.
Parallel Processing: While there's sequential dependency within each direction, the forward and backward passes can be computed in parallel on modern hardware.

Mathematical Formulation

Let us formally define the bidirectional RNN for a sequence $\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_T$ where each $\mathbf{x}t \in \mathbb{R}^{d{\text{in}}}$.

Forward Pass (Left-to-Right)

$$\overrightarrow{\mathbf{h}}t = f\left(\overrightarrow{\mathbf{W}}{hh} \overrightarrow{\mathbf{h}}{t-1} + \overrightarrow{\mathbf{W}}{xh} \mathbf{x}_t + \overrightarrow{\mathbf{b}}_h\right)$$

with initial state $\overrightarrow{\mathbf{h}}_0 = \mathbf{0}$ (or learned initialization).

Backward Pass (Right-to-Left)

$$\overleftarrow{\mathbf{h}}t = f\left(\overleftarrow{\mathbf{W}}{hh} \overleftarrow{\mathbf{h}}{t+1} + \overleftarrow{\mathbf{W}}{xh} \mathbf{x}_t + \overleftarrow{\mathbf{b}}_h\right)$$

with initial state $\overleftarrow{\mathbf{h}}_{T+1} = \mathbf{0}$ (or learned initialization).

Parameter Count

Component	Parameters
$\overrightarrow{\mathbf{W}}_{xh}$	$d \times d_{\text{in}}$
$\overrightarrow{\mathbf{W}}_{hh}$	$d \times d$
$\overrightarrow{\mathbf{b}}_h$	$d$
$\overleftarrow{\mathbf{W}}_{xh}$	$d \times d_{\text{in}}$
$\overleftarrow{\mathbf{W}}_{hh}$	$d \times d$
$\overleftarrow{\mathbf{b}}_h$	$d$
Total	$2(d^2 + d \cdot d_{\text{in}} + d)$

Parameter Scaling

Combining Directional Hidden States

The most common combination strategy is concatenation:

$$\mathbf{h}_t^{\text{bi}} = [\overrightarrow{\mathbf{h}}_t; \overleftarrow{\mathbf{h}}_t] \in \mathbb{R}^{2d}$$

Other combination strategies include:

Method	Formula	Resulting Dimension
Concatenation	$[\overrightarrow{\mathbf{h}}_t; \overleftarrow{\mathbf{h}}_t]$	$2d$
Summation	$\overrightarrow{\mathbf{h}}_t + \overleftarrow{\mathbf{h}}_t$	$d$
Average	$\frac{1}{2}(\overrightarrow{\mathbf{h}}_t + \overleftarrow{\mathbf{h}}_t)$	$d$
Element-wise Max	$\max(\overrightarrow{\mathbf{h}}_t, \overleftarrow{\mathbf{h}}_t)$	$d$
Gated Combination	$\mathbf{g} \odot \overrightarrow{\mathbf{h}}_t + (1-\mathbf{g}) \odot \overleftarrow{\mathbf{h}}_t$	$d$

Concatenation is preferred in most cases because it preserves all information from both directions without information loss. Summation and averaging lose information but reduce dimensionality.

Gradient Flow in Bidirectional Networks

Bidirectional RNNs exhibit interesting gradient flow properties during backpropagation through time (BPTT).

Forward RNN Gradient Flow

Gradients flow backward through time (from $t=T$ to $t=1$):

Backward RNN Gradient Flow

Gradients flow forward through time (from $t=1$ to $t=T$):

Bidirectional LSTM/GRU

Computational Considerations

Aspect	Implication
Forward Pass	Two independent passes (parallelizable on GPU)
Backward Pass	Two independent gradient computations
Memory	Must store hidden states for both directions across all timesteps
Latency	Cannot begin producing output until full sequence is available
Throughput	With parallelization, only ~1.1-1.3× slower than unidirectional

Gradient Paths to Parameters

For any loss $\mathcal{L}$ that depends on the bidirectional representation $\mathbf{h}_t^{\text{bi}}$:

The forward and backward parameter gradients are computed independently and accumulated separately.

Output Strategies for Different Tasks

The way bidirectional representations are used depends on the downstream task. Different tasks require different output strategies.

Sequence Labeling (Per-Position Output)

For tasks like NER, POS tagging, or slot filling, we need an output at every position:

$$\mathbf{y}_t = \text{softmax}(\mathbf{W}_y \mathbf{h}_t^{\text{bi}} + \mathbf{b}_y)$$

where $\mathbf{W}_y \in \mathbb{R}^{|V| \times 2d}$ projects the bidirectional hidden state to the output vocabulary.

sequence_labeling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import torch
import torch.nn as nn
 
class BiRNNSequenceLabeler(nn.Module):
    """Bidirectional RNN for sequence labeling tasks."""
    
    def __init__(
        self,
        vocab_size: int,
        embed_dim: int,
        hidden_dim: int,
        num_labels: int,
        num_layers: int = 1,
        dropout: float = 0.1,
        cell_type: str = "lstm"
    ):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        
        # Choose cell type
        rnn_class = nn.LSTM if cell_type == "lstm" else nn.GRU
        
        # Bidirectional RNN
        self.rnn = rnn_class(
            input_size=embed_dim,
            hidden_size=hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            bidirectional=True,  # Key: enables bidirectionality
            dropout=dropout if num_layers > 1 else 0
        )
        
        # Output projection: 2*hidden_dim due to bidirectionality
        self.classifier = nn.Linear(2 * hidden_dim, num_labels)
        self.dropout = nn.Dropout(dropout)
        
    def forward(
        self,
        input_ids: torch.Tensor,
        lengths: torch.Tensor = None
    ) -> torch.Tensor:
        """
        Args:
            input_ids: [batch, seq_len] - token indices
            lengths: [batch] - actual sequence lengths (for packing)
            
        Returns:
            logits: [batch, seq_len, num_labels] - per-position predictions
        """
        # Embed input tokens
        x = self.embedding(input_ids)  # [batch, seq_len, embed_dim]
        x = self.dropout(x)
        
        # Pack sequences if lengths provided (for efficiency)
        if lengths is not None:
            x = nn.utils.rnn.pack_padded_sequence(
                x, lengths.cpu(), batch_first=True, enforce_sorted=False
            )
        
        # Bidirectional RNN: h_bi has shape [batch, seq_len, 2*hidden_dim]
        h_bi, _ = self.rnn(x)
        
        # Unpack if packed
        if lengths is not None:
            h_bi, _ = nn.utils.rnn.pad_packed_sequence(h_bi, batch_first=True)
        
        h_bi = self.dropout(h_bi)
        
        # Per-position classification
        logits = self.classifier(h_bi)  # [batch, seq_len, num_labels]
        
        return logits

Sequence Classification (Single Output)

For tasks like sentiment analysis or document classification, we need a single representation for the entire sequence:

$$\mathbf{h}_{\text{final}} = f(\overrightarrow{\mathbf{h}}_T, \overleftarrow{\mathbf{h}}_1)$$

Common choices for $f$:

Method	Formula	Captures
Last states only	$[\overrightarrow{\mathbf{h}}_T; \overleftarrow{\mathbf{h}}_1]$	Sequence endpoints with full context
Max pooling	$\max_{t}(\mathbf{h}_t^{\text{bi}})$	Strongest activations across sequence
Mean pooling	$\frac{1}{T}\sum_{t=1}^{T}\mathbf{h}_t^{\text{bi}}$	Average representation
Attention pooling	$\sum_{t}\alpha_t \mathbf{h}_t^{\text{bi}}$	Learned weighted combination

Why Last States Work Differently

When NOT to Use Bidirectional RNNs

Bidirectional processing is powerful but not universally applicable. Understanding when to avoid BiRNNs is as important as knowing when to use them.

Autoregressive Generation

For tasks requiring sequential generation—where the model produces one token at a time, conditioned on previously generated tokens—bidirectional processing is fundamentally incompatible.

Critical Constraint

Tasks Requiring Unidirectional Processing

Task	Why Bidirectional Fails
Language Modeling	Must predict next word from previous words only
Text Generation	Future tokens are unknown during decoding
Real-time Streaming	Cannot wait for future input
Causal Prediction	Future information causes data leakage
Time Series Forecasting	Future values are the prediction target

Streaming and Real-Time Applications

BiRNNs require the complete sequence before producing any output. For applications requiring low-latency, streaming output, this is unacceptable:

Real-time speech recognition
Live captioning
Interactive dialogue systems
Financial trading signals

For these cases, use unidirectional models or clever windowing strategies.

When TO Use Bidirectional RNNs

•Sequence Labeling — NER, POS tagging, slot filling, chunking
•Sequence Classification — Sentiment analysis, intent detection, spam classification
•Encoder in Seq2Seq — The source sequence is fully available before decoding begins
•Feature Extraction — Generating contextual embeddings (like in ELMo)
•Question Answering — Both question and context are known before answering
•Machine Reading — Document is fully available for comprehension tasks

Practical Implementation Details

Implementing bidirectional RNNs correctly requires attention to several practical details that significantly impact training and performance.

Handling Variable-Length Sequences

In batched training with sequences of different lengths, we must handle padding carefully to avoid the backward RNN starting from padding tokens.

birnn_with_masking.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
import torch
import torch.nn as nn
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
 
class MaskedBiRNN(nn.Module):
    """
    Bidirectional RNN with proper handling of variable-length sequences.
    Uses packing/unpacking to ensure the backward RNN doesn't see padding.
    """
    
    def __init__(self, input_dim: int, hidden_dim: int, num_layers: int = 2):
        super().__init__()
        self.hidden_dim = hidden_dim
        
        self.rnn = nn.LSTM(
            input_size=input_dim,
            hidden_size=hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            bidirectional=True,
            dropout=0.2
        )
    
    def forward(
        self,
        x: torch.Tensor,
        lengths: torch.Tensor
    ) -> tuple[torch.Tensor, torch.Tensor]:
        """
        Args:
            x: [batch, max_seq_len, input_dim] - padded sequences
            lengths: [batch] - actual lengths of each sequence
            
        Returns:
            output: [batch, max_seq_len, 2*hidden_dim] - bidirectional hidden states
            final: [batch, 2*hidden_dim] - final hidden state for classification
        """
        batch_size = x.size(0)
        
        # Sort by length (required for pack_padded_sequence)
        sorted_lengths, sort_idx = lengths.sort(descending=True)
        sorted_x = x[sort_idx]
        
        # Pack the sequences - CRUCIAL for correct backward RNN behavior
        packed = pack_padded_sequence(
            sorted_x, 
            sorted_lengths.cpu(), 
            batch_first=True
        )
        
        # Run bidirectional RNN
        packed_output, (h_n, c_n) = self.rnn(packed)
        
        # Unpack back to padded form
        output, _ = pad_packed_sequence(packed_output, batch_first=True)
        
        # Unsort to restore original order
        _, unsort_idx = sort_idx.sort()
        output = output[unsort_idx]
        
        # Get final hidden states
        # h_n shape: [num_layers*num_directions, batch, hidden_dim]
        # Reshape to separate layers and directions
        h_n = h_n.view(self.rnn.num_layers, 2, batch_size, self.hidden_dim)
        
        # Take last layer, both directions: [2, batch, hidden_dim]
        h_last = h_n[-1]  # Last layer
        
        # Concatenate forward and backward final states
        # Forward final: h_last[0] - last timestep of forward pass
        # Backward final: h_last[1] - first timestep of backward pass
        final = torch.cat([h_last[0], h_last[1]], dim=-1)  # [batch, 2*hidden_dim]
        final = final[unsort_idx]  # Restore original order
        
        return output, final
    
    def extract_final_states(
        self,
        h_bi: torch.Tensor,
        lengths: torch.Tensor
    ) -> torch.Tensor:
        """
        Alternative: Extract final states from padded output directly.
        Useful when you need per-sequence final states without packing.
        
        Args:
            h_bi: [batch, max_seq_len, 2*hidden_dim]
            lengths: [batch]
            
        Returns:
            final: [batch, 2*hidden_dim]
        """
        batch_size = h_bi.size(0)
        device = h_bi.device
        
        # Split into forward and backward
        forward = h_bi[:, :, :self.hidden_dim]   # [batch, seq_len, hidden_dim]
        backward = h_bi[:, :, self.hidden_dim:]  # [batch, seq_len, hidden_dim]
        
        # Forward final: at position (length - 1) for each sequence
        forward_final = forward[
            torch.arange(batch_size, device=device), 
            lengths - 1
        ]  # [batch, hidden_dim]
        
        # Backward final: always at position 0 (starting point of backward pass)
        backward_final = backward[:, 0]  # [batch, hidden_dim]
        
        return torch.cat([forward_final, backward_final], dim=-1)

Common Bug: Ignoring Packing

Multi-Layer Bidirectional Networks

For complex tasks, stacking multiple bidirectional layers creates deep bidirectional networks with hierarchical representations.

Architecture Convention

When stacking bidirectional layers, each layer receives the concatenated output from the previous layer:

$$\mathbf{h}_t^{(l)} = [\overrightarrow{\mathbf{h}}_t^{(l)}; \overleftarrow{\mathbf{h}}_t^{(l)}]$$

$$\overrightarrow{\mathbf{h}}_t^{(l+1)} = \text{Forward-RNN}^{(l+1)}(\mathbf{h}_t^{(l)})$$

Dimensionality Growth

Layer	Input Dimension	Output Dimension
Layer 1	$d_{\text{embed}}$	$2d$
Layer 2	$2d$	$2d$
Layer 3	$2d$	$2d$
...	$2d$	$2d$

Modern frameworks like PyTorch handle this automatically when setting bidirectional=True with num_layers > 1.

Converting Mermaid diagram...

Residual Connections in Deep BiRNNs

For deep bidirectional networks (3+ layers), residual connections significantly improve training:

$$\mathbf{h}_t^{(l)} = \mathbf{h}_t^{(l-1)} + \text{BiRNN}^{(l)}(\mathbf{h}_t^{(l-1)})$$

This requires matching dimensions—typically achieved by using the same hidden size at each layer or adding projection layers.

Depth vs Width Trade-off

BiRNNs vs Transformers: A Modern Perspective

With the rise of Transformer architectures, it's natural to ask: Are bidirectional RNNs still relevant?

The answer is nuanced. Transformers—particularly self-attention—can capture bidirectional context in a single operation:

$$\text{Attention}(\mathbf{x}t) = \sum{i=1}^{T} \alpha_{ti} \mathbf{v}_i$$

Every position can directly attend to every other position, eliminating the need for separate forward and backward passes.

Comparative Analysis

Bidirectional RNNs vs Transformers
Aspect	BiRNN	Transformer
Context Access	Full bidirectional via two passes	Full bidirectional in single attention
Computational Complexity	O(T) sequential in each direction	O(T²) parallelizable attention
Parallelization	Limited (sequential dependency)	Highly parallelizable (all positions at once)
Long-Range Dependencies	Degrades over distance despite LSTM/GRU	Direct attention at any distance
Training Speed	Slower (sequential)	Faster on GPUs (parallel)
Memory Efficiency	More efficient for short sequences	Quadratic memory in sequence length
Inductive Bias	Strong sequential/temporal bias	Minimal positional bias (learned embeddings)
Sub-100 Tokens Performance	Competitive or superior	Similar or overkill

When BiRNNs Still Excel

Resource-Constrained Environments: BiRNNs have lower memory footprint for moderate sequence lengths
Strong Sequential Inductive Bias: When data is inherently sequential (speech, music, certain time series), BiRNNs' built-in sequential processing can be beneficial
Small Data Regimes: BiRNNs may generalize better with limited training data due to their constrained architecture
Latency-Sensitive Applications: For inference on CPU or edge devices, BiRNNs can be faster
Hybrid Architectures: Many modern systems use BiRNN encoders with Transformer decoders or attention layers over BiRNN outputs

Historical Significance

Summary: Bidirectional RNNs

We have thoroughly explored bidirectional recurrent neural networks—their motivation, architecture, mathematical formulation, and practical considerations. Let's consolidate the key takeaways:

Key Takeaways

•Bidirectional processing combines forward and backward context — Each position receives information from the entire sequence, not just preceding elements
•Two independent RNNs with separate parameters — Forward and backward networks learn complementary representations
•Concatenation is the default combination — Producing hidden states of dimension 2d preserves all directional information
•Packing is essential for variable-length sequences — Without proper packing, backward RNN processes padding tokens incorrectly
•Not suitable for autoregressive generation — BiRNNs require complete input sequences; cannot generate tokens sequentially
•Ideal for sequence labeling and classification — Tasks where the full input is available before making predictions
•Foundation for encoder architectures — BiRNNs commonly serve as encoders in seq2seq models

What's Next:

Page Complete