Loading content...
The power of LSTM lies not in any single component, but in the coordinated interplay of three gates—forget, input, and output—that together create a sophisticated memory controller. Each gate learns to recognize specific patterns in the input and hidden state, and responses by either opening (allowing information to flow) or closing (blocking information).
In the previous page, we established what each gate does. Now we explore how gates work at a deeper level—their mathematical properties, learning dynamics, and the emergent behaviors that arise from their interaction. Understanding gates thoroughly is essential for diagnosing issues, designing LSTM variants, and knowing when to apply (or not apply) LSTM architectures.
By the end of this page, you will understand:
• The mathematical properties of sigmoid gating and why it works • How each gate learns its specific role through gradient signals • The dynamics of gate saturation and its effects on learning • How gates coordinate to encode different memory patterns • Practical techniques for analyzing and improving gate behavior
Gates in LSTM use the sigmoid function to produce values between 0 and 1. This choice is not arbitrary—it has profound mathematical consequences that enable LSTM's memory capabilities.
The Sigmoid Function:
$$\sigma(z) = \frac{1}{1 + e^{-z}}$$
Key Properties:
| Pre-activation (z) | σ(z) | σ'(z) | Gate State |
|---|---|---|---|
| -6 | 0.0025 | 0.0025 | Nearly closed (99.75% blocked) |
| -3 | 0.0474 | 0.0452 | Mostly closed |
| -1 | 0.268 | 0.196 | Partially closed |
| 0 | 0.5 | 0.25 | Half-open (maximum uncertainty) |
| +1 | 0.732 | 0.196 | Partially open |
| +3 | 0.9526 | 0.0452 | Mostly open |
| +6 | 0.9975 | 0.0025 | Nearly open (99.75% passed) |
The Gating Operation:
When a gate $g = \sigma(z)$ modulates a signal $x$, the output is $g \cdot x$. This multiplication has crucial properties:
Why Not Hard Gates?
One might ask: why use soft sigmoid gates instead of hard 0/1 decisions? The answer is learnability:
There's a fundamental tradeoff in gate saturation:
• Saturated gates (near 0 or 1): Make decisive, binary decisions. Good for clear memory patterns but harder to adjust during training (small gradients).
• Unsaturated gates (near 0.5): Make soft, uncertain decisions. Easier to train (larger gradients) but less sharp memory control.
A well-trained LSTM typically has saturated gates for clear situations (sentence boundaries, definite topic changes) and unsaturated gates for ambiguous situations (mid-sentence, gradual context shifts).
The forget gate $f_t$ is perhaps the most critical gate in the LSTM. It directly controls the gradient highway and determines whether information persists or is erased.
Mathematical Definition:
$$f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$$
What the Forget Gate Learns to Detect:
Through training, the forget gate learns to recognize patterns that signal "erase memory":
Conversely, it stays open (high $f_t$) when context must persist:
1234567891011121314151617
# Gradient flow through forget gate# Cell state update: c_t = f_t ⊙ c_{t-1} + i_t ⊙ c̃_t # Gradient w.r.t. c_{t-1}:∂c_t/∂c_{t-1} = f_t # Element-wise # For gradient from c_T back to c_t:∂c_T/∂c_t = ∏(k=t+1 to T) f_k # Product of forget gates # Example: T=100, t=0, all f_k = 0.99∂c_T/∂c_0 = 0.99^100 ≈ 0.366 # 36.6% of gradient survives! # Compare: If f_k = 0.9 (forget gate not quite saturated enough)∂c_T/∂c_0 = 0.9^100 ≈ 2.66 × 10^-5 # Only 0.003% survives # The difference between f=0.99 and f=0.9 is DRAMATIC for long sequences# This is why forget gate bias initialization matters so much!Forget Gate Bias: The Critical Hyperparameter
The initialization of $b_f$ is one of the most consequential choices in LSTM training:
| Bias Value | Initial $f_t$ (approx.) | Effect |
|---|---|---|
| -2.0 | 0.12 | Aggressive forgetting; may lose long-term info |
| 0.0 | 0.50 | Uncertain; requires significant training to learn memory |
| 1.0 | 0.73 | Conservative; defaults to remembering |
| 2.0 | 0.88 | Very conservative; strong memory retention |
The Jozefowicz et al. Recommendation:
In "An Empirical Exploration of Recurrent Network Architectures" (2015), the authors found that initializing $b_f = 1.0$ consistently improved performance across tasks. Some practitioners use even higher values (1.5 or 2.0) for tasks requiring very long dependencies.
Intuition: At initialization, before the network has learned anything, it's better to err on the side of remembering (gradients flow) than forgetting (gradient death).
If the forget gate learns to always output low values (f_t ≈ 0), the LSTM degenerates into a memoryless model—each time step is essentially independent. This can happen when:
• Sequences are too short to reward long-term memory • The task doesn't require temporal dependencies • Forget bias is initialized too low • Learning rate is too high early in training, pushing gates to extremes
Monitor forget gate statistics during training. If mean(f_t) < 0.5 consistently, something may be wrong.
The input gate $i_t$ and candidate values $\tilde{c}_t$ work together to control what new information enters the cell state. Their separation into two components is a key design decision.
Mathematical Definitions:
$$i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$$ $$\tilde{c}t = \tanh(W_c \cdot [h{t-1}, x_t] + b_c)$$
Why Two Separate Components?
Consider the alternative: a single component $\Delta c_t = f(W \cdot [h, x] + b)$ added to cell state.
The problem is that this conflates two decisions:
By separating these:
The Coordinate Behavior of Input and Forget Gates:
In practice, input and forget gates often develop complementary behaviors:
| Situation | Forget Gate | Input Gate | Net Effect |
|---|---|---|---|
| New topic | $f_t ≈ 0$ | $i_t ≈ 1$ | Reset: erase old, write new |
| Continuing topic | $f_t ≈ 1$ | $i_t ≈ 0$ | Preserve: keep old, ignore new |
| Accumulating info | $f_t ≈ 1$ | $i_t ≈ 1$ | Aggregate: keep old AND add new |
| Holding for later | $f_t ≈ 1$ | $i_t ≈ 0$ | Store: maintain without updating |
This coordination emerges naturally through training—the loss function shapes both gates to work together for the task at hand.
Some LSTM variants couple the input and forget gates: i_t = 1 - f_t. This reduces parameters and ensures c_t doesn't grow unboundedly. The GRU (next module) takes this approach. However, independent gates allow more nuanced behaviors—for instance, both adding new info AND retaining all old info (useful for counting or accumulation tasks).
The output gate $o_t$ serves a fundamentally different purpose from the input and forget gates. While those control storage, the output gate controls visibility—what information from the cell state is exposed to the rest of the network.
Mathematical Definition:
$$o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$$ $$h_t = o_t \odot \tanh(c_t)$$
Why Separate Storage from Output?
This separation is crucial for long-term dependencies:
Example: Subject-Verb Agreement Across a Long Clause
Consider: "The trophy, which the athletes who won the championship game received from the governor, [was/were] very heavy."
The subject is "trophy" (singular), but many words intervene before the verb. An LSTM might:
Without the output gate, the singular marker would constantly influence hidden states, potentially confusing clause-internal processing.
123456789101112131415161718
# Hidden state: h_t = o_t ⊙ tanh(c_t) # When o_t ≈ 0:h_t ≈ 0 # Cell state hidden from network# Gradients through h_t are blocked (∂h_t/∂c_t ≈ 0)# But c_t still flows independently (not affected!) # When o_t ≈ 1:h_t ≈ tanh(c_t) # Cell state fully visible# Gradients flow through both paths # Key insight: Output gate affects gradient to c_t through h_t,# but the direct c_{t-1} → c_t path (via forget gate) is unaffected. # This is why LSTMs can:# 1. Store information (via cell state path)# 2. Use it only when needed (via output gate)# 3. Receive gradients at use-time that flow back through storage-timeBefore the output gate, c_t passes through tanh. This bounds the values to [-1, 1], preventing the hidden state from growing unboundedly even if cell state accumulates. It also provides additional non-linearity in the output pathway. Note that this tanh is applied before gating—so a closed output gate (o_t ≈ 0) still means h_t ≈ 0 regardless of c_t's magnitude.
When all three gates operate together, complex memory behaviors emerge that go beyond what any single gate could achieve. Understanding these emergent patterns helps diagnose LSTM behavior and design better architectures.
The Five Fundamental Memory Operations:
| Operation | Forget | Input | Output | Cell Dynamics | Use Case |
|---|---|---|---|---|---|
| Write | Low (≈0) | High (≈1) | Any | Replace with new | Topic change, new entity |
| Read | High (≈1) | Low (≈0) | High (≈1) | Expose stored content | Use stored info for prediction |
| Carry | High (≈1) | Low (≈0) | Low (≈0) | Preserve hidden | Long-term info storage |
| Update | High (≈1) | High (≈1) | Any | Accumulate | Counting, sentiment accumulation |
| Clear | Low (≈0) | Low (≈0) | Any | Reset to ~0 | End of document, major transition |
Temporal Patterns in Gate Activations:
Well-trained LSTMs exhibit characteristic temporal patterns:
Periodic Gating: For structured data (music, formatted text), gates often show periodic patterns matching the structure
Burst Writing: At semantically rich points (entity mentions, key events), input gates spike while forget gates may simultaneously dip
Sustained Carry: During parenthetical expressions or subclauses, forget gates stay high and input gates stay low—pure memory preservation
Delayed Read: Information stored early may only be read (output gate high) much later when it becomes relevant
Cascading Resets: Major transitions can trigger reset cascades where forget gates go low across multiple dimensions simultaneously
A powerful debugging technique is to visualize gate activations as heatmaps over time:
• x-axis: Time steps (sequence position) • y-axis: Hidden dimensions • Color: Gate activation value (blue=0, red=1)
Healthy patterns show: • Forget gate: Mostly red (high) with occasional blue streaks at boundaries • Input gate: Sparse blue background with targeted red spikes at content words • Output gate: Task-dependent, but should show clear patterns rather than uniform gray
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
import torchimport matplotlib.pyplot as pltimport numpy as np def visualize_gates(lstm_layer, input_sequence, tokens): """ Visualize gate activations for an input sequence. Args: lstm_layer: LSTM layer (modified to expose gate values) input_sequence: (seq_len, batch=1, input_dim) tensor tokens: List of token strings for labeling """ # Run forward pass collecting gate values forget_gates = [] input_gates = [] output_gates = [] h = torch.zeros(1, lstm_layer.hidden_size) c = torch.zeros(1, lstm_layer.hidden_size) for t in range(input_sequence.size(0)): x = input_sequence[t] # Get gates (assuming lstm_layer.forward_with_gates exists) h, c, (f, i, o) = lstm_layer.forward_with_gates(x, (h, c)) forget_gates.append(f.squeeze().detach().numpy()) input_gates.append(i.squeeze().detach().numpy()) output_gates.append(o.squeeze().detach().numpy()) # Stack into arrays: (seq_len, hidden_dim) F = np.stack(forget_gates) I = np.stack(input_gates) O = np.stack(output_gates) # Create visualization fig, axes = plt.subplots(3, 1, figsize=(14, 10), sharex=True) for ax, gates, name in zip(axes, [F, I, O], ['Forget Gate', 'Input Gate', 'Output Gate']): im = ax.imshow(gates.T, aspect='auto', cmap='RdYlBu_r', vmin=0, vmax=1) ax.set_ylabel(f'{name}\n(Hidden Dim)') ax.set_title(f'{name} Activations') plt.colorbar(im, ax=ax) # Set token labels axes[-1].set_xticks(range(len(tokens))) axes[-1].set_xticklabels(tokens, rotation=45, ha='right') axes[-1].set_xlabel('Sequence Position') plt.tight_layout() plt.savefig('gate_visualization.png', dpi=150) plt.show()How do gates learn their specific roles? The answer lies in the gradient signals they receive during training and how the loss function shapes their behavior.
Gradient Flow to Gates:
Considering the cell state update $c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t$:
$$\frac{\partial L}{\partial f_t} = \frac{\partial L}{\partial c_t} \odot c_{t-1}$$
$$\frac{\partial L}{\partial i_t} = \frac{\partial L}{\partial c_t} \odot \tilde{c}_t$$
Interpretation:
The forget gate receives gradient proportional to how useful the previous cell state is. If $c_{t-1}$ contains information needed for the current loss, $f_t$ is pushed higher.
The input gate receives gradient proportional to how useful the candidate write would be. If $\tilde{c}_t$ contains missing information, $i_t$ is pushed higher.
The Credit Assignment Problem:
A key challenge in gate learning is temporal credit assignment. If information written at time $t_1$ enables correct prediction at time $t_2 >> t_1$:
The crucial observation: If forget gates maintain high values ($f_k \approx 1$), this gradient survives the journey back through time, and $i_{t_1}$ learns "writing here was useful." If forget gates are too low, the signal is lost, and the network can't learn to write information that will be needed later.
This is why forget gate initialization is so critical—it directly determines whether the network can learn long-range dependencies at all.
Early in training, gates are typically unsaturated (near 0.5) and behave almost uniformly across hidden dimensions. As training progresses:
This specialization is a form of implicit feature learning—the network discovers what information is worth storing.
1234567891011121314151617181920212223242526
Training Epoch: Gate Statistics (averaged across sequence and batch)--------------------------------------------------------------------------- Epoch 1 (random init with forget bias = 1.0): Forget: mean=0.73, std=0.05, saturation=0.02 Input: mean=0.50, std=0.04, saturation=0.01 Output: mean=0.50, std=0.03, saturation=0.01 → Gates are unsaturated, not yet specialized Epoch 10: Forget: mean=0.78, std=0.15, saturation=0.25 Input: mean=0.42, std=0.22, saturation=0.31 Output: mean=0.55, std=0.18, saturation=0.22 → Starting to differentiate; some saturation appearing Epoch 50: Forget: mean=0.85, std=0.18, saturation=0.55 Input: mean=0.28, std=0.30, saturation=0.58 Output: mean=0.62, std=0.25, saturation=0.48 → Clear specialization; input gate learning sparsity Epoch 100 (converged): Forget: mean=0.91, std=0.12, saturation=0.78 Input: mean=0.18, std=0.28, saturation=0.72 Output: mean=0.68, std=0.28, saturation=0.65 → High saturation, clear gate decisions, learned memory patternsWhile gates learn their behavior from the task, we can guide them through regularization techniques—either encouraging certain behaviors or preventing pathological ones.
Activation Regularization (AR):
Penalizes hidden state magnitude, indirectly affecting gates: $$L_{AR} = \alpha \sum_t ||h_t||^2$$
This encourages sparser, more selective output gate activations.
Temporal Activation Regularization (TAR):
Penalizes changes in hidden state, encouraging smoothness: $$L_{TAR} = \beta \sum_t ||h_t - h_{t-1}||^2$$
This discourages rapid gate switching, promoting more coherent memory behavior.
Gate Activity Regularization:
Directly regularize gate activations:
$$L_{gate} = \gamma \sum_t \sum_{g \in {f,i,o}} ||g_t||^2$$
This pushes gates toward 0, encouraging sparsity in memory operations.
Entropy Regularization:
Gate activations can be viewed as probabilities. High entropy means uncertain, soft gates; low entropy means decisive, hard gates:
$$H(g) = -g \log(g) - (1-g) \log(1-g)$$
We can:
Over-regularizing gates toward 0 can collapse memory entirely. If forget gates all go to 0 (aggressive forgetting) and input gates all go to 0 (no new input), the cell state becomes meaningless. Always monitor gate statistics when applying regularization:
• mean(f_t) should stay above 0.5 for long-range tasks • mean(i_t) should be non-zero at semantic content positions • Balance regularization strength with task requirements
We've deeply explored the three gates that make LSTM a powerful memory-enabled architecture.
Key Insights:
Looking Ahead:
We've now understood the individual gates and their dynamics. The next page focuses on the Cell State Highway—how the linear flow of cell state through time creates the constant error carousel, enabling gradient flow and long-term memory that was impossible with vanilla RNNs.
You now have deep mastery of LSTM gate mechanisms—their mathematics, learning dynamics, and practical implications. This understanding is essential for debugging networks, designing variants, and knowing when LSTMs are appropriate for your task.