Neural Networks & Deep LearningMulti-Layer Perceptrons

Multi-Layer Perceptrons

LevelIntermediate

Duration90 mins

TopicMulti-Layer Perceptrons

2 / 5

Hidden Layers

The Power of Hidden Representations

Hidden layers are the defining innovation that elevates neural networks from simple linear classifiers to universal function approximators. The term "hidden" refers not to secrecy but to the fact that these layers' activations are internal to the network—not directly observed in either input data or output predictions.

Consider the XOR problem that famously stymied single-layer perceptrons. The input patterns $(0,0), (0,1), (1,0), (1,1)$ have outputs $(0, 1, 1, 0)$. No single line can separate the positive from negative examples—they're arranged in a checkerboard pattern. But a hidden layer can transform these points into a new representation where they become linearly separable.

This is the fundamental insight: hidden layers learn representations. They don't just filter inputs to outputs—they transform raw data into increasingly abstract features that make the final prediction task tractable. This page explores the mechanics, mathematics, and intuitions behind this transformative capability.

What You Will Master

By the end of this page, you will understand: (1) How hidden layers transform input spaces to make problems linearly separable; (2) The geometry of learned representations; (3) The role of width and depth in representation capacity; (4) Feature hierarchies and compositional learning; (5) Visualization and interpretation of hidden layer activations.

The XOR Problem: A Case Study in Hidden Power

The XOR (exclusive OR) function is the simplest non-linearly separable problem, making it the perfect vehicle to understand hidden layer necessity and mechanics.

XOR Definition:

$$\text{XOR}(x_1, x_2) = \begin{cases} 1 & \text{if } x_1 \neq x_2 \ 0 & \text{if } x_1 = x_2 \end{cases}$$

Plotting the four input points in 2D and coloring by output:

$(0,0) \to 0$ (blue)
$(0,1) \to 1$ (red)
$(1,0) \to 1$ (red)
$(1,1) \to 0$ (blue)

The classes form a checkerboard—no straight line separates them. A single-layer perceptron computes $\hat{y} = \sigma(w_1 x_1 + w_2 x_2 + b)$, which defines a linear decision boundary. Minsky and Papert proved this is fundamentally impossible.

The Hidden Layer Solution:

Add one hidden layer with 2 units. The network becomes:

Layer 1 (Hidden): $$h_1 = \sigma(w_{11}x_1 + w_{12}x_2 + b_1)$$ $$h_2 = \sigma(w_{21}x_1 + w_{22}x_2 + b_2)$$

Layer 2 (Output): $$\hat{y} = \sigma(v_1 h_1 + v_2 h_2 + c)$$

One Solution (among infinitely many):

Hidden layer weights (using step function for clarity):

Unit 1: $h_1 = (x_1 + x_2 \geq 0.5)$ — fires if at least one input is on
Unit 2: $h_2 = (x_1 + x_2 \geq 1.5)$ — fires only if both inputs are on

This transforms the input space:

$(0,0) \to (0, 0)$
$(0,1) \to (1, 0)$
$(1,0) \to (1, 0)$
$(1,1) \to (1, 1)$

Now in the $(h_1, h_2)$ space, the "on" points $(1,0)$ and the "off" points $(0,0), (1,1)$ are linearly separable. The output layer simply learns: fire when $h_1=1$ AND $h_2=0$.

xor_hidden_layer_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
import numpy as np
import matplotlib.pyplot as plt
 
def sigmoid(z):
    """Sigmoid activation function."""
    return 1 / (1 + np.exp(-z))
 
def xor_network(x1, x2, hidden_scale=10):
    """
    XOR network with explicit learned weights.
    
    This demonstrates how a 2-hidden-unit network solves XOR
    by transforming the input space.
    """
    # Hidden layer weights (manually designed solution)
    # These weights could be learned by gradient descent
    
    # Hidden unit 1: Fires when (x1 OR x2)
    # Learns: x1 + x2 > 0.5
    h1_pre = hidden_scale * (x1 + x2 - 0.5)
    h1 = sigmoid(h1_pre)
    
    # Hidden unit 2: Fires when (x1 AND x2)  
    # Learns: x1 + x2 > 1.5
    h2_pre = hidden_scale * (x1 + x2 - 1.5)
    h2 = sigmoid(h2_pre)
    
    # Output layer: Fires when h1 AND NOT h2
    # XOR = (x1 OR x2) AND NOT (x1 AND x2)
    output_pre = hidden_scale * (h1 - h2 - 0.5)
    output = sigmoid(output_pre)
    
    return output, h1, h2
 
def visualize_xor_transformation():
    """
    Visualize how hidden layer transforms XOR problem
    from non-separable to separable.
    """
    # XOR inputs and targets
    X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
    y = np.array([0, 1, 1, 0])
    
    fig, axes = plt.subplots(1, 3, figsize=(14, 4))
    
    # Original input space
    ax1 = axes[0]
    colors = ['blue' if yi == 0 else 'red' for yi in y]
    ax1.scatter(X[:, 0], X[:, 1], c=colors, s=200, edgecolors='black')
    ax1.set_title('Input Space (NOT Separable)', fontsize=12)
    ax1.set_xlabel('$x_1$')
    ax1.set_ylabel('$x_2$')
    ax1.set_xlim(-0.5, 1.5)
    ax1.set_ylim(-0.5, 1.5)
    ax1.grid(True, alpha=0.3)
    
    # Compute hidden representations
    H = np.zeros((4, 2))
    for i, (x1, x2) in enumerate(X):
        _, h1, h2 = xor_network(x1, x2)
        H[i] = [h1, h2]
    
    # Hidden layer space
    ax2 = axes[1]
    ax2.scatter(H[:, 0], H[:, 1], c=colors, s=200, edgecolors='black')
    ax2.set_title('Hidden Space (NOW Separable!)', fontsize=12)
    ax2.set_xlabel('$h_1$ (OR gate)')
    ax2.set_ylabel('$h_2$ (AND gate)')
    ax2.set_xlim(-0.5, 1.5)
    ax2.set_ylim(-0.5, 1.5)
    
    # Add separating line in hidden space
    x_line = np.linspace(-0.5, 1.5, 100)
    y_line = x_line - 0.5  # h1 - h2 = 0.5
    ax2.plot(x_line, y_line, 'g--', linewidth=2, label='Decision Boundary')
    ax2.legend()
    ax2.grid(True, alpha=0.3)
    
    # Network output over input space
    ax3 = axes[2]
    xx, yy = np.meshgrid(np.linspace(-0.3, 1.3, 100), 
                         np.linspace(-0.3, 1.3, 100))
    Z = np.zeros_like(xx)
    for i in range(xx.shape[0]):
        for j in range(xx.shape[1]):
            Z[i, j], _, _ = xor_network(xx[i, j], yy[i, j])
    
    ax3.contourf(xx, yy, Z, levels=20, cmap='RdBu_r', alpha=0.8)
    ax3.scatter(X[:, 0], X[:, 1], c=colors, s=200, edgecolors='black')
    ax3.contour(xx, yy, Z, levels=[0.5], colors='green', linewidths=2)
    ax3.set_title('Network Output (Nonlinear Boundary)', fontsize=12)
    ax3.set_xlabel('$x_1$')
    ax3.set_ylabel('$x_2$')
    
    plt.tight_layout()
    plt.savefig('xor_transformation.png', dpi=150)
    plt.show()
    
    # Print transformation details
    print("XOR Space Transformation:")
    print("-" * 40)
    for i, (x1, x2) in enumerate(X):
        output, h1, h2 = xor_network(x1, x2)
        print(f"Input ({x1}, {x2}) → Hidden ({h1:.2f}, {h2:.2f}) → Output {output:.2f} (Target: {y[i]})")
 
if __name__ == "__main__":
    visualize_xor_transformation()

The Key Insight

The hidden layer doesn't just process information—it transforms the problem. By mapping inputs to a new representation space, it converts an impossible problem (linear separation of XOR) into a trivial one. This is the essence of representation learning: finding the right coordinate system where the problem becomes easy.

The Geometry of Hidden Representations

Hidden layers perform geometric transformations on data. Each layer's linear transformation followed by nonlinearity implements a series of warps, stretches, and folds that progressively reshape the data distribution.

The Transformation Sequence:

For each hidden layer:

Affine Transformation $(W\mathbf{x} + \mathbf{b})$:
- Rotation (via orthogonal components of $W$)
- Scaling (via singular values of $W$)
- Shearing (via non-orthogonal $W$)
- Translation (via $\mathbf{b}$)
Nonlinear Activation $\sigma(\cdot)$:
- ReLU: Folds negative half-spaces to zero
- Sigmoid/Tanh: Compresses to bounded range
- Creates piecewise linear regions (for ReLU)

Unfolding Data Manifolds:

Real-world data often lies on low-dimensional manifolds embedded in high-dimensional space. Hidden layers learn to "unfold" these manifolds:

Entangled manifold: Data from different classes interleaved
Unfolded manifold: Classes smoothly separated in hidden space

A key insight from manifold learning: well-trained networks progressively separate class manifolds while preserving within-class structure.

Geometric Operations by Layer

•Linear Expansion: Projecting to higher dimensions creates 'room' for complex boundaries. A 2D circle becomes separable when lifted to 3D by adding $z = x^2 + y^2$.
•Linear Compression: Projecting to lower dimensions discards irrelevant variation while preserving class-discriminative features.
•Folding (ReLU): Each ReLU creates a reflection across a hyperplane, folding the space. Multiple ReLUs create piecewise linear regions that approximate smooth curves.
•Saturation (Sigmoid/Tanh): Maps extreme values to fixed outputs, creating bounded representations that are robust to input scale.
•Space Warping: Combined transformations can stretch space in one direction while compressing in another, adapting to data geometry.

Piecewise Linear Interpretation (for ReLU networks):

A network with ReLU activations is a piecewise linear function. Each hidden unit divides input space with a hyperplane (where pre-activation = 0). The regions where each combination of units is "on" or "off" define a partition of input space into convex polytopes.

For a network with $L$ layers of width $n$:

Maximum number of linear regions: $O(n^L)$ (exponential in depth)
Each region: network computes a specific linear function

This exponential growth explains depth efficiency: deeper networks can create exponentially more complex decision boundaries with the same parameter count as shallow ones.

The Representation Manifold:

As data flows through the network:

Early layers: Learn low-level features (edges, frequencies, simple patterns)
Middle layers: Combine into mid-level representations (textures, parts, motifs)
Later layers: Form high-level abstractions (objects, concepts, classes)

This hierarchy emerges naturally from gradient descent—no explicit supervision for intermediate features is required.

Width vs. Depth for Complexity

A single hidden layer with $n$ units creates at most $O(n)$ linear regions. But $L$ layers of width $n$ can create $O(n^L)$ regions—exponentially more. This is why depth matters: it's not just about having more parameters, but about compositional structure that enables exponentially richer function classes.

Hidden Layer Width and Representational Capacity

The number of hidden units directly controls the network's capacity—its ability to represent complex functions. But the relationship is nuanced.

Width and Function Class:

For a single hidden layer with $h$ units and smooth activation $\sigma$:

The network can represent functions with at most $O(h)$ "bumps" or oscillations
Adding units increases the basis of representable functions
Universal approximation requires $h \to \infty$ for arbitrary precision

The Bias-Variance Trade-off:

Hidden Units	Capacity	Training Error	Test Error	Regime
Too few	Low	High	High	Underfitting
Just right	Moderate	Low	Low	Good generalization
Too many (classical)	High	Very low	High	Overfitting
Too many (modern)	Very high	Zero	Low	Double descent

Modern Overparameterization:

Classical learning theory suggests more parameters = more overfitting. But modern deep learning often uses networks with far more parameters than training examples—and they generalize well!

This "double descent" phenomenon involves:

Interpolation threshold: Network has exactly enough capacity to memorize data
Beyond interpolation: Adding more parameters actually improves generalization

The implicit regularization of gradient descent (preferring smooth solutions) explains some of this behavior.

Symptoms of Too Few Units

•High training AND test error
•Model cannot capture data patterns
•Learning curve plateaus early
•Predictions are too 'smooth'
•Adding data doesn't help

Symptoms of Too Many Units

•Very low training error, high test error
•Model memorizes noise/outliers
•High variance across random seeds
•Predictions are 'spiky' or inconsistent
•Requires strong regularization

Practical Width Selection Guidelines:

Start moderately: Begin with width equal to 1-2x input dimension or a power of 2 (64, 128, 256)
Scale with data: More training examples support wider networks
- Rule of thumb: parameters < 10× training examples (classical)
- Modern practice often ignores this with proper regularization
Match complexity: Simple patterns need fewer units; complex patterns need more
Use validation: Monitor validation loss to detect overfitting
Apply regularization: Dropout, weight decay, and early stopping allow wider networks without overfitting

Dead Neurons Problem:

With ReLU activation, hidden units can "die"—their pre-activation becomes permanently negative for all inputs, so they never activate or update. Wide layers help ensure enough neurons remain active. Batch normalization and proper initialization also mitigate this issue.

The Power of 2 Heuristic

In practice, hidden layer widths are often powers of 2 (32, 64, 128, 256, 512, 1024) or multiples of GPU warp size (32). This is primarily for computational efficiency—memory alignment and parallel processing are optimized for these sizes—not for any fundamental mathematical reason. But efficiency matters, so follow this convention unless you have specific reasons not to.

Depth and Hierarchical Feature Learning

While width determines capacity at each layer, depth enables hierarchical composition—building complex features from simpler ones. This compositional structure is key to the success of deep learning.

The Composition Principle:

Consider recognizing a face:

Layer 1: Edge detectors (local gradient patterns)
Layer 2: Edge combinations → eyes, nose, mouth features
Layer 3: Face parts → face holistic representation
Layer 4: Face representation → identity, expression

Each layer composes features from the previous layer into more abstract representations. This hierarchy is remarkably consistent across trained networks—even without explicit supervision for intermediate features.

Depth Efficiency Theorems (Telgarsky, 2016):

There exist functions expressible by a network of depth $k$ with polynomial width that require exponential width to express with depth $k-1$.

Example function: Highly oscillatory functions like $\sin(2^k \cdot x)$ require exponentially many units in depth-$(k-1)$ networks but only $O(k)$ units in depth-$k$ networks.

Practical Implications:

Deep networks can represent functions impossible for shallow networks of reasonable size
Not all functions benefit equally from depth—tabular data often works fine with 2-3 layers
Deeper networks are harder to train (vanishing/exploding gradients, optimization difficulty)

Feature Hierarchy by Depth (ImageNet CNN Example)
Layer Depth	Receptive Field	Features Learned	Abstraction Level
1-2	Small (3x3 to 7x7)	Edges, colors, simple textures	Local, universal
3-5	Medium (20-50 pixels)	Textures, patterns, part shapes	Mid-level, transferable
6-10	Large (100+ pixels)	Object parts, semantic regions	High-level, domain-specific
11+	Entire image	Object categories, scene context	Abstract, task-specific

Transfer Learning Insight:

The hierarchical structure explains why transfer learning works: early layers learn universal features (edges, textures) that apply across many domains, while later layers specialize to the specific task. When fine-tuning a pretrained network:

Freeze early layers: They already have good general features
Train later layers: Adapt high-level features to new task

The Lottery Ticket Hypothesis (Frankle & Carlin, 2019):

Dense networks contain sparse subnetworks ("winning tickets") that, when trained in isolation from initialization, match full network performance. This suggests that depth provides many possible feature compositions, and training discovers which ones are useful.

Depth vs. Width Trade-offs:

Aspect	Deeper Networks	Wider Networks
Expressiveness	Exponentially more efficient	Polynomial in width
Training difficulty	Harder (gradient issues)	Easier (gradient flow)
Memory during training	Higher (store activations)	Lower per layer
Computation	More sequential operations	More parallelizable
Feature reuse	Features build on each other	Features independent

The Vanishing Gradient Problem

Deep networks face a fundamental optimization challenge: gradients shrink exponentially as backpropagation traverses many layers. A network with 20 layers and sigmoid activations might have gradients 20 orders of magnitude smaller at the first layer than the last. Solutions include ReLU activations, residual connections, careful initialization, and normalization layers—all essential for training deep networks successfully.

Hidden Layer Design Patterns

Decades of empirical experience have established common patterns for hidden layer configuration. These aren't rigid rules but useful starting points.

Pattern 1: Constant Width

All hidden layers have the same width
Example: $(784, 256, 256, 256, 10)$
When to use: Default choice when uncertain; simplifies hyperparameter search
Rationale: Each layer has similar capacity to transform representations

Pattern 2: Funnel/Pyramid

Width decreases toward output
Example: $(784, 512, 256, 128, 10)$
When to use: Classification tasks where you progressively compress to class representation
Rationale: Information is condensed as irrelevant details are discarded

Pattern 3: Bottleneck (Encoder-Decoder)

Width decreases then increases
Example: $(784, 256, 64, 256, 784)$
When to use: Autoencoders, compression, learning disentangled representations
Rationale: Force information through a narrow bottleneck to learn essential features

Pattern 4: Expanding

Width increases with depth
Example: $(100, 256, 512, 1024, 10)$
When to use: Low-dimensional input requiring rich intermediate representations
Rationale: Create more representational room before final compression

hidden_layer_patterns.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
import torch
import torch.nn as nn
from typing import List, Literal
 
def create_hidden_layers(
    input_dim: int,
    output_dim: int,
    num_hidden: int,
    base_width: int,
    pattern: Literal["constant", "funnel", "bottleneck", "expanding"],
    activation: nn.Module = nn.ReLU
) -> nn.Sequential:
    """
    Create hidden layers following common architectural patterns.
    
    Args:
        input_dim: Input feature dimension
        output_dim: Output dimension
        num_hidden: Number of hidden layers
        base_width: Base width for hidden layers
        pattern: Architecture pattern to use
        activation: Activation function class
    
    Returns:
        nn.Sequential containing the full network
    """
    
    def get_widths(pattern: str, num_hidden: int, base_width: int) -> List[int]:
        """Generate layer widths based on pattern."""
        if pattern == "constant":
            return [base_width] * num_hidden
        
        elif pattern == "funnel":
            # Exponential decrease
            widths = []
            w = base_width
            for _ in range(num_hidden):
                widths.append(int(w))
                w = max(w // 2, 32)  # Don't go below 32
            return widths
        
        elif pattern == "bottleneck":
            if num_hidden < 3:
                return [base_width // 2] * num_hidden
            # Decrease to middle, then increase
            mid = num_hidden // 2
            bottleneck_width = max(base_width // 4, 32)
            widths = []
            for i in range(num_hidden):
                dist_from_mid = abs(i - mid)
                width = bottleneck_width * (2 ** dist_from_mid)
                width = min(width, base_width)
                widths.append(int(width))
            return widths
        
        elif pattern == "expanding":
            widths = []
            w = base_width // 4
            for _ in range(num_hidden):
                widths.append(int(w))
                w = min(w * 2, base_width * 2)
            return widths
        
        else:
            raise ValueError(f"Unknown pattern: {pattern}")
    
    hidden_widths = get_widths(pattern, num_hidden, base_width)
    
    layers = []
    prev_dim = input_dim
    
    for i, width in enumerate(hidden_widths):
        layers.append(nn.Linear(prev_dim, width))
        layers.append(activation())
        prev_dim = width
    
    # Output layer (no activation - added externally if needed)
    layers.append(nn.Linear(prev_dim, output_dim))
    
    return nn.Sequential(*layers), hidden_widths
 
 
def demonstrate_patterns():
    """Show examples of each pattern."""
    patterns = ["constant", "funnel", "bottleneck", "expanding"]
    
    print("Hidden Layer Design Patterns")
    print("=" * 60)
    
    for pattern in patterns:
        network, widths = create_hidden_layers(
            input_dim=784,
            output_dim=10,
            num_hidden=4,
            base_width=256,
            pattern=pattern
        )
        
        total_params = sum(p.numel() for p in network.parameters())
        arch_str = " → ".join(["784"] + [str(w) for w in widths] + ["10"])
        
        print(f"\n{pattern.upper()} Pattern:")
        print(f"  Architecture: {arch_str}")
        print(f"  Parameters: {total_params:,}")
        print(f"  Widths: {widths}")
 
 
if __name__ == "__main__":
    demonstrate_patterns()

No Universal Best Pattern

The 'best' architecture depends on the problem. Tabular data with independent features often works well with funnel architectures. Generative models benefit from bottlenecks. High-dimensional compositional data (images, text) leverages deep constant-width or slightly funneling structures. Experimentation and validation remain essential.

Visualizing and Interpreting Hidden Representations

Understanding what hidden layers learn is crucial for debugging, interpretation, and scientific inquiry. Several techniques illuminate the black box.

Activation Visualization:

For each hidden unit $j$ in layer $l$, examine:

Which inputs strongly activate it (high $a_j^{(l)}$)
Which inputs strongly suppress it (low $a_j^{(l)}$)
The distribution of activations across the dataset

Dimensionality Reduction:

Hidden representations are high-dimensional. To visualize:

t-SNE: Nonlinear projection preserving local structure
UMAP: Similar to t-SNE, often faster and better at global structure
PCA: Linear projection showing principal variation axes

Apply to hidden activations ${\mathbf{a}^{(l)}_i}$ for all inputs $i$ and color by class to see cluster formation.

Feature Maximization:

Find input $\mathbf{x}^*$ that maximally activates a given hidden unit:

$$\mathbf{x}^* = \arg\max_{\mathbf{x}} a_j^{(l)}(\mathbf{x})$$

Solved via gradient ascent in input space. Reveals what pattern the unit "looks for."

Saliency Maps:

For classification, compute gradient of output w.r.t. input:

$$\text{Saliency} = \left| \frac{\partial f_c(\mathbf{x})}{\partial \mathbf{x}} \right|$$

Highlights which input features most influence the prediction—revealing what the network attends to.

hidden_representation_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
import numpy as np
import torch
import torch.nn as nn
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
 
class HiddenRepresentationAnalyzer:
    """
    Comprehensive tools for analyzing hidden layer representations.
    
    Provides methods for:
    - Extracting activations at any layer
    - Dimensionality reduction for visualization
    - Activation statistics and distributions
    - Layer-wise representation analysis
    """
    
    def __init__(self, model: nn.Module):
        self.model = model
        self.activations = {}
        self._register_hooks()
    
    def _register_hooks(self):
        """Register forward hooks to capture activations."""
        def get_activation(name):
            def hook(module, input, output):
                self.activations[name] = output.detach()
            return hook
        
        for name, module in self.model.named_modules():
            if isinstance(module, (nn.Linear, nn.ReLU)):
                module.register_forward_hook(get_activation(name))
    
    def get_layer_activations(
        self, 
        data: torch.Tensor, 
        layer_name: str
    ) -> np.ndarray:
        """Extract activations for a specific layer."""
        self.model.eval()
        with torch.no_grad():
            _ = self.model(data)
        return self.activations[layer_name].numpy()
    
    def visualize_representation_evolution(
        self,
        data: torch.Tensor,
        labels: np.ndarray,
        layer_names: list
    ):
        """
        Visualize how representations evolve through layers.
        Uses t-SNE to project hidden representations to 2D.
        """
        n_layers = len(layer_names)
        fig, axes = plt.subplots(1, n_layers, figsize=(4 * n_layers, 4))
        
        for idx, layer_name in enumerate(layer_names):
            acts = self.get_layer_activations(data, layer_name)
            
            # Apply t-SNE (PCA for very high-dimensional)
            if acts.shape[1] > 50:
                pca = PCA(n_components=50)
                acts = pca.fit_transform(acts)
            
            tsne = TSNE(n_components=2, random_state=42, perplexity=30)
            embedded = tsne.fit_transform(acts)
            
            ax = axes[idx] if n_layers > 1 else axes
            scatter = ax.scatter(
                embedded[:, 0], embedded[:, 1],
                c=labels, cmap='tab10', alpha=0.7, s=5
            )
            ax.set_title(f'Layer: {layer_name}')
            ax.set_xticks([])
            ax.set_yticks([])
        
        plt.tight_layout()
        return fig
    
    def compute_activation_statistics(
        self,
        data: torch.Tensor,
        layer_names: list
    ) -> dict:
        """
        Compute comprehensive activation statistics per layer.
        
        Returns:
            Dictionary with statistics for each layer:
            - mean, std of activations
            - sparsity (fraction of zeros for ReLU)
            - saturation (fraction at extreme values)
        """
        stats = {}
        
        for layer_name in layer_names:
            acts = self.get_layer_activations(data, layer_name)
            
            stats[layer_name] = {
                'mean': np.mean(acts),
                'std': np.std(acts),
                'min': np.min(acts),
                'max': np.max(acts),
                'sparsity': np.mean(acts == 0),  # For ReLU
                'dead_units': np.mean(np.all(acts == 0, axis=0)),
            }
        
        return stats
    
    def analyze_class_separability(
        self,
        data: torch.Tensor,
        labels: np.ndarray,
        layer_names: list
    ) -> dict:
        """
        Measure how well classes separate at each layer.
        Uses ratio of between-class to within-class variance.
        """
        separability = {}
        
        for layer_name in layer_names:
            acts = self.get_layer_activations(data, layer_name)
            
            # Compute class centroids
            classes = np.unique(labels)
            centroids = np.array([
                acts[labels == c].mean(axis=0) for c in classes
            ])
            global_mean = acts.mean(axis=0)
            
            # Between-class variance
            between_var = np.sum([
                np.sum((centroids[i] - global_mean)**2) * np.sum(labels == c)
                for i, c in enumerate(classes)
            ]) / len(labels)
            
            # Within-class variance
            within_var = np.mean([
                np.mean(np.sum((acts[labels == c] - centroids[i])**2, axis=1))
                for i, c in enumerate(classes)
            ])
            
            separability[layer_name] = between_var / (within_var + 1e-10)
        
        return separability
 
 
# Example usage with synthetic data
def demonstrate_hidden_analysis():
    """Show hidden layer analysis on a simple classification task."""
    
    # Create simple MLP
    model = nn.Sequential(
        nn.Linear(20, 64),
        nn.ReLU(),
        nn.Linear(64, 32),
        nn.ReLU(),
        nn.Linear(32, 16),
        nn.ReLU(),
        nn.Linear(16, 3)  # 3 classes
    )
    
    # Generate synthetic data
    np.random.seed(42)
    n_samples = 500
    X = np.random.randn(n_samples, 20).astype(np.float32)
    y = np.random.randint(0, 3, n_samples)
    
    # Add class-specific signal
    for c in range(3):
        X[y == c, :5] += c * 0.5
    
    data = torch.tensor(X)
    
    # Analyze
    analyzer = HiddenRepresentationAnalyzer(model)
    
    layer_names = ['0', '2', '4']  # Linear layers
    
    print("Activation Statistics:")
    stats = analyzer.compute_activation_statistics(data, layer_names)
    for name, s in stats.items():
        print(f"  Layer {name}: mean={s['mean']:.3f}, std={s['std']:.3f}, sparsity={s['sparsity']:.3f}")
    
    print("\nClass Separability (higher = better):")
    sep = analyzer.analyze_class_separability(data, y, layer_names)
    for name, val in sep.items():
        print(f"  Layer {name}: {val:.3f}")
 
 
if __name__ == "__main__":
    demonstrate_hidden_analysis()

The Class Separability Trajectory

Well-trained classification networks show monotonically increasing class separability through layers. If separability decreases at some layer, it may indicate that layer is too narrow (losing information), poorly trained, or unnecessary. Monitoring this metric during training can diagnose learning problems.

Summary: The Hidden Layer Foundation

Hidden layers are the computational engine that transforms neural networks from simple linear models into powerful universal approximators. Let's consolidate the key concepts.

Key Concepts Mastered

•Hidden layers enable non-linear separation: By transforming input space, hidden layers convert non-separable problems (like XOR) into separable ones.
•Geometry of transformation: Each layer performs affine transformation + nonlinearity, progressively warping the data manifold to separate classes.
•Width controls capacity: More hidden units = more representational power, but also more risk of overfitting (mitigated by regularization).
•Depth enables hierarchy: Deep networks learn compositional features—simple features combine into complex ones through layers.
•Design patterns exist: Constant, funnel, bottleneck, and expanding architectures suit different problem types.
•Representations are analyzable: t-SNE, activation statistics, and separability metrics reveal what hidden layers learn.
•Feature learning is emergent: Without explicit supervision, hidden layers discover useful intermediate representations.

What's Next:

With the role of hidden layers established, we turn to the computation itself. The next page examines Forward Propagation—the precise algorithm by which information flows from input to output, transforming at each layer according to learned parameters.

Page Complete

You now understand hidden layers as the representational core of neural networks. This foundation is essential for understanding how networks learn: backpropagation, which we'll cover later, adjusts weights to shape these hidden representations. Every modern architecture—CNNs, RNNs, Transformers—is fundamentally about designing hidden layers that capture the right structure for the problem at hand.

2 / 5

Loading learning content...

Neural Networks & Deep LearningMulti-Layer Perceptrons

Multi-Layer Perceptrons

LevelIntermediate

Duration90 mins

TopicMulti-Layer Perceptrons

2 / 5

Hidden Layers

The Power of Hidden Representations

What You Will Master

The XOR Problem: A Case Study in Hidden Power

The XOR (exclusive OR) function is the simplest non-linearly separable problem, making it the perfect vehicle to understand hidden layer necessity and mechanics.

XOR Definition:

$$\text{XOR}(x_1, x_2) = \begin{cases} 1 & \text{if } x_1 \neq x_2 \ 0 & \text{if } x_1 = x_2 \end{cases}$$

Plotting the four input points in 2D and coloring by output:

$(0,0) \to 0$ (blue)
$(0,1) \to 1$ (red)
$(1,0) \to 1$ (red)
$(1,1) \to 0$ (blue)

The Hidden Layer Solution:

Add one hidden layer with 2 units. The network becomes:

Layer 1 (Hidden): $$h_1 = \sigma(w_{11}x_1 + w_{12}x_2 + b_1)$$ $$h_2 = \sigma(w_{21}x_1 + w_{22}x_2 + b_2)$$

Layer 2 (Output): $$\hat{y} = \sigma(v_1 h_1 + v_2 h_2 + c)$$

One Solution (among infinitely many):

Hidden layer weights (using step function for clarity):

Unit 1: $h_1 = (x_1 + x_2 \geq 0.5)$ — fires if at least one input is on
Unit 2: $h_2 = (x_1 + x_2 \geq 1.5)$ — fires only if both inputs are on

This transforms the input space:

$(0,0) \to (0, 0)$
$(0,1) \to (1, 0)$
$(1,0) \to (1, 0)$
$(1,1) \to (1, 1)$

Now in the $(h_1, h_2)$ space, the "on" points $(1,0)$ and the "off" points $(0,0), (1,1)$ are linearly separable. The output layer simply learns: fire when $h_1=1$ AND $h_2=0$.

xor_hidden_layer_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
import numpy as np
import matplotlib.pyplot as plt
 
def sigmoid(z):
    """Sigmoid activation function."""
    return 1 / (1 + np.exp(-z))
 
def xor_network(x1, x2, hidden_scale=10):
    """
    XOR network with explicit learned weights.
    
    This demonstrates how a 2-hidden-unit network solves XOR
    by transforming the input space.
    """
    # Hidden layer weights (manually designed solution)
    # These weights could be learned by gradient descent
    
    # Hidden unit 1: Fires when (x1 OR x2)
    # Learns: x1 + x2 > 0.5
    h1_pre = hidden_scale * (x1 + x2 - 0.5)
    h1 = sigmoid(h1_pre)
    
    # Hidden unit 2: Fires when (x1 AND x2)  
    # Learns: x1 + x2 > 1.5
    h2_pre = hidden_scale * (x1 + x2 - 1.5)
    h2 = sigmoid(h2_pre)
    
    # Output layer: Fires when h1 AND NOT h2
    # XOR = (x1 OR x2) AND NOT (x1 AND x2)
    output_pre = hidden_scale * (h1 - h2 - 0.5)
    output = sigmoid(output_pre)
    
    return output, h1, h2
 
def visualize_xor_transformation():
    """
    Visualize how hidden layer transforms XOR problem
    from non-separable to separable.
    """
    # XOR inputs and targets
    X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
    y = np.array([0, 1, 1, 0])
    
    fig, axes = plt.subplots(1, 3, figsize=(14, 4))
    
    # Original input space
    ax1 = axes[0]
    colors = ['blue' if yi == 0 else 'red' for yi in y]
    ax1.scatter(X[:, 0], X[:, 1], c=colors, s=200, edgecolors='black')
    ax1.set_title('Input Space (NOT Separable)', fontsize=12)
    ax1.set_xlabel('$x_1$')
    ax1.set_ylabel('$x_2$')
    ax1.set_xlim(-0.5, 1.5)
    ax1.set_ylim(-0.5, 1.5)
    ax1.grid(True, alpha=0.3)
    
    # Compute hidden representations
    H = np.zeros((4, 2))
    for i, (x1, x2) in enumerate(X):
        _, h1, h2 = xor_network(x1, x2)
        H[i] = [h1, h2]
    
    # Hidden layer space
    ax2 = axes[1]
    ax2.scatter(H[:, 0], H[:, 1], c=colors, s=200, edgecolors='black')
    ax2.set_title('Hidden Space (NOW Separable!)', fontsize=12)
    ax2.set_xlabel('$h_1$ (OR gate)')
    ax2.set_ylabel('$h_2$ (AND gate)')
    ax2.set_xlim(-0.5, 1.5)
    ax2.set_ylim(-0.5, 1.5)
    
    # Add separating line in hidden space
    x_line = np.linspace(-0.5, 1.5, 100)
    y_line = x_line - 0.5  # h1 - h2 = 0.5
    ax2.plot(x_line, y_line, 'g--', linewidth=2, label='Decision Boundary')
    ax2.legend()
    ax2.grid(True, alpha=0.3)
    
    # Network output over input space
    ax3 = axes[2]
    xx, yy = np.meshgrid(np.linspace(-0.3, 1.3, 100), 
                         np.linspace(-0.3, 1.3, 100))
    Z = np.zeros_like(xx)
    for i in range(xx.shape[0]):
        for j in range(xx.shape[1]):
            Z[i, j], _, _ = xor_network(xx[i, j], yy[i, j])
    
    ax3.contourf(xx, yy, Z, levels=20, cmap='RdBu_r', alpha=0.8)
    ax3.scatter(X[:, 0], X[:, 1], c=colors, s=200, edgecolors='black')
    ax3.contour(xx, yy, Z, levels=[0.5], colors='green', linewidths=2)
    ax3.set_title('Network Output (Nonlinear Boundary)', fontsize=12)
    ax3.set_xlabel('$x_1$')
    ax3.set_ylabel('$x_2$')
    
    plt.tight_layout()
    plt.savefig('xor_transformation.png', dpi=150)
    plt.show()
    
    # Print transformation details
    print("XOR Space Transformation:")
    print("-" * 40)
    for i, (x1, x2) in enumerate(X):
        output, h1, h2 = xor_network(x1, x2)
        print(f"Input ({x1}, {x2}) → Hidden ({h1:.2f}, {h2:.2f}) → Output {output:.2f} (Target: {y[i]})")
 
if __name__ == "__main__":
    visualize_xor_transformation()

The Key Insight

The Geometry of Hidden Representations

The Transformation Sequence:

For each hidden layer:

Affine Transformation $(W\mathbf{x} + \mathbf{b})$:
- Rotation (via orthogonal components of $W$)
- Scaling (via singular values of $W$)
- Shearing (via non-orthogonal $W$)
- Translation (via $\mathbf{b}$)
Nonlinear Activation $\sigma(\cdot)$:
- ReLU: Folds negative half-spaces to zero
- Sigmoid/Tanh: Compresses to bounded range
- Creates piecewise linear regions (for ReLU)

Unfolding Data Manifolds:

Real-world data often lies on low-dimensional manifolds embedded in high-dimensional space. Hidden layers learn to "unfold" these manifolds:

Entangled manifold: Data from different classes interleaved
Unfolded manifold: Classes smoothly separated in hidden space

A key insight from manifold learning: well-trained networks progressively separate class manifolds while preserving within-class structure.

Geometric Operations by Layer

•Linear Expansion: Projecting to higher dimensions creates 'room' for complex boundaries. A 2D circle becomes separable when lifted to 3D by adding $z = x^2 + y^2$.
•Linear Compression: Projecting to lower dimensions discards irrelevant variation while preserving class-discriminative features.
•Folding (ReLU): Each ReLU creates a reflection across a hyperplane, folding the space. Multiple ReLUs create piecewise linear regions that approximate smooth curves.
•Saturation (Sigmoid/Tanh): Maps extreme values to fixed outputs, creating bounded representations that are robust to input scale.
•Space Warping: Combined transformations can stretch space in one direction while compressing in another, adapting to data geometry.

Piecewise Linear Interpretation (for ReLU networks):

For a network with $L$ layers of width $n$:

Maximum number of linear regions: $O(n^L)$ (exponential in depth)
Each region: network computes a specific linear function

This exponential growth explains depth efficiency: deeper networks can create exponentially more complex decision boundaries with the same parameter count as shallow ones.

The Representation Manifold:

As data flows through the network:

Early layers: Learn low-level features (edges, frequencies, simple patterns)
Middle layers: Combine into mid-level representations (textures, parts, motifs)
Later layers: Form high-level abstractions (objects, concepts, classes)

This hierarchy emerges naturally from gradient descent—no explicit supervision for intermediate features is required.

Width vs. Depth for Complexity

Hidden Layer Width and Representational Capacity

The number of hidden units directly controls the network's capacity—its ability to represent complex functions. But the relationship is nuanced.

Width and Function Class:

For a single hidden layer with $h$ units and smooth activation $\sigma$:

The network can represent functions with at most $O(h)$ "bumps" or oscillations
Adding units increases the basis of representable functions
Universal approximation requires $h \to \infty$ for arbitrary precision

The Bias-Variance Trade-off:

Hidden Units	Capacity	Training Error	Test Error	Regime
Too few	Low	High	High	Underfitting
Just right	Moderate	Low	Low	Good generalization
Too many (classical)	High	Very low	High	Overfitting
Too many (modern)	Very high	Zero	Low	Double descent

Modern Overparameterization:

Classical learning theory suggests more parameters = more overfitting. But modern deep learning often uses networks with far more parameters than training examples—and they generalize well!

This "double descent" phenomenon involves:

Interpolation threshold: Network has exactly enough capacity to memorize data
Beyond interpolation: Adding more parameters actually improves generalization

The implicit regularization of gradient descent (preferring smooth solutions) explains some of this behavior.

Symptoms of Too Few Units

•High training AND test error
•Model cannot capture data patterns
•Learning curve plateaus early
•Predictions are too 'smooth'
•Adding data doesn't help

Symptoms of Too Many Units

•Very low training error, high test error
•Model memorizes noise/outliers
•High variance across random seeds
•Predictions are 'spiky' or inconsistent
•Requires strong regularization

Practical Width Selection Guidelines:

Start moderately: Begin with width equal to 1-2x input dimension or a power of 2 (64, 128, 256)
Scale with data: More training examples support wider networks
- Rule of thumb: parameters < 10× training examples (classical)
- Modern practice often ignores this with proper regularization
Match complexity: Simple patterns need fewer units; complex patterns need more
Use validation: Monitor validation loss to detect overfitting
Apply regularization: Dropout, weight decay, and early stopping allow wider networks without overfitting

Dead Neurons Problem:

The Power of 2 Heuristic

Depth and Hierarchical Feature Learning

The Composition Principle:

Consider recognizing a face:

Layer 1: Edge detectors (local gradient patterns)
Layer 2: Edge combinations → eyes, nose, mouth features
Layer 3: Face parts → face holistic representation
Layer 4: Face representation → identity, expression

Depth Efficiency Theorems (Telgarsky, 2016):

There exist functions expressible by a network of depth $k$ with polynomial width that require exponential width to express with depth $k-1$.

Example function: Highly oscillatory functions like $\sin(2^k \cdot x)$ require exponentially many units in depth-$(k-1)$ networks but only $O(k)$ units in depth-$k$ networks.

Practical Implications:

Deep networks can represent functions impossible for shallow networks of reasonable size
Not all functions benefit equally from depth—tabular data often works fine with 2-3 layers
Deeper networks are harder to train (vanishing/exploding gradients, optimization difficulty)

Feature Hierarchy by Depth (ImageNet CNN Example)
Layer Depth	Receptive Field	Features Learned	Abstraction Level
1-2	Small (3x3 to 7x7)	Edges, colors, simple textures	Local, universal
3-5	Medium (20-50 pixels)	Textures, patterns, part shapes	Mid-level, transferable
6-10	Large (100+ pixels)	Object parts, semantic regions	High-level, domain-specific
11+	Entire image	Object categories, scene context	Abstract, task-specific

Transfer Learning Insight:

Freeze early layers: They already have good general features
Train later layers: Adapt high-level features to new task

The Lottery Ticket Hypothesis (Frankle & Carlin, 2019):

Depth vs. Width Trade-offs:

Aspect	Deeper Networks	Wider Networks
Expressiveness	Exponentially more efficient	Polynomial in width
Training difficulty	Harder (gradient issues)	Easier (gradient flow)
Memory during training	Higher (store activations)	Lower per layer
Computation	More sequential operations	More parallelizable
Feature reuse	Features build on each other	Features independent

The Vanishing Gradient Problem

Hidden Layer Design Patterns

Decades of empirical experience have established common patterns for hidden layer configuration. These aren't rigid rules but useful starting points.

Pattern 1: Constant Width

All hidden layers have the same width
Example: $(784, 256, 256, 256, 10)$
When to use: Default choice when uncertain; simplifies hyperparameter search
Rationale: Each layer has similar capacity to transform representations

Pattern 2: Funnel/Pyramid

Width decreases toward output
Example: $(784, 512, 256, 128, 10)$
When to use: Classification tasks where you progressively compress to class representation
Rationale: Information is condensed as irrelevant details are discarded

Pattern 3: Bottleneck (Encoder-Decoder)

Width decreases then increases
Example: $(784, 256, 64, 256, 784)$
When to use: Autoencoders, compression, learning disentangled representations
Rationale: Force information through a narrow bottleneck to learn essential features

Pattern 4: Expanding

Width increases with depth
Example: $(100, 256, 512, 1024, 10)$
When to use: Low-dimensional input requiring rich intermediate representations
Rationale: Create more representational room before final compression

hidden_layer_patterns.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
import torch
import torch.nn as nn
from typing import List, Literal
 
def create_hidden_layers(
    input_dim: int,
    output_dim: int,
    num_hidden: int,
    base_width: int,
    pattern: Literal["constant", "funnel", "bottleneck", "expanding"],
    activation: nn.Module = nn.ReLU
) -> nn.Sequential:
    """
    Create hidden layers following common architectural patterns.
    
    Args:
        input_dim: Input feature dimension
        output_dim: Output dimension
        num_hidden: Number of hidden layers
        base_width: Base width for hidden layers
        pattern: Architecture pattern to use
        activation: Activation function class
    
    Returns:
        nn.Sequential containing the full network
    """
    
    def get_widths(pattern: str, num_hidden: int, base_width: int) -> List[int]:
        """Generate layer widths based on pattern."""
        if pattern == "constant":
            return [base_width] * num_hidden
        
        elif pattern == "funnel":
            # Exponential decrease
            widths = []
            w = base_width
            for _ in range(num_hidden):
                widths.append(int(w))
                w = max(w // 2, 32)  # Don't go below 32
            return widths
        
        elif pattern == "bottleneck":
            if num_hidden < 3:
                return [base_width // 2] * num_hidden
            # Decrease to middle, then increase
            mid = num_hidden // 2
            bottleneck_width = max(base_width // 4, 32)
            widths = []
            for i in range(num_hidden):
                dist_from_mid = abs(i - mid)
                width = bottleneck_width * (2 ** dist_from_mid)
                width = min(width, base_width)
                widths.append(int(width))
            return widths
        
        elif pattern == "expanding":
            widths = []
            w = base_width // 4
            for _ in range(num_hidden):
                widths.append(int(w))
                w = min(w * 2, base_width * 2)
            return widths
        
        else:
            raise ValueError(f"Unknown pattern: {pattern}")
    
    hidden_widths = get_widths(pattern, num_hidden, base_width)
    
    layers = []
    prev_dim = input_dim
    
    for i, width in enumerate(hidden_widths):
        layers.append(nn.Linear(prev_dim, width))
        layers.append(activation())
        prev_dim = width
    
    # Output layer (no activation - added externally if needed)
    layers.append(nn.Linear(prev_dim, output_dim))
    
    return nn.Sequential(*layers), hidden_widths
 
 
def demonstrate_patterns():
    """Show examples of each pattern."""
    patterns = ["constant", "funnel", "bottleneck", "expanding"]
    
    print("Hidden Layer Design Patterns")
    print("=" * 60)
    
    for pattern in patterns:
        network, widths = create_hidden_layers(
            input_dim=784,
            output_dim=10,
            num_hidden=4,
            base_width=256,
            pattern=pattern
        )
        
        total_params = sum(p.numel() for p in network.parameters())
        arch_str = " → ".join(["784"] + [str(w) for w in widths] + ["10"])
        
        print(f"\n{pattern.upper()} Pattern:")
        print(f"  Architecture: {arch_str}")
        print(f"  Parameters: {total_params:,}")
        print(f"  Widths: {widths}")
 
 
if __name__ == "__main__":
    demonstrate_patterns()

No Universal Best Pattern

Visualizing and Interpreting Hidden Representations

Understanding what hidden layers learn is crucial for debugging, interpretation, and scientific inquiry. Several techniques illuminate the black box.

Activation Visualization:

For each hidden unit $j$ in layer $l$, examine:

Which inputs strongly activate it (high $a_j^{(l)}$)
Which inputs strongly suppress it (low $a_j^{(l)}$)
The distribution of activations across the dataset

Dimensionality Reduction:

Hidden representations are high-dimensional. To visualize:

t-SNE: Nonlinear projection preserving local structure
UMAP: Similar to t-SNE, often faster and better at global structure
PCA: Linear projection showing principal variation axes

Apply to hidden activations ${\mathbf{a}^{(l)}_i}$ for all inputs $i$ and color by class to see cluster formation.

Feature Maximization:

Find input $\mathbf{x}^*$ that maximally activates a given hidden unit:

$$\mathbf{x}^* = \arg\max_{\mathbf{x}} a_j^{(l)}(\mathbf{x})$$

Solved via gradient ascent in input space. Reveals what pattern the unit "looks for."

Saliency Maps:

For classification, compute gradient of output w.r.t. input:

$$\text{Saliency} = \left| \frac{\partial f_c(\mathbf{x})}{\partial \mathbf{x}} \right|$$

Highlights which input features most influence the prediction—revealing what the network attends to.

hidden_representation_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
import numpy as np
import torch
import torch.nn as nn
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
 
class HiddenRepresentationAnalyzer:
    """
    Comprehensive tools for analyzing hidden layer representations.
    
    Provides methods for:
    - Extracting activations at any layer
    - Dimensionality reduction for visualization
    - Activation statistics and distributions
    - Layer-wise representation analysis
    """
    
    def __init__(self, model: nn.Module):
        self.model = model
        self.activations = {}
        self._register_hooks()
    
    def _register_hooks(self):
        """Register forward hooks to capture activations."""
        def get_activation(name):
            def hook(module, input, output):
                self.activations[name] = output.detach()
            return hook
        
        for name, module in self.model.named_modules():
            if isinstance(module, (nn.Linear, nn.ReLU)):
                module.register_forward_hook(get_activation(name))
    
    def get_layer_activations(
        self, 
        data: torch.Tensor, 
        layer_name: str
    ) -> np.ndarray:
        """Extract activations for a specific layer."""
        self.model.eval()
        with torch.no_grad():
            _ = self.model(data)
        return self.activations[layer_name].numpy()
    
    def visualize_representation_evolution(
        self,
        data: torch.Tensor,
        labels: np.ndarray,
        layer_names: list
    ):
        """
        Visualize how representations evolve through layers.
        Uses t-SNE to project hidden representations to 2D.
        """
        n_layers = len(layer_names)
        fig, axes = plt.subplots(1, n_layers, figsize=(4 * n_layers, 4))
        
        for idx, layer_name in enumerate(layer_names):
            acts = self.get_layer_activations(data, layer_name)
            
            # Apply t-SNE (PCA for very high-dimensional)
            if acts.shape[1] > 50:
                pca = PCA(n_components=50)
                acts = pca.fit_transform(acts)
            
            tsne = TSNE(n_components=2, random_state=42, perplexity=30)
            embedded = tsne.fit_transform(acts)
            
            ax = axes[idx] if n_layers > 1 else axes
            scatter = ax.scatter(
                embedded[:, 0], embedded[:, 1],
                c=labels, cmap='tab10', alpha=0.7, s=5
            )
            ax.set_title(f'Layer: {layer_name}')
            ax.set_xticks([])
            ax.set_yticks([])
        
        plt.tight_layout()
        return fig
    
    def compute_activation_statistics(
        self,
        data: torch.Tensor,
        layer_names: list
    ) -> dict:
        """
        Compute comprehensive activation statistics per layer.
        
        Returns:
            Dictionary with statistics for each layer:
            - mean, std of activations
            - sparsity (fraction of zeros for ReLU)
            - saturation (fraction at extreme values)
        """
        stats = {}
        
        for layer_name in layer_names:
            acts = self.get_layer_activations(data, layer_name)
            
            stats[layer_name] = {
                'mean': np.mean(acts),
                'std': np.std(acts),
                'min': np.min(acts),
                'max': np.max(acts),
                'sparsity': np.mean(acts == 0),  # For ReLU
                'dead_units': np.mean(np.all(acts == 0, axis=0)),
            }
        
        return stats
    
    def analyze_class_separability(
        self,
        data: torch.Tensor,
        labels: np.ndarray,
        layer_names: list
    ) -> dict:
        """
        Measure how well classes separate at each layer.
        Uses ratio of between-class to within-class variance.
        """
        separability = {}
        
        for layer_name in layer_names:
            acts = self.get_layer_activations(data, layer_name)
            
            # Compute class centroids
            classes = np.unique(labels)
            centroids = np.array([
                acts[labels == c].mean(axis=0) for c in classes
            ])
            global_mean = acts.mean(axis=0)
            
            # Between-class variance
            between_var = np.sum([
                np.sum((centroids[i] - global_mean)**2) * np.sum(labels == c)
                for i, c in enumerate(classes)
            ]) / len(labels)
            
            # Within-class variance
            within_var = np.mean([
                np.mean(np.sum((acts[labels == c] - centroids[i])**2, axis=1))
                for i, c in enumerate(classes)
            ])
            
            separability[layer_name] = between_var / (within_var + 1e-10)
        
        return separability
 
 
# Example usage with synthetic data
def demonstrate_hidden_analysis():
    """Show hidden layer analysis on a simple classification task."""
    
    # Create simple MLP
    model = nn.Sequential(
        nn.Linear(20, 64),
        nn.ReLU(),
        nn.Linear(64, 32),
        nn.ReLU(),
        nn.Linear(32, 16),
        nn.ReLU(),
        nn.Linear(16, 3)  # 3 classes
    )
    
    # Generate synthetic data
    np.random.seed(42)
    n_samples = 500
    X = np.random.randn(n_samples, 20).astype(np.float32)
    y = np.random.randint(0, 3, n_samples)
    
    # Add class-specific signal
    for c in range(3):
        X[y == c, :5] += c * 0.5
    
    data = torch.tensor(X)
    
    # Analyze
    analyzer = HiddenRepresentationAnalyzer(model)
    
    layer_names = ['0', '2', '4']  # Linear layers
    
    print("Activation Statistics:")
    stats = analyzer.compute_activation_statistics(data, layer_names)
    for name, s in stats.items():
        print(f"  Layer {name}: mean={s['mean']:.3f}, std={s['std']:.3f}, sparsity={s['sparsity']:.3f}")
    
    print("\nClass Separability (higher = better):")
    sep = analyzer.analyze_class_separability(data, y, layer_names)
    for name, val in sep.items():
        print(f"  Layer {name}: {val:.3f}")
 
 
if __name__ == "__main__":
    demonstrate_hidden_analysis()

The Class Separability Trajectory

Summary: The Hidden Layer Foundation

Hidden layers are the computational engine that transforms neural networks from simple linear models into powerful universal approximators. Let's consolidate the key concepts.

Key Concepts Mastered

•Hidden layers enable non-linear separation: By transforming input space, hidden layers convert non-separable problems (like XOR) into separable ones.
•Geometry of transformation: Each layer performs affine transformation + nonlinearity, progressively warping the data manifold to separate classes.
•Width controls capacity: More hidden units = more representational power, but also more risk of overfitting (mitigated by regularization).
•Depth enables hierarchy: Deep networks learn compositional features—simple features combine into complex ones through layers.
•Design patterns exist: Constant, funnel, bottleneck, and expanding architectures suit different problem types.
•Representations are analyzable: t-SNE, activation statistics, and separability metrics reveal what hidden layers learn.
•Feature learning is emergent: Without explicit supervision, hidden layers discover useful intermediate representations.

What's Next:

Page Complete

2 / 5