Neural Networks & Deep LearningMulti-Layer Perceptrons

Multi-Layer Perceptrons

LevelIntermediate

Duration90 mins

TopicMulti-Layer Perceptrons

1 / 5

Network Architecture

Beyond the Perceptron: The Case for Multi-Layer Networks

The single-layer perceptron, for all its historical significance, possesses a fundamental limitation that nearly killed neural network research for over a decade: it can only learn linearly separable patterns. In 1969, Marvin Minsky and Seymour Papert's famous critique demonstrated this limitation with crystalline clarity through the XOR problem—a simple logical function that no single-layer perceptron can compute.

This limitation isn't merely academic. Real-world data is almost never linearly separable. Images, speech, text, and virtually every complex pattern humans recognize require learning nonlinear decision boundaries. The solution, as Frank Rosenblatt himself had intuited but lacked the tools to fully exploit, is to stack multiple layers of computational units.

The Multi-Layer Perceptron (MLP) represents the first and most fundamental departure from single-layer limitations. By introducing one or more hidden layers between input and output, MLPs can approximate any continuous function to arbitrary precision—a result known as the Universal Approximation Theorem. This page explores the architectural principles that make such extraordinary representational power possible.

What You Will Master

By the end of this page, you will understand: (1) The complete anatomy of MLP architecture including layers, units, and connections; (2) The mathematical representation of network topology; (3) Design principles for layer configuration; (4) The relationship between architecture and function class; (5) How different architectural choices affect learning capacity and generalization.

Anatomy of a Multi-Layer Perceptron

An MLP is a feedforward neural network consisting of multiple layers of computational units (neurons) where information flows in one direction—from input to output—without cycles. Understanding MLP architecture requires precision about its constituent parts.

Definition (Multi-Layer Perceptron): A multi-layer perceptron is a directed acyclic graph where:

Nodes represent computational units (neurons)
Edges represent weighted connections
Nodes are organized into layers
All edges connect nodes in layer $l$ to nodes in layer $l+1$ (fully connected architecture)

The architecture is completely specified by the sequence of layer widths $(n_0, n_1, \ldots, n_L)$ where $n_0$ is the input dimension, $n_L$ is the output dimension, and layers $1$ through $L-1$ are hidden layers.

Core Architectural Components

•Input Layer ($l=0$): Not a true computational layer. Simply holds the input feature vector $\mathbf{x} \in \mathbb{R}^{n_0}$. Each 'unit' represents one input feature. No computation occurs here—values pass through unchanged.
•Hidden Layers ($l=1, \ldots, L-1$): The computational core of the network. Each hidden layer transforms its input through a weighted linear combination followed by a nonlinear activation function. The term 'hidden' refers to the fact that these layers' outputs are not directly observed—they are internal representations.
•Output Layer ($l=L$): Produces the final prediction. The activation function here depends on the task: linear for regression, sigmoid for binary classification, softmax for multi-class classification.
•Weights ($W^{(l)}$): Learnable parameters connecting layer $l-1$ to layer $l$. The weight matrix $W^{(l)} \in \mathbb{R}^{n_l \times n_{l-1}}$ contains $n_l \times n_{l-1}$ parameters.
•Biases ($\mathbf{b}^{(l)}$): Each layer has a bias vector $\mathbf{b}^{(l)} \in \mathbb{R}^{n_l}$ that provides an affine offset, ensuring the network can learn functions that don't pass through the origin.
•Activation Functions ($\sigma^{(l)}$): Element-wise nonlinear functions applied after the linear transformation. Critical for learning nonlinear patterns—without them, the entire network collapses to a single linear transformation regardless of depth.

Fully Connected vs. Dense Layers

The terms 'fully connected layer,' 'dense layer,' and 'MLP layer' are essentially synonymous. Each neuron in layer $l$ receives input from every neuron in layer $l-1$. This is in contrast to architectures like CNNs (convolutional neural networks) where connections follow structured sparsity patterns. The fully connected pattern maximizes flexibility but at the cost of parameter count scaling quadratically with layer width.

Mathematical Representation of Network Topology

A precise mathematical specification of MLP architecture enables analysis, implementation, and communication without ambiguity. We develop the notation systematically.

Network Specification:

Let $L$ denote the number of layers (counting hidden + output, excluding input). The network architecture is specified by:

$$\text{Architecture} = (n_0, n_1, \ldots, n_L)$$

where $n_l$ denotes the width (number of units) in layer $l$.

Example: A network for MNIST digit classification with architecture $(784, 256, 128, 10)$ has:

Input layer: 784 units (28×28 pixel image flattened)
First hidden layer: 256 units
Second hidden layer: 128 units
Output layer: 10 units (one per digit class)

Parameter Space:

The complete parameter set $\Theta$ consists of all weights and biases:

$$\Theta = {W^{(1)}, \mathbf{b}^{(1)}, W^{(2)}, \mathbf{b}^{(2)}, \ldots, W^{(L)}, \mathbf{b}^{(L)}}$$

For layer $l$:

Weight matrix: $W^{(l)} \in \mathbb{R}^{n_l \times n_{l-1}}$
Bias vector: $\mathbf{b}^{(l)} \in \mathbb{R}^{n_l}$

Total Parameter Count:

The total number of learnable parameters is:

$$|\Theta| = \sum_{l=1}^{L} \left( n_l \cdot n_{l-1} + n_l \right) = \sum_{l=1}^{L} n_l(n_{l-1} + 1)$$

For our MNIST example: $(256 \times 785) + (128 \times 257) + (10 \times 129) = 201,088 + 32,896 + 1,290 = 235,274$ parameters.

Parameter Count Analysis for Common Architectures
Architecture	Application	Hidden Params	Output Params	Total
(784, 512, 10)	MNIST Simple	401,920	5,130	407,050
(784, 256, 128, 10)	MNIST Deep	234,240	1,290	235,530
(3072, 1024, 512, 10)	CIFAR-10	3,670,016	5,130	3,675,146
(768, 3072, 768)	Transformer FFN	2,362,368	2,362,368	4,724,736
(4096, 4096, 4096, 1000)	ImageNet FC	50,343,936	4,097,000	54,440,936

Parameter Explosion

The fully connected architecture creates a quadratic relationship between layer width and parameter count. Doubling the width of two adjacent layers quadruples the parameters connecting them. This is why modern computer vision models use convolutional layers (which share parameters spatially) rather than fully connected layers for image input—a 224×224 RGB image has 150,528 input dimensions, which would require billions of parameters even for a single hidden layer of moderate width.

Layer-by-Layer Computation Model

Understanding the computation at each layer is essential for grasping how information transforms as it flows through the network.

Pre-Activation (Net Input):

For layer $l$, the pre-activation $\mathbf{z}^{(l)}$ is the weighted sum of inputs plus bias:

$$\mathbf{z}^{(l)} = W^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}$$

where $\mathbf{a}^{(l-1)}$ is the activation (output) of the previous layer, with $\mathbf{a}^{(0)} = \mathbf{x}$ (the input).

Post-Activation:

The activation of layer $l$ applies the nonlinear activation function element-wise:

$$\mathbf{a}^{(l)} = \sigma^{(l)}(\mathbf{z}^{(l)})$$

Component-wise Expansion:

For unit $j$ in layer $l$:

$$z_j^{(l)} = \sum_{i=1}^{n_{l-1}} W_{ji}^{(l)} a_i^{(l-1)} + b_j^{(l)}$$

$$a_j^{(l)} = \sigma^{(l)}(z_j^{(l)})$$

Complete Forward Pass:

The network function $f: \mathbb{R}^{n_0} \to \mathbb{R}^{n_L}$ is the composition:

$$f(\mathbf{x}; \Theta) = \sigma^{(L)} \circ g^{(L)} \circ \sigma^{(L-1)} \circ g^{(L-1)} \circ \cdots \circ \sigma^{(1)} \circ g^{(1)}(\mathbf{x})$$

where $g^{(l)}(\mathbf{a}) = W^{(l)}\mathbf{a} + \mathbf{b}^{(l)}$ is the affine transformation.

mlp_forward_computation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
import numpy as np
from typing import List, Tuple, Callable
 
class MLPArchitecture:
    """
    Complete MLP implementation demonstrating layer-by-layer computation.
    
    This implementation prioritizes clarity over efficiency to illustrate
    the mathematical concepts precisely.
    """
    
    def __init__(self, layer_sizes: List[int], 
                 hidden_activation: Callable = None,
                 output_activation: Callable = None):
        """
        Initialize MLP with given architecture.
        
        Args:
            layer_sizes: List of layer widths [n_0, n_1, ..., n_L]
            hidden_activation: Activation function for hidden layers
            output_activation: Activation function for output layer
        """
        self.layer_sizes = layer_sizes
        self.L = len(layer_sizes) - 1  # Number of computational layers
        
        # Default activations
        self.hidden_activation = hidden_activation or self._relu
        self.output_activation = output_activation or self._identity
        
        # Initialize parameters
        self.weights = []  # W^(l) for l = 1, ..., L
        self.biases = []   # b^(l) for l = 1, ..., L
        
        for l in range(1, len(layer_sizes)):
            n_in = layer_sizes[l - 1]
            n_out = layer_sizes[l]
            
            # Xavier/Glorot initialization
            scale = np.sqrt(2.0 / (n_in + n_out))
            W = np.random.randn(n_out, n_in) * scale
            b = np.zeros(n_out)
            
            self.weights.append(W)
            self.biases.append(b)
    
    def forward(self, x: np.ndarray) -> Tuple[np.ndarray, List[np.ndarray], List[np.ndarray]]:
        """
        Compute forward pass through the network.
        
        Args:
            x: Input vector of shape (n_0,) or batch (batch_size, n_0)
        
        Returns:
            output: Network output
            activations: List of activations [a^(0), a^(1), ..., a^(L)]
            pre_activations: List of pre-activations [z^(1), ..., z^(L)]
        """
        # Ensure 2D input (batch_size, n_features)
        if x.ndim == 1:
            x = x.reshape(1, -1)
        
        activations = [x.T]  # a^(0) = x, transposed for matrix ops
        pre_activations = []
        
        a = x.T  # Current activation: shape (n_l, batch_size)
        
        for l in range(self.L):
            # Pre-activation: z^(l) = W^(l) @ a^(l-1) + b^(l)
            z = self.weights[l] @ a + self.biases[l].reshape(-1, 1)
            pre_activations.append(z)
            
            # Select activation function based on layer
            if l < self.L - 1:
                a = self.hidden_activation(z)
            else:
                a = self.output_activation(z)
            
            activations.append(a)
        
        return a.T, activations, pre_activations
    
    def count_parameters(self) -> int:
        """Return total number of learnable parameters."""
        total = 0
        for W, b in zip(self.weights, self.biases):
            total += W.size + b.size
        return total
    
    def _relu(self, z: np.ndarray) -> np.ndarray:
        """ReLU activation: max(0, z)"""
        return np.maximum(0, z)
    
    def _identity(self, z: np.ndarray) -> np.ndarray:
        """Identity activation (for regression)"""
        return z
    
    def describe_architecture(self) -> str:
        """Return human-readable architecture description."""
        lines = [
            f"MLP Architecture: {' → '.join(map(str, self.layer_sizes))}",
            f"Total parameters: {self.count_parameters():,}",
            ""
        ]
        
        for l in range(self.L):
            n_in = self.layer_sizes[l]
            n_out = self.layer_sizes[l + 1]
            weight_params = n_in * n_out
            bias_params = n_out
            layer_type = "Hidden" if l < self.L - 1 else "Output"
            
            lines.append(
                f"Layer {l + 1} ({layer_type}): "
                f"{n_in} → {n_out} | "
                f"Weights: {weight_params:,}, Biases: {bias_params:,}"
            )
        
        return "\n".join(lines)
 
 
# Example usage
if __name__ == "__main__":
    # Create MNIST-style architecture
    mlp = MLPArchitecture([784, 256, 128, 64, 10])
    print(mlp.describe_architecture())
    
    # Forward pass with random input
    x = np.random.randn(5, 784)  # Batch of 5 samples
    output, activations, pre_activations = mlp.forward(x)
    
    print(f"\nInput shape: {x.shape}")
    print(f"Output shape: {output.shape}")
    print(f"\nActivation shapes through network:")
    for l, a in enumerate(activations):
        print(f"  Layer {l}: {a.shape}")

Computation as Function Composition

Viewing the MLP as function composition is powerful for analysis. Each layer performs an affine transformation followed by a pointwise nonlinearity. The composition of these relatively simple operations yields extraordinary representational power—but only because of the nonlinear activations. Remove them, and no matter how many layers you stack, the result is a single affine transformation.

Architectural Design Principles

Designing an MLP architecture involves making principled choices about depth, width, and layer configuration. While no universal rules exist, decades of research and practice have revealed guiding principles.

The Width-Depth Trade-off:

Two primary axes define MLP capacity:

Width: The number of units per hidden layer
Depth: The number of hidden layers

Theoretical results (Universal Approximation Theorem) guarantee that a single, sufficiently wide hidden layer can approximate any continuous function. However, this may require exponentially many units. Deeper networks can represent certain functions with exponentially fewer parameters than shallow networks with equivalent capacity.

Rule of Thumb: Start with 2-3 hidden layers. Increase depth for complex hierarchical patterns; increase width for high-dimensional input with many independent features.

Wide & Shallow Networks

•Advantages: Easier optimization (fewer vanishing gradient issues), faster training per epoch, theoretically sufficient for any function
•Disadvantages: May require exponentially more parameters, less efficient feature reuse, poorer generalization for hierarchical data
•Best for: Tabular data with independent features, simple function approximation, when interpretability matters
•Example: Single layer with 1000+ units for simple classification

Narrow & Deep Networks

•Advantages: Efficient parameter sharing, hierarchical feature learning, better generalization for structured data
•Disadvantages: Harder optimization (vanishing/exploding gradients), requires careful initialization, longer training
•Best for: Hierarchical patterns, images/audio/text (with appropriate architectures), compositional structure
•Example: 5 layers with 256 units each for complex patterns

Layer Width Strategies:

Several heuristics guide layer width selection:

Pyramid/Funnel: Decreasing width as you go deeper (e.g., 512 → 256 → 128). Common for classification where you progressively compress information.
Constant Width: Same width throughout hidden layers (e.g., 256 → 256 → 256). Good default when uncertain; simplifies hyperparameter tuning.
Expanding then Contracting: Width increases then decreases (e.g., 256 → 512 → 256). Useful when intermediate representations need higher dimensionality.
Width Multiple of 8/32/64: For GPU efficiency, widths that are powers of 2 or multiples of warp size (32 for NVIDIA) can significantly accelerate training.

The Input-Output Constraint:

Input layer width is fixed by data dimensionality
Output layer width is fixed by task (1 for regression, $C$ for $C$-class classification)
Hidden layers are free design choices

Practical Starting Point:

For a problem with $n$ input features and $C$ classes:

First hidden: $\min(n \times 2, 512)$ units
Subsequent hidden: halve previous or maintain
2-4 hidden layers total for most tabular/structured problems

Connectivity Patterns and Variations

While the "classic" MLP uses fully connected layers exclusively, understanding connectivity variations illuminates both the design space and modern architectural innovations.

Fully Connected (Dense) Connectivity:

In standard MLP, layer $l$ has $n_l \times n_{l-1}$ learnable connections (plus $n_l$ biases). Every output unit depends on every input unit. This maximizes flexibility but:

Has high parameter count ($O(n^2)$ per layer)
Ignores any structure in the input (spatial, temporal, etc.)
Prone to overfitting on small datasets

Sparse Connectivity Patterns:

Modern architectures introduce structured sparsity:

Convolutional: Connections follow a spatial pattern (local receptive fields)
Block Diagonal: Units grouped into independent blocks
Random Sparse: Randomly pruned connections (less common)
Structured Pruning: Learned sparsity during training

Skip Connections:

While not traditional MLPs, skip (residual) connections are ubiquitous in modern networks:

$$\mathbf{a}^{(l)} = \sigma(W^{(l)}\mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}) + \mathbf{a}^{(l-2)}$$

Skip connections ease gradient flow in deep networks and are essential for training networks with $>10$ layers.

Comparison of Connectivity Patterns
Pattern	Parameters/Layer	Inductive Bias	Use Case
Fully Connected	O(n²)	None (maximum flexibility)	Tabular data, final layers
Convolutional	O(k² × c)	Translation equivariance	Images, signals, sequences
Block Diagonal	O(n²/k)	Independent feature groups	Multi-task, factored models
Low-Rank	O(nr)	Smooth/low-frequency functions	Compression, regularization
Attention	O(n²) in sequence length	Position-independent interactions	NLP, vision transformers

The Role of Inductive Bias

Fully connected layers have no inductive bias—they treat every input-output relationship as equally likely a priori. This flexibility is powerful but data-hungry. Specialized connectivity patterns (convolutions for images, attention for sequences) inject domain knowledge that dramatically improves sample efficiency. The art of architecture design is choosing the right inductive biases for your problem.

Capacity and Representational Power

A fundamental question in neural network architecture is: what functions can a given network represent? The answer involves both theoretical guarantees and practical limitations.

Universal Approximation Theorem (Cybenko, 1989; Hornik, 1991):

A feedforward network with a single hidden layer containing a finite number of neurons can approximate any continuous function on compact subsets of $\mathbb{R}^n$, under mild assumptions on the activation function.

More precisely, for any continuous function $f: K \to \mathbb{R}$ on a compact set $K \subset \mathbb{R}^n$ and any $\epsilon > 0$, there exists an MLP $g$ with one hidden layer such that:

$$\sup_{\mathbf{x} \in K} |f(\mathbf{x}) - g(\mathbf{x})| < \epsilon$$

Critical Nuances:

Existence, not construction: The theorem guarantees the network exists but provides no algorithm to find it.
Width may be exponential: Approximating some functions requires hidden layer width exponential in input dimension.
Doesn't address learning: Finding the right weights through gradient descent is a separate (and often harder) problem.
Depth-Efficiency Results (Telgarsky, 2016): There exist functions representable by networks of depth $k$ with polynomial size that require exponential size for networks of depth $k-1$.

What Architecture Controls

•Hypothesis Class ($\mathcal{H}$): The set of all functions the network architecture can represent. Larger capacity = larger $\mathcal{H}$.
•VC Dimension: A measure of hypothesis class complexity. For sigmoid networks, VC dimension scales as $O(|\Theta| \log |\Theta|)$ where $|\Theta|$ is parameter count.
•Expressiveness: Ability to represent complex functions. More layers generally increase expressiveness more efficiently than width.
•Sample Complexity: How much data is needed to learn well. Larger models typically require more data (though overparameterization can help via implicit regularization).
•Optimization Landscape: Deeper networks have more complex loss surfaces. Architecture affects trainability independent of expressiveness.

The Gap Between Theory and Practice

Universal approximation is a beautiful theoretical result but a poor practical guide. It says nothing about how many samples, how much compute, or whether gradient descent will find good solutions. Modern deep learning success relies more on empirical insights—careful initialization, batch normalization, residual connections—than on approximation theory guarantees. Choose architectures based on empirical performance for your problem class, not just theoretical expressiveness.

Implementation Considerations

Translating MLP architecture from mathematics to efficient code requires attention to numerical stability, memory layout, and hardware utilization.

Weight Initialization:

Poor initialization can make training impossible:

Too small: Signals shrink through layers (vanishing activations)
Too large: Signals explode, causing numerical overflow
Symmetric: All hidden units learn identical features

Common initialization schemes:

Initialization Schemes and Their Applications
Method	Formula	Best For
Xavier/Glorot	$W \sim \mathcal{U}(-\sqrt{6/(n_{in}+n_{out})}, \sqrt{6/(n_{in}+n_{out})})$	Tanh, sigmoid activations
He/Kaiming	$W \sim \mathcal{N}(0, 2/n_{in})$	ReLU activations
LeCun	$W \sim \mathcal{N}(0, 1/n_{in})$	SELU activations
Orthogonal	$W = Q$ where $QR = A, A \sim \mathcal{N}(0,1)$	RNNs, very deep networks

mlp_architecture_pytorch.py
PyTorch
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
import torch
import torch.nn as nn
from typing import List, Optional
 
class FlexibleMLP(nn.Module):
    """
    Production-quality MLP with modern best practices.
    
    Features:
    - Flexible architecture specification
    - Multiple activation functions
    - Batch normalization option
    - Dropout regularization
    - Proper weight initialization
    """
    
    def __init__(
        self,
        layer_sizes: List[int],
        activation: str = "relu",
        output_activation: Optional[str] = None,
        use_batch_norm: bool = True,
        dropout_rate: float = 0.0,
        init_method: str = "kaiming"
    ):
        """
        Args:
            layer_sizes: List [n_input, n_hidden_1, ..., n_output]
            activation: Activation for hidden layers
            output_activation: Activation for output layer (None = linear)
            use_batch_norm: Whether to apply batch normalization
            dropout_rate: Dropout probability (0 = no dropout)
            init_method: Weight initialization method
        """
        super().__init__()
        
        self.layer_sizes = layer_sizes
        self.num_layers = len(layer_sizes) - 1
        
        # Build layers
        layers = []
        
        for i in range(self.num_layers):
            # Linear transformation
            in_features = layer_sizes[i]
            out_features = layer_sizes[i + 1]
            layers.append(nn.Linear(in_features, out_features))
            
            # For non-output layers, add activation and optional components
            if i < self.num_layers - 1:
                # Batch normalization (before activation)
                if use_batch_norm:
                    layers.append(nn.BatchNorm1d(out_features))
                
                # Activation function
                layers.append(self._get_activation(activation))
                
                # Dropout
                if dropout_rate > 0:
                    layers.append(nn.Dropout(dropout_rate))
        
        # Output activation if specified
        if output_activation is not None:
            layers.append(self._get_activation(output_activation))
        
        self.network = nn.Sequential(*layers)
        
        # Apply initialization
        self._initialize_weights(init_method)
    
    def _get_activation(self, name: str) -> nn.Module:
        """Get activation function by name."""
        activations = {
            "relu": nn.ReLU(),
            "leaky_relu": nn.LeakyReLU(0.1),
            "elu": nn.ELU(),
            "gelu": nn.GELU(),
            "selu": nn.SELU(),
            "tanh": nn.Tanh(),
            "sigmoid": nn.Sigmoid(),
            "softmax": nn.Softmax(dim=-1),
        }
        if name not in activations:
            raise ValueError(f"Unknown activation: {name}")
        return activations[name]
    
    def _initialize_weights(self, method: str):
        """Apply weight initialization scheme."""
        for module in self.modules():
            if isinstance(module, nn.Linear):
                if method == "kaiming":
                    nn.init.kaiming_normal_(module.weight, nonlinearity='relu')
                elif method == "xavier":
                    nn.init.xavier_uniform_(module.weight)
                elif method == "orthogonal":
                    nn.init.orthogonal_(module.weight)
                else:
                    raise ValueError(f"Unknown init method: {method}")
                
                if module.bias is not None:
                    nn.init.zeros_(module.bias)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass through the network."""
        return self.network(x)
    
    def count_parameters(self) -> int:
        """Count total trainable parameters."""
        return sum(p.numel() for p in self.parameters() if p.requires_grad)
    
    def summary(self) -> str:
        """Return architecture summary."""
        lines = [f"FlexibleMLP: {' → '.join(map(str, self.layer_sizes))}"]
        lines.append(f"Total parameters: {self.count_parameters():,}")
        lines.append("")
        for name, module in self.network.named_children():
            lines.append(f"  {name}: {module}")
        return "\n".join(lines)
 
 
# Example usage
if __name__ == "__main__":
    # Create MNIST classifier
    model = FlexibleMLP(
        layer_sizes=[784, 512, 256, 128, 10],
        activation="relu",
        output_activation="softmax",
        use_batch_norm=True,
        dropout_rate=0.2,
        init_method="kaiming"
    )
    
    print(model.summary())
    
    # Test forward pass
    batch = torch.randn(32, 784)
    output = model(batch)
    print(f"\nInput shape: {batch.shape}")
    print(f"Output shape: {output.shape}")
    print(f"Output sums to 1: {output.sum(dim=1).allclose(torch.ones(32))}")

Summary and Looking Ahead

We've established a rigorous understanding of MLP architecture—the foundation upon which modern deep learning is built. Let's consolidate the key insights and preview what follows.

Key Architectural Concepts

•Layer Types: Input (passthrough), hidden (transformation), and output (task-specific) layers form the MLP structure.
•Mathematical Specification: Architecture is fully defined by layer widths $(n_0, \ldots, n_L)$ and choice of activation functions.
•Parameters: Weights and biases total $\sum_l n_l(n_{l-1} + 1)$, scaling quadratically with layer width.
•Computation Model: Each layer applies affine transformation + nonlinearity: $\mathbf{a}^{(l)} = \sigma(W^{(l)}\mathbf{a}^{(l-1)} + \mathbf{b}^{(l)})$.
•Design Trade-offs: Width vs. depth, parameter efficiency vs. expressiveness, flexibility vs. inductive bias.
•Universal Approximation: MLPs can approximate any continuous function—but expressiveness doesn't guarantee learnability.
•Implementation Matters: Initialization, numerical stability, and hardware efficiency are crucial for practical success.

What's Next:

With architecture established, we turn to the components that make MLPs work:

Hidden Layers: How intermediate representations enable nonlinear decision boundaries
Forward Propagation: The complete computation from input to output
Matrix Formulation: Efficient batch processing through linear algebra
Activation Functions: The nonlinearities that enable universal approximation

Page Complete

You now understand the fundamental architecture of multi-layer perceptrons. This architectural foundation is essential for all neural network variants—CNNs, RNNs, Transformers, and beyond all build upon or modify the core MLP principles established here. Next, we explore hidden layers in depth, understanding how they create the internal representations that enable complex learning.

1 / 5

Loading learning content...

Neural Networks & Deep LearningMulti-Layer Perceptrons

Multi-Layer Perceptrons

LevelIntermediate

Duration90 mins

TopicMulti-Layer Perceptrons

1 / 5

Network Architecture

Beyond the Perceptron: The Case for Multi-Layer Networks

What You Will Master

Anatomy of a Multi-Layer Perceptron

Definition (Multi-Layer Perceptron): A multi-layer perceptron is a directed acyclic graph where:

Nodes represent computational units (neurons)
Edges represent weighted connections
Nodes are organized into layers
All edges connect nodes in layer $l$ to nodes in layer $l+1$ (fully connected architecture)

Core Architectural Components

•Input Layer ($l=0$): Not a true computational layer. Simply holds the input feature vector $\mathbf{x} \in \mathbb{R}^{n_0}$. Each 'unit' represents one input feature. No computation occurs here—values pass through unchanged.
•Hidden Layers ($l=1, \ldots, L-1$): The computational core of the network. Each hidden layer transforms its input through a weighted linear combination followed by a nonlinear activation function. The term 'hidden' refers to the fact that these layers' outputs are not directly observed—they are internal representations.
•Output Layer ($l=L$): Produces the final prediction. The activation function here depends on the task: linear for regression, sigmoid for binary classification, softmax for multi-class classification.
•Weights ($W^{(l)}$): Learnable parameters connecting layer $l-1$ to layer $l$. The weight matrix $W^{(l)} \in \mathbb{R}^{n_l \times n_{l-1}}$ contains $n_l \times n_{l-1}$ parameters.
•Biases ($\mathbf{b}^{(l)}$): Each layer has a bias vector $\mathbf{b}^{(l)} \in \mathbb{R}^{n_l}$ that provides an affine offset, ensuring the network can learn functions that don't pass through the origin.
•Activation Functions ($\sigma^{(l)}$): Element-wise nonlinear functions applied after the linear transformation. Critical for learning nonlinear patterns—without them, the entire network collapses to a single linear transformation regardless of depth.

Fully Connected vs. Dense Layers

Mathematical Representation of Network Topology

A precise mathematical specification of MLP architecture enables analysis, implementation, and communication without ambiguity. We develop the notation systematically.

Network Specification:

Let $L$ denote the number of layers (counting hidden + output, excluding input). The network architecture is specified by:

$$\text{Architecture} = (n_0, n_1, \ldots, n_L)$$

where $n_l$ denotes the width (number of units) in layer $l$.

Example: A network for MNIST digit classification with architecture $(784, 256, 128, 10)$ has:

Input layer: 784 units (28×28 pixel image flattened)
First hidden layer: 256 units
Second hidden layer: 128 units
Output layer: 10 units (one per digit class)

Parameter Space:

The complete parameter set $\Theta$ consists of all weights and biases:

$$\Theta = {W^{(1)}, \mathbf{b}^{(1)}, W^{(2)}, \mathbf{b}^{(2)}, \ldots, W^{(L)}, \mathbf{b}^{(L)}}$$

For layer $l$:

Weight matrix: $W^{(l)} \in \mathbb{R}^{n_l \times n_{l-1}}$
Bias vector: $\mathbf{b}^{(l)} \in \mathbb{R}^{n_l}$

Total Parameter Count:

The total number of learnable parameters is:

$$|\Theta| = \sum_{l=1}^{L} \left( n_l \cdot n_{l-1} + n_l \right) = \sum_{l=1}^{L} n_l(n_{l-1} + 1)$$

For our MNIST example: $(256 \times 785) + (128 \times 257) + (10 \times 129) = 201,088 + 32,896 + 1,290 = 235,274$ parameters.

Parameter Count Analysis for Common Architectures
Architecture	Application	Hidden Params	Output Params	Total
(784, 512, 10)	MNIST Simple	401,920	5,130	407,050
(784, 256, 128, 10)	MNIST Deep	234,240	1,290	235,530
(3072, 1024, 512, 10)	CIFAR-10	3,670,016	5,130	3,675,146
(768, 3072, 768)	Transformer FFN	2,362,368	2,362,368	4,724,736
(4096, 4096, 4096, 1000)	ImageNet FC	50,343,936	4,097,000	54,440,936

Parameter Explosion

Layer-by-Layer Computation Model

Understanding the computation at each layer is essential for grasping how information transforms as it flows through the network.

Pre-Activation (Net Input):

For layer $l$, the pre-activation $\mathbf{z}^{(l)}$ is the weighted sum of inputs plus bias:

$$\mathbf{z}^{(l)} = W^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}$$

where $\mathbf{a}^{(l-1)}$ is the activation (output) of the previous layer, with $\mathbf{a}^{(0)} = \mathbf{x}$ (the input).

Post-Activation:

The activation of layer $l$ applies the nonlinear activation function element-wise:

$$\mathbf{a}^{(l)} = \sigma^{(l)}(\mathbf{z}^{(l)})$$

Component-wise Expansion:

For unit $j$ in layer $l$:

$$z_j^{(l)} = \sum_{i=1}^{n_{l-1}} W_{ji}^{(l)} a_i^{(l-1)} + b_j^{(l)}$$

$$a_j^{(l)} = \sigma^{(l)}(z_j^{(l)})$$

Complete Forward Pass:

The network function $f: \mathbb{R}^{n_0} \to \mathbb{R}^{n_L}$ is the composition:

$$f(\mathbf{x}; \Theta) = \sigma^{(L)} \circ g^{(L)} \circ \sigma^{(L-1)} \circ g^{(L-1)} \circ \cdots \circ \sigma^{(1)} \circ g^{(1)}(\mathbf{x})$$

where $g^{(l)}(\mathbf{a}) = W^{(l)}\mathbf{a} + \mathbf{b}^{(l)}$ is the affine transformation.

mlp_forward_computation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
import numpy as np
from typing import List, Tuple, Callable
 
class MLPArchitecture:
    """
    Complete MLP implementation demonstrating layer-by-layer computation.
    
    This implementation prioritizes clarity over efficiency to illustrate
    the mathematical concepts precisely.
    """
    
    def __init__(self, layer_sizes: List[int], 
                 hidden_activation: Callable = None,
                 output_activation: Callable = None):
        """
        Initialize MLP with given architecture.
        
        Args:
            layer_sizes: List of layer widths [n_0, n_1, ..., n_L]
            hidden_activation: Activation function for hidden layers
            output_activation: Activation function for output layer
        """
        self.layer_sizes = layer_sizes
        self.L = len(layer_sizes) - 1  # Number of computational layers
        
        # Default activations
        self.hidden_activation = hidden_activation or self._relu
        self.output_activation = output_activation or self._identity
        
        # Initialize parameters
        self.weights = []  # W^(l) for l = 1, ..., L
        self.biases = []   # b^(l) for l = 1, ..., L
        
        for l in range(1, len(layer_sizes)):
            n_in = layer_sizes[l - 1]
            n_out = layer_sizes[l]
            
            # Xavier/Glorot initialization
            scale = np.sqrt(2.0 / (n_in + n_out))
            W = np.random.randn(n_out, n_in) * scale
            b = np.zeros(n_out)
            
            self.weights.append(W)
            self.biases.append(b)
    
    def forward(self, x: np.ndarray) -> Tuple[np.ndarray, List[np.ndarray], List[np.ndarray]]:
        """
        Compute forward pass through the network.
        
        Args:
            x: Input vector of shape (n_0,) or batch (batch_size, n_0)
        
        Returns:
            output: Network output
            activations: List of activations [a^(0), a^(1), ..., a^(L)]
            pre_activations: List of pre-activations [z^(1), ..., z^(L)]
        """
        # Ensure 2D input (batch_size, n_features)
        if x.ndim == 1:
            x = x.reshape(1, -1)
        
        activations = [x.T]  # a^(0) = x, transposed for matrix ops
        pre_activations = []
        
        a = x.T  # Current activation: shape (n_l, batch_size)
        
        for l in range(self.L):
            # Pre-activation: z^(l) = W^(l) @ a^(l-1) + b^(l)
            z = self.weights[l] @ a + self.biases[l].reshape(-1, 1)
            pre_activations.append(z)
            
            # Select activation function based on layer
            if l < self.L - 1:
                a = self.hidden_activation(z)
            else:
                a = self.output_activation(z)
            
            activations.append(a)
        
        return a.T, activations, pre_activations
    
    def count_parameters(self) -> int:
        """Return total number of learnable parameters."""
        total = 0
        for W, b in zip(self.weights, self.biases):
            total += W.size + b.size
        return total
    
    def _relu(self, z: np.ndarray) -> np.ndarray:
        """ReLU activation: max(0, z)"""
        return np.maximum(0, z)
    
    def _identity(self, z: np.ndarray) -> np.ndarray:
        """Identity activation (for regression)"""
        return z
    
    def describe_architecture(self) -> str:
        """Return human-readable architecture description."""
        lines = [
            f"MLP Architecture: {' → '.join(map(str, self.layer_sizes))}",
            f"Total parameters: {self.count_parameters():,}",
            ""
        ]
        
        for l in range(self.L):
            n_in = self.layer_sizes[l]
            n_out = self.layer_sizes[l + 1]
            weight_params = n_in * n_out
            bias_params = n_out
            layer_type = "Hidden" if l < self.L - 1 else "Output"
            
            lines.append(
                f"Layer {l + 1} ({layer_type}): "
                f"{n_in} → {n_out} | "
                f"Weights: {weight_params:,}, Biases: {bias_params:,}"
            )
        
        return "\n".join(lines)
 
 
# Example usage
if __name__ == "__main__":
    # Create MNIST-style architecture
    mlp = MLPArchitecture([784, 256, 128, 64, 10])
    print(mlp.describe_architecture())
    
    # Forward pass with random input
    x = np.random.randn(5, 784)  # Batch of 5 samples
    output, activations, pre_activations = mlp.forward(x)
    
    print(f"\nInput shape: {x.shape}")
    print(f"Output shape: {output.shape}")
    print(f"\nActivation shapes through network:")
    for l, a in enumerate(activations):
        print(f"  Layer {l}: {a.shape}")

Computation as Function Composition

Architectural Design Principles

The Width-Depth Trade-off:

Two primary axes define MLP capacity:

Width: The number of units per hidden layer
Depth: The number of hidden layers

Rule of Thumb: Start with 2-3 hidden layers. Increase depth for complex hierarchical patterns; increase width for high-dimensional input with many independent features.

Wide & Shallow Networks

•Advantages: Easier optimization (fewer vanishing gradient issues), faster training per epoch, theoretically sufficient for any function
•Disadvantages: May require exponentially more parameters, less efficient feature reuse, poorer generalization for hierarchical data
•Best for: Tabular data with independent features, simple function approximation, when interpretability matters
•Example: Single layer with 1000+ units for simple classification

Narrow & Deep Networks

•Advantages: Efficient parameter sharing, hierarchical feature learning, better generalization for structured data
•Disadvantages: Harder optimization (vanishing/exploding gradients), requires careful initialization, longer training
•Best for: Hierarchical patterns, images/audio/text (with appropriate architectures), compositional structure
•Example: 5 layers with 256 units each for complex patterns

Layer Width Strategies:

Several heuristics guide layer width selection:

Pyramid/Funnel: Decreasing width as you go deeper (e.g., 512 → 256 → 128). Common for classification where you progressively compress information.
Constant Width: Same width throughout hidden layers (e.g., 256 → 256 → 256). Good default when uncertain; simplifies hyperparameter tuning.
Expanding then Contracting: Width increases then decreases (e.g., 256 → 512 → 256). Useful when intermediate representations need higher dimensionality.
Width Multiple of 8/32/64: For GPU efficiency, widths that are powers of 2 or multiples of warp size (32 for NVIDIA) can significantly accelerate training.

The Input-Output Constraint:

Input layer width is fixed by data dimensionality
Output layer width is fixed by task (1 for regression, $C$ for $C$-class classification)
Hidden layers are free design choices

Practical Starting Point:

For a problem with $n$ input features and $C$ classes:

First hidden: $\min(n \times 2, 512)$ units
Subsequent hidden: halve previous or maintain
2-4 hidden layers total for most tabular/structured problems

Connectivity Patterns and Variations

While the "classic" MLP uses fully connected layers exclusively, understanding connectivity variations illuminates both the design space and modern architectural innovations.

Fully Connected (Dense) Connectivity:

In standard MLP, layer $l$ has $n_l \times n_{l-1}$ learnable connections (plus $n_l$ biases). Every output unit depends on every input unit. This maximizes flexibility but:

Has high parameter count ($O(n^2)$ per layer)
Ignores any structure in the input (spatial, temporal, etc.)
Prone to overfitting on small datasets

Sparse Connectivity Patterns:

Modern architectures introduce structured sparsity:

Convolutional: Connections follow a spatial pattern (local receptive fields)
Block Diagonal: Units grouped into independent blocks
Random Sparse: Randomly pruned connections (less common)
Structured Pruning: Learned sparsity during training

Skip Connections:

While not traditional MLPs, skip (residual) connections are ubiquitous in modern networks:

$$\mathbf{a}^{(l)} = \sigma(W^{(l)}\mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}) + \mathbf{a}^{(l-2)}$$

Skip connections ease gradient flow in deep networks and are essential for training networks with $>10$ layers.

Comparison of Connectivity Patterns
Pattern	Parameters/Layer	Inductive Bias	Use Case
Fully Connected	O(n²)	None (maximum flexibility)	Tabular data, final layers
Convolutional	O(k² × c)	Translation equivariance	Images, signals, sequences
Block Diagonal	O(n²/k)	Independent feature groups	Multi-task, factored models
Low-Rank	O(nr)	Smooth/low-frequency functions	Compression, regularization
Attention	O(n²) in sequence length	Position-independent interactions	NLP, vision transformers

The Role of Inductive Bias

Capacity and Representational Power

A fundamental question in neural network architecture is: what functions can a given network represent? The answer involves both theoretical guarantees and practical limitations.

Universal Approximation Theorem (Cybenko, 1989; Hornik, 1991):

More precisely, for any continuous function $f: K \to \mathbb{R}$ on a compact set $K \subset \mathbb{R}^n$ and any $\epsilon > 0$, there exists an MLP $g$ with one hidden layer such that:

$$\sup_{\mathbf{x} \in K} |f(\mathbf{x}) - g(\mathbf{x})| < \epsilon$$

Critical Nuances:

Existence, not construction: The theorem guarantees the network exists but provides no algorithm to find it.
Width may be exponential: Approximating some functions requires hidden layer width exponential in input dimension.
Doesn't address learning: Finding the right weights through gradient descent is a separate (and often harder) problem.
Depth-Efficiency Results (Telgarsky, 2016): There exist functions representable by networks of depth $k$ with polynomial size that require exponential size for networks of depth $k-1$.

What Architecture Controls

•Hypothesis Class ($\mathcal{H}$): The set of all functions the network architecture can represent. Larger capacity = larger $\mathcal{H}$.
•VC Dimension: A measure of hypothesis class complexity. For sigmoid networks, VC dimension scales as $O(|\Theta| \log |\Theta|)$ where $|\Theta|$ is parameter count.
•Expressiveness: Ability to represent complex functions. More layers generally increase expressiveness more efficiently than width.
•Sample Complexity: How much data is needed to learn well. Larger models typically require more data (though overparameterization can help via implicit regularization).
•Optimization Landscape: Deeper networks have more complex loss surfaces. Architecture affects trainability independent of expressiveness.

The Gap Between Theory and Practice

Implementation Considerations

Translating MLP architecture from mathematics to efficient code requires attention to numerical stability, memory layout, and hardware utilization.

Weight Initialization:

Poor initialization can make training impossible:

Too small: Signals shrink through layers (vanishing activations)
Too large: Signals explode, causing numerical overflow
Symmetric: All hidden units learn identical features

Common initialization schemes:

Initialization Schemes and Their Applications
Method	Formula	Best For
Xavier/Glorot	$W \sim \mathcal{U}(-\sqrt{6/(n_{in}+n_{out})}, \sqrt{6/(n_{in}+n_{out})})$	Tanh, sigmoid activations
He/Kaiming	$W \sim \mathcal{N}(0, 2/n_{in})$	ReLU activations
LeCun	$W \sim \mathcal{N}(0, 1/n_{in})$	SELU activations
Orthogonal	$W = Q$ where $QR = A, A \sim \mathcal{N}(0,1)$	RNNs, very deep networks

mlp_architecture_pytorch.py
PyTorch
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
import torch
import torch.nn as nn
from typing import List, Optional
 
class FlexibleMLP(nn.Module):
    """
    Production-quality MLP with modern best practices.
    
    Features:
    - Flexible architecture specification
    - Multiple activation functions
    - Batch normalization option
    - Dropout regularization
    - Proper weight initialization
    """
    
    def __init__(
        self,
        layer_sizes: List[int],
        activation: str = "relu",
        output_activation: Optional[str] = None,
        use_batch_norm: bool = True,
        dropout_rate: float = 0.0,
        init_method: str = "kaiming"
    ):
        """
        Args:
            layer_sizes: List [n_input, n_hidden_1, ..., n_output]
            activation: Activation for hidden layers
            output_activation: Activation for output layer (None = linear)
            use_batch_norm: Whether to apply batch normalization
            dropout_rate: Dropout probability (0 = no dropout)
            init_method: Weight initialization method
        """
        super().__init__()
        
        self.layer_sizes = layer_sizes
        self.num_layers = len(layer_sizes) - 1
        
        # Build layers
        layers = []
        
        for i in range(self.num_layers):
            # Linear transformation
            in_features = layer_sizes[i]
            out_features = layer_sizes[i + 1]
            layers.append(nn.Linear(in_features, out_features))
            
            # For non-output layers, add activation and optional components
            if i < self.num_layers - 1:
                # Batch normalization (before activation)
                if use_batch_norm:
                    layers.append(nn.BatchNorm1d(out_features))
                
                # Activation function
                layers.append(self._get_activation(activation))
                
                # Dropout
                if dropout_rate > 0:
                    layers.append(nn.Dropout(dropout_rate))
        
        # Output activation if specified
        if output_activation is not None:
            layers.append(self._get_activation(output_activation))
        
        self.network = nn.Sequential(*layers)
        
        # Apply initialization
        self._initialize_weights(init_method)
    
    def _get_activation(self, name: str) -> nn.Module:
        """Get activation function by name."""
        activations = {
            "relu": nn.ReLU(),
            "leaky_relu": nn.LeakyReLU(0.1),
            "elu": nn.ELU(),
            "gelu": nn.GELU(),
            "selu": nn.SELU(),
            "tanh": nn.Tanh(),
            "sigmoid": nn.Sigmoid(),
            "softmax": nn.Softmax(dim=-1),
        }
        if name not in activations:
            raise ValueError(f"Unknown activation: {name}")
        return activations[name]
    
    def _initialize_weights(self, method: str):
        """Apply weight initialization scheme."""
        for module in self.modules():
            if isinstance(module, nn.Linear):
                if method == "kaiming":
                    nn.init.kaiming_normal_(module.weight, nonlinearity='relu')
                elif method == "xavier":
                    nn.init.xavier_uniform_(module.weight)
                elif method == "orthogonal":
                    nn.init.orthogonal_(module.weight)
                else:
                    raise ValueError(f"Unknown init method: {method}")
                
                if module.bias is not None:
                    nn.init.zeros_(module.bias)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass through the network."""
        return self.network(x)
    
    def count_parameters(self) -> int:
        """Count total trainable parameters."""
        return sum(p.numel() for p in self.parameters() if p.requires_grad)
    
    def summary(self) -> str:
        """Return architecture summary."""
        lines = [f"FlexibleMLP: {' → '.join(map(str, self.layer_sizes))}"]
        lines.append(f"Total parameters: {self.count_parameters():,}")
        lines.append("")
        for name, module in self.network.named_children():
            lines.append(f"  {name}: {module}")
        return "\n".join(lines)
 
 
# Example usage
if __name__ == "__main__":
    # Create MNIST classifier
    model = FlexibleMLP(
        layer_sizes=[784, 512, 256, 128, 10],
        activation="relu",
        output_activation="softmax",
        use_batch_norm=True,
        dropout_rate=0.2,
        init_method="kaiming"
    )
    
    print(model.summary())
    
    # Test forward pass
    batch = torch.randn(32, 784)
    output = model(batch)
    print(f"\nInput shape: {batch.shape}")
    print(f"Output shape: {output.shape}")
    print(f"Output sums to 1: {output.sum(dim=1).allclose(torch.ones(32))}")

Summary and Looking Ahead

We've established a rigorous understanding of MLP architecture—the foundation upon which modern deep learning is built. Let's consolidate the key insights and preview what follows.

Key Architectural Concepts

•Layer Types: Input (passthrough), hidden (transformation), and output (task-specific) layers form the MLP structure.
•Mathematical Specification: Architecture is fully defined by layer widths $(n_0, \ldots, n_L)$ and choice of activation functions.
•Parameters: Weights and biases total $\sum_l n_l(n_{l-1} + 1)$, scaling quadratically with layer width.
•Computation Model: Each layer applies affine transformation + nonlinearity: $\mathbf{a}^{(l)} = \sigma(W^{(l)}\mathbf{a}^{(l-1)} + \mathbf{b}^{(l)})$.
•Design Trade-offs: Width vs. depth, parameter efficiency vs. expressiveness, flexibility vs. inductive bias.
•Universal Approximation: MLPs can approximate any continuous function—but expressiveness doesn't guarantee learnability.
•Implementation Matters: Initialization, numerical stability, and hardware efficiency are crucial for practical success.

What's Next:

With architecture established, we turn to the components that make MLPs work:

Hidden Layers: How intermediate representations enable nonlinear decision boundaries
Forward Propagation: The complete computation from input to output
Matrix Formulation: Efficient batch processing through linear algebra
Activation Functions: The nonlinearities that enable universal approximation

Page Complete

1 / 5