Biological Inspiration - Learning Module

Loading content...

0/245

Perceptron Learning Rule

Learning from Mistakes

The perceptron learning rule is one of the most elegant algorithms in all of machine learning. Proposed by Frank Rosenblatt in 1957, it answered a question that McCulloch and Pitts left open: How can a neural network learn?

The perceptron algorithm is beautifully simple: make a prediction, check if it's wrong, and if so, adjust the weights to do better next time. Remarkably, for problems it can solve, it's guaranteed to converge to a correct solution in finite time.

Understanding the perceptron learning rule is essential—not because it's used directly in modern deep learning (it isn't), but because it establishes the fundamental principles of neural network training: forward passes, error signals, and weight updates.

What You Will Learn

By the end of this page, you will understand the perceptron model and its algorithm, derive the update rule geometrically and algebraically, prove the convergence theorem, implement the algorithm from scratch, and appreciate both its elegance and its fundamental limitations.

The Perceptron Model

Rosenblatt's perceptron is a binary classifier—it takes an input and produces one of two outputs. The model consists of:

Components

Input vector: x = [x₁, x₂, ..., xₙ]ᵀ ∈ ℝⁿ

Weight vector: w = [w₁, w₂, ..., wₙ]ᵀ ∈ ℝⁿ

Bias term: b ∈ ℝ (can be absorbed into weights by augmenting x with 1)

Output: y ∈ {-1, +1} or {0, 1} depending on convention

The Decision Function

The perceptron computes:

y = sign(wᵀx + b)

Where:

sign(z) = +1 if z ≥ 0 sign(z) = -1 if z < 0

Equivalently, using the Heaviside step function:

y = 1 if wᵀx + b ≥ 0 y = 0 otherwise

Geometric Interpretation

The perceptron implements a linear classifier. The equation:

wᵀx + b = 0

defines a hyperplane in the input space. The perceptron classifies points based on which side of the hyperplane they fall:

wᵀx + b > 0 → positive class (+1)
wᵀx + b < 0 → negative class (-1)

The weight vector w is perpendicular to this hyperplane and points toward the positive region.

Bias Trick Convention

For mathematical convenience, we often "absorb" the bias into the weight vector by augmenting the input with a constant 1: let x̃ = [x₁, ..., xₙ, 1]ᵀ and w̃ = [w₁, ..., wₙ, b]ᵀ. Then w̃ᵀx̃ = wᵀx + b. This lets us write the perceptron simply as y = sign(wᵀx) where x is understood to include the bias term.

The Perceptron Learning Algorithm

The perceptron learning algorithm is an online algorithm—it processes one example at a time and updates immediately. Here's the algorithm:

Algorithm Statement

Input: Training set D = {(x⁽¹⁾, t⁽¹⁾), ..., (x⁽ᵐ⁾, t⁽ᵐ⁾)} where t⁽ⁱ⁾ ∈ {-1, +1}

Initialize: w ← 0 (or random small values)

Repeat until convergence:

For each training example (**x**⁽ⁱ⁾, t⁽ⁱ⁾):
    1. Compute output: y⁽ⁱ⁾ = sign(**w**ᵀ**x**⁽ⁱ⁾)
    2. If y⁽ⁱ⁾ ≠ t⁽ⁱ⁾ (misclassification):
           **w** ← **w** + η · t⁽ⁱ⁾ · **x**⁽ⁱ⁾

Output: Trained weight vector w

Where η > 0 is the learning rate (often set to 1).

Understanding the Update Rule

The update w ← w + η · t⁽ⁱ⁾ · x⁽ⁱ⁾ has an elegant interpretation:

Case 1: False negative (t = +1, y = -1)

The true label is positive, but we predicted negative. This means wᵀx < 0 when it should be ≥ 0. The update adds x to w:

w_new = w_old + ηx

Now:

w_newᵀx = (w_old + ηx)ᵀx = w_oldᵀx + η||x||² > w_oldᵀx

The inner product increases, moving the output toward the positive side. ✓

Case 2: False positive (t = -1, y = +1)

The true label is negative, but we predicted positive. The update subtracts x from w:

w_new = w_old - ηx

Now:

w_newᵀx = (w_old - ηx)ᵀx = w_oldᵀx - η||x||² < w_oldᵀx

The inner product decreases, moving the output toward the negative side. ✓

perceptron.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
import numpy as np
from typing import List, Tuple
 
class Perceptron:
    """
    Classic Perceptron classifier with the original learning rule.
    
    The perceptron learns a linear decision boundary by adjusting
    weights based on misclassified examples.
    """
    
    def __init__(self, n_features: int, learning_rate: float = 1.0):
        """
        Initialize perceptron with zero weights.
        
        Args:
            n_features: Number of input features (excluding bias)
            learning_rate: Step size for weight updates (default=1)
        """
        # Weights include bias as the last element (bias trick)
        self.weights = np.zeros(n_features + 1)
        self.learning_rate = learning_rate
        self.n_updates = 0  # Track total updates
        
    def _augment(self, x: np.ndarray) -> np.ndarray:
        """Append 1 to input for bias term."""
        return np.append(x, 1)
    
    def predict(self, x: np.ndarray) -> int:
        """
        Predict class label for input x.
        
        Returns:
            +1 if wᵀx + b ≥ 0, -1 otherwise
        """
        x_aug = self._augment(x)
        activation = np.dot(self.weights, x_aug)
        return 1 if activation >= 0 else -1
    
    def train_step(self, x: np.ndarray, target: int) -> bool:
        """
        Perform one training step on a single example.
        
        Args:
            x: Input feature vector
            target: True label (+1 or -1)
            
        Returns:
            True if update was made (example was misclassified)
        """
        prediction = self.predict(x)
        
        if prediction != target:
            # Misclassification - update weights
            x_aug = self._augment(x)
            self.weights += self.learning_rate * target * x_aug
            self.n_updates += 1
            return True
        return False
    
    def fit(self, X: np.ndarray, y: np.ndarray, 
            max_epochs: int = 1000) -> List[int]:
        """
        Train perceptron on dataset.
        
        Args:
            X: Training inputs, shape (n_samples, n_features)
            y: Training labels, shape (n_samples,), values in {-1, +1}
            max_epochs: Maximum passes through the data
            
        Returns:
            List of misclassification counts per epoch
        """
        n_samples = X.shape[0]
        errors_per_epoch = []
        
        for epoch in range(max_epochs):
            n_errors = 0
            
            # Shuffle data each epoch for better convergence
            indices = np.random.permutation(n_samples)
            
            for i in indices:
                if self.train_step(X[i], y[i]):
                    n_errors += 1
            
            errors_per_epoch.append(n_errors)
            
            # Converged if no errors
            if n_errors == 0:
                print(f"Converged after {epoch + 1} epochs, "
                      f"{self.n_updates} total updates")
                break
        else:
            print(f"Did not converge after {max_epochs} epochs")
        
        return errors_per_epoch
 
# Example: Learning AND gate
def demo_perceptron():
    # AND gate training data
    X = np.array([
        [0, 0],
        [0, 1],
        [1, 0],
        [1, 1]
    ])
    y = np.array([-1, -1, -1, 1])  # AND outputs
    
    # Train perceptron
    perceptron = Perceptron(n_features=2)
    errors = perceptron.fit(X, y)
    
    print(f"\nLearned weights: {perceptron.weights[:-1]}")
    print(f"Learned bias: {perceptron.weights[-1]}")
    
    print("\nPredictions:")
    for x, target in zip(X, y):
        pred = perceptron.predict(x)
        print(f"  Input: {x}, Target: {target:+d}, Prediction: {pred:+d}")
    
    return perceptron
 
demo_perceptron()

Geometric Intuition

The perceptron learning rule has a beautiful geometric interpretation that deepens understanding.

The Weight Vector Moves Toward Misclassified Points

Think of the weight vector w as defining a "direction" in input space. Points with positive class should have positive inner product with w; negative class points should have negative inner product.

When a positive example is misclassified:

wᵀx < 0 for a point with t = +1

The update w ← w + x rotates w toward x:

The new weight vector is the vector sum of w and x
This increases the inner product wᵀx (as we showed)
Geometrically: w rotates to point more toward x

When a negative example is misclassified:

wᵀx > 0 for a point with t = -1

The update w ← w - x rotates w away from x:

This decreases the inner product wᵀx
Geometrically: w rotates to point more away from x

Hyperplane Movement

Each update rotates the decision hyperplane (perpendicular to w) to correctly classify the misclassified example—while hopefully not disturbing too many previously correct classifications.

The algorithm proceeds by "bumping" the hyperplane around until it settles into a position that correctly separates all training points (if such a position exists).

The Margin Concept

The margin of a training example is its signed distance to the decision boundary, multiplied by its true label:

γᵢ = tᵢ · (wᵀxᵢ + b) / ||w||

A positive margin means correct classification; a larger margin means the point is further from the boundary (more "confident").

If all training examples have positive margin, the perceptron classifies everything correctly.

Linear Separability

A dataset is linearly separable if there exists a hyperplane that correctly separates all positive examples from all negative examples—equivalently, if there exist w and b such that:

tᵢ · (wᵀxᵢ + b) > 0 for all training examples

The perceptron algorithm converges if and only if the data is linearly separable.

Examples of linearly separable problems:

AND gate: Separable by w = [1, 1], b = -1.5
OR gate: Separable by w = [1, 1], b = -0.5

Examples of non-linearly separable problems:

XOR gate: No line separates {(0,0), (1,1)} from {(0,1), (1,0)}
Any dataset where classes are "interleaved"

Visualization Insight

To visualize perceptron learning in 2D: plot the data points colored by class, then watch the decision line (perpendicular to w) rotate with each update. The line "bounces" off misclassified points until it finds an orientation that separates the classes.

The Perceptron Convergence Theorem

The most important theoretical result about perceptrons:

Theorem (Perceptron Convergence): If the training data is linearly separable, the perceptron learning algorithm will find a separating hyperplane in a finite number of updates.

Moreover, the number of updates is bounded—it doesn't depend on the order of examples, only on properties of the data.

Setup for the Proof

Assumptions:

Training data is linearly separable
There exists a unit vector w* (||w*|| = 1) and margin γ > 0 such that: tᵢ · (w*ᵀxᵢ) ≥ γ for all i
All training examples have bounded norm: ||xᵢ|| ≤ R for all i
We use learning rate η = 1 and initialize w₀ = 0

The Proof (Sketch)

Let wₖ be the weight vector after k updates. We'll prove:

Lower bound on ||wₖ||: Growth is at least linear in k Upper bound on ||wₖ||²: Growth is at most linear in k

These together bound k.

Part 1: Lower bound

Each update occurs on a misclassified example. If the k-th update is on example (x, t):

wₖ = wₖ₋₁ + tx

Taking inner product with optimal w*:

wᵀwₖ = wᵀwₖ₋₁ + t(wᵀx) ≥ wᵀwₖ₋₁ + γ (by separation assumption)

By induction:

w*ᵀwₖ ≥ kγ

Since w* is a unit vector (by Cauchy-Schwarz):

||wₖ|| ≥ w*ᵀwₖ ≥ kγ

Part 2: Upper bound

||wₖ||² = ||wₖ₋₁ + tx||² = ||wₖ₋₁||² + 2t(wₖ₋₁ᵀx) + ||x||²

Since we only update on misclassifications, t(wₖ₋₁ᵀx) ≤ 0:

||wₖ||² ≤ ||wₖ₋₁||² + ||x||² ≤ ||wₖ₋₁||² + R²

By induction:

||wₖ||² ≤ kR²

Combining the bounds:

From Part 1: ||wₖ||² ≥ k²γ² From Part 2: ||wₖ||² ≤ kR²

Therefore:

k²γ² ≤ kR² k ≤ R²/γ²

Conclusion: The number of updates is at most (R/γ)², which is finite.

Interpretation

The bound tells us:

Larger margin γ → fewer updates needed (well-separated data is easier)
Larger R (spread of data) → more updates needed
The bound is worst-case; actual convergence is often faster
The bound doesn't depend on the number of training examples—only on separation quality

Margin Maximization

The perceptron finds some separating hyperplane, but not necessarily the best one. Support Vector Machines (SVMs) improve on this by explicitly maximizing the margin—finding the separating hyperplane furthest from all training points. This typically leads to better generalization.

Variants and Extensions

Several variants of the perceptron algorithm address different concerns.

Voted Perceptron

The basic perceptron can oscillate—the final weight vector depends on which examples came last. The voted perceptron keeps track of all weight vectors encountered during training and their "survival times" (how many examples each classified correctly).

At prediction time:

y = sign(Σₖ cₖ · sign(wₖᵀx))

Where cₖ is the survival count. This typically improves generalization by ensemble-like averaging.

Averaged Perceptron

A computationally simpler variant: return the average of all weight vectors:

w_avg = (1/T) Σₖ wₖ

Or equivalently, weight each wₖ by how long it survived. The averaged perceptron often performs nearly as well as voted perceptron with much less computation.

Pocket Algorithm

For non-separable data, the basic perceptron never converges. The pocket algorithm keeps track of the best weight vector seen so far:

Run perceptron as usual
After each update, check total classification accuracy
Keep the weights with highest accuracy in a "pocket"
Return pocket weights at the end

This gives reasonable results even for non-separable data.

averaged_perceptron.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import numpy as np
 
class AveragedPerceptron:
    """
    Averaged Perceptron for improved generalization.
    
    Instead of returning the final weights, returns the average
    of all weight vectors seen during training, weighted by
    how long each survived.
    """
    
    def __init__(self, n_features: int):
        # Current weights
        self.weights = np.zeros(n_features + 1)
        # Accumulated weights (for computing average)
        self.accumulated_weights = np.zeros(n_features + 1)
        # Counter for averaging
        self.survival_count = 0
        
    def _augment(self, x: np.ndarray) -> np.ndarray:
        return np.append(x, 1)
    
    def predict(self, x: np.ndarray, use_average: bool = True) -> int:
        """Predict using averaged weights (default) or current weights."""
        x_aug = self._augment(x)
        if use_average:
            w = self.accumulated_weights / max(1, self.survival_count)
        else:
            w = self.weights
        return 1 if np.dot(w, x_aug) >= 0 else -1
    
    def fit(self, X: np.ndarray, y: np.ndarray, 
            max_epochs: int = 100) -> 'AveragedPerceptron':
        """
        Train averaged perceptron.
        """
        n_samples = X.shape[0]
        
        for epoch in range(max_epochs):
            n_errors = 0
            
            for i in np.random.permutation(n_samples):
                x_aug = self._augment(X[i])
                prediction = 1 if np.dot(self.weights, x_aug) >= 0 else -1
                
                if prediction != y[i]:
                    # Update current weights
                    self.weights += y[i] * x_aug
                    n_errors += 1
                
                # Accumulate current weights (even if no error)
                self.accumulated_weights += self.weights
                self.survival_count += 1
            
            if n_errors == 0:
                print(f"Converged after {epoch + 1} epochs")
                break
        
        return self
    
    def get_average_weights(self) -> np.ndarray:
        """Return the averaged weight vector."""
        return self.accumulated_weights / max(1, self.survival_count)
 
 
# The averaged perceptron often generalizes better than standard perceptron
# especially when training data has noise or is nearly-separable

Multi-Class Extension

The perceptron extends to multi-class classification using the one-vs-all or multi-class perceptron approach:

One-vs-all: Train K binary perceptrons, one for each class vs. all others. Predict the class whose perceptron gives the highest activation.

Multi-class perceptron: Maintain a weight vector wₖ for each class k. Predict class = argmaxₖ(wₖᵀx). Update rule:

If predicted class ŷ ≠ true class y:
- w_y ← w_y + x (boost correct class)
- w_ŷ ← w_ŷ - x (penalize predicted class)

Kernel Perceptron

The perceptron can be kernelized to learn nonlinear decision boundaries:

Express weights as linear combination of training examples: w = Σᵢ αᵢyᵢxᵢ
Prediction: y = sign(Σᵢ αᵢyᵢK(xᵢ, x)) where K is a kernel function
Update: Instead of updating w, increment αᵢ for misclassified examples

This is conceptually an infinite-dimensional perceptron in feature space.

Relationship to Other Learning Rules

The perceptron learning rule is one of several related approaches that emerged in the early neural network era.

Perceptron vs. Delta Rule (LMS)

The delta rule (Widrow-Hoff rule, LMS algorithm) updates weights based on the continuous error before thresholding:

Perceptron: Δw = η · (t - y) · x where y ∈ {-1, +1}

Delta rule: Δw = η · (t - z) · x where z = wᵀx (pre-threshold)

Key differences:

Aspect	Perceptron	Delta Rule
Error signal	Binary (correct/wrong)	Continuous (distance from target)
Objective	Any separating hyperplane	Minimize squared error
Gradient-based	Not exactly	Yes (gradient of MSE)
Guaranteed convergence	Yes (if separable)	Converges to LSE solution
Update on correct examples	No	Yes (if not exactly right)

The delta rule is a precursor to backpropagation—it uses gradient descent on squared error for a single neuron.

Hebbian Learning Connection

Recall Hebb's rule: "Neurons that fire together, wire together."

Mathematically: Δwᵢⱼ ∝ xᵢ · yⱼ

The perceptron update has a Hebbian character:

Δw = η · t · x

But uses the target t instead of the actual output y. This is sometimes called the "teacher signal"—external feedback about what the output should be.

Pure Hebbian learning is unsupervised (no targets); perceptron learning is supervised.

Connection to Gradient Descent

While the perceptron rule predates modern optimization framing, we can understand it through gradients.

Consider the perceptron loss for a single example:

L(w) = max(0, -t · wᵀx)

If correctly classified: t · wᵀx ≥ 0 → L = 0
If misclassified: t · wᵀx < 0 → L = -t · wᵀx > 0

The subgradient of this loss:

∂L/∂w = 0 if correctly classified ∂L/∂w = -t · x if misclassified

Gradient descent update:

w ← w - η · ∂L/∂w = w + η · t · x (if misclassified)

This is exactly the perceptron update! The perceptron algorithm is gradient descent on the perceptron loss.

Relationship to SVMs

Perceptron finds any separating hyperplane; SVM finds the maximum-margin hyperplane.

SVM objective:

minimize ||w||² subject to tᵢ(wᵀxᵢ + b) ≥ 1 for all i

The perceptron has no preference among separating hyperplanes. SVM's margin maximization typically yields better generalization.

The kernel perceptron and kernel SVM are closely related—both can learn nonlinear boundaries via kernel trick. But SVM's convex optimization is more principled than perceptron's greedy updates.

Why Perceptrons Still Matter

Despite being superseded by SVMs and deep learning, understanding the perceptron is valuable: (1) It's the simplest learning algorithm with a convergence guarantee; (2) The update rule intuition carries to backpropagation; (3) The averaged perceptron is still used in NLP for efficiency; (4) The convergence proof technique (bounding growth rates) is widely applicable.

Complete Implementation with Visualization

Let's implement a complete perceptron with training visualization to see the algorithm in action.

perceptron_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
import numpy as np
from typing import List, Tuple, Optional
 
class VisualizablePerceptron:
    """
    Perceptron that records learning history for visualization.
    """
    
    def __init__(self, n_features: int = 2, learning_rate: float = 1.0):
        self.n_features = n_features
        self.learning_rate = learning_rate
        self.weights = np.zeros(n_features)
        self.bias = 0.0
        
        # History for visualization
        self.weight_history: List[Tuple[np.ndarray, float]] = []
        self.update_history: List[Tuple[np.ndarray, int]] = []  # (point, label)
        
    def predict(self, x: np.ndarray) -> int:
        activation = np.dot(self.weights, x) + self.bias
        return 1 if activation >= 0 else -1
    
    def predict_batch(self, X: np.ndarray) -> np.ndarray:
        activations = X @ self.weights + self.bias
        return np.where(activations >= 0, 1, -1)
    
    def fit(self, X: np.ndarray, y: np.ndarray, 
            max_epochs: int = 100) -> dict:
        """
        Train perceptron and record history.
        
        Returns:
            Dictionary with training statistics
        """
        n_samples = X.shape[0]
        self.weight_history = [(self.weights.copy(), self.bias)]
        self.update_history = []
        
        stats = {
            'converged': False,
            'epochs': 0,
            'updates': 0,
            'errors_per_epoch': []
        }
        
        for epoch in range(max_epochs):
            n_errors = 0
            
            for i in range(n_samples):
                prediction = self.predict(X[i])
                
                if prediction != y[i]:
                    # Record update
                    self.update_history.append((X[i].copy(), y[i]))
                    
                    # Perceptron update
                    self.weights += self.learning_rate * y[i] * X[i]
                    self.bias += self.learning_rate * y[i]
                    
                    # Record new weights
                    self.weight_history.append(
                        (self.weights.copy(), self.bias)
                    )
                    
                    n_errors += 1
                    stats['updates'] += 1
            
            stats['errors_per_epoch'].append(n_errors)
            stats['epochs'] = epoch + 1
            
            if n_errors == 0:
                stats['converged'] = True
                break
        
        return stats
    
    def decision_boundary(self) -> Optional[Tuple[float, float, float]]:
        """
        Return (slope, intercept) for decision boundary line (2D only).
        Decision boundary: w1*x1 + w2*x2 + b = 0
        x2 = (-w1*x1 - b) / w2
        """
        if self.n_features != 2:
            return None
        w1, w2 = self.weights
        if abs(w2) < 1e-10:
            return None  # Vertical line
        slope = -w1 / w2
        intercept = -self.bias / w2
        return slope, intercept, self.bias
 
    def accuracy(self, X: np.ndarray, y: np.ndarray) -> float:
        predictions = self.predict_batch(X)
        return np.mean(predictions == y)
 
 
def generate_linearly_separable_data(
    n_samples: int = 100,
    margin: float = 0.5,
    noise: float = 0.0,
    seed: int = 42
) -> Tuple[np.ndarray, np.ndarray]:
    """
    Generate linearly separable 2D data.
    
    Points are separated by the line x2 = x1.
    """
    np.random.seed(seed)
    
    X_positive = np.random.randn(n_samples // 2, 2)
    X_positive[:, 1] += margin + X_positive[:, 0]
    
    X_negative = np.random.randn(n_samples // 2, 2)
    X_negative[:, 1] -= margin - X_negative[:, 0]
    
    # Add noise
    X_positive += noise * np.random.randn(*X_positive.shape)
    X_negative += noise * np.random.randn(*X_negative.shape)
    
    X = np.vstack([X_positive, X_negative])
    y = np.hstack([np.ones(n_samples // 2), -np.ones(n_samples // 2)])
    
    # Shuffle
    perm = np.random.permutation(n_samples)
    return X[perm], y[perm].astype(int)
 
 
# Example usage
X, y = generate_linearly_separable_data(n_samples=100, margin=1.0)
perceptron = VisualizablePerceptron(n_features=2)
stats = perceptron.fit(X, y)
 
print(f"Training statistics:")
print(f"  Converged: {stats['converged']}")
print(f"  Epochs: {stats['epochs']}")
print(f"  Updates: {stats['updates']}")
print(f"  Final accuracy: {perceptron.accuracy(X, y):.2%}")
print(f"  Final weights: {perceptron.weights}")
print(f"  Final bias: {perceptron.bias:.4f}")

Summary: The Perceptron Learning Rule

We've thoroughly examined the perceptron—the first learning algorithm for neural networks.

Key Takeaways

•The perceptron is a linear classifier: y = sign(wᵀx + b)
•The learning rule is elegantly simple: Add input to weights on false negatives, subtract on false positives
•Convergence is guaranteed for linearly separable data with bound O((R/γ)²)
•Geometric interpretation: Updates rotate the decision hyperplane toward correct classification
•Variants improve generalization: Averaged and voted perceptrons outperform basic perceptron
•Connection to gradient descent: The perceptron algorithm is SGD on the perceptron loss
•Limitations: Cannot learn non-linearly-separable functions (XOR problem)

The Perceptron Update Rule:

If sign(wᵀx) ≠ t: w ← w + η · t · x b ← b + η · t

This simple rule, proposed in 1957, established the paradigm of neural network learning:

Make a prediction
Compare to target
Adjust weights to reduce error

This same pattern—forward pass, loss computation, backward adjustment—underlies all of modern deep learning.

What's next:

The next page examines the XOR problem in depth—the limitation that sparked the first AI winter. Understanding why single-layer networks fail on XOR, and how multi-layer networks succeed, is essential for appreciating the power of deep architectures.

Page Complete

You now have a thorough understanding of the perceptron learning rule—its algorithm, geometric intuition, convergence guarantee, variants, and relationship to other learning methods. This foundation is essential for understanding both the power and limitations of neural network learning.

Perceptron Learning Rule

Learning from Mistakes

What You Will Learn

The Perceptron Model

Rosenblatt's perceptron is a binary classifier—it takes an input and produces one of two outputs. The model consists of:

Components

Input vector: x = [x₁, x₂, ..., xₙ]ᵀ ∈ ℝⁿ

Weight vector: w = [w₁, w₂, ..., wₙ]ᵀ ∈ ℝⁿ

Bias term: b ∈ ℝ (can be absorbed into weights by augmenting x with 1)

Output: y ∈ {-1, +1} or {0, 1} depending on convention

The Decision Function

The perceptron computes:

y = sign(wᵀx + b)

Where:

sign(z) = +1 if z ≥ 0 sign(z) = -1 if z < 0

Equivalently, using the Heaviside step function:

y = 1 if wᵀx + b ≥ 0 y = 0 otherwise

Geometric Interpretation

The perceptron implements a linear classifier. The equation:

wᵀx + b = 0

defines a hyperplane in the input space. The perceptron classifies points based on which side of the hyperplane they fall:

wᵀx + b > 0 → positive class (+1)
wᵀx + b < 0 → negative class (-1)

The weight vector w is perpendicular to this hyperplane and points toward the positive region.

Bias Trick Convention

The Perceptron Learning Algorithm

The perceptron learning algorithm is an online algorithm—it processes one example at a time and updates immediately. Here's the algorithm:

Algorithm Statement

Input: Training set D = {(x⁽¹⁾, t⁽¹⁾), ..., (x⁽ᵐ⁾, t⁽ᵐ⁾)} where t⁽ⁱ⁾ ∈ {-1, +1}

Initialize: w ← 0 (or random small values)

Repeat until convergence:

For each training example (**x**⁽ⁱ⁾, t⁽ⁱ⁾):
    1. Compute output: y⁽ⁱ⁾ = sign(**w**ᵀ**x**⁽ⁱ⁾)
    2. If y⁽ⁱ⁾ ≠ t⁽ⁱ⁾ (misclassification):
           **w** ← **w** + η · t⁽ⁱ⁾ · **x**⁽ⁱ⁾

Output: Trained weight vector w

Where η > 0 is the learning rate (often set to 1).

Understanding the Update Rule

The update w ← w + η · t⁽ⁱ⁾ · x⁽ⁱ⁾ has an elegant interpretation:

Case 1: False negative (t = +1, y = -1)

The true label is positive, but we predicted negative. This means wᵀx < 0 when it should be ≥ 0. The update adds x to w:

w_new = w_old + ηx

Now:

w_newᵀx = (w_old + ηx)ᵀx = w_oldᵀx + η||x||² > w_oldᵀx

The inner product increases, moving the output toward the positive side. ✓

Case 2: False positive (t = -1, y = +1)

The true label is negative, but we predicted positive. The update subtracts x from w:

w_new = w_old - ηx

Now:

w_newᵀx = (w_old - ηx)ᵀx = w_oldᵀx - η||x||² < w_oldᵀx

The inner product decreases, moving the output toward the negative side. ✓

perceptron.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
import numpy as np
from typing import List, Tuple
 
class Perceptron:
    """
    Classic Perceptron classifier with the original learning rule.
    
    The perceptron learns a linear decision boundary by adjusting
    weights based on misclassified examples.
    """
    
    def __init__(self, n_features: int, learning_rate: float = 1.0):
        """
        Initialize perceptron with zero weights.
        
        Args:
            n_features: Number of input features (excluding bias)
            learning_rate: Step size for weight updates (default=1)
        """
        # Weights include bias as the last element (bias trick)
        self.weights = np.zeros(n_features + 1)
        self.learning_rate = learning_rate
        self.n_updates = 0  # Track total updates
        
    def _augment(self, x: np.ndarray) -> np.ndarray:
        """Append 1 to input for bias term."""
        return np.append(x, 1)
    
    def predict(self, x: np.ndarray) -> int:
        """
        Predict class label for input x.
        
        Returns:
            +1 if wᵀx + b ≥ 0, -1 otherwise
        """
        x_aug = self._augment(x)
        activation = np.dot(self.weights, x_aug)
        return 1 if activation >= 0 else -1
    
    def train_step(self, x: np.ndarray, target: int) -> bool:
        """
        Perform one training step on a single example.
        
        Args:
            x: Input feature vector
            target: True label (+1 or -1)
            
        Returns:
            True if update was made (example was misclassified)
        """
        prediction = self.predict(x)
        
        if prediction != target:
            # Misclassification - update weights
            x_aug = self._augment(x)
            self.weights += self.learning_rate * target * x_aug
            self.n_updates += 1
            return True
        return False
    
    def fit(self, X: np.ndarray, y: np.ndarray, 
            max_epochs: int = 1000) -> List[int]:
        """
        Train perceptron on dataset.
        
        Args:
            X: Training inputs, shape (n_samples, n_features)
            y: Training labels, shape (n_samples,), values in {-1, +1}
            max_epochs: Maximum passes through the data
            
        Returns:
            List of misclassification counts per epoch
        """
        n_samples = X.shape[0]
        errors_per_epoch = []
        
        for epoch in range(max_epochs):
            n_errors = 0
            
            # Shuffle data each epoch for better convergence
            indices = np.random.permutation(n_samples)
            
            for i in indices:
                if self.train_step(X[i], y[i]):
                    n_errors += 1
            
            errors_per_epoch.append(n_errors)
            
            # Converged if no errors
            if n_errors == 0:
                print(f"Converged after {epoch + 1} epochs, "
                      f"{self.n_updates} total updates")
                break
        else:
            print(f"Did not converge after {max_epochs} epochs")
        
        return errors_per_epoch
 
# Example: Learning AND gate
def demo_perceptron():
    # AND gate training data
    X = np.array([
        [0, 0],
        [0, 1],
        [1, 0],
        [1, 1]
    ])
    y = np.array([-1, -1, -1, 1])  # AND outputs
    
    # Train perceptron
    perceptron = Perceptron(n_features=2)
    errors = perceptron.fit(X, y)
    
    print(f"\nLearned weights: {perceptron.weights[:-1]}")
    print(f"Learned bias: {perceptron.weights[-1]}")
    
    print("\nPredictions:")
    for x, target in zip(X, y):
        pred = perceptron.predict(x)
        print(f"  Input: {x}, Target: {target:+d}, Prediction: {pred:+d}")
    
    return perceptron
 
demo_perceptron()

Geometric Intuition

The perceptron learning rule has a beautiful geometric interpretation that deepens understanding.

The Weight Vector Moves Toward Misclassified Points

Think of the weight vector w as defining a "direction" in input space. Points with positive class should have positive inner product with w; negative class points should have negative inner product.

When a positive example is misclassified:

wᵀx < 0 for a point with t = +1

The update w ← w + x rotates w toward x:

The new weight vector is the vector sum of w and x
This increases the inner product wᵀx (as we showed)
Geometrically: w rotates to point more toward x

When a negative example is misclassified:

wᵀx > 0 for a point with t = -1

The update w ← w - x rotates w away from x:

This decreases the inner product wᵀx
Geometrically: w rotates to point more away from x

Hyperplane Movement

Each update rotates the decision hyperplane (perpendicular to w) to correctly classify the misclassified example—while hopefully not disturbing too many previously correct classifications.

The algorithm proceeds by "bumping" the hyperplane around until it settles into a position that correctly separates all training points (if such a position exists).

The Margin Concept

The margin of a training example is its signed distance to the decision boundary, multiplied by its true label:

γᵢ = tᵢ · (wᵀxᵢ + b) / ||w||

A positive margin means correct classification; a larger margin means the point is further from the boundary (more "confident").

If all training examples have positive margin, the perceptron classifies everything correctly.

Linear Separability

A dataset is linearly separable if there exists a hyperplane that correctly separates all positive examples from all negative examples—equivalently, if there exist w and b such that:

tᵢ · (wᵀxᵢ + b) > 0 for all training examples

The perceptron algorithm converges if and only if the data is linearly separable.

Examples of linearly separable problems:

AND gate: Separable by w = [1, 1], b = -1.5
OR gate: Separable by w = [1, 1], b = -0.5

Examples of non-linearly separable problems:

XOR gate: No line separates {(0,0), (1,1)} from {(0,1), (1,0)}
Any dataset where classes are "interleaved"

Visualization Insight

The Perceptron Convergence Theorem

The most important theoretical result about perceptrons:

Theorem (Perceptron Convergence): If the training data is linearly separable, the perceptron learning algorithm will find a separating hyperplane in a finite number of updates.

Moreover, the number of updates is bounded—it doesn't depend on the order of examples, only on properties of the data.

Setup for the Proof

Assumptions:

Training data is linearly separable
There exists a unit vector w* (||w*|| = 1) and margin γ > 0 such that: tᵢ · (w*ᵀxᵢ) ≥ γ for all i
All training examples have bounded norm: ||xᵢ|| ≤ R for all i
We use learning rate η = 1 and initialize w₀ = 0

The Proof (Sketch)

Let wₖ be the weight vector after k updates. We'll prove:

Lower bound on ||wₖ||: Growth is at least linear in k Upper bound on ||wₖ||²: Growth is at most linear in k

These together bound k.

Part 1: Lower bound

Each update occurs on a misclassified example. If the k-th update is on example (x, t):

wₖ = wₖ₋₁ + tx

Taking inner product with optimal w*:

wᵀwₖ = wᵀwₖ₋₁ + t(wᵀx) ≥ wᵀwₖ₋₁ + γ (by separation assumption)

By induction:

w*ᵀwₖ ≥ kγ

Since w* is a unit vector (by Cauchy-Schwarz):

||wₖ|| ≥ w*ᵀwₖ ≥ kγ

Part 2: Upper bound

||wₖ||² = ||wₖ₋₁ + tx||² = ||wₖ₋₁||² + 2t(wₖ₋₁ᵀx) + ||x||²

Since we only update on misclassifications, t(wₖ₋₁ᵀx) ≤ 0:

||wₖ||² ≤ ||wₖ₋₁||² + ||x||² ≤ ||wₖ₋₁||² + R²

By induction:

||wₖ||² ≤ kR²

Combining the bounds:

From Part 1: ||wₖ||² ≥ k²γ² From Part 2: ||wₖ||² ≤ kR²

Therefore:

k²γ² ≤ kR² k ≤ R²/γ²

Conclusion: The number of updates is at most (R/γ)², which is finite.

Interpretation

The bound tells us:

Larger margin γ → fewer updates needed (well-separated data is easier)
Larger R (spread of data) → more updates needed
The bound is worst-case; actual convergence is often faster
The bound doesn't depend on the number of training examples—only on separation quality

Margin Maximization

Variants and Extensions

Several variants of the perceptron algorithm address different concerns.

Voted Perceptron

At prediction time:

y = sign(Σₖ cₖ · sign(wₖᵀx))

Where cₖ is the survival count. This typically improves generalization by ensemble-like averaging.

Averaged Perceptron

A computationally simpler variant: return the average of all weight vectors:

w_avg = (1/T) Σₖ wₖ

Or equivalently, weight each wₖ by how long it survived. The averaged perceptron often performs nearly as well as voted perceptron with much less computation.

Pocket Algorithm

For non-separable data, the basic perceptron never converges. The pocket algorithm keeps track of the best weight vector seen so far:

Run perceptron as usual
After each update, check total classification accuracy
Keep the weights with highest accuracy in a "pocket"
Return pocket weights at the end

This gives reasonable results even for non-separable data.

averaged_perceptron.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import numpy as np
 
class AveragedPerceptron:
    """
    Averaged Perceptron for improved generalization.
    
    Instead of returning the final weights, returns the average
    of all weight vectors seen during training, weighted by
    how long each survived.
    """
    
    def __init__(self, n_features: int):
        # Current weights
        self.weights = np.zeros(n_features + 1)
        # Accumulated weights (for computing average)
        self.accumulated_weights = np.zeros(n_features + 1)
        # Counter for averaging
        self.survival_count = 0
        
    def _augment(self, x: np.ndarray) -> np.ndarray:
        return np.append(x, 1)
    
    def predict(self, x: np.ndarray, use_average: bool = True) -> int:
        """Predict using averaged weights (default) or current weights."""
        x_aug = self._augment(x)
        if use_average:
            w = self.accumulated_weights / max(1, self.survival_count)
        else:
            w = self.weights
        return 1 if np.dot(w, x_aug) >= 0 else -1
    
    def fit(self, X: np.ndarray, y: np.ndarray, 
            max_epochs: int = 100) -> 'AveragedPerceptron':
        """
        Train averaged perceptron.
        """
        n_samples = X.shape[0]
        
        for epoch in range(max_epochs):
            n_errors = 0
            
            for i in np.random.permutation(n_samples):
                x_aug = self._augment(X[i])
                prediction = 1 if np.dot(self.weights, x_aug) >= 0 else -1
                
                if prediction != y[i]:
                    # Update current weights
                    self.weights += y[i] * x_aug
                    n_errors += 1
                
                # Accumulate current weights (even if no error)
                self.accumulated_weights += self.weights
                self.survival_count += 1
            
            if n_errors == 0:
                print(f"Converged after {epoch + 1} epochs")
                break
        
        return self
    
    def get_average_weights(self) -> np.ndarray:
        """Return the averaged weight vector."""
        return self.accumulated_weights / max(1, self.survival_count)
 
 
# The averaged perceptron often generalizes better than standard perceptron
# especially when training data has noise or is nearly-separable

Multi-Class Extension

The perceptron extends to multi-class classification using the one-vs-all or multi-class perceptron approach:

One-vs-all: Train K binary perceptrons, one for each class vs. all others. Predict the class whose perceptron gives the highest activation.

Multi-class perceptron: Maintain a weight vector wₖ for each class k. Predict class = argmaxₖ(wₖᵀx). Update rule:

If predicted class ŷ ≠ true class y:
- w_y ← w_y + x (boost correct class)
- w_ŷ ← w_ŷ - x (penalize predicted class)

Kernel Perceptron

The perceptron can be kernelized to learn nonlinear decision boundaries:

Express weights as linear combination of training examples: w = Σᵢ αᵢyᵢxᵢ
Prediction: y = sign(Σᵢ αᵢyᵢK(xᵢ, x)) where K is a kernel function
Update: Instead of updating w, increment αᵢ for misclassified examples

This is conceptually an infinite-dimensional perceptron in feature space.

Relationship to Other Learning Rules

The perceptron learning rule is one of several related approaches that emerged in the early neural network era.

Perceptron vs. Delta Rule (LMS)

The delta rule (Widrow-Hoff rule, LMS algorithm) updates weights based on the continuous error before thresholding:

Perceptron: Δw = η · (t - y) · x where y ∈ {-1, +1}

Delta rule: Δw = η · (t - z) · x where z = wᵀx (pre-threshold)

Key differences:

Aspect	Perceptron	Delta Rule
Error signal	Binary (correct/wrong)	Continuous (distance from target)
Objective	Any separating hyperplane	Minimize squared error
Gradient-based	Not exactly	Yes (gradient of MSE)
Guaranteed convergence	Yes (if separable)	Converges to LSE solution
Update on correct examples	No	Yes (if not exactly right)

The delta rule is a precursor to backpropagation—it uses gradient descent on squared error for a single neuron.

Hebbian Learning Connection

Recall Hebb's rule: "Neurons that fire together, wire together."

Mathematically: Δwᵢⱼ ∝ xᵢ · yⱼ

The perceptron update has a Hebbian character:

Δw = η · t · x

But uses the target t instead of the actual output y. This is sometimes called the "teacher signal"—external feedback about what the output should be.

Pure Hebbian learning is unsupervised (no targets); perceptron learning is supervised.

Connection to Gradient Descent

While the perceptron rule predates modern optimization framing, we can understand it through gradients.

Consider the perceptron loss for a single example:

L(w) = max(0, -t · wᵀx)

If correctly classified: t · wᵀx ≥ 0 → L = 0
If misclassified: t · wᵀx < 0 → L = -t · wᵀx > 0

The subgradient of this loss:

∂L/∂w = 0 if correctly classified ∂L/∂w = -t · x if misclassified

Gradient descent update:

w ← w - η · ∂L/∂w = w + η · t · x (if misclassified)

This is exactly the perceptron update! The perceptron algorithm is gradient descent on the perceptron loss.

Relationship to SVMs

Perceptron finds any separating hyperplane; SVM finds the maximum-margin hyperplane.

SVM objective:

minimize ||w||² subject to tᵢ(wᵀxᵢ + b) ≥ 1 for all i

The perceptron has no preference among separating hyperplanes. SVM's margin maximization typically yields better generalization.

The kernel perceptron and kernel SVM are closely related—both can learn nonlinear boundaries via kernel trick. But SVM's convex optimization is more principled than perceptron's greedy updates.

Why Perceptrons Still Matter

Complete Implementation with Visualization

Let's implement a complete perceptron with training visualization to see the algorithm in action.

perceptron_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
import numpy as np
from typing import List, Tuple, Optional
 
class VisualizablePerceptron:
    """
    Perceptron that records learning history for visualization.
    """
    
    def __init__(self, n_features: int = 2, learning_rate: float = 1.0):
        self.n_features = n_features
        self.learning_rate = learning_rate
        self.weights = np.zeros(n_features)
        self.bias = 0.0
        
        # History for visualization
        self.weight_history: List[Tuple[np.ndarray, float]] = []
        self.update_history: List[Tuple[np.ndarray, int]] = []  # (point, label)
        
    def predict(self, x: np.ndarray) -> int:
        activation = np.dot(self.weights, x) + self.bias
        return 1 if activation >= 0 else -1
    
    def predict_batch(self, X: np.ndarray) -> np.ndarray:
        activations = X @ self.weights + self.bias
        return np.where(activations >= 0, 1, -1)
    
    def fit(self, X: np.ndarray, y: np.ndarray, 
            max_epochs: int = 100) -> dict:
        """
        Train perceptron and record history.
        
        Returns:
            Dictionary with training statistics
        """
        n_samples = X.shape[0]
        self.weight_history = [(self.weights.copy(), self.bias)]
        self.update_history = []
        
        stats = {
            'converged': False,
            'epochs': 0,
            'updates': 0,
            'errors_per_epoch': []
        }
        
        for epoch in range(max_epochs):
            n_errors = 0
            
            for i in range(n_samples):
                prediction = self.predict(X[i])
                
                if prediction != y[i]:
                    # Record update
                    self.update_history.append((X[i].copy(), y[i]))
                    
                    # Perceptron update
                    self.weights += self.learning_rate * y[i] * X[i]
                    self.bias += self.learning_rate * y[i]
                    
                    # Record new weights
                    self.weight_history.append(
                        (self.weights.copy(), self.bias)
                    )
                    
                    n_errors += 1
                    stats['updates'] += 1
            
            stats['errors_per_epoch'].append(n_errors)
            stats['epochs'] = epoch + 1
            
            if n_errors == 0:
                stats['converged'] = True
                break
        
        return stats
    
    def decision_boundary(self) -> Optional[Tuple[float, float, float]]:
        """
        Return (slope, intercept) for decision boundary line (2D only).
        Decision boundary: w1*x1 + w2*x2 + b = 0
        x2 = (-w1*x1 - b) / w2
        """
        if self.n_features != 2:
            return None
        w1, w2 = self.weights
        if abs(w2) < 1e-10:
            return None  # Vertical line
        slope = -w1 / w2
        intercept = -self.bias / w2
        return slope, intercept, self.bias
 
    def accuracy(self, X: np.ndarray, y: np.ndarray) -> float:
        predictions = self.predict_batch(X)
        return np.mean(predictions == y)
 
 
def generate_linearly_separable_data(
    n_samples: int = 100,
    margin: float = 0.5,
    noise: float = 0.0,
    seed: int = 42
) -> Tuple[np.ndarray, np.ndarray]:
    """
    Generate linearly separable 2D data.
    
    Points are separated by the line x2 = x1.
    """
    np.random.seed(seed)
    
    X_positive = np.random.randn(n_samples // 2, 2)
    X_positive[:, 1] += margin + X_positive[:, 0]
    
    X_negative = np.random.randn(n_samples // 2, 2)
    X_negative[:, 1] -= margin - X_negative[:, 0]
    
    # Add noise
    X_positive += noise * np.random.randn(*X_positive.shape)
    X_negative += noise * np.random.randn(*X_negative.shape)
    
    X = np.vstack([X_positive, X_negative])
    y = np.hstack([np.ones(n_samples // 2), -np.ones(n_samples // 2)])
    
    # Shuffle
    perm = np.random.permutation(n_samples)
    return X[perm], y[perm].astype(int)
 
 
# Example usage
X, y = generate_linearly_separable_data(n_samples=100, margin=1.0)
perceptron = VisualizablePerceptron(n_features=2)
stats = perceptron.fit(X, y)
 
print(f"Training statistics:")
print(f"  Converged: {stats['converged']}")
print(f"  Epochs: {stats['epochs']}")
print(f"  Updates: {stats['updates']}")
print(f"  Final accuracy: {perceptron.accuracy(X, y):.2%}")
print(f"  Final weights: {perceptron.weights}")
print(f"  Final bias: {perceptron.bias:.4f}")

Summary: The Perceptron Learning Rule

We've thoroughly examined the perceptron—the first learning algorithm for neural networks.

Key Takeaways

•The perceptron is a linear classifier: y = sign(wᵀx + b)
•The learning rule is elegantly simple: Add input to weights on false negatives, subtract on false positives
•Convergence is guaranteed for linearly separable data with bound O((R/γ)²)
•Geometric interpretation: Updates rotate the decision hyperplane toward correct classification
•Variants improve generalization: Averaged and voted perceptrons outperform basic perceptron
•Connection to gradient descent: The perceptron algorithm is SGD on the perceptron loss
•Limitations: Cannot learn non-linearly-separable functions (XOR problem)

The Perceptron Update Rule:

If sign(wᵀx) ≠ t: w ← w + η · t · x b ← b + η · t

This simple rule, proposed in 1957, established the paradigm of neural network learning:

Make a prediction
Compare to target
Adjust weights to reduce error

This same pattern—forward pass, loss computation, backward adjustment—underlies all of modern deep learning.

What's next:

Page Complete