Loading learning content...
Here's a fundamental truth that often surprises newcomers: When deep learning practitioners say 'convolution', they almost always mean cross-correlation.
Open any deep learning framework—PyTorch, TensorFlow, JAX—and examine the Conv2d operation. Despite its name, it implements cross-correlation, not true mathematical convolution. The kernel is not flipped before sliding across the input.
This isn't a bug or an oversight. It's a deliberate design choice with deep practical justification. Understanding why this choice was made—and why it typically doesn't matter for neural networks—illuminates something fundamental about how CNNs learn.
In this page, we rigorously define cross-correlation, compare it mathematically to convolution, and explain the conditions under which the distinction is irrelevant. By the end, you'll understand exactly what happens when you instantiate a 'convolutional' layer in your favorite framework.
By the end of this page, you will understand the mathematical definition of cross-correlation, clearly distinguish it from true convolution, know why deep learning uses cross-correlation despite calling it 'convolution', and recognize the cases where the distinction does matter.
1D Cross-Correlation:
For a discrete signal f[n] and kernel g[n], the cross-correlation (often denoted with ★ or ⊛ to distinguish from convolution's *) is:
$$(f \star g)[n] = \sum_{k=-\infty}^{\infty} f[k] \cdot g[k - n]$$
Or equivalently, with finite sequences:
$$(f \star g)[n] = \sum_{k=0}^{K-1} f[n + k] \cdot g[k]$$
The Crucial Difference:
Compare to convolution:
The difference is subtle in the formula but profound in interpretation. In convolution, the kernel is reversed as it slides; in cross-correlation, it maintains its original orientation.
2D Cross-Correlation:
For a 2D input I[m, n] and kernel K[i, j]:
$$(I \star K)[m, n] = \sum_{i=0}^{K_h-1} \sum_{j=0}^{K_w-1} I[m + i, n + j] \cdot K[i, j]$$
Here, the kernel slides across the image without rotation or flipping. Each output position is the dot product of the kernel with the corresponding image patch, aligned directly.
Relationship to Convolution:
Cross-correlation with kernel K is equivalent to convolution with the flipped kernel K':
$$f \star g = f * g'$$
where g'[k] = g[-k] in 1D, or K'[i, j] = K[Kₕ - 1 - i, Kw - 1 - j] in 2D (180° rotation).
This means any operation achievable with one is achievable with the other, just with a transformed kernel.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
import numpy as npfrom scipy.signal import convolve2d, correlate2d def demonstrate_convolution_vs_correlation(): """ Demonstrate the mathematical difference between convolution and correlation. """ # Create a simple test image image = np.array([ [1, 2, 3], [4, 5, 6], [7, 8, 9] ], dtype=float) # An asymmetric kernel (to see the flip effect clearly) kernel = np.array([ [1, 0], [0, 2] ], dtype=float) # Flipped kernel (180° rotation) kernel_flipped = np.array([ [2, 0], [0, 1] ], dtype=float) print("Image:") print(image) print("\nKernel:") print(kernel) print("\nFlipped Kernel (180° rotation):") print(kernel_flipped) # True convolution (scipy uses the mathematical definition) conv_result = convolve2d(image, kernel, mode='valid') # Cross-correlation (no flip) corr_result = correlate2d(image, kernel, mode='valid') # Convolution with flipped kernel = Cross-correlation with original conv_flipped = convolve2d(image, kernel_flipped, mode='valid') print("\nConvolution (with kernel flip):") print(conv_result) print("\nCross-correlation (no flip):") print(corr_result) print("\nConvolution with pre-flipped kernel:") print(conv_flipped) # Key observation: conv_flipped equals corr_result print("\nConv(image, flipped_kernel) == Corr(image, kernel):") print(np.allclose(conv_flipped, corr_result)) if __name__ == "__main__": demonstrate_convolution_vs_correlation()For symmetric kernels (where K[i,j] = K[Kₕ-1-i, Kw-1-j]), convolution and cross-correlation produce identical results. Many classical image processing kernels (Gaussian, Laplacian) are symmetric. But learned CNN kernels are not constrained to symmetry, so the distinction could theoretically matter.
Given that convolution and cross-correlation are distinct operations, why did the deep learning community standardize on cross-correlation while calling it 'convolution'? There are several compelling reasons:
1. Gradient Computation Symmetry:
During backpropagation, the gradient with respect to the input involves correlating the output gradient with the kernel. If the forward pass uses convolution (with flip), the backward pass uses correlation (without flip), and vice versa.
By using cross-correlation in the forward pass, the mathematical structure of backpropagation becomes more symmetric and slightly more intuitive. The same kernel orientation appears in both input gradient and weight gradient computations.
2. Historical Momentum:
Early CNN implementations (dating to LeCun's 1989 work) used cross-correlation, possibly for implementation simplicity. As the field grew, this choice became entrenched. Changing conventions would break backward compatibility across decades of research.
3. The Learnability Argument (The Killer Insight):
This is the most important reason. Consider what happens during training:
But the kernel weights are learned!
Backpropagation doesn't know what the 'intended' kernel looks like. It simply adjusts kernel weights to minimize loss. If cross-correlation is used and a flipped kernel is needed, backpropagation will learn the flipped version directly.
In other words:
The final effect on outputs is identical. The network learns whatever kernel orientation produces the correct output—the flip is absorbed into the learned weights.
Since kernels are learned (not hand-designed), the choice between convolution and cross-correlation is mathematically irrelevant to CNN performance. Whatever feature detector would be learned under convolution, its 180° rotated version is learned under cross-correlation. The network's capacity and final performance are identical.
4. Implementation Simplicity:
Cross-correlation is slightly simpler to implement:
These micro-simplifications compound across millions of lines of deep learning code.
While the learnability argument makes the practical distinction moot for CNNs, the mathematical properties differ significantly. Understanding these differences is valuable for theoretical analysis and for applications outside standard CNNs.
Properties of True Convolution:
Properties of Cross-Correlation:
| Property | Convolution (*) | Cross-Correlation (★) | Impact in Deep Learning |
|---|---|---|---|
| Commutativity | ✓ f * g = g * f | ✗ f ★ g ≠ g ★ f | Rarely exploited in CNNs |
| Associativity | ✓ (f * g) * h = f * (g * h) | ✗ Not associative | Matters for kernel fusion analysis |
| Distributivity | ✓ Distributes over + | ✓ Distributes over + | Both support linear operations |
| Identity | ✓ δ is identity | ✗ No identity | No practical CNN impact |
| Fourier Relation | Multiplication | Multiplication with conjugate | FFT-based convolution uses true conv |
Loss of Associativity: A Deeper Look
The loss of associativity in cross-correlation has subtle implications for theoretical analysis:
With true convolution, stacking two convolutional layers with kernels K₁ and K₂ is theoretically equivalent to a single layer with kernel K₁ * K₂. This is because:
$$f * K_1 * K_2 = f * (K_1 * K_2)$$
With cross-correlation, this equivalence doesn't hold in general:
$$(I \star K_1) \star K_2 \neq I \star (K_1 \star K_2)$$
However, in practice, this theoretical difference has minimal impact because:
If you're analyzing CNN expressivity or attempting to algebraically combine kernel effects (e.g., for network compression or theoretical capacity analysis), the non-associativity of cross-correlation can cause surprises. Always verify whether a paper uses true convolution or cross-correlation conventions.
Let's build geometric intuition for the difference between convolution and cross-correlation through visualization.
Convolution: Flip and Slide
Cross-Correlation: Direct Slide
Visual Example:
Consider a 1D signal [a, b, c, d, e] and kernel [1, 2, 3].
Cross-Correlation at position 0:
Convolution at position 0:
The difference is how kernel weights align with signal positions.
12345678910111213141516171819202122232425262728293031323334353637
import numpy as np def visual_comparison_1d(): """ Step-by-step trace comparing convolution and cross-correlation. """ signal = np.array([1, 2, 3, 4, 5], dtype=float) kernel = np.array([1, 2, 3], dtype=float) kernel_flipped = kernel[::-1] # [3, 2, 1] print("Signal:", signal) print("Kernel:", kernel) print("Flipped Kernel:", kernel_flipped) print() # Cross-correlation: direct alignment print("=== Cross-Correlation (no flip) ===") for pos in range(len(signal) - len(kernel) + 1): window = signal[pos:pos + len(kernel)] result = np.dot(window, kernel) print(f"Position {pos}: {window} • {kernel} = {result}") print() # Convolution: kernel is flipped print("=== Convolution (with flip) ===") for pos in range(len(signal) - len(kernel) + 1): window = signal[pos:pos + len(kernel)] result = np.dot(window, kernel_flipped) print(f"Position {pos}: {window} • {kernel_flipped} = {result}") print() print("Notice: Same positions, different results due to kernel orientation") if __name__ == "__main__": visual_comparison_1d()Directional Interpretation:
The kernel flip in convolution can be thought of as 'looking backward'—the kernel's first element (reading left-to-right) aligns with the signal's last element in the window (right-to-left).
Cross-correlation is 'looking forward'—the kernel's first element aligns with the signal's first element in the window. Both directions.
For symmetric kernels (palindromic pattern), both operations yield identical results because flipping doesn't change the kernel.
In Images (2D):
The 180° rotation in 2D means:
For a kernel detecting diagonal lines going top-left to bottom-right, flipping would detect diagonals going the other direction. With learned kernels, this just means the network learns the orientation that works.
Cross-correlation is more intuitive for template matching: the kernel directly represents the pattern you're looking for. With convolution, you'd need to flip the template mentally. This is another reason cross-correlation became the deep learning standard—what you visualize is what gets applied.
Let's examine how major deep learning frameworks implement their 'convolution' operations and verify that they use cross-correlation.
PyTorch nn.Conv2d:
PyTorch's documentation explicitly states that Conv2d computes cross-correlation. From the docs:
This module supports TensorFloat32. The implementation uses cross-correlation...
The forward pass is: $$\text{output}[n, c_{out}, h, w] = \sum_{c_{in}} \sum_{i} \sum_{j} \text{input}[n, c_{in}, h+i, w+j] \times \text{weight}[c_{out}, c_{in}, i, j] + \text{bias}[c_{out}]$$
No flip is applied to the weight tensor.
TensorFlow/Keras Conv2D:
TensorFlow similarly implements cross-correlation:
This layer creates a convolution kernel that is convolved (actually cross-correlated) with the layer input...
JAX/Flax:
JAX's lax.conv_general_dilated provides options for various convolution modes but defaults to cross-correlation semantics.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
import torchimport torch.nn as nnimport numpy as np def verify_pytorch_uses_cross_correlation(): """ Empirically verify that PyTorch's Conv2d uses cross-correlation. """ # Create a simple test case # Input: 1 batch, 1 channel, 3x3 image input_tensor = torch.tensor([ [1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [7.0, 8.0, 9.0] ]).reshape(1, 1, 3, 3) # Asymmetric kernel to make flip visible kernel = torch.tensor([ [1.0, 2.0], [3.0, 4.0] ]).reshape(1, 1, 2, 2) # Create Conv2d layer and set weights manually conv = nn.Conv2d(1, 1, 2, bias=False) conv.weight.data = kernel # PyTorch result pytorch_result = conv(input_tensor).squeeze() # Manual cross-correlation (no flip) input_np = input_tensor.numpy().squeeze() kernel_np = kernel.numpy().squeeze() correlation_result = np.zeros((2, 2)) for i in range(2): for j in range(2): patch = input_np[i:i+2, j:j+2] correlation_result[i, j] = np.sum(patch * kernel_np) # Manual convolution (with 180° flip) kernel_flipped = kernel_np[::-1, ::-1] convolution_result = np.zeros((2, 2)) for i in range(2): for j in range(2): patch = input_np[i:i+2, j:j+2] convolution_result[i, j] = np.sum(patch * kernel_flipped) print("Input:") print(input_np) print("\nKernel:") print(kernel_np) print("\nPyTorch Conv2d result:") print(pytorch_result.detach().numpy()) print("\nManual cross-correlation:") print(correlation_result) print("\nManual true convolution:") print(convolution_result) print("\n--- Verification ---") print(f"PyTorch matches cross-correlation: {np.allclose(pytorch_result.detach().numpy(), correlation_result)}") print(f"PyTorch matches true convolution: {np.allclose(pytorch_result.detach().numpy(), convolution_result)}") if __name__ == "__main__": verify_pytorch_uses_cross_correlation()The naming choice to call cross-correlation 'convolution' dates back to the earliest CNN papers. Changing terminology now would cause immense confusion across decades of literature, papers, and codebases. The deep learning community has collectively decided to live with the terminology overload.
While learnability makes the distinction irrelevant for most CNN training, there are specific scenarios where you must be aware of the difference:
1. Using Pre-defined Kernels:
If you're using hand-crafted kernels (e.g., Sobel, Laplacian, Gabor filters) within a CNN framework, remember that the framework applies cross-correlation. If these kernels were designed with true convolution in mind, you may need to flip them first.
2. FFT-Based Acceleration:
The convolution theorem (ℱ{f * g} = ℱ{f}·ℱ{g}) applies to true convolution, not cross-correlation. If you implement FFT-based 'convolution' for speed, you must either:
3. Signal Processing Interoperability:
When interfacing with signal processing libraries (scipy, MATLAB's Signal Processing Toolbox), be aware that:
scipy.signal.convolve uses true convolutionscipy.signal.correlate uses cross-correlationMixing conventions without awareness causes bugs.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
import torchimport torch.nn as nnimport numpy as np def sobel_kernel_example(): """ Demonstrate the importance of flipping when using pre-defined kernels. """ # Standard Sobel kernel for vertical edge detection # Designed with convolution in mind: bright-on-right edges sobel_x = np.array([ [-1, 0, 1], [-2, 0, 2], [-1, 0, 1] ], dtype=np.float32) print("Original Sobel-X kernel (designed for convolution):") print(sobel_x) # For use in PyTorch (which uses cross-correlation): # To get the same effect as convolution, flip the kernel sobel_x_for_pytorch = np.flip(np.flip(sobel_x, 0), 1) # 180° rotation print("\nFlipped Sobel-X for PyTorch cross-correlation:") print(sobel_x_for_pytorch) # However, for symmetric kernels like Gaussian, no flip needed gaussian_3x3 = np.array([ [1, 2, 1], [2, 4, 2], [1, 2, 1] ], dtype=np.float32) / 16 # 180° rotation of a symmetric kernel is the same gaussian_flipped = np.flip(np.flip(gaussian_3x3, 0), 1) print("\nGaussian kernel:") print(gaussian_3x3) print("\nFlipped Gaussian (identical):") print(gaussian_flipped) print("\nSymmetric kernels are flip-invariant:", np.allclose(gaussian_3x3, gaussian_flipped)) if __name__ == "__main__": sobel_kernel_example()For standard CNN training, ignore the distinction—learned kernels adapt. For pre-defined kernels, test empirically: apply your kernel to a known input and verify the output matches expectations. For theoretical work, always specify which convention you're using.
In signal processing, cross-correlation has a distinct purpose that differs from convolution. Understanding this original context illuminates why the two operations are structurally similar but conceptually different.
Cross-Correlation as Similarity Measure:
In signal processing, cross-correlation measures how similar two signals are as one is shifted relative to the other:
$$R_{fg}[\tau] = \sum_{n} f[n] \cdot g[n + \tau]$$
The value Rfg[τ] tells us how well signal g aligns with signal f when g is shifted by τ samples.
Applications of Cross-Correlation:
Convolution as System Response:
In contrast, convolution computes the output of a linear time-invariant system:
$$y[n] = (x * h)[n]$$
where x is the input and h is the system's impulse response.
The kernel flip in convolution ensures causality for physical systems: past inputs affect current output. Cross-correlation doesn't have this causal interpretation.
| Aspect | Convolution | Cross-Correlation |
|---|---|---|
| Primary Use | System response computation | Similarity/alignment measurement |
| Physical Interpretation | Output of LTI system | Pattern matching score |
| Kernel/Filter | Impulse response | Template/reference signal |
| Symmetry | Commutative, associative | Neither (in general) |
| Causality | Naturally causal | Not inherently causal |
CNNs as Pattern Matchers:
The choice of cross-correlation in CNNs makes sense from the template matching perspective. Each kernel is a learned template, and cross-correlation measures how well each image patch matches the template.
The 'system response' interpretation of convolution is less natural for image classification—we're not computing what happens when an 'image signal' passes through a 'filter system'. We're detecting patterns.
So while the naming is confusing (calling cross-correlation 'convolution'), the operational choice aligns with the intuitive purpose of convolutional layers: pattern detection, not system simulation.
Signal processors and deep learners often talk past each other because of terminology differences. When reading interdisciplinary papers or implementing hybrid systems, always verify which operation is meant by 'convolution'.
We've thoroughly examined the distinction between convolution and cross-correlation, establishing exactly what deep learning frameworks do under the hood. Let's consolidate the key insights:
What's Next:
With the convolution/correlation distinction clarified, we turn to practical aspects of the operation. The next page covers stride and padding—the control parameters that determine how the kernel moves across the input and how boundaries are handled. These choices directly affect output spatial dimensions and computational cost.
You now understand precisely what 'convolution' means in deep learning contexts—it's cross-correlation, with learned kernels that absorb any flip. This knowledge prevents confusion when reading papers, debugging implementations, or interfacing with signal processing tools.