Loading content...
The 2D convolution operation is one of the most fundamental building blocks in computer vision and deep learning. It enables neural networks to detect spatial patterns, edges, textures, and complex features within images and other grid-structured data.
In a 2D convolution, a small filter (called a kernel) slides across the input matrix, computing element-wise products and summing them at each position. This operation transforms the input into a new representation called a feature map, which highlights specific patterns based on the kernel's values.
The Convolution Process:
Given an input matrix I of dimensions H × W, a kernel K of dimensions kH × kW, padding p, and stride s, the output feature map O is computed as follows:
Padding: First, surround the input matrix with p layers of zeros on all sides, creating a padded input of dimensions (H + 2p) × (W + 2p)
Sliding Window: Starting from the top-left corner, place the kernel over the padded input. At each position (i, j), compute:
$$O_{i,j} = \sum_{m=0}^{kH-1} \sum_{n=0}^{kW-1} K_{m,n} \cdot I_{padded}[i \cdot s + m, j \cdot s + n]$$
Output Dimensions:
The output feature map has dimensions:
Your Task: Implement a Python function that performs a 2D convolution operation. The function should properly apply zero-padding, slide the kernel with the specified stride, and compute the element-wise products and sums to produce the output feature map.
input_matrix = [[1.0, 2.0, 3.0, 4.0], [5.0, 6.0, 7.0, 8.0], [9.0, 10.0, 11.0, 12.0], [13.0, 14.0, 15.0, 16.0]]
kernel = [[1.0, 0.0], [-1.0, 1.0]]
padding = 1
stride = 2[[1.0, 1.0, -4.0], [9.0, 7.0, -4.0], [0.0, 14.0, 16.0]]After adding 1 layer of zero-padding, the input becomes 6×6. The 2×2 kernel slides with stride 2, producing a 3×3 output.
• Position (0,0): Covers padded region and top-left element → (1×0) + (0×0) + (-1×0) + (1×1) = 1.0 • Position (0,1): (1×0) + (0×1) + (-1×0) + (1×2) = 2.0... continuing yields -4.0 at edge • The kernel detects diagonal patterns, with the -1 capturing differences between adjacent rows.
The resulting 3×3 feature map captures spatial relationships at half the resolution due to stride 2.
input_matrix = [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [7.0, 8.0, 9.0]]
kernel = [[1.0, 0.0], [0.0, 1.0]]
padding = 0
stride = 1[[6.0, 8.0], [12.0, 14.0]]With no padding and stride 1, the 2×2 kernel slides across the 3×3 input, producing a 2×2 output.
• Position (0,0): (1×1) + (0×2) + (0×4) + (1×5) = 1 + 5 = 6.0 • Position (0,1): (1×2) + (0×3) + (0×5) + (1×6) = 2 + 6 = 8.0 • Position (1,0): (1×4) + (0×5) + (0×7) + (1×8) = 4 + 8 = 12.0 • Position (1,1): (1×5) + (0×6) + (0×8) + (1×9) = 5 + 9 = 14.0
This diagonal kernel extracts the sum of diagonal elements, useful for detecting diagonal features.
input_matrix = [[1.0, 2.0, 3.0, 4.0], [5.0, 6.0, 7.0, 8.0], [9.0, 10.0, 11.0, 12.0], [13.0, 14.0, 15.0, 16.0]]
kernel = [[0.0, 1.0, 0.0], [1.0, -4.0, 1.0], [0.0, 1.0, 0.0]]
padding = 1
stride = 1[[3.0, 2.0, 1.0, -5.0], [-4.0, 0.0, 0.0, -9.0], [-8.0, 0.0, 0.0, -13.0], [-29.0, -18.0, -19.0, -37.0]]This kernel is the Laplacian filter, a classic edge detection operator. It computes the second derivative of the image, highlighting areas of rapid intensity change.
• For interior points (like position (1,1)), the kernel averages neighbors minus 4× the center: (2+6+6+10) - 4×6 = 0 • Edge and corner positions show non-zero values due to the zero-padding boundary effects • The large negative values in the last row indicate strong edges at the bottom of the image
With padding=1 and stride=1, the output maintains the same 4×4 dimensions as the input, typical for 'same' padding convolutions.
Constraints