Loading problem...
In deep learning, volumetric convolution (also known as 3D convolution) extends the concept of traditional 2D convolution to three-dimensional data. This operation is fundamental for processing spatiotemporal data like video sequences, medical imaging scans (CT, MRI), and point cloud representations.
A volumetric convolution works by sliding a 3D kernel (a small cubic or rectangular prism of learnable weights) across an input 3D volume along all three spatial dimensions: depth, height, and width. At each position, the kernel computes a weighted sum (dot product) of the overlapping elements, producing a single output value.
Mathematical Formulation: For an input volume V with dimensions (D × H × W) and a kernel K with dimensions (Kd × Kh × Kw), the output at position (d, h, w) is computed as:
$$O[d, h, w] = \sum_{i=0}^{K_d-1} \sum_{j=0}^{K_h-1} \sum_{k=0}^{K_w-1} V[d \cdot s_d + i, h \cdot s_h + j, w \cdot s_w + k] \cdot K[i, j, k]$$
where (sₐ, sₕ, sᵥ) represents the stride in each dimension.
Output Dimensions: Given padding (pₐ, pₕ, pᵥ) and stride (sₐ, sₕ, sᵥ), the output dimensions are calculated as:
Your Task: Implement a function that computes the forward pass of a volumetric convolution. Given an input 3D volume, a 3D kernel, stride values for each dimension, and padding values for each dimension, compute and return the resulting output feature map.
Key Concepts:
input_volume = [
[[1.0, 2.0], [3.0, 4.0]],
[[5.0, 6.0], [7.0, 8.0]]
]
kernel = [
[[1.0, 0.0], [0.0, 0.0]],
[[0.0, 0.0], [0.0, 0.0]]
]
stride = [1, 1, 1]
padding = [0, 0, 0][[[1.0]]]The input is a 2×2×2 volume (depth=2, height=2, width=2), and the kernel is also 2×2×2.
With no padding and stride of 1 in all dimensions, the kernel can only be placed at exactly one position (the origin).
The kernel has a 1.0 at position [0,0,0] and 0.0 everywhere else. This effectively extracts the value at the top-left-front corner of the input.
Computing the dot product: • (1.0 × 1.0) + (0.0 × 2.0) + (0.0 × 3.0) + (0.0 × 4.0) + (0.0 × 5.0) + (0.0 × 6.0) + (0.0 × 7.0) + (0.0 × 8.0) • = 1.0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 = 1.0
Output: A 1×1×1 volume containing [[[ 1.0 ]]]
input_volume = [
[[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [7.0, 8.0, 9.0]],
[[10.0, 11.0, 12.0], [13.0, 14.0, 15.0], [16.0, 17.0, 18.0]],
[[19.0, 20.0, 21.0], [22.0, 23.0, 24.0], [25.0, 26.0, 27.0]]
]
kernel = [
[[1.0, 0.0], [0.0, 1.0]],
[[0.0, 1.0], [1.0, 0.0]]
]
stride = [1, 1, 1]
padding = [0, 0, 0][[[30.0, 34.0], [42.0, 46.0]], [[66.0, 70.0], [78.0, 82.0]]]The input is a 3×3×3 volume, and the kernel is 2×2×2. The diagonal kernel pattern captures anti-diagonal spatial relationships.
Output dimensions: (3-2)/1 + 1 = 2 in each dimension → 2×2×2 output.
For output position [0,0,0]: • Input patch: V[0:2, 0:2, 0:2] = [[[1,2],[4,5]], [[10,11],[13,14]]] • Kernel weights at (0,0,0)=1, (0,1,1)=1, (1,0,1)=1, (1,1,0)=1 • Dot product: 1×1 + 5×1 + 11×1 + 13×1 = 1 + 5 + 11 + 13 = 30.0
Similar calculations for all positions yield the 2×2×2 output feature map.
input_volume = [
[[1.0, 2.0, 3.0, 4.0], [5.0, 6.0, 7.0, 8.0], [9.0, 10.0, 11.0, 12.0], [13.0, 14.0, 15.0, 16.0]],
[[17.0, 18.0, 19.0, 20.0], [21.0, 22.0, 23.0, 24.0], [25.0, 26.0, 27.0, 28.0], [29.0, 30.0, 31.0, 32.0]],
[[33.0, 34.0, 35.0, 36.0], [37.0, 38.0, 39.0, 40.0], [41.0, 42.0, 43.0, 44.0], [45.0, 46.0, 47.0, 48.0]],
[[49.0, 50.0, 51.0, 52.0], [53.0, 54.0, 55.0, 56.0], [57.0, 58.0, 59.0, 60.0], [61.0, 62.0, 63.0, 64.0]]
]
kernel = [
[[1.0, 1.0], [1.0, 1.0]],
[[1.0, 1.0], [1.0, 1.0]]
]
stride = [2, 2, 2]
padding = [0, 0, 0][[[92.0, 108.0], [156.0, 172.0]], [[348.0, 364.0], [412.0, 428.0]]]The input is a 4×4×4 volume with sequential values 1-64. The kernel is a 2×2×2 "all-ones" averaging kernel (sums all values in the receptive field).
With stride [2,2,2], the kernel jumps 2 positions in each dimension, performing spatial downsampling.
Output dimensions: (4-2)/2 + 1 = 2 in each dimension → 2×2×2 output.
For output position [0,0,0]: • Input patch: V[0:2, 0:2, 0:2] containing values 1,2,5,6,17,18,21,22 • Sum: 1+2+5+6+17+18+21+22 = 92.0
For output position [1,1,1] (last position): • Input patch: V[2:4, 2:4, 2:4] containing values 43,44,47,48,59,60,63,64 • Sum: 43+44+47+48+59+60+63+64 = 428.0
This demonstrates how stride-2 convolution halves the spatial dimensions while capturing local sums.
Constraints