Loading problem...
In modern large-scale neural network training, memory efficiency is paramount. One powerful technique for reducing memory consumption is low-precision quantization, where high-precision floating-point values are converted to more compact representations while preserving model accuracy.
FP8 (8-bit Floating Point) is a cutting-edge numerical format gaining traction in production systems like those powering massive language models. The E4M3 variant of FP8 uses 1 sign bit, 4 exponent bits, and 3 mantissa bits, providing a dynamic range suitable for neural network activations. The maximum representable value in E4M3 format is 448.
Block-wise Quantization improves upon naïve tensor-level quantization by partitioning the input tensor into smaller segments (blocks) and computing an independent scale factor for each block. This approach offers several advantages:
Quantization Process: For each block of values:
scale = max_abs_value / 448quantized_value = round(original_value / scale)Your Task: Implement a function that performs block-wise FP8 quantization on an input tensor. The function should divide the tensor into blocks, compute scale factors, and return the quantized representation for each block.
tensor = [1.0, 2.0, 100.0, 200.0]
block_size = 2[{"block": 0, "scale": 0.004464, "quantized": [224, 448.0]}, {"block": 1, "scale": 0.446429, "quantized": [224, 448.0]}]The tensor is divided into two blocks of size 2:
Block 0: [1.0, 2.0] • Maximum absolute value = 2.0 • Scale = 2.0 / 448 = 0.004464 • Quantized values: 1.0 / 0.004464 ≈ 224, 2.0 / 0.004464 ≈ 448
Block 1: [100.0, 200.0] • Maximum absolute value = 200.0 • Scale = 200.0 / 448 = 0.446429 • Quantized values: 100.0 / 0.446429 ≈ 224, 200.0 / 0.446429 ≈ 448
Notice how both blocks achieve quantized values that span the full FP8 range [0, 448], even though the original values differ by 100x. This demonstrates the power of block-wise scaling.
tensor = [10.0, 10.0, 10.0, 10.0]
block_size = 2[{"block": 0, "scale": 0.022321, "quantized": [448.0, 448.0]}, {"block": 1, "scale": 0.022321, "quantized": [448.0, 448.0]}]With uniform values across the tensor:
Block 0 & Block 1: [10.0, 10.0] • Maximum absolute value = 10.0 for each block • Scale = 10.0 / 448 = 0.022321 • Quantized values: 10.0 / 0.022321 ≈ 448, 10.0 / 0.022321 ≈ 448
Since both blocks contain identical values, they receive identical scale factors and quantized outputs.
tensor = [-5.0, 10.0, -15.0, 20.0, -25.0, 30.0]
block_size = 3[{"block": 0, "scale": 0.033482, "quantized": [-149, 299, -448.0]}, {"block": 1, "scale": 0.066964, "quantized": [299, -373, 448.0]}]Mixed positive and negative values with block_size = 3:
Block 0: [-5.0, 10.0, -15.0] • Maximum absolute value = 15.0 • Scale = 15.0 / 448 = 0.033482 • Quantized values: -5.0 / 0.033482 ≈ -149, 10.0 / 0.033482 ≈ 299, -15.0 / 0.033482 ≈ -448
Block 1: [20.0, -25.0, 30.0] • Maximum absolute value = 30.0 • Scale = 30.0 / 448 = 0.066964 • Quantized values: 20.0 / 0.066964 ≈ 299, -25.0 / 0.066964 ≈ -373, 30.0 / 0.066964 ≈ 448
The sign of each value is preserved through quantization, and the largest absolute value in each block maps to ±448.
Constraints