Loading content...
In deep learning and model deployment, symmetric 8-bit integer quantization is a critical technique that enables neural networks to run efficiently on resource-constrained devices. This process converts high-precision floating-point values (typically 32-bit) into compact signed 8-bit integers, dramatically reducing memory requirements and accelerating inference computations.
Symmetric quantization maps floating-point values to integers using a scale factor that centers the quantization range around zero. Unlike asymmetric quantization (which uses a separate zero-point offset), symmetric quantization maintains the origin at zero, making it particularly efficient for weights and activations that are naturally centered.
Given a set of floating-point values, the algorithm operates in three key stages:
1. Scale Factor Calculation:
The scale factor s is determined by mapping the maximum absolute value in the data to the largest representable positive integer (127 for signed 8-bit):
$$s = \frac{\text{max}(|x|)}{127}$$
2. Quantization (Float → Integer): Each floating-point value is converted to its integer representation by dividing by the scale factor and rounding to the nearest integer:
$$q_i = \text{round}\left(\frac{x_i}{s}\right)$$
The result is then clipped to ensure it falls within the valid range ([-127, 127]).
3. Dequantization (Integer → Float): To reconstruct the original floating-point values (with some loss), multiply the quantized integers by the scale factor:
$$\hat{x}_i = q_i \times s$$
When all input values are zero, use a default scale of 1.0 to avoid division by zero issues.
Your Task: Implement a function that performs symmetric 8-bit integer quantization on a list of floating-point values. Your function should return a dictionary containing:
x = [1.0, -1.0, 0.5, -0.5]{'quantized': [127, -127, 64, -64], 'scale': 0.007874, 'dequantized': [1.0, -1.0, 0.5039, -0.5039]}Step 1: Find the maximum absolute value abs_max = max(|1.0|, |-1.0|, |0.5|, |-0.5|) = 1.0
Step 2: Calculate the scale factor scale = abs_max / 127 = 1.0 / 127 = 0.007874
Step 3: Quantize each value • 1.0 / 0.007874 = 127.0 → 127 • -1.0 / 0.007874 = -127.0 → -127 • 0.5 / 0.007874 = 63.5 → round to 64 • -0.5 / 0.007874 = -63.5 → round to -64
Step 4: Dequantize to verify reconstruction • 127 × 0.007874 = 1.0000 • -127 × 0.007874 = -1.0000 • 64 × 0.007874 = 0.5039 • -64 × 0.007874 = -0.5039
Note the quantization error for 0.5 and -0.5: the values become 0.5039 after reconstruction due to the discrete nature of 8-bit representation.
x = [0.0, 0.5, 1.0, 1.5, 2.0]{'quantized': [0, 32, 64, 95, 127], 'scale': 0.015748, 'dequantized': [0.0, 0.5039, 1.0079, 1.4961, 2.0]}Step 1: Find the maximum absolute value abs_max = max(|0.0|, |0.5|, |1.0|, |1.5|, |2.0|) = 2.0
Step 2: Calculate the scale factor scale = 2.0 / 127 = 0.015748
Step 3: Quantize each value • 0.0 / 0.015748 = 0.0 → 0 • 0.5 / 0.015748 = 31.75 → round to 32 • 1.0 / 0.015748 = 63.5 → round to 64 • 1.5 / 0.015748 = 95.25 → round to 95 • 2.0 / 0.015748 = 127.0 → 127
Step 4: Dequantize to verify reconstruction The maximum value (2.0) is perfectly preserved, while intermediate values show minor quantization errors.
x = [-5.0, -2.5, 0.0, 2.5, 5.0]{'quantized': [-127, -64, 0, 64, 127], 'scale': 0.03937, 'dequantized': [-5.0, -2.5197, 0.0, 2.5197, 5.0]}Step 1: Find the maximum absolute value abs_max = max(|-5.0|, |-2.5|, |0.0|, |2.5|, |5.0|) = 5.0
Step 2: Calculate the scale factor scale = 5.0 / 127 = 0.03937
Step 3: Quantize each value • -5.0 / 0.03937 = -127.0 → -127 • -2.5 / 0.03937 = -63.5 → round to -64 • 0.0 is always 0 • 2.5 / 0.03937 = 63.5 → round to 64 • 5.0 / 0.03937 = 127.0 → 127
This demonstrates symmetric quantization: the extreme values ±5.0 map perfectly to ±127, while the midpoints ±2.5 experience slight reconstruction error (becoming ±2.5197).
Constraints