Loading content...
In the era of large language models (LLMs) with billions of parameters, fine-tuning on consumer hardware presents a significant challenge. Traditional fine-tuning requires loading all model weights in full precision (32-bit floating point), which can consume hundreds of gigabytes of GPU memory. Quantized Low-Rank Adaptation (commonly known as QLoRA) elegantly solves this problem by combining two powerful techniques:
During inference or fine-tuning, the forward pass combines both paths:
$$\text{output} = x \cdot W_{\text{dequant}} + \frac{\alpha}{r} \cdot x \cdot B \cdot A$$
Where:
The frozen weights are stored in quantized integer format to save memory. Before computation, they must be dequantized back to floating-point:
$$W_{\text{dequant}} = \text{quantized_W} \times \text{scale} + \text{zero_point}$$
This affine transformation maps the quantized integer values back to their approximate original floating-point representations.
Implement the qlora_forward function that computes the complete forward pass:
The rank r should be inferred from the number of columns in matrix B (or equivalently, the number of rows in matrix A).
x = [[1.0, 2.0]]
quantized_W = [[0, 10], [10, 0]]
scale = 0.1
zero_point = 0.0
B = [[0.5], [0.5]]
A = [[1.0, 1.0]]
alpha = 1.0[[3.5, 2.5]]Step 1: Dequantize the frozen weights W = quantized_W × scale + zero_point W = [[0, 10], [10, 0]] × 0.1 + 0.0 = [[0.0, 1.0], [1.0, 0.0]]
Step 2: Compute the frozen path (x @ W) frozen_output = [[1.0, 2.0]] @ [[0.0, 1.0], [1.0, 0.0]] = [[(1.0×0.0 + 2.0×1.0), (1.0×1.0 + 2.0×0.0)]] = [[2.0, 1.0]]
Step 3: Compute the adapter path (x @ B @ A) intermediate = [[1.0, 2.0]] @ [[0.5], [0.5]] = [[1.0×0.5 + 2.0×0.5]] = [[1.5]] adapter_output = [[1.5]] @ [[1.0, 1.0]] = [[1.5, 1.5]]
Step 4: Combine with scaling (rank = 1, alpha = 1.0) final = frozen_output + (alpha/rank) × adapter_output = [[2.0, 1.0]] + (1.0/1) × [[1.5, 1.5]] = [[2.0 + 1.5, 1.0 + 1.5]] = [[3.5, 2.5]]
x = [[1.0, 0.0], [0.0, 1.0]]
quantized_W = [[10, 0], [0, 10]]
scale = 0.1
zero_point = 0.0
B = [[0.0], [0.0]]
A = [[0.0, 0.0]]
alpha = 1.0[[1.0, 0.0], [0.0, 1.0]]Step 1: Dequantize the frozen weights W = [[10, 0], [0, 10]] × 0.1 + 0.0 = [[1.0, 0.0], [0.0, 1.0]]
This is the identity matrix!
Step 2: Compute the frozen path frozen_output = x @ W = [[1.0, 0.0], [0.0, 1.0]] @ [[1.0, 0.0], [0.0, 1.0]] = [[1.0, 0.0], [0.0, 1.0]]
Step 3: Compute the adapter path Since B and A are all zeros, the adapter contribution is zero. adapter_output = [[0.0, 0.0], [0.0, 0.0]]
Step 4: Combine final = [[1.0, 0.0], [0.0, 1.0]] + [[0.0, 0.0], [0.0, 0.0]] = [[1.0, 0.0], [0.0, 1.0]]
This demonstrates that with zero LoRA weights, the output equals the frozen model output.
x = [[2.0, 1.0]]
quantized_W = [[5, 5], [5, 5]]
scale = 0.2
zero_point = -0.5
B = [[0.1], [0.1]]
A = [[0.5, 0.5]]
alpha = 2.0[[1.8, 1.8]]Step 1: Dequantize with non-zero offset W = [[5, 5], [5, 5]] × 0.2 + (-0.5) = [[1.0, 1.0], [1.0, 1.0]] + [[-0.5, -0.5], [-0.5, -0.5]] = [[0.5, 0.5], [0.5, 0.5]]
Step 2: Compute the frozen path frozen_output = [[2.0, 1.0]] @ [[0.5, 0.5], [0.5, 0.5]] = [[(2.0×0.5 + 1.0×0.5), (2.0×0.5 + 1.0×0.5)]] = [[1.5, 1.5]]
Step 3: Compute the adapter path intermediate = [[2.0, 1.0]] @ [[0.1], [0.1]] = [[2.0×0.1 + 1.0×0.1]] = [[0.3]] adapter_output = [[0.3]] @ [[0.5, 0.5]] = [[0.15, 0.15]]
Step 4: Combine with alpha = 2.0, rank = 1 final = [[1.5, 1.5]] + (2.0/1) × [[0.15, 0.15]] = [[1.5, 1.5]] + [[0.3, 0.3]] = [[1.8, 1.8]]
This example shows how the zero_point shifts the dequantized values and how alpha amplifies the adapter contribution.
Constraints