0/318

00:00:00

Description

Editorial

Quantized Low-Rank Adaptation Forward Pass

MEDIUM20 pts

In the era of large language models (LLMs) with billions of parameters, fine-tuning on consumer hardware presents a significant challenge. Traditional fine-tuning requires loading all model weights in full precision (32-bit floating point), which can consume hundreds of gigabytes of GPU memory. Quantized Low-Rank Adaptation (commonly known as QLoRA) elegantly solves this problem by combining two powerful techniques:

Weight Quantization: The pretrained model weights are compressed to 4-bit precision, reducing memory footprint by approximately 8× compared to 32-bit storage
Low-Rank Adaptation (LoRA): Instead of modifying all weights, small trainable "adapter" matrices are introduced that capture task-specific knowledge

The QLoRA Forward Pass

During inference or fine-tuning, the forward pass combines both paths:

$$\text{output} = x \cdot W_{\text{dequant}} + \frac{\alpha}{r} \cdot x \cdot B \cdot A$$

Where:

x is the input tensor of shape (batch_size, input_dim)
W_dequant is the dequantized frozen weight matrix obtained from: $W_{\text{dequant}} = \text{quantized_W} \times \text{scale} + \text{zero_point}$
B is the low-rank down-projection matrix of shape (input_dim, rank)
A is the low-rank up-projection matrix of shape (rank, output_dim)
α (alpha) is a scaling hyperparameter that controls the magnitude of the adaptation
r is the rank of the low-rank decomposition (derived from the dimensions of B)

Dequantization Process

The frozen weights are stored in quantized integer format to save memory. Before computation, they must be dequantized back to floating-point:

$$W_{\text{dequant}} = \text{quantized_W} \times \text{scale} + \text{zero_point}$$

This affine transformation maps the quantized integer values back to their approximate original floating-point representations.

Your Task

Implement the qlora_forward function that computes the complete forward pass:

Dequantize the frozen weight matrix using the provided scale and zero_point
Compute the frozen path: multiply the input by the dequantized weights
Compute the adapter path: multiply the input through the low-rank matrices B and A
Combine both paths: add the frozen path to the scaled adapter path (using alpha/rank as the scaling factor)

The rank r should be inferred from the number of columns in matrix B (or equivalently, the number of rows in matrix A).

Example

Input

x = [[1.0, 2.0]]
quantized_W = [[0, 10], [10, 0]]
scale = 0.1
zero_point = 0.0
B = [[0.5], [0.5]]
A = [[1.0, 1.0]]
alpha = 1.0

Output

[[3.5, 2.5]]

Explanation

Step 1: Dequantize the frozen weights W = quantized_W × scale + zero_point W = [[0, 10], [10, 0]] × 0.1 + 0.0 = [[0.0, 1.0], [1.0, 0.0]]

Step 2: Compute the frozen path (x @ W) frozen_output = [[1.0, 2.0]] @ [[0.0, 1.0], [1.0, 0.0]] = [[(1.0×0.0 + 2.0×1.0), (1.0×1.0 + 2.0×0.0)]] = [[2.0, 1.0]]

Step 3: Compute the adapter path (x @ B @ A) intermediate = [[1.0, 2.0]] @ [[0.5], [0.5]] = [[1.0×0.5 + 2.0×0.5]] = [[1.5]] adapter_output = [[1.5]] @ [[1.0, 1.0]] = [[1.5, 1.5]]

Step 4: Combine with scaling (rank = 1, alpha = 1.0) final = frozen_output + (alpha/rank) × adapter_output = [[2.0, 1.0]] + (1.0/1) × [[1.5, 1.5]] = [[2.0 + 1.5, 1.0 + 1.5]] = [[3.5, 2.5]]

Example

Input

x = [[1.0, 0.0], [0.0, 1.0]]
quantized_W = [[10, 0], [0, 10]]
scale = 0.1
zero_point = 0.0
B = [[0.0], [0.0]]
A = [[0.0, 0.0]]
alpha = 1.0

Output

[[1.0, 0.0], [0.0, 1.0]]

Explanation

Step 1: Dequantize the frozen weights W = [[10, 0], [0, 10]] × 0.1 + 0.0 = [[1.0, 0.0], [0.0, 1.0]]

This is the identity matrix!

Step 2: Compute the frozen path frozen_output = x @ W = [[1.0, 0.0], [0.0, 1.0]] @ [[1.0, 0.0], [0.0, 1.0]] = [[1.0, 0.0], [0.0, 1.0]]

Step 3: Compute the adapter path Since B and A are all zeros, the adapter contribution is zero. adapter_output = [[0.0, 0.0], [0.0, 0.0]]

Step 4: Combine final = [[1.0, 0.0], [0.0, 1.0]] + [[0.0, 0.0], [0.0, 0.0]] = [[1.0, 0.0], [0.0, 1.0]]

This demonstrates that with zero LoRA weights, the output equals the frozen model output.

Example

Input

x = [[2.0, 1.0]]
quantized_W = [[5, 5], [5, 5]]
scale = 0.2
zero_point = -0.5
B = [[0.1], [0.1]]
A = [[0.5, 0.5]]
alpha = 2.0

Output

[[1.8, 1.8]]

Explanation

Step 1: Dequantize with non-zero offset W = [[5, 5], [5, 5]] × 0.2 + (-0.5) = [[1.0, 1.0], [1.0, 1.0]] + [[-0.5, -0.5], [-0.5, -0.5]] = [[0.5, 0.5], [0.5, 0.5]]

Step 2: Compute the frozen path frozen_output = [[2.0, 1.0]] @ [[0.5, 0.5], [0.5, 0.5]] = [[(2.0×0.5 + 1.0×0.5), (2.0×0.5 + 1.0×0.5)]] = [[1.5, 1.5]]

Step 3: Compute the adapter path intermediate = [[2.0, 1.0]] @ [[0.1], [0.1]] = [[2.0×0.1 + 1.0×0.1]] = [[0.3]] adapter_output = [[0.3]] @ [[0.5, 0.5]] = [[0.15, 0.15]]

Step 4: Combine with alpha = 2.0, rank = 1 final = [[1.5, 1.5]] + (2.0/1) × [[0.15, 0.15]] = [[1.5, 1.5]] + [[0.3, 0.3]] = [[1.8, 1.8]]

This example shows how the zero_point shifts the dequantized values and how alpha amplifies the adapter contribution.

Accepted0/0·0% Acceptance

Constraints

1 ≤ batch_size ≤ 64
1 ≤ input_dim, output_dim ≤ 1024
1 ≤ rank ≤ min(input_dim, output_dim)
0 ≤ quantized_W[i][j] ≤ 15 (representing 4-bit quantized values)
-1.0 ≤ scale ≤ 1.0, scale ≠ 0
-10.0 ≤ zero_point ≤ 10.0
-10.0 ≤ B[i][j], A[i][j] ≤ 10.0
0.1 ≤ alpha ≤ 10.0
Matrix dimensions are always compatible for multiplication

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

A =

[[1,1]]

B =

[[0.5],[0.5]]

x =

[[1,2]]

alpha =

scale =

0.1

zero_point =

quantized_W =

[[0,10],[10,0]]

The QLoRA Forward Pass

During inference or fine-tuning, the forward pass combines both paths:

$$\text{output} = x \cdot W_{\text{dequant}} + \frac{\alpha}{r} \cdot x \cdot B \cdot A$$

Where:

x is the input tensor of shape (batch_size, input_dim)

W_dequant is the dequantized frozen weight matrix obtained from: $W_{\text{dequant}} = \text{quantized_W} \times \text{scale} + \text{zero_point}$

B is the low-rank down-projection matrix of shape (input_dim, rank)

A is the low-rank up-projection matrix of shape (rank, output_dim)

α (alpha) is a scaling hyperparameter that controls the magnitude of the adaptation

r is the rank of the low-rank decomposition (derived from the dimensions of B)

Dequantization Process

The frozen weights are stored in quantized integer format to save memory. Before computation, they must be dequantized back to floating-point:

$$W_{\text{dequant}} = \text{quantized_W} \times \text{scale} + \text{zero_point}$$

This affine transformation maps the quantized integer values back to their approximate original floating-point representations.

Your Task

Implement the qlora_forward function that computes the complete forward pass:

Dequantize the frozen weight matrix using the provided scale and zero_point

Compute the frozen path: multiply the input by the dequantized weights

Compute the adapter path: multiply the input through the low-rank matrices B and A

Combine both paths: add the frozen path to the scaled adapter path (using alpha/rank as the scaling factor)

The rank r should be inferred from the number of columns in matrix B (or equivalently, the number of rows in matrix A).

Quantized Low-Rank Adaptation Forward Pass

The QLoRA Forward Pass

Dequantization Process

Your Task

Hints

Quantized Low-Rank Adaptation Forward Pass

The QLoRA Forward Pass

Dequantization Process

Your Task

Hints