Low-Rank Adaptation Forward Computation (Medium) — Practice with Code Visualizer

Parameter-Efficient Fine-Tuning with Low-Rank Decomposition

In modern deep learning, large pretrained models contain billions of parameters that are expensive to fully fine-tune for specific downstream tasks. Low-Rank Adaptation (LRA) is an elegant parameter-efficient fine-tuning technique that addresses this challenge by keeping the pretrained weights frozen and learning a low-rank decomposition of the weight updates.

The key insight behind LRA is that the weight updates during fine-tuning often lie in a low-rank subspace, meaning we can represent them efficiently using two smaller matrices instead of a full-sized update matrix. This dramatically reduces the number of trainable parameters while maintaining model quality.

Mathematical Formulation:

For a pretrained weight matrix W of dimensions (d_in × d_out), instead of computing:

$$h = x \cdot W_{fine-tuned}$$

LRA computes:

$$h = x \cdot W + \frac{\alpha}{r} \cdot (x \cdot B \cdot A)$$

Where:

x is the input of shape (batch_size × d_in)
W is the frozen pretrained weight matrix of shape (d_in × d_out)
B is the down-projection matrix of shape (d_in × r)
A is the up-projection matrix of shape (r × d_out)
r is the rank (a hyperparameter, typically much smaller than d_in and d_out)
α is a scaling factor that controls the magnitude of the adaptation

The term α/r normalizes the adaptation to prevent it from dominating the pretrained weights, ensuring stable training and predictable behavior across different rank choices.

Why This Matters:

The product B · A represents a low-rank approximation to what would otherwise be a full (d_in × d_out) update matrix. By choosing a small rank r, the number of trainable parameters reduces from d_in × d_out to (d_in × r) + (r × d_out), which can be orders of magnitude smaller.

Your Task:

Implement the forward pass of a Low-Rank Adaptation layer. Given an input matrix x, frozen pretrained weights W, low-rank matrices B and A, and a scaling factor α, compute the output by combining the frozen path with the scaled low-rank adaptation path. The rank r should be inferred from the dimensions of matrices B and A (specifically, the number of columns in B or equivalently the number of rows in A).

Step-by-step computation:

Frozen path (x @ W):
- x @ W = [[1.0, 2.0]] @ [[1.0, 0.0], [0.0, 1.0]]
- = [[(1.0×1.0 + 2.0×0.0), (1.0×0.0 + 2.0×1.0)]]
- = [[1.0, 2.0]]
Low-rank path (x @ B @ A):
- x @ B = [[1.0, 2.0]] @ [[1.0], [1.0]] = [[3.0]]
- (x @ B) @ A = [[3.0]] @ [[0.5, 0.5]] = [[1.5, 1.5]]
Determine rank r:
- B has shape (2, 1), so r = 1
Compute scaling factor:
- alpha / r = 2.0 / 1 = 2.0
Combine paths:
- output = frozen_path + (alpha/r) × low_rank_path
- = [[1.0, 2.0]] + 2.0 × [[1.5, 1.5]]
- = [[1.0, 2.0]] + [[3.0, 3.0]]
- = [[4.0, 5.0]]

Step-by-step computation for batch input:

Frozen path (x @ W):
- [[1.0, 0.0], [0.0, 1.0]] @ [[2.0, 1.0], [1.0, 2.0]]
- = [[2.0, 1.0], [1.0, 2.0]]
Low-rank path (x @ B @ A):
- x @ B = [[1.0, 0.0], [0.0, 1.0]] @ [[0.5, 0.5], [0.5, 0.5]]
- = [[0.5, 0.5], [0.5, 0.5]]
- (x @ B) @ A = [[0.5, 0.5], [0.5, 0.5]] @ [[1.0, 0.0], [0.0, 1.0]]
- = [[0.5, 0.5], [0.5, 0.5]]
Determine rank r:
- B has shape (2, 2), so r = 2
Compute scaling factor:
- alpha / r = 1.0 / 2 = 0.5
Combine paths:
- output = [[2.0, 1.0], [1.0, 2.0]] + 0.5 × [[0.5, 0.5], [0.5, 0.5]]
- = [[2.0, 1.0], [1.0, 2.0]] + [[0.25, 0.25], [0.25, 0.25]]
- = [[2.25, 1.25], [1.25, 2.25]]

Minimal 1D example:

Frozen path: x @ W = [[2.0]] @ [[3.0]] = [[6.0]]
Low-rank path: x @ B @ A = [[2.0]] @ [[1.0]] @ [[1.0]] = [[2.0]]
Rank: r = 1 (B has 1 column)
Scaling: alpha / r = 4.0 / 1 = 4.0
Output: [[6.0]] + 4.0 × [[2.0]] = [[6.0]] + [[8.0]] = [[14.0]]

This example demonstrates how a larger alpha value amplifies the contribution of the low-rank adaptation, allowing it to significantly modify the output from the frozen pretrained weights.