Loading content...
Principal Component Analysis (PCA) is one of the most powerful and widely-used techniques in data science and machine learning for dimensionality reduction. It transforms high-dimensional data into a lower-dimensional representation while preserving as much variance (information) as possible.
The core idea behind PCA is to identify the directions (called principal components) along which the data varies the most. These directions are the eigenvectors of the data's covariance matrix, and the eigenvalues indicate how much variance each direction captures.
Given a dataset X with n samples and m features, PCA proceeds as follows:
Step 1: Standardize the Data Center the data by subtracting the mean of each feature: $$X_{standardized} = X - \mu$$
where (\mu) is the mean vector computed across all samples for each feature.
Step 2: Compute the Covariance Matrix Calculate the covariance matrix C of the standardized data: $$C = \frac{1}{n-1} X_{standardized}^T X_{standardized}$$
The covariance matrix captures the relationships between all pairs of features.
Step 3: Eigenvalue Decomposition Find the eigenvalues (\lambda_1, \lambda_2, ..., \lambda_m) and corresponding eigenvectors (v_1, v_2, ..., v_m) of the covariance matrix. Each eigenvector represents a principal component direction, and its eigenvalue indicates the variance captured along that direction.
Step 4: Select Top k Components Sort the eigenvectors by their corresponding eigenvalues in descending order and select the top k eigenvectors as the principal components.
Mathematically, if v is an eigenvector, then -v is also a valid eigenvector pointing in the opposite direction. To ensure consistent, reproducible results, apply the following normalization rule:
For each eigenvector, check the first non-zero element. If it is negative, multiply the entire eigenvector by -1 to flip its direction.
This convention ensures that all implementations produce identical results regardless of the underlying numerical library used.
Implement a function that performs Principal Component Analysis from scratch. The function should:
data = [[1, 2], [3, 4], [5, 6]]
k = 1[[0.7071], [0.7071]]For this 3×2 dataset:
Step 1: Standardization Column means: [3.0, 4.0] Centered data: [1-3, 2-4] = [-2, -2] [3-3, 4-4] = [0, 0] [5-3, 6-4] = [2, 2]
Step 2: Covariance Matrix Computing the 2×2 covariance matrix yields equal variance in both features with perfect correlation.
Step 3: Eigendecomposition The dominant eigenvector is [0.7071, 0.7071], which points along the diagonal direction where the data varies most.
Step 4: Result Since k=1, we return only the first principal component. The first element (0.7071) is positive, so no sign flip is needed.
data = [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]
k = 2[[0.5774, 0.0585], [0.5774, 0.6761], [0.5774, -0.7345]]For this 4×3 dataset:
Step 1: Centering The data is centered by subtracting column means [5.5, 6.5, 7.5].
Step 2-3: Covariance & Eigendecomposition The covariance matrix reveals the variance structure. The eigenanalysis finds that:
Step 4: Result With k=2, we return both eigenvectors as columns, forming a 3×2 matrix (3 features × 2 components).
data = [[1, 0], [0, 1], [-1, 0], [0, -1]]
k = 2[[0.0, 1.0], [1.0, 0.0]]This dataset contains 4 points forming a cross pattern centered at the origin:
Geometric Interpretation The points lie along two perpendicular axes. After centering and covariance computation:
Both eigenvalues are equal (the data has equal spread in both directions). The eigenvectors from eigh() are deterministic and align with the coordinate axes for this diagonal covariance matrix.
Constraints