Loading content...
The term "scree" comes from geology—it refers to the loose rocks that accumulate at the base of a cliff. In PCA, the scree plot got its name because its shape often resembles a cliff face with scree at the bottom: steep descent followed by a flat tail of 'rubble.'
The scree plot is arguably the most important diagnostic tool in PCA. It visualizes the eigenvalue spectrum, revealing patterns that determine how many components to retain. A quick glance at a scree plot tells experienced practitioners whether PCA will be effective, how many components make sense, and whether the data has unusual structure.
This page develops a deep understanding of scree plots—from construction fundamentals to sophisticated interpretation techniques and automatic elbow detection algorithms.
By the end of this page, you will know how to construct and interpret scree plots in various formats, understand the mathematical basis for different scree plot patterns, learn formal algorithms for detecting elbows automatically, and recognize when scree plots can mislead and how to guard against misinterpretation.
Let's establish the fundamentals of scree plot construction, including different variants and their respective uses.
The standard scree plot visualizes eigenvalues against component index:
The eigenvalues are always plotted in descending order: $\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d$.
Instead of raw eigenvalues, plot the proportion of variance explained:
$$\text{PVE}j = \frac{\lambda_j}{\sum{k=1}^{d} \lambda_k}$$
This normalizes the plot to the [0, 1] range, making it easier to interpret and compare across datasets.
For data with eigenvalues spanning many orders of magnitude, a log-scale y-axis is essential:
When to use log scale:
A highly informative variant shows both individual and cumulative variance:
Always include axis labels and titles. Consider both linear and log scales—each reveals different information. Limit the x-axis range to meaningful components (e.g., first 50 even if d=10,000). Add reference lines at key thresholds (mean eigenvalue, variance targets). Use consistent styling across analyses for comparability.
Different data structures produce characteristic scree plot shapes. Learning to read these patterns is a valuable diagnostic skill.
The ideal scree plot shows a clear elbow—a sharp bend from steep descent to flat plateau:
|*
| *
| *
| * * * * * * * * *
+----------------------→
Interpretation:
Typical sources: Signal-plus-noise models, low-rank structure corrupted by measurement error.
|*
| *
| *
| *
| *
| *
| *
+----------------------→
Interpretation:
Typical sources: Well-designed experiments, truly independent features, complex natural phenomena.
|*
| *
| * *
| *
| * * *
| * * * *
+----------------------→
Interpretation:
Typical sources: Multi-resolution data, nested structure, mixtures of processes.
| Pattern | Visual Signature | Data Structure | Action |
|---|---|---|---|
| Sharp Elbow | Cliff then flat | Low-rank + noise | Clear $k$ choice at elbow |
| Gradual Decay | Smooth curve down | High intrinsic dimensionality | Use variance threshold |
| Power Law | Straight line in log-log | Scale-free structure | Context-dependent choice |
| Step Pattern | Plateaus then drops | Repeated eigenvalues | Look within plateaus |
| Multimodal | Multiple elbows | Hierarchical structure | Domain expertise needed |
| Flat | Nearly horizontal | Independent features | PCA not helpful |
The 'elbow method' is widely cited but often poorly understood. Let's dissect what we're actually looking for and why.
Mathematically, an elbow occurs where the second derivative of the eigenvalue sequence changes sign or magnitude significantly. It represents a transition between two different decay regimes:
The elbow is where marginal value drops sharply.
Signal-plus-noise model: If true data lies near a $k$-dimensional subspace but is observed with noise:
$$\mathbf{x}{\text{observed}} = \mathbf{x}{\text{signal}} + \boldsymbol{\epsilon}_{\text{noise}}$$
The eigenvalues will be:
The elbow occurs at $k$, the true signal dimensionality.
Why the elbow might be fuzzy:
Different analysts often identify different elbows in the same plot. The 'elbow' is a subjective visual judgment unless we specify an algorithm. Two reasonable people can disagree—this is a limitation of the method, not a user error. Always report your reasoning and consider sensitivity to the choice.
To remove subjectivity, we can use algorithmic methods for elbow detection. Several approaches have been developed, each with different assumptions.
Find the point of maximum curvature in the eigenvalue curve. For discrete data, approximate curvature as:
$$\kappa_j \approx |\lambda_{j-1} - 2\lambda_j + \lambda_{j+1}|$$
The elbow is at $k^* = \arg\max_j \kappa_j$.
Pros: Intuitive, matches visual perception Cons: Sensitive to noise in eigenvalues, requires smoothing
Draw a line from the first to the last point. Find the point with maximum perpendicular distance from this line:
Pros: Robust, handles various decay patterns Cons: Depends on normalization, may miss elbows at edges
For power-law decays, analyze the ratio of consecutive eigenvalues:
$$r_j = \frac{\lambda_j}{\lambda_{j+1}}$$
In a pure power-law decay, $r_j$ is constant. The elbow is where $r_j$ changes significantly:
$$k^* = \arg\max_j |r_j - r_{j+1}|$$
Pros: Scale-invariant, good for power-law data Cons: Assumes specific decay structure
Compare eigenvalues to those expected if total variance were randomly distributed. Under the broken stick model, the expected proportion for component $j$ is:
$$b_j = \frac{1}{d} \sum_{k=j}^{d} \frac{1}{k}$$
Keep components where observed PVE exceeds this null expectation.
Pros: Provides statistical baseline Cons: Very conservative, may underestimate $k$
Use multiple methods and check for consistency. If three different algorithms suggest $k = 5$, $k = 6$, and $k = 5$, you can be confident the true elbow is around 5-6. Large disagreements indicate either no clear elbow or the need for domain-specific guidance.
A fundamentally different approach asks: which eigenvalues are statistically distinguishable from noise? This connects scree plot interpretation to hypothesis testing.
Even for completely random data, eigenvalues are not all equal—there's sampling variation. The largest eigenvalues of random matrices follow the Marchenko-Pastur distribution (in the limit of large $n$ and $d$).
For a $n \times d$ random matrix with $\gamma = d/n < 1$, the eigenvalue distribution has support:
$$[(1-\sqrt{\gamma})^2, (1+\sqrt{\gamma})^2]$$
Eigenvalues outside this range are significant; those inside could be noise.
A simulation-based approach that doesn't require asymptotic formulas:
This accounts for finite-sample effects that the Marchenko-Pastur approximation might miss.
| Component | Observed $\lambda$ | 95% Null Threshold | Significant? |
|---|---|---|---|
| 1 | 15.3 | 2.31 | Yes ✓ |
| 2 | 8.7 | 2.08 | Yes ✓ |
| 3 | 4.2 | 1.91 | Yes ✓ |
| 4 | 2.1 | 1.77 | Yes ✓ |
| 5 | 1.5 | 1.65 | No ✗ |
| 6 | 1.3 | 1.54 | No ✗ |
Parallel analysis typically yields fewer components than visual elbow detection. It answers 'which components are distinguishable from pure noise?'—a conservative question. For many applications, you might want more components than this suggests, especially if prior knowledge supports them.
Scree plots are powerful but can mislead. Understanding common pitfalls helps you avoid drawing incorrect conclusions.
The same data plotted at different scales can suggest different elbows:
Mitigation: Always show the full range and multiple scales. Be explicit about scale choices.
Sometimes the first eigenvalue is so large that it dominates the plot, making all other structure invisible:
|*
|
|
|
|
|* * * * * * * * *
+----------------------→
This can happen when:
Solutions: Check centering, consider standardization, use log scale, plot components 2-d separately.
Scree plots reveal variance structure but not relevance. In supervised learning, the most informative features for classification may have low variance. PCA and scree analysis optimize for variance, which is orthogonal to class separability. Always validate against your actual objective.
Beyond basic scree plots, advanced techniques provide deeper insights for complex situations.
Eigenvalues have sampling uncertainty. Visualize this with bootstrap confidence intervals:
This reveals which eigenvalues are distinguishable from each other. If confidence intervals for $\lambda_3$ and $\lambda_4$ overlap, they may represent the same underlying dimensionality.
Check how stable the scree plot is to data perturbations:
Unstable eigenvalues (large variance across perturbations) should be treated cautiously—they may be fitting noise rather than signal.
When analyzing multiple related datasets, overlay their scree plots:
Example use cases: Comparing PCA across different patient cohorts, analyzing variance structure pre- and post-treatment, comparing training and test set structure.
After choosing $k$ components, analyze the residuals:
$$\mathbf{E} = \mathbf{X} - \mathbf{X}_k = \mathbf{X} - \mathbf{X}\mathbf{W}_k\mathbf{W}_k^T$$
Plot:
Patterns in residuals suggest additional structure not captured by $k$ components.
In professional settings, report multiple views: (1) basic scree plot with candidate elbows marked, (2) cumulative variance with thresholds, (3) parallel analysis results, and (4) downstream task performance for different $k$ values. This comprehensive approach demonstrates rigor and supports defensible decisions.
We've developed a comprehensive understanding of scree plots—from construction to interpretation to automated analysis. Here are the essential insights:
You've now mastered the theoretical foundations of Principal Component Analysis:
With this foundation, you're prepared to understand PCA variants (standardized, kernel, sparse, incremental) and apply PCA effectively to real-world problems.
Congratulations! You've completed the PCA Theory module. You now have a deep, principled understanding of what PCA does, why it works, and how to interpret its outputs. The next module will explore PCA variants—extensions and modifications that address specific limitations of standard PCA.