Regression Metrics - Learning Module

Loading content...

0/245

R-squared and Adjusted R²

Proportion of Variance Explained

MSE tells you the average squared error. MAE tells you the average absolute error. But neither answers a fundamental question: How much of the variation in outcomes is my model actually explaining?

Enter R-squared (R²), also called the coefficient of determination. R² provides a scale-independent measure that answers: 'Of all the variance in y, what fraction does my model capture?'

An R² of 0.85 means your model explains 85% of the variance in outcomes—a statement that's meaningful regardless of whether you're predicting house prices in dollars or temperatures in Celsius.

What You Will Learn

By the end of this page, you will understand the mathematical definition and derivation of R², the geometric interpretation as projection, why R² ranges from 0 to 1 (usually), edge cases where R² can be negative, the overfitting problem and why adjusted R² exists, how to compute and use adjusted R², and limitations and appropriate use of R².

Mathematical Definition

R-squared is defined as the proportion of variance in the dependent variable that is explained by the model:

$$R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$$

Where:

$SS_{res}$ (Residual Sum of Squares): $\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$ — unexplained variance
$SS_{tot}$ (Total Sum of Squares): $\sum_{i=1}^{n}(y_i - \bar{y})^2$ — total variance in y

Let's unpack this formula to build intuition.

The Three Sum of Squares

We can decompose the total variation in y into two components:

$$SS_{tot} = SS_{reg} + SS_{res}$$

$SS_{tot}$ — Total Sum of Squares
- Measures total variance around the mean: $\sum(y_i - \bar{y})^2$
- This is what we'd have if our only 'model' was predicting the mean
$SS_{res}$ — Residual Sum of Squares
- Measures variance not explained by model: $\sum(y_i - \hat{y}_i)^2$
- This is $n \times \text{MSE}$
$SS_{reg}$ — Regression (Explained) Sum of Squares
- Measures variance explained by model: $\sum(\hat{y}_i - \bar{y})^2$
- How far predictions are from the baseline mean

r_squared_calculation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import numpy as np
 
def r_squared_from_scratch(y_true: np.ndarray, y_pred: np.ndarray) -> dict:
    """
    Compute R² and its components from scratch.
    
    Returns detailed breakdown of sum of squares.
    """
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)
    n = len(y_true)
    
    # Mean of true values (baseline prediction)
    y_mean = np.mean(y_true)
    
    # Sum of Squares components
    ss_tot = np.sum((y_true - y_mean) ** 2)      # Total variance
    ss_res = np.sum((y_true - y_pred) ** 2)      # Residual (unexplained)
    ss_reg = np.sum((y_pred - y_mean) ** 2)      # Explained by model
    
    # R-squared
    r_squared = 1 - (ss_res / ss_tot)
    
    # Alternative formulation (same result for OLS)
    r_squared_alt = ss_reg / ss_tot
    
    return {
        'ss_tot': ss_tot,
        'ss_res': ss_res,
        'ss_reg': ss_reg,
        'r_squared': r_squared,
        'r_squared_alt': r_squared_alt,
        'variance_explained_pct': r_squared * 100,
        'mse': ss_res / n
    }
 
# Example: Predicting test scores
y_true = np.array([65, 72, 80, 85, 90, 75, 70, 88, 95, 78])
y_pred = np.array([68, 70, 82, 84, 88, 77, 72, 85, 92, 80])
 
# Calculate R²
result = r_squared_from_scratch(y_true, y_pred)
 
print("=== R² Breakdown ===")
print(f"Mean of y: {np.mean(y_true):.2f}")
print(f"\nSum of Squares:")
print(f"  SS_total (variance around mean): {result['ss_tot']:.2f}")
print(f"  SS_residual (unexplained):       {result['ss_res']:.2f}")
print(f"  SS_regression (explained):       {result['ss_reg']:.2f}")
print(f"\nR² = 1 - (SS_res / SS_tot)")
print(f"R² = 1 - ({result['ss_res']:.2f} / {result['ss_tot']:.2f})")
print(f"R² = {result['r_squared']:.4f}")
print(f"\n→ Model explains {result['variance_explained_pct']:.1f}% of variance in test scores")

Relation to MSE

Notice that $SS_{res} = n \times MSE$. So R² can be written as: $R^2 = 1 - \frac{n \times MSE}{SS_{tot}} = 1 - \frac{MSE}{Var(y)}$. R² normalizes MSE by the total variance, producing a scale-free metric.

Interpretation and Intuition

R² has an intuitive interpretation: the fraction of variance in y that is 'explained' by the model.

Baseline Comparison

R² implicitly compares your model to a baseline model that predicts the mean for all samples:

Baseline MSE = $\frac{1}{n}\sum(y_i - \bar{y})^2 = Var(y)$
Model MSE = $\frac{1}{n}\sum(y_i - \hat{y}_i)^2$

$$R^2 = 1 - \frac{\text{Model MSE}}{\text{Baseline MSE}}$$

So R² measures: How much better is my model than just predicting the mean?

Interpreting R² Values
R² Value	Interpretation	Model Quality
1.0	Perfect predictions (all variance explained)	Perfect (or overfitting)
0.9-1.0	Explains 90%+ of variance	Excellent
0.7-0.9	Explains 70-90% of variance	Good
0.5-0.7	Explains 50-70% of variance	Moderate
0.3-0.5	Explains 30-50% of variance	Weak
0.0-0.3	Explains little variance	Poor (but may still be useful)
0.0	Same as predicting the mean	No predictive power
< 0	Worse than predicting the mean	Actively harmful

Important Caveats

R² values must be interpreted in context:

Domain-dependent expectations: In physics, R² > 0.99 is common; in social sciences, R² = 0.3 might be excellent.
Doesn't mean 'accurate': R² = 0.9 doesn't mean predictions are within 10% of true values. It means 90% of variance is explained.
Can be misleading for non-linear patterns: R² only measures linear association when used with linear models.
Sensitive to outcome variance: If y has very high variance, even a good model might have moderate R². If y has low variance, even a mediocre model might have high R².

r_squared_interpretation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import numpy as np
 
def r_squared_vs_accuracy():
    """
    Demonstrate that high R² doesn't mean predictions are 'close'.
    """
    np.random.seed(42)
    
    # Scenario 1: High R² with large absolute errors
    y_true_1 = np.array([100, 200, 300, 400, 500, 600, 700, 800, 900, 1000])
    y_pred_1 = y_true_1 + np.array([-30, 20, -40, 35, -25, 30, -35, 40, -20, 25])
    
    ss_tot_1 = np.sum((y_true_1 - np.mean(y_true_1)) ** 2)
    ss_res_1 = np.sum((y_true_1 - y_pred_1) ** 2)
    r2_1 = 1 - ss_res_1 / ss_tot_1
    mae_1 = np.mean(np.abs(y_true_1 - y_pred_1))
    
    # Scenario 2: Lower R² with smaller absolute errors
    y_true_2 = np.array([100, 102, 104, 106, 108, 110, 112, 114, 116, 118])
    y_pred_2 = y_true_2 + np.array([-3, 2, -4, 3, -2, 3, -3, 4, -2, 2])
    
    ss_tot_2 = np.sum((y_true_2 - np.mean(y_true_2)) ** 2)
    ss_res_2 = np.sum((y_true_2 - y_pred_2) ** 2)
    r2_2 = 1 - ss_res_2 / ss_tot_2
    mae_2 = np.mean(np.abs(y_true_2 - y_pred_2))
    
    print("=== R² vs Prediction Accuracy ===")
    print("\nScenario 1: High variance target (100 to 1000)")
    print(f"  R²: {r2_1:.4f}")
    print(f"  MAE: {mae_1:.1f}")
    print(f"  → High R² but predictions off by ~{mae_1:.0f} on average!")
    
    print("\nScenario 2: Low variance target (100 to 118)")
    print(f"  R²: {r2_2:.4f}")
    print(f"  MAE: {mae_2:.1f}")
    print(f"  → Lower R² but predictions off by only ~{mae_2:.0f}!")
    
    print("\n*** Key Insight ***")
    print("R² depends on target variance. Same MAE gives different R²")
    print("depending on how spread out the true values are.")
 
r_squared_vs_accuracy()

R² Alone Is Not Enough

Never evaluate a model solely on R². Always report R² alongside MAE/RMSE for absolute error magnitude, residual plots for systematic patterns, and domain-specific metrics for business relevance.

Geometric Interpretation

R² has an elegant geometric interpretation in the space of observations.

Vectors in n-Dimensional Space

Consider each quantity as a vector in $\mathbb{R}^n$ (one dimension per data point):

$\mathbf{y}$ = $(y_1, y_2, ..., y_n)$ — the true values
$\hat{\mathbf{y}}$ = $(\hat{y}_1, \hat{y}_2, ..., \hat{y}_n)$ — the predictions
$\bar{\mathbf{y}}$ = $(\bar{y}, \bar{y}, ..., \bar{y})$ — the mean prediction vector

R² as Cosine of Angle

For linear regression (OLS), R² equals the squared correlation between y and $\hat{y}$:

$$R^2 = \text{Corr}(y, \hat{y})^2 = \cos^2(\theta)$$

Where $\theta$ is the angle between the centered vectors $(\mathbf{y} - \bar{\mathbf{y}})$ and $(\hat{\mathbf{y}} - \bar{\mathbf{y}})$.

$R^2 = 1$: Vectors perfectly aligned ($\theta = 0°$)
$R^2 = 0$: Vectors perpendicular ($\theta = 90°$)
$R^2 = 0.25$: Angle of 60° between vectors

geometric_r_squared.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import numpy as np
 
def geometric_r_squared(y_true: np.ndarray, y_pred: np.ndarray):
    """
    Demonstrate the geometric interpretation of R².
    R² = cos²(θ) between centered y and y_hat vectors.
    """
    # Center the vectors
    y_centered = y_true - np.mean(y_true)
    y_hat_centered = y_pred - np.mean(y_pred)
    
    # Compute angle using dot product
    # cos(θ) = (a · b) / (||a|| × ||b||)
    dot_product = np.dot(y_centered, y_hat_centered)
    norm_y = np.linalg.norm(y_centered)
    norm_y_hat = np.linalg.norm(y_hat_centered)
    
    cos_theta = dot_product / (norm_y * norm_y_hat)
    theta_radians = np.arccos(np.clip(cos_theta, -1, 1))
    theta_degrees = np.degrees(theta_radians)
    
    # R² from geometry
    r_squared_geometric = cos_theta ** 2
    
    # R² from standard formula
    ss_tot = np.sum((y_true - np.mean(y_true)) ** 2)
    ss_res = np.sum((y_true - y_pred) ** 2)
    r_squared_standard = 1 - ss_res / ss_tot
    
    # Correlation coefficient
    correlation = np.corrcoef(y_true, y_pred)[0, 1]
    
    print("=== Geometric R² Interpretation ===")
    print(f"\nVector norms:")
    print(f"  ||y - ȳ||  = {norm_y:.4f}")
    print(f"  ||ŷ - ȳ|| = {norm_y_hat:.4f}")
    print(f"\nAngle between centered vectors:")
    print(f"  cos(θ) = {cos_theta:.4f}")
    print(f"  θ = {theta_degrees:.2f}°")
    print(f"\nR² calculations:")
    print(f"  cos²(θ) = {r_squared_geometric:.4f}")
    print(f"  1 - SS_res/SS_tot = {r_squared_standard:.4f}")
    print(f"  Corr(y, ŷ)² = {correlation**2:.4f}")
    print(f"\n→ All three methods give the same R²!")
    
    return r_squared_geometric
 
# Example
y_true = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
y_pred = np.array([1.2, 2.1, 2.8, 4.2, 5.1, 5.9, 7.2, 7.8, 9.1, 10.2])
 
geometric_r_squared(y_true, y_pred)

Projection Interpretation

In linear regression, the predictions $\hat{\mathbf{y}}$ are the orthogonal projection of $\mathbf{y}$ onto the column space of the design matrix $\mathbf{X}$.

By the Pythagorean theorem (since residuals are orthogonal to predictions):

$$||\mathbf{y} - \bar{\mathbf{y}}||^2 = ||\hat{\mathbf{y}} - \bar{\mathbf{y}}||^2 + ||\mathbf{y} - \hat{\mathbf{y}}||^2$$ $$SS_{tot} = SS_{reg} + SS_{res}$$

This decomposition only holds exactly for OLS linear regression, which is why $SS_{reg}/SS_{tot} = R^2$ only for linear models.

When R² Can Be Negative

A common misconception is that R² must be between 0 and 1. In fact, R² can be negative when the model performs worse than the baseline mean prediction.

$$R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$$

If $SS_{res} > SS_{tot}$, then $R^2 < 0$.

This means your model's predictions are further from the true values than the mean is!

negative_r_squared.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import numpy as np
 
def demonstrate_negative_r_squared():
    """
    Show scenarios where R² becomes negative.
    """
    # True values
    y_true = np.array([10, 20, 30, 40, 50])
    y_mean = np.mean(y_true)  # = 30
    
    print("True values:", y_true)
    print(f"Mean: {y_mean}")
    print(f"\nBaseline (mean) predictions: [{y_mean}] × 5")
    
    ss_tot = np.sum((y_true - y_mean) ** 2)
    print(f"SS_tot (variance around mean): {ss_tot}")
    
    # Scenario 1: Good predictions
    y_pred_good = np.array([12, 22, 28, 38, 52])
    ss_res_good = np.sum((y_true - y_pred_good) ** 2)
    r2_good = 1 - ss_res_good / ss_tot
    print(f"\n--- Good Model ---")
    print(f"Predictions: {y_pred_good}")
    print(f"SS_res: {ss_res_good}")
    print(f"R²: {r2_good:.4f} (positive)")
    
    # Scenario 2: Bad predictions (worse than mean)
    y_pred_bad = np.array([50, 10, 50, 10, 50])  # Wrong direction!
    ss_res_bad = np.sum((y_true - y_pred_bad) ** 2)
    r2_bad = 1 - ss_res_bad / ss_tot
    print(f"\n--- Bad Model ---")
    print(f"Predictions: {y_pred_bad}")
    print(f"SS_res: {ss_res_bad}")
    print(f"R²: {r2_bad:.4f} (NEGATIVE!)")
    print(f"\n→ Model is worse than just predicting the mean!")
    
    # Scenario 3: Predicting exactly the mean
    y_pred_mean = np.full_like(y_true, y_mean, dtype=float)
    ss_res_mean = np.sum((y_true - y_pred_mean) ** 2)
    r2_mean = 1 - ss_res_mean / ss_tot
    print(f"\n--- Baseline Model (predict mean) ---")
    print(f"Predictions: {y_pred_mean}")
    print(f"SS_res: {ss_res_mean} (equals SS_tot)")
    print(f"R²: {r2_mean:.4f} (exactly zero)")
 
demonstrate_negative_r_squared()

When Does Negative R² Happen?

Model applied to wrong data: Training data distribution differs from test data
Incorrect model specification: Model structure fundamentally wrong for the problem
No intercept term: If linear regression has no intercept and data doesn't pass through origin
Adversarial or random predictions: Predictions uncorrelated or anti-correlated with truth
Data leakage during training: Model overfit to leaked information that's absent at test time

Negative R² Is a Red Flag

If you see negative R² on test data, something is seriously wrong. The absolute minimum a reasonable model should achieve is R² = 0 (predict the training mean). Negative R² indicates either a bug, data problem, or fundamental model mismatch.

The Overfitting Problem

R² has a fundamental flaw: it never decreases when you add more predictors, even if those predictors have no real relationship with the outcome.

Mathematically, adding a variable (or increasing model complexity) can only reduce $SS_{res}$—the model can always use the extra degree of freedom to fit the training data better, even if just fitting noise.

This leads to a critical problem: R² on training data is inflated for complex models.

r_squared_inflation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
 
def demonstrate_r_squared_inflation():
    """
    Show how R² increases with model complexity,
    even for useless random features.
    """
    np.random.seed(42)
    
    # True relationship: y = 2*x1 + 3*x2 + noise
    n_samples = 100
    X_true = np.random.randn(n_samples, 2)
    y = 2 * X_true[:, 0] + 3 * X_true[:, 1] + np.random.randn(n_samples) * 0.5
    
    print("True model: y = 2*x1 + 3*x2 + noise")
    print(f"Samples: {n_samples}\n")
    
    # Progressively add random (useless) features
    r_squared_values = []
    
    for n_noise_features in range(0, 80, 5):
        # Create dataset with true features + noise features
        X_noise = np.random.randn(n_samples, n_noise_features)
        X_full = np.hstack([X_true, X_noise]) if n_noise_features > 0 else X_true
        
        # Fit model and compute R²
        model = LinearRegression()
        model.fit(X_full, y)
        y_pred = model.predict(X_full)
        
        r2 = r2_score(y, y_pred)
        r_squared_values.append((2 + n_noise_features, r2))
        
        if n_noise_features in [0, 10, 30, 50, 70]:
            print(f"Features: {2 + n_noise_features:2d} ({n_noise_features} noise) → R² = {r2:.4f}")
    
    print(f"\n*** Key Insight ***")
    print("R² keeps increasing even though noise features add NOTHING!")
    print("With enough features, R² → 1.0 (perfect fit to training noise)")
 
demonstrate_r_squared_inflation()
# Example output:
# Features:  2 (0 noise) → R² = 0.9721
# Features: 12 (10 noise) → R² = 0.9841
# Features: 32 (30 noise) → R² = 0.9944
# Features: 52 (50 noise) → R² = 0.9982
# Features: 72 (70 noise) → R² = 0.9996

The Extreme Case

With $n$ data points and $n$ features, a linear model can fit the training data perfectly ($R^2 = 1$) regardless of the true relationship—it just memorizes each point.

This is why:

Training R² is optimistically biased
More complex models have more inflated R²
We need a metric that penalizes unnecessary complexity

Training vs Test R²

Always evaluate R² on held-out test data, not training data. Test R² will drop (sometimes dramatically) for overfit models, revealing the true generalization performance.

Adjusted R-Squared

Adjusted R² corrects for the inflation problem by penalizing model complexity:

$$R^2_{adj} = 1 - \frac{(1 - R^2)(n - 1)}{n - p - 1}$$

Where:

$n$ = number of samples
$p$ = number of predictors (features)
$R^2$ = unadjusted R-squared

How It Works

The adjustment factor $\frac{n-1}{n-p-1}$ is always ≥ 1 (since $p \geq 0$), so:

$$(1 - R^2_{adj}) = (1 - R^2) \times \frac{n-1}{n-p-1} \geq (1 - R^2)$$

Therefore $R^2_{adj} \leq R^2$ — adjusted R² is always less than or equal to regular R².

R² vs Adjusted R²
Property	R²	Adjusted R²
Always increases with features	Yes	No
Penalizes complexity	No	Yes
Can be negative	Only if worse than mean	Yes (if model overly complex)
Suitable for model selection	No (training data)	Yes (approximation)
Interpretation	Variance explained	Variance explained, adjusted for p

adjusted_r_squared.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
 
def adjusted_r_squared(r2: float, n: int, p: int) -> float:
    """
    Compute adjusted R² from R², sample size, and number of predictors.
    
    Parameters:
    -----------
    r2 : float - R-squared value
    n : int - Number of samples
    p : int - Number of predictors (excluding intercept)
    """
    if n - p - 1 <= 0:
        raise ValueError("n must be greater than p + 1")
    
    return 1 - ((1 - r2) * (n - 1) / (n - p - 1))
 
def compare_r2_adjusted():
    """
    Compare R² and Adjusted R² as features increase.
    """
    np.random.seed(42)
    
    n_samples = 100
    X_true = np.random.randn(n_samples, 2)
    y = 2 * X_true[:, 0] + 3 * X_true[:, 1] + np.random.randn(n_samples) * 0.5
    
    print("Effect of adding noise features on R² vs Adjusted R²")
    print(f"{'Features':^10} | {'R²':^10} | {'Adj R²':^10} | {'Difference':^10}")
    print("-" * 48)
    
    for n_noise in [0, 5, 10, 20, 30, 50, 70, 90]:
        X_noise = np.random.randn(n_samples, max(1, n_noise))
        X = np.hstack([X_true, X_noise]) if n_noise > 0 else X_true
        
        n_features = X.shape[1]
        
        model = LinearRegression()
        model.fit(X, y)
        r2 = r2_score(y, model.predict(X))
        adj_r2 = adjusted_r_squared(r2, n_samples, n_features)
        
        print(f"{n_features:^10} | {r2:^10.4f} | {adj_r2:^10.4f} | {r2 - adj_r2:^10.4f}")
    
    print("\n*** Key Insight ***")
    print("R² keeps increasing, but Adjusted R² DECREASES when")
    print("useless features are added—correctly penalizing overfitting.")
 
compare_r2_adjusted()

Adjusted R² for Model Selection

When comparing models with different numbers of features on the same dataset:

Higher Adjusted $R^2$ = better balance of fit and parsimony
If Adjusted $R^2$ decreases when adding a feature → that feature doesn't help
Maximum Adjusted $R^2$ often corresponds to the best model complexity

However, Adjusted $R^2$ is still computed on training data—it's an approximation, not a substitute for cross-validation on held-out data.

When to Use Adjusted R²

Use Adjusted R² when comparing models with different numbers of features on the same training data. For model evaluation on held-out data, regular R² is fine since you're not fitting to that data. For formal model selection, consider AIC/BIC or cross-validation.

Limitations and Pitfalls

R² is widely used but often misused. Understanding its limitations prevents interpretation errors.

Major Limitations

Common R² Pitfalls

•Doesn't imply causation — High R² means variables move together, not that X causes Y.
•Doesn't check residual assumptions — R² can be high even with heteroscedasticity, non-linearity, or autocorrelation.
•Doesn't compare across different y variables — R² = 0.8 for temperature vs R² = 0.8 for sales aren't comparable.
•Favors complex models — Without adjustment, more parameters always increase R².
•Sensitive to outliers — Like MSE, R² is affected by extreme values since it's MSE-based.
•Doesn't indicate prediction accuracy — High R² doesn't mean predictions are 'close enough' for your application.
•Can be deceived by restricted range — If test data has less variance than training, R² can be misleadingly high or low.

r_squared_pitfalls.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import numpy as np
from sklearn.metrics import r2_score
 
def demonstrate_pitfalls():
    """
    Show scenarios where R² is misleading.
    """
    np.random.seed(42)
    
    # Pitfall 1: High R² with non-linear relationship
    print("=== Pitfall 1: Non-linearity hidden by R² ===")
    x = np.linspace(0, 10, 100)
    y_nonlinear = np.sin(x) * x + np.random.randn(100) * 0.5
    
    # Fit linear model
    from sklearn.linear_model import LinearRegression
    lr = LinearRegression()
    lr.fit(x.reshape(-1, 1), y_nonlinear)
    y_pred = lr.predict(x.reshape(-1, 1))
    
    r2 = r2_score(y_nonlinear, y_pred)
    print(f"True relationship: y = sin(x) * x")
    print(f"Linear model R²: {r2:.4f}")
    print("→ R² looks decent, but the model is fundamentally wrong!")
    
    # Pitfall 2: Range restriction
    print("\n=== Pitfall 2: Range restriction ===")
    x_full = np.linspace(0, 100, 1000)
    y_full = 2 * x_full + np.random.randn(1000) * 10
    
    # Full range
    lr.fit(x_full.reshape(-1, 1), y_full)
    r2_full = r2_score(y_full, lr.predict(x_full.reshape(-1, 1)))
    
    # Restricted range (50-60 only)
    mask = (x_full >= 50) & (x_full <= 60)
    x_restricted = x_full[mask]
    y_restricted = y_full[mask]
    lr.fit(x_restricted.reshape(-1, 1), y_restricted)
    r2_restricted = r2_score(y_restricted, lr.predict(x_restricted.reshape(-1, 1)))
    
    print(f"Full range (0-100): R² = {r2_full:.4f}")
    print(f"Restricted (50-60): R² = {r2_restricted:.4f}")
    print("→ Same model quality, but range affects R² dramatically!")
    
    # Pitfall 3: R² with non-proportional variance
    print("\n=== Pitfall 3: Heteroscedasticity ===")
    x = np.linspace(1, 100, 100)
    # Noise increases with x (heteroscedastic)
    y_hetero = 2 * x + np.random.randn(100) * x * 0.5
    
    lr.fit(x.reshape(-1, 1), y_hetero)
    r2 = r2_score(y_hetero, lr.predict(x.reshape(-1, 1)))
    print(f"Heteroscedastic data R²: {r2:.4f}")
    print("→ High R², but residuals aren't random—model assumptions violated!")
 
demonstrate_pitfalls()

Best Practices

Always visualize residuals — Plot predicted vs. actual, residuals vs. predicted, and Q-Q plots
Report confidence intervals — R² is a point estimate with uncertainty
Compare to baselines — R² = 0.7 is great if previous work achieved 0.5, poor if the domain typically sees 0.95
Use domain-appropriate thresholds — Acceptable R² varies wildly by field

Practical Implementation

Here's how to effectively compute and report R² in real-world workflows.

comprehensive_r_squared.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import numpy as np
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from scipy import stats
 
def comprehensive_regression_report(y_true, y_pred, n_features, model_name="Model"):
    """
    Generate comprehensive regression metrics report including R² analysis.
    """
    n = len(y_true)
    
    # Basic metrics
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_true, y_pred)
    
    # R² and Adjusted R²
    r2 = r2_score(y_true, y_pred)
    adj_r2 = 1 - ((1 - r2) * (n - 1) / (n - n_features - 1))
    
    # Additional diagnostics
    residuals = y_true - y_pred
    residual_std = np.std(residuals)
    
    # Correlation between predictions and actuals
    correlation = np.corrcoef(y_true, y_pred)[0, 1]
    
    # 95% confidence interval for R² (approximate, using Fisher transformation)
    z = 0.5 * np.log((1 + correlation) / (1 - correlation))
    se = 1 / np.sqrt(n - 3)
    z_lower = z - 1.96 * se
    z_upper = z + 1.96 * se
    r_lower = (np.exp(2 * z_lower) - 1) / (np.exp(2 * z_lower) + 1)
    r_upper = (np.exp(2 * z_upper) - 1) / (np.exp(2 * z_upper) + 1)
    r2_ci = (r_lower**2, r_upper**2)
    
    print(f"╔══════ {model_name} Regression Report ══════╗")
    print(f"║ Samples: {n}, Features: {n_features}")
    print(f"╠════════════════════════════════════════════╣")
    print(f"║ CORE METRICS")
    print(f"║   MSE:        {mse:.4f}")
    print(f"║   RMSE:       {rmse:.4f}")
    print(f"║   MAE:        {mae:.4f}")
    print(f"╠════════════════════════════════════════════╣")
    print(f"║ R² ANALYSIS")
    print(f"║   R²:         {r2:.4f} ({r2*100:.1f}% variance explained)")
    print(f"║   Adjusted R²:{adj_r2:.4f}")
    print(f"║   95% CI:     ({r2_ci[0]:.4f}, {r2_ci[1]:.4f})")
    print(f"╠════════════════════════════════════════════╣")
    print(f"║ QUALITY CHECKS")
    print(f"║   Residual Mean: {np.mean(residuals):.4f} (should be ~0)")
    print(f"║   Residual Std:  {residual_std:.4f}")
    print(f"╚════════════════════════════════════════════╝")
    
    if adj_r2 < r2 - 0.05:
        print("⚠️  Large gap between R² and Adjusted R² suggests overfitting")
    
    return {
        'r2': r2, 'adj_r2': adj_r2, 'rmse': rmse, 'mae': mae,
        'r2_ci': r2_ci
    }
 
def cross_validated_r2(X, y, model, cv=5):
    """
    Compute R² with cross-validation for robust estimation.
    """
    scores = cross_val_score(model, X, y, cv=cv, scoring='r2')
    
    print(f"\nCross-Validated R² (k={cv}):")
    print(f"  Mean:  {scores.mean():.4f}")
    print(f"  Std:   {scores.std():.4f}")
    print(f"  Range: [{scores.min():.4f}, {scores.max():.4f}]")
    
    return scores
 
# Example usage
np.random.seed(42)
n_samples, n_features = 200, 5
X = np.random.randn(n_samples, n_features)
true_coef = np.array([3.0, -2.0, 1.5, 0, 0])  # Last two features are noise
y = X @ true_coef + np.random.randn(n_samples) * 2
 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
 
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
 
comprehensive_regression_report(y_test, y_pred, n_features, "Linear Regression")
cross_validated_r2(X, y, LinearRegression())

Summary: R² and Adjusted R²

R² provides a scale-independent measure of model explanatory power. Let's consolidate the key insights:

Key Takeaways

•Definition: $R^2 = 1 - SS_{res}/SS_{tot}$ — fraction of variance explained by the model.
•Baseline Comparison: R² compares your model to predicting the mean for all samples.
•Range: Usually [0, 1], but can be negative if model is worse than the mean baseline.
•Scale Independence: R² is unitless—comparable across different outcome variables.
•Geometric View: R² = cos²(θ) between centered prediction and outcome vectors.
•Overfitting Problem: R² always increases with more features, even useless ones.
•Adjusted R²: Penalizes model complexity: $R^2_{adj} = 1 - \frac{(1-R^2)(n-1)}{n-p-1}$
•Limitations: Doesn't imply causation, ignores residual patterns, context-dependent interpretation.

What's Next

R² and RMSE are absolute metrics. But what if you need error as a percentage of the true value? MAPE (Mean Absolute Percentage Error) and SMAPE provide relative error metrics that are scale-invariant and often preferred in business contexts. We'll explore these next.

Page Complete

You now understand R² deeply—its computation, interpretation, limitations, and when to use adjusted R². You can communicate model quality effectively and avoid common interpretation pitfalls.

R-squared and Adjusted R²

Proportion of Variance Explained

Enter R-squared (R²), also called the coefficient of determination. R² provides a scale-independent measure that answers: 'Of all the variance in y, what fraction does my model capture?'

An R² of 0.85 means your model explains 85% of the variance in outcomes—a statement that's meaningful regardless of whether you're predicting house prices in dollars or temperatures in Celsius.

What You Will Learn

Mathematical Definition

R-squared is defined as the proportion of variance in the dependent variable that is explained by the model:

$$R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$$

Where:

$SS_{res}$ (Residual Sum of Squares): $\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$ — unexplained variance
$SS_{tot}$ (Total Sum of Squares): $\sum_{i=1}^{n}(y_i - \bar{y})^2$ — total variance in y

Let's unpack this formula to build intuition.

The Three Sum of Squares

We can decompose the total variation in y into two components:

$$SS_{tot} = SS_{reg} + SS_{res}$$

$SS_{tot}$ — Total Sum of Squares
- Measures total variance around the mean: $\sum(y_i - \bar{y})^2$
- This is what we'd have if our only 'model' was predicting the mean
$SS_{res}$ — Residual Sum of Squares
- Measures variance not explained by model: $\sum(y_i - \hat{y}_i)^2$
- This is $n \times \text{MSE}$
$SS_{reg}$ — Regression (Explained) Sum of Squares
- Measures variance explained by model: $\sum(\hat{y}_i - \bar{y})^2$
- How far predictions are from the baseline mean

r_squared_calculation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import numpy as np
 
def r_squared_from_scratch(y_true: np.ndarray, y_pred: np.ndarray) -> dict:
    """
    Compute R² and its components from scratch.
    
    Returns detailed breakdown of sum of squares.
    """
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)
    n = len(y_true)
    
    # Mean of true values (baseline prediction)
    y_mean = np.mean(y_true)
    
    # Sum of Squares components
    ss_tot = np.sum((y_true - y_mean) ** 2)      # Total variance
    ss_res = np.sum((y_true - y_pred) ** 2)      # Residual (unexplained)
    ss_reg = np.sum((y_pred - y_mean) ** 2)      # Explained by model
    
    # R-squared
    r_squared = 1 - (ss_res / ss_tot)
    
    # Alternative formulation (same result for OLS)
    r_squared_alt = ss_reg / ss_tot
    
    return {
        'ss_tot': ss_tot,
        'ss_res': ss_res,
        'ss_reg': ss_reg,
        'r_squared': r_squared,
        'r_squared_alt': r_squared_alt,
        'variance_explained_pct': r_squared * 100,
        'mse': ss_res / n
    }
 
# Example: Predicting test scores
y_true = np.array([65, 72, 80, 85, 90, 75, 70, 88, 95, 78])
y_pred = np.array([68, 70, 82, 84, 88, 77, 72, 85, 92, 80])
 
# Calculate R²
result = r_squared_from_scratch(y_true, y_pred)
 
print("=== R² Breakdown ===")
print(f"Mean of y: {np.mean(y_true):.2f}")
print(f"\nSum of Squares:")
print(f"  SS_total (variance around mean): {result['ss_tot']:.2f}")
print(f"  SS_residual (unexplained):       {result['ss_res']:.2f}")
print(f"  SS_regression (explained):       {result['ss_reg']:.2f}")
print(f"\nR² = 1 - (SS_res / SS_tot)")
print(f"R² = 1 - ({result['ss_res']:.2f} / {result['ss_tot']:.2f})")
print(f"R² = {result['r_squared']:.4f}")
print(f"\n→ Model explains {result['variance_explained_pct']:.1f}% of variance in test scores")

Relation to MSE

Interpretation and Intuition

R² has an intuitive interpretation: the fraction of variance in y that is 'explained' by the model.

Baseline Comparison

R² implicitly compares your model to a baseline model that predicts the mean for all samples:

Baseline MSE = $\frac{1}{n}\sum(y_i - \bar{y})^2 = Var(y)$
Model MSE = $\frac{1}{n}\sum(y_i - \hat{y}_i)^2$

$$R^2 = 1 - \frac{\text{Model MSE}}{\text{Baseline MSE}}$$

So R² measures: How much better is my model than just predicting the mean?

Interpreting R² Values
R² Value	Interpretation	Model Quality
1.0	Perfect predictions (all variance explained)	Perfect (or overfitting)
0.9-1.0	Explains 90%+ of variance	Excellent
0.7-0.9	Explains 70-90% of variance	Good
0.5-0.7	Explains 50-70% of variance	Moderate
0.3-0.5	Explains 30-50% of variance	Weak
0.0-0.3	Explains little variance	Poor (but may still be useful)
0.0	Same as predicting the mean	No predictive power
< 0	Worse than predicting the mean	Actively harmful

Important Caveats

R² values must be interpreted in context:

Domain-dependent expectations: In physics, R² > 0.99 is common; in social sciences, R² = 0.3 might be excellent.
Doesn't mean 'accurate': R² = 0.9 doesn't mean predictions are within 10% of true values. It means 90% of variance is explained.
Can be misleading for non-linear patterns: R² only measures linear association when used with linear models.
Sensitive to outcome variance: If y has very high variance, even a good model might have moderate R². If y has low variance, even a mediocre model might have high R².

r_squared_interpretation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import numpy as np
 
def r_squared_vs_accuracy():
    """
    Demonstrate that high R² doesn't mean predictions are 'close'.
    """
    np.random.seed(42)
    
    # Scenario 1: High R² with large absolute errors
    y_true_1 = np.array([100, 200, 300, 400, 500, 600, 700, 800, 900, 1000])
    y_pred_1 = y_true_1 + np.array([-30, 20, -40, 35, -25, 30, -35, 40, -20, 25])
    
    ss_tot_1 = np.sum((y_true_1 - np.mean(y_true_1)) ** 2)
    ss_res_1 = np.sum((y_true_1 - y_pred_1) ** 2)
    r2_1 = 1 - ss_res_1 / ss_tot_1
    mae_1 = np.mean(np.abs(y_true_1 - y_pred_1))
    
    # Scenario 2: Lower R² with smaller absolute errors
    y_true_2 = np.array([100, 102, 104, 106, 108, 110, 112, 114, 116, 118])
    y_pred_2 = y_true_2 + np.array([-3, 2, -4, 3, -2, 3, -3, 4, -2, 2])
    
    ss_tot_2 = np.sum((y_true_2 - np.mean(y_true_2)) ** 2)
    ss_res_2 = np.sum((y_true_2 - y_pred_2) ** 2)
    r2_2 = 1 - ss_res_2 / ss_tot_2
    mae_2 = np.mean(np.abs(y_true_2 - y_pred_2))
    
    print("=== R² vs Prediction Accuracy ===")
    print("\nScenario 1: High variance target (100 to 1000)")
    print(f"  R²: {r2_1:.4f}")
    print(f"  MAE: {mae_1:.1f}")
    print(f"  → High R² but predictions off by ~{mae_1:.0f} on average!")
    
    print("\nScenario 2: Low variance target (100 to 118)")
    print(f"  R²: {r2_2:.4f}")
    print(f"  MAE: {mae_2:.1f}")
    print(f"  → Lower R² but predictions off by only ~{mae_2:.0f}!")
    
    print("\n*** Key Insight ***")
    print("R² depends on target variance. Same MAE gives different R²")
    print("depending on how spread out the true values are.")
 
r_squared_vs_accuracy()

R² Alone Is Not Enough

Never evaluate a model solely on R². Always report R² alongside MAE/RMSE for absolute error magnitude, residual plots for systematic patterns, and domain-specific metrics for business relevance.

Geometric Interpretation

R² has an elegant geometric interpretation in the space of observations.

Vectors in n-Dimensional Space

Consider each quantity as a vector in $\mathbb{R}^n$ (one dimension per data point):

$\mathbf{y}$ = $(y_1, y_2, ..., y_n)$ — the true values
$\hat{\mathbf{y}}$ = $(\hat{y}_1, \hat{y}_2, ..., \hat{y}_n)$ — the predictions
$\bar{\mathbf{y}}$ = $(\bar{y}, \bar{y}, ..., \bar{y})$ — the mean prediction vector

R² as Cosine of Angle

For linear regression (OLS), R² equals the squared correlation between y and $\hat{y}$:

$$R^2 = \text{Corr}(y, \hat{y})^2 = \cos^2(\theta)$$

Where $\theta$ is the angle between the centered vectors $(\mathbf{y} - \bar{\mathbf{y}})$ and $(\hat{\mathbf{y}} - \bar{\mathbf{y}})$.

$R^2 = 1$: Vectors perfectly aligned ($\theta = 0°$)
$R^2 = 0$: Vectors perpendicular ($\theta = 90°$)
$R^2 = 0.25$: Angle of 60° between vectors

geometric_r_squared.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import numpy as np
 
def geometric_r_squared(y_true: np.ndarray, y_pred: np.ndarray):
    """
    Demonstrate the geometric interpretation of R².
    R² = cos²(θ) between centered y and y_hat vectors.
    """
    # Center the vectors
    y_centered = y_true - np.mean(y_true)
    y_hat_centered = y_pred - np.mean(y_pred)
    
    # Compute angle using dot product
    # cos(θ) = (a · b) / (||a|| × ||b||)
    dot_product = np.dot(y_centered, y_hat_centered)
    norm_y = np.linalg.norm(y_centered)
    norm_y_hat = np.linalg.norm(y_hat_centered)
    
    cos_theta = dot_product / (norm_y * norm_y_hat)
    theta_radians = np.arccos(np.clip(cos_theta, -1, 1))
    theta_degrees = np.degrees(theta_radians)
    
    # R² from geometry
    r_squared_geometric = cos_theta ** 2
    
    # R² from standard formula
    ss_tot = np.sum((y_true - np.mean(y_true)) ** 2)
    ss_res = np.sum((y_true - y_pred) ** 2)
    r_squared_standard = 1 - ss_res / ss_tot
    
    # Correlation coefficient
    correlation = np.corrcoef(y_true, y_pred)[0, 1]
    
    print("=== Geometric R² Interpretation ===")
    print(f"\nVector norms:")
    print(f"  ||y - ȳ||  = {norm_y:.4f}")
    print(f"  ||ŷ - ȳ|| = {norm_y_hat:.4f}")
    print(f"\nAngle between centered vectors:")
    print(f"  cos(θ) = {cos_theta:.4f}")
    print(f"  θ = {theta_degrees:.2f}°")
    print(f"\nR² calculations:")
    print(f"  cos²(θ) = {r_squared_geometric:.4f}")
    print(f"  1 - SS_res/SS_tot = {r_squared_standard:.4f}")
    print(f"  Corr(y, ŷ)² = {correlation**2:.4f}")
    print(f"\n→ All three methods give the same R²!")
    
    return r_squared_geometric
 
# Example
y_true = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
y_pred = np.array([1.2, 2.1, 2.8, 4.2, 5.1, 5.9, 7.2, 7.8, 9.1, 10.2])
 
geometric_r_squared(y_true, y_pred)

Projection Interpretation

In linear regression, the predictions $\hat{\mathbf{y}}$ are the orthogonal projection of $\mathbf{y}$ onto the column space of the design matrix $\mathbf{X}$.

By the Pythagorean theorem (since residuals are orthogonal to predictions):

$$||\mathbf{y} - \bar{\mathbf{y}}||^2 = ||\hat{\mathbf{y}} - \bar{\mathbf{y}}||^2 + ||\mathbf{y} - \hat{\mathbf{y}}||^2$$ $$SS_{tot} = SS_{reg} + SS_{res}$$

This decomposition only holds exactly for OLS linear regression, which is why $SS_{reg}/SS_{tot} = R^2$ only for linear models.

When R² Can Be Negative

A common misconception is that R² must be between 0 and 1. In fact, R² can be negative when the model performs worse than the baseline mean prediction.

$$R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$$

If $SS_{res} > SS_{tot}$, then $R^2 < 0$.

This means your model's predictions are further from the true values than the mean is!

negative_r_squared.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import numpy as np
 
def demonstrate_negative_r_squared():
    """
    Show scenarios where R² becomes negative.
    """
    # True values
    y_true = np.array([10, 20, 30, 40, 50])
    y_mean = np.mean(y_true)  # = 30
    
    print("True values:", y_true)
    print(f"Mean: {y_mean}")
    print(f"\nBaseline (mean) predictions: [{y_mean}] × 5")
    
    ss_tot = np.sum((y_true - y_mean) ** 2)
    print(f"SS_tot (variance around mean): {ss_tot}")
    
    # Scenario 1: Good predictions
    y_pred_good = np.array([12, 22, 28, 38, 52])
    ss_res_good = np.sum((y_true - y_pred_good) ** 2)
    r2_good = 1 - ss_res_good / ss_tot
    print(f"\n--- Good Model ---")
    print(f"Predictions: {y_pred_good}")
    print(f"SS_res: {ss_res_good}")
    print(f"R²: {r2_good:.4f} (positive)")
    
    # Scenario 2: Bad predictions (worse than mean)
    y_pred_bad = np.array([50, 10, 50, 10, 50])  # Wrong direction!
    ss_res_bad = np.sum((y_true - y_pred_bad) ** 2)
    r2_bad = 1 - ss_res_bad / ss_tot
    print(f"\n--- Bad Model ---")
    print(f"Predictions: {y_pred_bad}")
    print(f"SS_res: {ss_res_bad}")
    print(f"R²: {r2_bad:.4f} (NEGATIVE!)")
    print(f"\n→ Model is worse than just predicting the mean!")
    
    # Scenario 3: Predicting exactly the mean
    y_pred_mean = np.full_like(y_true, y_mean, dtype=float)
    ss_res_mean = np.sum((y_true - y_pred_mean) ** 2)
    r2_mean = 1 - ss_res_mean / ss_tot
    print(f"\n--- Baseline Model (predict mean) ---")
    print(f"Predictions: {y_pred_mean}")
    print(f"SS_res: {ss_res_mean} (equals SS_tot)")
    print(f"R²: {r2_mean:.4f} (exactly zero)")
 
demonstrate_negative_r_squared()

When Does Negative R² Happen?

Model applied to wrong data: Training data distribution differs from test data
Incorrect model specification: Model structure fundamentally wrong for the problem
No intercept term: If linear regression has no intercept and data doesn't pass through origin
Adversarial or random predictions: Predictions uncorrelated or anti-correlated with truth
Data leakage during training: Model overfit to leaked information that's absent at test time

Negative R² Is a Red Flag

The Overfitting Problem

R² has a fundamental flaw: it never decreases when you add more predictors, even if those predictors have no real relationship with the outcome.

This leads to a critical problem: R² on training data is inflated for complex models.

r_squared_inflation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
 
def demonstrate_r_squared_inflation():
    """
    Show how R² increases with model complexity,
    even for useless random features.
    """
    np.random.seed(42)
    
    # True relationship: y = 2*x1 + 3*x2 + noise
    n_samples = 100
    X_true = np.random.randn(n_samples, 2)
    y = 2 * X_true[:, 0] + 3 * X_true[:, 1] + np.random.randn(n_samples) * 0.5
    
    print("True model: y = 2*x1 + 3*x2 + noise")
    print(f"Samples: {n_samples}\n")
    
    # Progressively add random (useless) features
    r_squared_values = []
    
    for n_noise_features in range(0, 80, 5):
        # Create dataset with true features + noise features
        X_noise = np.random.randn(n_samples, n_noise_features)
        X_full = np.hstack([X_true, X_noise]) if n_noise_features > 0 else X_true
        
        # Fit model and compute R²
        model = LinearRegression()
        model.fit(X_full, y)
        y_pred = model.predict(X_full)
        
        r2 = r2_score(y, y_pred)
        r_squared_values.append((2 + n_noise_features, r2))
        
        if n_noise_features in [0, 10, 30, 50, 70]:
            print(f"Features: {2 + n_noise_features:2d} ({n_noise_features} noise) → R² = {r2:.4f}")
    
    print(f"\n*** Key Insight ***")
    print("R² keeps increasing even though noise features add NOTHING!")
    print("With enough features, R² → 1.0 (perfect fit to training noise)")
 
demonstrate_r_squared_inflation()
# Example output:
# Features:  2 (0 noise) → R² = 0.9721
# Features: 12 (10 noise) → R² = 0.9841
# Features: 32 (30 noise) → R² = 0.9944
# Features: 52 (50 noise) → R² = 0.9982
# Features: 72 (70 noise) → R² = 0.9996

The Extreme Case

With $n$ data points and $n$ features, a linear model can fit the training data perfectly ($R^2 = 1$) regardless of the true relationship—it just memorizes each point.

This is why:

Training R² is optimistically biased
More complex models have more inflated R²
We need a metric that penalizes unnecessary complexity

Training vs Test R²

Always evaluate R² on held-out test data, not training data. Test R² will drop (sometimes dramatically) for overfit models, revealing the true generalization performance.

Adjusted R-Squared

Adjusted R² corrects for the inflation problem by penalizing model complexity:

$$R^2_{adj} = 1 - \frac{(1 - R^2)(n - 1)}{n - p - 1}$$

Where:

$n$ = number of samples
$p$ = number of predictors (features)
$R^2$ = unadjusted R-squared

How It Works

The adjustment factor $\frac{n-1}{n-p-1}$ is always ≥ 1 (since $p \geq 0$), so:

$$(1 - R^2_{adj}) = (1 - R^2) \times \frac{n-1}{n-p-1} \geq (1 - R^2)$$

Therefore $R^2_{adj} \leq R^2$ — adjusted R² is always less than or equal to regular R².

R² vs Adjusted R²
Property	R²	Adjusted R²
Always increases with features	Yes	No
Penalizes complexity	No	Yes
Can be negative	Only if worse than mean	Yes (if model overly complex)
Suitable for model selection	No (training data)	Yes (approximation)
Interpretation	Variance explained	Variance explained, adjusted for p

adjusted_r_squared.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
 
def adjusted_r_squared(r2: float, n: int, p: int) -> float:
    """
    Compute adjusted R² from R², sample size, and number of predictors.
    
    Parameters:
    -----------
    r2 : float - R-squared value
    n : int - Number of samples
    p : int - Number of predictors (excluding intercept)
    """
    if n - p - 1 <= 0:
        raise ValueError("n must be greater than p + 1")
    
    return 1 - ((1 - r2) * (n - 1) / (n - p - 1))
 
def compare_r2_adjusted():
    """
    Compare R² and Adjusted R² as features increase.
    """
    np.random.seed(42)
    
    n_samples = 100
    X_true = np.random.randn(n_samples, 2)
    y = 2 * X_true[:, 0] + 3 * X_true[:, 1] + np.random.randn(n_samples) * 0.5
    
    print("Effect of adding noise features on R² vs Adjusted R²")
    print(f"{'Features':^10} | {'R²':^10} | {'Adj R²':^10} | {'Difference':^10}")
    print("-" * 48)
    
    for n_noise in [0, 5, 10, 20, 30, 50, 70, 90]:
        X_noise = np.random.randn(n_samples, max(1, n_noise))
        X = np.hstack([X_true, X_noise]) if n_noise > 0 else X_true
        
        n_features = X.shape[1]
        
        model = LinearRegression()
        model.fit(X, y)
        r2 = r2_score(y, model.predict(X))
        adj_r2 = adjusted_r_squared(r2, n_samples, n_features)
        
        print(f"{n_features:^10} | {r2:^10.4f} | {adj_r2:^10.4f} | {r2 - adj_r2:^10.4f}")
    
    print("\n*** Key Insight ***")
    print("R² keeps increasing, but Adjusted R² DECREASES when")
    print("useless features are added—correctly penalizing overfitting.")
 
compare_r2_adjusted()

Adjusted R² for Model Selection

When comparing models with different numbers of features on the same dataset:

Higher Adjusted $R^2$ = better balance of fit and parsimony
If Adjusted $R^2$ decreases when adding a feature → that feature doesn't help
Maximum Adjusted $R^2$ often corresponds to the best model complexity

However, Adjusted $R^2$ is still computed on training data—it's an approximation, not a substitute for cross-validation on held-out data.

When to Use Adjusted R²

Limitations and Pitfalls

R² is widely used but often misused. Understanding its limitations prevents interpretation errors.

Major Limitations

Common R² Pitfalls

•Doesn't imply causation — High R² means variables move together, not that X causes Y.
•Doesn't check residual assumptions — R² can be high even with heteroscedasticity, non-linearity, or autocorrelation.
•Doesn't compare across different y variables — R² = 0.8 for temperature vs R² = 0.8 for sales aren't comparable.
•Favors complex models — Without adjustment, more parameters always increase R².
•Sensitive to outliers — Like MSE, R² is affected by extreme values since it's MSE-based.
•Doesn't indicate prediction accuracy — High R² doesn't mean predictions are 'close enough' for your application.
•Can be deceived by restricted range — If test data has less variance than training, R² can be misleadingly high or low.

r_squared_pitfalls.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import numpy as np
from sklearn.metrics import r2_score
 
def demonstrate_pitfalls():
    """
    Show scenarios where R² is misleading.
    """
    np.random.seed(42)
    
    # Pitfall 1: High R² with non-linear relationship
    print("=== Pitfall 1: Non-linearity hidden by R² ===")
    x = np.linspace(0, 10, 100)
    y_nonlinear = np.sin(x) * x + np.random.randn(100) * 0.5
    
    # Fit linear model
    from sklearn.linear_model import LinearRegression
    lr = LinearRegression()
    lr.fit(x.reshape(-1, 1), y_nonlinear)
    y_pred = lr.predict(x.reshape(-1, 1))
    
    r2 = r2_score(y_nonlinear, y_pred)
    print(f"True relationship: y = sin(x) * x")
    print(f"Linear model R²: {r2:.4f}")
    print("→ R² looks decent, but the model is fundamentally wrong!")
    
    # Pitfall 2: Range restriction
    print("\n=== Pitfall 2: Range restriction ===")
    x_full = np.linspace(0, 100, 1000)
    y_full = 2 * x_full + np.random.randn(1000) * 10
    
    # Full range
    lr.fit(x_full.reshape(-1, 1), y_full)
    r2_full = r2_score(y_full, lr.predict(x_full.reshape(-1, 1)))
    
    # Restricted range (50-60 only)
    mask = (x_full >= 50) & (x_full <= 60)
    x_restricted = x_full[mask]
    y_restricted = y_full[mask]
    lr.fit(x_restricted.reshape(-1, 1), y_restricted)
    r2_restricted = r2_score(y_restricted, lr.predict(x_restricted.reshape(-1, 1)))
    
    print(f"Full range (0-100): R² = {r2_full:.4f}")
    print(f"Restricted (50-60): R² = {r2_restricted:.4f}")
    print("→ Same model quality, but range affects R² dramatically!")
    
    # Pitfall 3: R² with non-proportional variance
    print("\n=== Pitfall 3: Heteroscedasticity ===")
    x = np.linspace(1, 100, 100)
    # Noise increases with x (heteroscedastic)
    y_hetero = 2 * x + np.random.randn(100) * x * 0.5
    
    lr.fit(x.reshape(-1, 1), y_hetero)
    r2 = r2_score(y_hetero, lr.predict(x.reshape(-1, 1)))
    print(f"Heteroscedastic data R²: {r2:.4f}")
    print("→ High R², but residuals aren't random—model assumptions violated!")
 
demonstrate_pitfalls()

Best Practices

Always visualize residuals — Plot predicted vs. actual, residuals vs. predicted, and Q-Q plots
Report confidence intervals — R² is a point estimate with uncertainty
Compare to baselines — R² = 0.7 is great if previous work achieved 0.5, poor if the domain typically sees 0.95
Use domain-appropriate thresholds — Acceptable R² varies wildly by field

Practical Implementation

Here's how to effectively compute and report R² in real-world workflows.

comprehensive_r_squared.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import numpy as np
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from scipy import stats
 
def comprehensive_regression_report(y_true, y_pred, n_features, model_name="Model"):
    """
    Generate comprehensive regression metrics report including R² analysis.
    """
    n = len(y_true)
    
    # Basic metrics
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_true, y_pred)
    
    # R² and Adjusted R²
    r2 = r2_score(y_true, y_pred)
    adj_r2 = 1 - ((1 - r2) * (n - 1) / (n - n_features - 1))
    
    # Additional diagnostics
    residuals = y_true - y_pred
    residual_std = np.std(residuals)
    
    # Correlation between predictions and actuals
    correlation = np.corrcoef(y_true, y_pred)[0, 1]
    
    # 95% confidence interval for R² (approximate, using Fisher transformation)
    z = 0.5 * np.log((1 + correlation) / (1 - correlation))
    se = 1 / np.sqrt(n - 3)
    z_lower = z - 1.96 * se
    z_upper = z + 1.96 * se
    r_lower = (np.exp(2 * z_lower) - 1) / (np.exp(2 * z_lower) + 1)
    r_upper = (np.exp(2 * z_upper) - 1) / (np.exp(2 * z_upper) + 1)
    r2_ci = (r_lower**2, r_upper**2)
    
    print(f"╔══════ {model_name} Regression Report ══════╗")
    print(f"║ Samples: {n}, Features: {n_features}")
    print(f"╠════════════════════════════════════════════╣")
    print(f"║ CORE METRICS")
    print(f"║   MSE:        {mse:.4f}")
    print(f"║   RMSE:       {rmse:.4f}")
    print(f"║   MAE:        {mae:.4f}")
    print(f"╠════════════════════════════════════════════╣")
    print(f"║ R² ANALYSIS")
    print(f"║   R²:         {r2:.4f} ({r2*100:.1f}% variance explained)")
    print(f"║   Adjusted R²:{adj_r2:.4f}")
    print(f"║   95% CI:     ({r2_ci[0]:.4f}, {r2_ci[1]:.4f})")
    print(f"╠════════════════════════════════════════════╣")
    print(f"║ QUALITY CHECKS")
    print(f"║   Residual Mean: {np.mean(residuals):.4f} (should be ~0)")
    print(f"║   Residual Std:  {residual_std:.4f}")
    print(f"╚════════════════════════════════════════════╝")
    
    if adj_r2 < r2 - 0.05:
        print("⚠️  Large gap between R² and Adjusted R² suggests overfitting")
    
    return {
        'r2': r2, 'adj_r2': adj_r2, 'rmse': rmse, 'mae': mae,
        'r2_ci': r2_ci
    }
 
def cross_validated_r2(X, y, model, cv=5):
    """
    Compute R² with cross-validation for robust estimation.
    """
    scores = cross_val_score(model, X, y, cv=cv, scoring='r2')
    
    print(f"\nCross-Validated R² (k={cv}):")
    print(f"  Mean:  {scores.mean():.4f}")
    print(f"  Std:   {scores.std():.4f}")
    print(f"  Range: [{scores.min():.4f}, {scores.max():.4f}]")
    
    return scores
 
# Example usage
np.random.seed(42)
n_samples, n_features = 200, 5
X = np.random.randn(n_samples, n_features)
true_coef = np.array([3.0, -2.0, 1.5, 0, 0])  # Last two features are noise
y = X @ true_coef + np.random.randn(n_samples) * 2
 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
 
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
 
comprehensive_regression_report(y_test, y_pred, n_features, "Linear Regression")
cross_validated_r2(X, y, LinearRegression())

Summary: R² and Adjusted R²

R² provides a scale-independent measure of model explanatory power. Let's consolidate the key insights:

Key Takeaways

•Definition: $R^2 = 1 - SS_{res}/SS_{tot}$ — fraction of variance explained by the model.
•Baseline Comparison: R² compares your model to predicting the mean for all samples.
•Range: Usually [0, 1], but can be negative if model is worse than the mean baseline.
•Scale Independence: R² is unitless—comparable across different outcome variables.
•Geometric View: R² = cos²(θ) between centered prediction and outcome vectors.
•Overfitting Problem: R² always increases with more features, even useless ones.
•Adjusted R²: Penalizes model complexity: $R^2_{adj} = 1 - \frac{(1-R^2)(n-1)}{n-p-1}$
•Limitations: Doesn't imply causation, ignores residual patterns, context-dependent interpretation.

What's Next

Page Complete

You now understand R² deeply—its computation, interpretation, limitations, and when to use adjusted R². You can communicate model quality effectively and avoid common interpretation pitfalls.