Recommender System Fundamentals - Learning Module

Loading content...

0/278

A/B Testing

The Ground Truth of Production

You've built a new recommendation model. Offline metrics look promising—NDCG improved by 3%, precision by 2%. You're confident it's an improvement. But here's the uncomfortable truth: offline metrics can lie.

The only way to truly know if your recommendation system improves user experience is to show it to real users and measure their response. This is the domain of A/B testing (also called online controlled experiments)—the gold standard for evaluating changes in production systems.

Companies like Google, Netflix, Amazon, and Meta run thousands of A/B tests simultaneously. Every significant recommendation system change at scale goes through rigorous online experimentation. Understanding A/B testing isn't optional for recommendation practitioners—it's essential.

What You Will Learn

By the end of this page, you will understand the statistical foundations of A/B testing, design properly powered experiments, calculate and interpret p-values and confidence intervals, avoid common pitfalls that invalidate experiments, and handle the specific challenges of testing recommendation systems.

A/B Testing Fundamentals

An A/B test (or A/B/n test with multiple variants) is a randomized controlled experiment where users are randomly assigned to different treatments, and outcomes are compared.

Core Components:

1. Randomization Unit

What entity gets randomized? Usually users, but could be sessions, devices, or geographic regions.

2. Control (A) vs Treatment (B)

Control: Current production system (the baseline)
Treatment: New variant being tested

3. Metrics

Primary metric (OEC - Overall Evaluation Criterion): The single metric for decision-making
Secondary metrics: Additional signals to understand behavior
Guardrail metrics: Safety metrics that shouldn't degrade

4. Sample Size & Duration

How many users and how long? Determined by statistical power analysis.

5. Statistical Analysis

P-values, confidence intervals, effect sizes to determine if differences are real.

Converting Mermaid diagram...

The Hypothesis Testing Framework:

$$H_0: \mu_{treatment} = \mu_{control} \quad \text{(no difference)}$$ $$H_1: \mu_{treatment} eq \mu_{control} \quad \text{(there is a difference)}$$

We want to reject $H_0$ with low probability of being wrong (Type I error), while having high probability of detecting real effects (power, 1 - Type II error).

Key Decisions:

Parameter	Typical Value	Impact
Significance level (α)	0.05	Probability of false positive
Power (1-β)	0.80	Probability of detecting true effect
Minimum detectable effect (MDE)	Varies	Smallest effect worth detecting
Traffic split	50/50	Power vs risk balance

Statistical Foundations

A solid grasp of statistics is essential for valid A/B testing. Let's cover the core concepts systematically.

The Central Limit Theorem (CLT)

Regardless of the underlying distribution, the sample mean approaches a normal distribution as sample size increases:

$$\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right)$$

This enables using normal-distribution-based statistical tests for most metrics.

Standard Error

The variability of the sample mean:

$$SE = \frac{\sigma}{\sqrt{n}}$$

Larger samples → smaller standard error → more precise estimates.

Confidence Intervals

For a 95% CI:

$$CI = \bar{X} \pm 1.96 \cdot SE$$

"We're 95% confident the true population mean lies within this interval."

P-Value Interpretation

The probability of observing results at least as extreme as what we measured, if the null hypothesis were true.

p < 0.05: Statistically significant at 5% level
p < 0.01: Statistically significant at 1% level

Common Misinterpretations (avoid these!):

❌ "There's a 95% chance treatment is better" — Wrong
✅ "If there's no difference, we'd see results this extreme <5% of the time"

Two-Sample T-Test

For comparing means between treatment and control:

$$t = \frac{\bar{X}{treatment} - \bar{X}{control}}{\sqrt{\frac{s^2_{treatment}}{n_{treatment}} + \frac{s^2_{control}}{n_{control}}}}$$

For proportions (CTR, conversion rate):

$$z = \frac{p_{treatment} - p_{control}}{\sqrt{p(1-p)(\frac{1}{n_{treatment}} + \frac{1}{n_{control}})}}$$

Where $p$ is the pooled proportion.

ab_test_statistics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
import numpy as np
from scipy import stats
from typing import Tuple, Dict
from dataclasses import dataclass
 
@dataclass
class ABTestResult:
    """Container for A/B test statistical results."""
    control_mean: float
    treatment_mean: float
    relative_lift: float
    absolute_difference: float
    p_value: float
    confidence_interval: Tuple[float, float]
    is_significant: bool
    required_sample_size: int = None
    actual_sample_size: int = None
 
 
def two_sample_proportions_test(
    control_successes: int,
    control_total: int,
    treatment_successes: int,
    treatment_total: int,
    alpha: float = 0.05
) -> ABTestResult:
    """
    Two-sample proportion test for binary outcomes (CTR, conversion).
    
    Uses Z-test for proportions with pooled variance.
    
    Args:
        control_successes: Number of successes (clicks, conversions) in control
        control_total: Total observations in control
        treatment_successes: Number of successes in treatment
        treatment_total: Total observations in treatment
        alpha: Significance level
    
    Returns:
        ABTestResult with test statistics
    """
    # Proportions
    p_control = control_successes / control_total
    p_treatment = treatment_successes / treatment_total
    
    # Pooled proportion under null hypothesis
    p_pooled = (control_successes + treatment_successes) / (control_total + treatment_total)
    
    # Standard error under pooled variance
    se = np.sqrt(
        p_pooled * (1 - p_pooled) * (1/control_total + 1/treatment_total)
    )
    
    # Z-statistic
    z = (p_treatment - p_control) / se if se > 0 else 0
    
    # Two-tailed p-value
    p_value = 2 * (1 - stats.norm.cdf(abs(z)))
    
    # Confidence interval for the difference
    se_diff = np.sqrt(
        p_control * (1 - p_control) / control_total +
        p_treatment * (1 - p_treatment) / treatment_total
    )
    z_critical = stats.norm.ppf(1 - alpha/2)
    ci_low = (p_treatment - p_control) - z_critical * se_diff
    ci_high = (p_treatment - p_control) + z_critical * se_diff
    
    return ABTestResult(
        control_mean=p_control,
        treatment_mean=p_treatment,
        relative_lift=(p_treatment - p_control) / p_control if p_control > 0 else 0,
        absolute_difference=p_treatment - p_control,
        p_value=p_value,
        confidence_interval=(ci_low, ci_high),
        is_significant=p_value < alpha,
        actual_sample_size=control_total + treatment_total
    )
 
 
def two_sample_means_test(
    control_values: np.ndarray,
    treatment_values: np.ndarray,
    alpha: float = 0.05
) -> ABTestResult:
    """
    Two-sample t-test for continuous outcomes (revenue, watch time).
    
    Uses Welch's t-test (unequal variances assumption).
    
    Args:
        control_values: Array of continuous values from control
        treatment_values: Array of continuous values from treatment
        alpha: Significance level
    
    Returns:
        ABTestResult with test statistics
    """
    # Means and standard errors
    control_mean = np.mean(control_values)
    treatment_mean = np.mean(treatment_values)
    
    # Welch's t-test
    t_stat, p_value = stats.ttest_ind(
        treatment_values, 
        control_values,
        equal_var=False  # Welch's t-test
    )
    
    # Confidence interval for difference
    se_control = stats.sem(control_values)
    se_treatment = stats.sem(treatment_values)
    se_diff = np.sqrt(se_control**2 + se_treatment**2)
    
    # Using conservative df for t-critical value
    df = min(len(control_values), len(treatment_values)) - 1
    t_critical = stats.t.ppf(1 - alpha/2, df)
    
    diff = treatment_mean - control_mean
    ci_low = diff - t_critical * se_diff
    ci_high = diff + t_critical * se_diff
    
    return ABTestResult(
        control_mean=control_mean,
        treatment_mean=treatment_mean,
        relative_lift=(treatment_mean - control_mean) / control_mean if control_mean > 0 else 0,
        absolute_difference=treatment_mean - control_mean,
        p_value=p_value,
        confidence_interval=(ci_low, ci_high),
        is_significant=p_value < alpha,
        actual_sample_size=len(control_values) + len(treatment_values)
    )
 
 
def calculate_sample_size(
    baseline_rate: float,
    mde: float,  # Minimum detectable effect (relative)
    alpha: float = 0.05,
    power: float = 0.80
) -> int:
    """
    Calculate required sample size per variant for proportion test.
    
    Args:
        baseline_rate: Current conversion/CTR rate
        mde: Minimum detectable effect as relative change (e.g., 0.05 for 5% lift)
        alpha: Significance level (Type I error rate)
        power: Statistical power (1 - Type II error rate)
    
    Returns:
        Required sample size per variant
    """
    # Treatment rate under alternative
    treatment_rate = baseline_rate * (1 + mde)
    
    # Effect size (Cohen's h for proportions)
    h = 2 * (np.arcsin(np.sqrt(treatment_rate)) - np.arcsin(np.sqrt(baseline_rate)))
    
    # Z-scores
    z_alpha = stats.norm.ppf(1 - alpha/2)
    z_beta = stats.norm.ppf(power)
    
    # Sample size formula
    n = 2 * ((z_alpha + z_beta) / h) ** 2
    
    return int(np.ceil(n))
 
 
class ABTestAnalyzer:
    """
    Production-grade A/B test analyzer with multiple metric support
    and correction for multiple comparisons.
    """
    
    def __init__(self, alpha: float = 0.05):
        self.alpha = alpha
    
    def analyze_multiple_metrics(
        self,
        control_data: Dict[str, np.ndarray],  # metric_name -> values
        treatment_data: Dict[str, np.ndarray],
        metric_types: Dict[str, str]  # metric_name -> 'proportion' or 'continuous'
    ) -> Dict[str, ABTestResult]:
        """
        Analyze multiple metrics with Bonferroni correction.
        
        When testing multiple metrics, we need to adjust for multiple
        comparisons to control false positive rate.
        """
        n_metrics = len(control_data)
        adjusted_alpha = self.alpha / n_metrics  # Bonferroni correction
        
        results = {}
        
        for metric_name, control_values in control_data.items():
            treatment_values = treatment_data[metric_name]
            metric_type = metric_types.get(metric_name, 'continuous')
            
            if metric_type == 'proportion':
                result = two_sample_proportions_test(
                    control_successes=int(control_values.sum()),
                    control_total=len(control_values),
                    treatment_successes=int(treatment_values.sum()),
                    treatment_total=len(treatment_values),
                    alpha=adjusted_alpha
                )
            else:
                result = two_sample_means_test(
                    control_values=control_values,
                    treatment_values=treatment_values,
                    alpha=adjusted_alpha
                )
            
            results[metric_name] = result
        
        return results
    
    def compute_power(
        self,
        control_values: np.ndarray,
        treatment_values: np.ndarray,
        mde: float
    ) -> float:
        """
        Compute observed power (post-hoc power analysis).
        
        Warning: Post-hoc power is controversial in statistics.
        Best used for planning future experiments.
        """
        from statsmodels.stats.power import TTestIndPower
        
        effect_size = mde * np.std(control_values)
        n = min(len(control_values), len(treatment_values))
        
        power_analysis = TTestIndPower()
        power = power_analysis.power(
            effect_size=effect_size,
            nobs1=n,
            ratio=1.0,
            alpha=self.alpha,
            alternative='two-sided'
        )
        
        return power

Sample Size and Statistical Power

Running an underpowered experiment is one of the most common mistakes in A/B testing. You waste time and resources, and may incorrectly conclude there's no effect when there actually is one.

What is Statistical Power?

Power is the probability of detecting a true effect when it exists:

$$\text{Power} = P(\text{reject } H_0 | H_1 \text{ is true}) = 1 - \beta$$

Low power (e.g., 0.50): 50% chance of missing real effects
High power (e.g., 0.80): Only 20% chance of missing real effects

Factors Affecting Sample Size:

Baseline rate/mean ($\mu_0$): Higher variability requires more samples
Minimum Detectable Effect (MDE): Smaller effects need larger samples
Significance level ($\alpha$): Stricter thresholds need larger samples
Desired power (1-$\beta$): Higher power requires larger samples

Required Sample Size Examples (per variant, 80% power, α=0.05)
Baseline CTR	5% Relative Lift	10% Relative Lift	20% Relative Lift
1%	3,143,168	786,368	197,264
2%	1,541,568	386,176	97,088
5%	593,408	148,672	37,376
10%	280,064	70,208	17,664
20%	124,416	31,232	7,872

Key Insight: The MDE-Sample Size Trade-off

Sample size scales with $1/MDE^2$. Halving the MDE you want to detect quadruples the required sample size.

Practical Implications:

If you have low traffic, you can only detect large effects
If you want to detect small effects, you need massive traffic
Always calculate required sample size before starting the experiment

Duration Estimation:

$$\text{Days Required} = \frac{2 \times n_{per_variant}}{\text{Daily Traffic} \times \text{Experiment %}}$$

Example: Need 100K users per variant, 50K daily traffic, 10% in experiment: $$\text{Days} = \frac{2 \times 100,000}{50,000 \times 0.10} = 40 \text{ days}$$

Never Stop Early Based on Results

Peeking at results and stopping when p < 0.05 massively inflates false positive rates. If you check twice, your Type I error is ~8% instead of 5%. If you check daily, it can exceed 30%. Always run to the predetermined sample size, or use sequential testing methods designed for early stopping.

Designing Recommendation System Experiments

Recommendation systems have unique characteristics that affect experimental design. Understanding these nuances is critical for valid experiments.

Randomization Unit Choices:

Unit	Pros	Cons
User	Consistent experience, captures long-term effects	Needs user identification
Session	Works for anonymous users	Inconsistent experience across sessions
Request	Maximum sample size	Within-session inconsistency
Device	Compromise for logged-out	Multiple people may share device
Geographic	Tests regional effects	Limited samples, high variance

Metric Selection for RecSys:

Primary Metrics (OEC candidates):

Engagement: CTR, time on site, sessions per user
Conversion: Purchase rate, subscription rate, add-to-cart rate
Retention: Return rate, churn, DAU/MAU

Guardrail Metrics:

Latency (shouldn't increase)
Error rates (shouldn't increase)
Revenue (shouldn't decrease during engagement tests)
Customer support contacts (shouldn't increase)

RecSys-Specific Design Considerations

•Novelty Effect — New recommendations may get more clicks initially just because they're different. Run experiments long enough (2+ weeks) to pass novelty period.
•Primacy Effect — Long-time users may resist change. Segment analysis by user tenure is essential.
•Network Effects — If recommendations affect social sharing, control and treatment groups may contaminate each other.
•Cold Start Interaction — New recommendation algorithms may perform differently for new vs existing users. Always stratify by user warmth.
•Position Bias — Changes to what appears in position 1 vs position 10 have different impacts. Log positions for analysis.
•Session Length Sensitivity — Some users have one-item sessions (email click), others browse extensively. Consider user-level aggregation.

experiment_design.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
import hashlib
from datetime import datetime, timedelta
from typing import Dict, List, Optional
from dataclasses import dataclass
import numpy as np
 
@dataclass
class ExperimentConfig:
    """Configuration for a recommendation A/B test."""
    experiment_id: str
    control_name: str
    treatment_name: str
    traffic_percentage: float  # Total experiment traffic (0.0 to 1.0)
    treatment_split: float = 0.5  # Within experiment, how much goes to treatment
    start_date: datetime = None
    end_date: datetime = None
    target_sample_per_variant: int = None
    
    # Stratification
    user_segments: List[str] = None  # e.g., ["new", "returning", "power_user"]
    
    # Metric configuration
    primary_metric: str = "conversion_rate"
    secondary_metrics: List[str] = None
    guardrail_metrics: List[str] = None
 
 
class ExperimentAssigner:
    """
    Deterministic assignment of users to experiment variants.
    
    Uses consistent hashing to ensure:
    1. Same user always gets same variant
    2. Assignment is pseudo-random
    3. Reproducible results
    """
    
    def __init__(self, salt: str = "recsys_experiment"):
        self.salt = salt
    
    def get_assignment(
        self,
        user_id: str,
        config: ExperimentConfig
    ) -> Optional[str]:
        """
        Assign user to experiment variant.
        
        Returns:
            'control', 'treatment', or None (not in experiment)
        """
        # Hash user_id + experiment_id + salt for deterministic assignment
        hash_input = f"{user_id}:{config.experiment_id}:{self.salt}"
        hash_value = int(hashlib.md5(hash_input.encode()).hexdigest(), 16)
        
        # Convert to [0, 1) uniformly
        bucket = (hash_value % 10000) / 10000
        
        # Is user in experiment at all?
        if bucket >= config.traffic_percentage:
            return None  # Not in experiment
        
        # Within experiment, which variant?
        # Re-hash for independent randomization
        variant_input = f"{user_id}:{config.experiment_id}:variant:{self.salt}"
        variant_hash = int(hashlib.md5(variant_input.encode()).hexdigest(), 16)
        variant_bucket = (variant_hash % 10000) / 10000
        
        if variant_bucket < config.treatment_split:
            return config.treatment_name
        else:
            return config.control_name
    
    def simulate_split(
        self,
        n_users: int,
        config: ExperimentConfig
    ) -> Dict[str, int]:
        """
        Simulate experiment split for validation.
        
        Useful for verifying assignment is approximately correct.
        """
        counts = {config.control_name: 0, config.treatment_name: 0, "not_in_experiment": 0}
        
        for i in range(n_users):
            user_id = f"test_user_{i}"
            assignment = self.get_assignment(user_id, config)
            
            if assignment is None:
                counts["not_in_experiment"] += 1
            else:
                counts[assignment] += 1
        
        return counts
 
 
class ExperimentGuardrails:
    """
    Monitor guardrail metrics to catch harmful experiments early.
    
    Even if primary metric improves, we may need to stop if guardrails fail.
    """
    
    def __init__(
        self,
        degradation_threshold: float = 0.05,  # 5% degradation triggers alert
        latency_increase_threshold_ms: float = 50
    ):
        self.degradation_threshold = degradation_threshold
        self.latency_increase_threshold = latency_increase_threshold_ms
    
    def check_guardrails(
        self,
        control_metrics: Dict[str, float],
        treatment_metrics: Dict[str, float]
    ) -> Dict[str, Dict]:
        """
        Check all guardrail metrics.
        
        Returns dict of metric_name -> {passed: bool, message: str}
        """
        results = {}
        
        # Revenue guardrail (shouldn't decrease significantly)
        if 'revenue_per_user' in control_metrics:
            control_rev = control_metrics['revenue_per_user']
            treatment_rev = treatment_metrics['revenue_per_user']
            
            relative_change = (treatment_rev - control_rev) / control_rev if control_rev > 0 else 0
            
            results['revenue'] = {
                'passed': relative_change > -self.degradation_threshold,
                'message': f"Revenue change: {relative_change:.2%}",
                'control': control_rev,
                'treatment': treatment_rev
            }
        
        # Latency guardrail (shouldn't increase significantly)
        if 'p50_latency_ms' in control_metrics:
            control_lat = control_metrics['p50_latency_ms']
            treatment_lat = treatment_metrics['p50_latency_ms']
            
            increase = treatment_lat - control_lat
            
            results['latency'] = {
                'passed': increase < self.latency_increase_threshold,
                'message': f"Latency change: {increase:.1f}ms",
                'control': control_lat,
                'treatment': treatment_lat
            }
        
        # Error rate guardrail
        if 'error_rate' in control_metrics:
            control_err = control_metrics['error_rate']
            treatment_err = treatment_metrics['error_rate']
            
            # Error rate shouldn't increase by more than 0.5 percentage points
            increase = treatment_err - control_err
            
            results['error_rate'] = {
                'passed': increase < 0.005,
                'message': f"Error rate change: {increase:.3%}",
                'control': control_err,
                'treatment': treatment_err
            }
        
        return results
    
    def should_stop_experiment(
        self,
        guardrail_results: Dict[str, Dict]
    ) -> bool:
        """
        Determine if experiment should be stopped due to guardrail failures.
        """
        failed_guardrails = [
            name for name, result in guardrail_results.items()
            if not result['passed']
        ]
        
        if failed_guardrails:
            print(f"⚠️ Guardrail failures: {failed_guardrails}")
            return True
        
        return False

Common A/B Testing Pitfalls

A/B testing seems simple but is filled with subtle pitfalls that can invalidate results. Knowledge of these pitfalls is what separates rigorous experimentation from cargo cult science.

Pitfall 1: Peeking (Multiple Testing Over Time)

Checking results daily and stopping when you see significance inflates false positive rates dramatically.

Why: Under the null hypothesis, p-values fluctuate randomly. If you check 10 times, you have 10 chances to get a false positive.

Solution: Pre-commit to sample size and duration. Or use sequential testing methods (SPRT, Bayesian methods) designed for early stopping.

Pitfall 2: Sample Ratio Mismatch (SRM)

The observed split differs from expected split (e.g., 52% control vs 48% treatment when expecting 50/50).

Why: Indicates a bug in randomization, assignment, or logging. Results are unreliable.

Solution: Always check for SRM before analyzing results. Chi-squared test on observed vs expected counts.

Additional Critical Pitfalls

•Survivor Bias — Only analyzing users who completed a session ignores those who left. If treatment drives users away, they won't appear in 'session complete' data.
•Simpson's Paradox — Overall results can be opposite of segment-level results. Always check that effects are consistent across key segments.
•Dilution — Testing on all users when only a subset is affected (e.g., testing homepage rec changes on users who never visit homepage).
•Carryover Effects — User state persists across experiment changes. If user was in treatment, then moves to control, they remember treatment experience.
•Inconsistent Logging — Control and treatment log different events or with different latency. Metric differences may be logging artifacts.
•Too Many Metrics — Testing 20 metrics guarantees finding 'significant' results by chance. Correct for multiple comparisons.

pitfall_detection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
import numpy as np
from scipy import stats
from typing import Dict, Tuple
 
def check_sample_ratio_mismatch(
    control_count: int,
    treatment_count: int,
    expected_ratio: float = 0.5,
    alpha: float = 0.01  # Use stricter threshold for SRM
) -> Tuple[bool, float]:
    """
    Check for Sample Ratio Mismatch (SRM).
    
    SRM indicates bugs in randomization/logging.
    If SRM is detected, experiment results are unreliable.
    
    Uses chi-squared goodness of fit test.
    
    Returns:
        (has_srm: bool, p_value: float)
    """
    total = control_count + treatment_count
    expected_control = total * expected_ratio
    expected_treatment = total * (1 - expected_ratio)
    
    observed = [control_count, treatment_count]
    expected = [expected_control, expected_treatment]
    
    chi2, p_value = stats.chisquare(observed, expected)
    
    has_srm = p_value < alpha
    
    if has_srm:
        print(f"⚠️ SAMPLE RATIO MISMATCH DETECTED!")
        print(f"   Expected: {expected_ratio:.2%} / {1-expected_ratio:.2%}")
        print(f"   Observed: {control_count/total:.2%} / {treatment_count/total:.2%}")
        print(f"   p-value: {p_value:.6f}")
        print("   Results are likely unreliable. Investigate assignment/logging bugs.")
    
    return has_srm, p_value
 
 
def bonferroni_correction(
    p_values: Dict[str, float],
    alpha: float = 0.05
) -> Dict[str, Dict]:
    """
    Apply Bonferroni correction for multiple comparisons.
    
    When testing multiple metrics, we need stricter thresholds
    to maintain overall false positive rate.
    """
    n_tests = len(p_values)
    adjusted_alpha = alpha / n_tests
    
    results = {}
    for metric, p_value in p_values.items():
        results[metric] = {
            'original_p': p_value,
            'adjusted_alpha': adjusted_alpha,
            'significant_after_correction': p_value < adjusted_alpha
        }
    
    return results
 
 
def check_simpsons_paradox(
    segment_results: Dict[str, Dict],  # segment -> {control_rate, treatment_rate}
    overall_result: Dict  # {control_rate, treatment_rate}
) -> bool:
    """
    Check if overall result contradicts segment-level results.
    
    Simpson's Paradox: Treatment beats control in every segment,
    but control beats treatment overall (or vice versa).
    
    This happens when segment sizes differ between control/treatment.
    """
    overall_treatment_wins = overall_result['treatment_rate'] > overall_result['control_rate']
    
    segment_directions = []
    for segment, result in segment_results.items():
        treatment_wins = result['treatment_rate'] > result['control_rate']
        segment_directions.append(treatment_wins)
    
    # Paradox if all segments agree but disagree with overall
    all_segments_agree = len(set(segment_directions)) == 1
    segments_disagree_with_overall = all_segments_agree and (segment_directions[0] != overall_treatment_wins)
    
    if segments_disagree_with_overall:
        print("⚠️ SIMPSON'S PARADOX DETECTED!")
        print("   Overall and segment-level results contradict.")
        print("   This often indicates confounding or segment imbalance.")
        return True
    
    return False
 
 
class ExperimentValidator:
    """
    Comprehensive experiment validation before trusting results.
    """
    
    def __init__(self):
        self.validation_results = {}
    
    def validate(
        self,
        control_users: int,
        treatment_users: int,
        control_metrics: Dict[str, np.ndarray],
        treatment_metrics: Dict[str, np.ndarray],
        expected_split: float = 0.5
    ) -> Dict[str, bool]:
        """
        Run all validation checks.
        
        Returns dict of check_name -> passed
        """
        checks = {}
        
        # 1. SRM Check
        has_srm, srm_pvalue = check_sample_ratio_mismatch(
            control_users, treatment_users, expected_split
        )
        checks['no_sample_ratio_mismatch'] = not has_srm
        
        # 2. Sufficient sample size (at least 1000 per variant for stable stats)
        checks['sufficient_sample'] = min(control_users, treatment_users) >= 1000
        
        # 3. Metric variance reasonability
        for metric_name, control_values in control_metrics.items():
            treatment_values = treatment_metrics[metric_name]
            
            # Check for suspiciously identical means (logging bug)
            if np.mean(control_values) == np.mean(treatment_values):
                checks[f'{metric_name}_not_identical'] = False
            else:
                checks[f'{metric_name}_not_identical'] = True
            
            # Check variance ratio (shouldn't be too different)
            var_ratio = np.var(control_values) / np.var(treatment_values)
            checks[f'{metric_name}_variance_reasonable'] = 0.5 < var_ratio < 2.0
        
        self.validation_results = checks
        
        all_passed = all(checks.values())
        if not all_passed:
            failed = [k for k, v in checks.items() if not v]
            print(f"⚠️ Validation failed: {failed}")
        
        return checks

Advanced A/B Testing Topics

Production A/B testing at scale involves sophisticated techniques beyond basic t-tests. Here are key advanced methods:

1. CUPED (Controlled-experiment Using Pre-Experiment Data)

Use pre-experiment user behavior to reduce variance and detect smaller effects with the same sample size.

$$Y_{adjusted} = Y - \theta (X - \bar{X})$$

Where $X$ is pre-experiment metric value and $\theta$ is chosen to minimize variance.

Benefit: Can reduce required sample size by 50%+ for metrics with strong pre-period correlation.

2. Sequential Testing

Methods that allow valid early stopping while controlling error rates.

SPRT (Sequential Probability Ratio Test): Classic sequential approach
Alpha spending: Allocate type I error budget across interim looks
Bayesian methods: Update posterior probability over time

Benefit: Reach conclusions faster for clear winners/losers.

3. Multi-Armed Bandits

Allocate traffic dynamically, sending more to winning variants.

Thompson Sampling
Upper Confidence Bound (UCB)

Benefit: Reduces regret (opportunity cost) during experimentation. Tradeoff: Less statistical clarity; harder to compute precise effect sizes.

More Advanced Techniques

•Stratified Sampling — Ensure key segments (new users, power users, mobile) are balanced. Reduces variance and enables segment-level analysis.
•Quantile Metrics — Test on medians or percentiles (p95) instead of means. Robust to outliers that inflate variance.
•Cluster-Randomized Experiments — Randomize at higher level (city, cohort) when individual randomization isn't possible.
•Interleaving — For ranking experiments, interleave control and treatment results and measure click position. More sensitive than side-by-side A/B.
•Long-term Holdouts — Keep small % of users permanently in old system to measure long-term cumulative impact of changes.

Industry Standards

Netflix uses interleaving extensively for ranking tests. Microsoft's ExP platform serves trillions of daily experiments. Booking.com runs ~1000 concurrent experiments. These companies have published extensively—their papers are excellent resources for advanced methodology.

Interpreting and Communicating Results

Getting statistical significance is just the beginning. Translating experimental results into business decisions requires additional analysis and clear communication.

Effect Size vs Statistical Significance:

Statistical significance tells you an effect exists. Effect size tells you if it matters.

A 0.01% lift in CTR might be significant with enough traffic, but is it worth deploying a complex new model?

Practical Significance Threshold:

Define before the experiment: "What is the minimum effect that would justify the complexity/maintenance cost?"

Example: "If the new model improves conversion by less than 0.5%, we won't ship it regardless of significance."

Confidence Interval Interpretation:

CI captures uncertainty better than p-value alone.

CI entirely above 0: Confident in positive effect
CI contains 0 but majority positive: Likely positive but uncertain
CI centered on 0: No evidence of effect
CI entirely below 0: Confident in negative effect

Decision Matrix Based on Results
Statistical Significance	Practical Significance	Decision
Significant (p < 0.05)	Above threshold	Ship — Strong evidence of meaningful improvement
Significant (p < 0.05)	Below threshold	Don't ship — Statistically real but not worth it
Not significant	Point estimate above threshold	Extend experiment — Might be underpowered
Not significant	Point estimate near zero	Don't ship — No evidence of improvement
N/A - Guardrails failed	N/A	Stop immediately — Investigate harm

Communicating to Stakeholders:

Do:

Lead with business impact ("$X revenue increase expected")
Show confidence intervals, not just point estimates
Explain tradeoffs ("CTR up 2%, but pages/session down 1%")
Acknowledge uncertainty ("90% confident of positive effect")
Provide reproducible methodology

Don't:

Say "proven" — experiments provide evidence, not proof
Hide inconvenient results
Over-interpret exploratory segment cuts
Conflate correlation with causation in post-hoc analyses

The Experiment Report Template

Standard sections: (1) Hypothesis, (2) Design (metrics, sample size, duration), (3) Validation checks (SRM, logging), (4) Results (tables, CIs), (5) Segment analysis, (6) Guardrail metrics, (7) Interpretation, (8) Recommendation. This structure ensures rigor and completeness.

Summary: A/B Testing

We've comprehensively explored A/B testing methodology for recommendation systems. Let's consolidate the essential knowledge:

Key Takeaways

•A/B tests are the gold standard — Randomized controlled experiments are the only way to establish causality in production systems.
•Statistical power matters — Underpowered experiments waste resources and miss real effects. Always calculate sample size upfront.
•Peeking invalidates results — Checking early and stopping when significant grossly inflates false positives. Commit to duration.
•Check for SRM first — Sample ratio mismatch indicates fundamental bugs. Don't trust results until SRM is ruled out.
•Use guardrail metrics — Even successful primary metrics aren't enough if guardrails (latency, errors, revenue) degrade.
•Practical significance > statistical significance — A statistically significant 0.01% lift isn't worth shipping complex changes.
•Advanced methods boost sensitivity — CUPED, sequential testing, and interleaving can dramatically reduce required sample sizes.

Module Complete!

Congratulations! You've completed Module 1: Recommender System Fundamentals. You now understand:

How to formulate recommendation problems mathematically
The crucial difference between explicit and implicit feedback
Strategies for handling cold start challenges
How to evaluate recommendation quality with appropriate metrics
How to validate improvements with rigorous A/B testing

These fundamentals prepare you for the advanced topics ahead: collaborative filtering, matrix factorization, content-based methods, and deep learning approaches.

Module Complete

You now possess a solid foundation in recommendation system fundamentals. You can articulate problems, design experiments, and make rigorous decisions about what to ship. The next modules will build on this foundation with specific algorithmic techniques for collaborative filtering and beyond.