Loading content...
You've built a new recommendation model. Offline metrics look promising—NDCG improved by 3%, precision by 2%. You're confident it's an improvement. But here's the uncomfortable truth: offline metrics can lie.
The only way to truly know if your recommendation system improves user experience is to show it to real users and measure their response. This is the domain of A/B testing (also called online controlled experiments)—the gold standard for evaluating changes in production systems.
Companies like Google, Netflix, Amazon, and Meta run thousands of A/B tests simultaneously. Every significant recommendation system change at scale goes through rigorous online experimentation. Understanding A/B testing isn't optional for recommendation practitioners—it's essential.
By the end of this page, you will understand the statistical foundations of A/B testing, design properly powered experiments, calculate and interpret p-values and confidence intervals, avoid common pitfalls that invalidate experiments, and handle the specific challenges of testing recommendation systems.
An A/B test (or A/B/n test with multiple variants) is a randomized controlled experiment where users are randomly assigned to different treatments, and outcomes are compared.
Core Components:
1. Randomization Unit
What entity gets randomized? Usually users, but could be sessions, devices, or geographic regions.
2. Control (A) vs Treatment (B)
3. Metrics
4. Sample Size & Duration
How many users and how long? Determined by statistical power analysis.
5. Statistical Analysis
P-values, confidence intervals, effect sizes to determine if differences are real.
The Hypothesis Testing Framework:
$$H_0: \mu_{treatment} = \mu_{control} \quad \text{(no difference)}$$ $$H_1: \mu_{treatment} eq \mu_{control} \quad \text{(there is a difference)}$$
We want to reject $H_0$ with low probability of being wrong (Type I error), while having high probability of detecting real effects (power, 1 - Type II error).
Key Decisions:
| Parameter | Typical Value | Impact |
|---|---|---|
| Significance level (α) | 0.05 | Probability of false positive |
| Power (1-β) | 0.80 | Probability of detecting true effect |
| Minimum detectable effect (MDE) | Varies | Smallest effect worth detecting |
| Traffic split | 50/50 | Power vs risk balance |
A solid grasp of statistics is essential for valid A/B testing. Let's cover the core concepts systematically.
The Central Limit Theorem (CLT)
Regardless of the underlying distribution, the sample mean approaches a normal distribution as sample size increases:
$$\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right)$$
This enables using normal-distribution-based statistical tests for most metrics.
Standard Error
The variability of the sample mean:
$$SE = \frac{\sigma}{\sqrt{n}}$$
Larger samples → smaller standard error → more precise estimates.
Confidence Intervals
For a 95% CI:
$$CI = \bar{X} \pm 1.96 \cdot SE$$
"We're 95% confident the true population mean lies within this interval."
P-Value Interpretation
The probability of observing results at least as extreme as what we measured, if the null hypothesis were true.
Common Misinterpretations (avoid these!):
Two-Sample T-Test
For comparing means between treatment and control:
$$t = \frac{\bar{X}{treatment} - \bar{X}{control}}{\sqrt{\frac{s^2_{treatment}}{n_{treatment}} + \frac{s^2_{control}}{n_{control}}}}$$
For proportions (CTR, conversion rate):
$$z = \frac{p_{treatment} - p_{control}}{\sqrt{p(1-p)(\frac{1}{n_{treatment}} + \frac{1}{n_{control}})}}$$
Where $p$ is the pooled proportion.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244
import numpy as npfrom scipy import statsfrom typing import Tuple, Dictfrom dataclasses import dataclass @dataclassclass ABTestResult: """Container for A/B test statistical results.""" control_mean: float treatment_mean: float relative_lift: float absolute_difference: float p_value: float confidence_interval: Tuple[float, float] is_significant: bool required_sample_size: int = None actual_sample_size: int = None def two_sample_proportions_test( control_successes: int, control_total: int, treatment_successes: int, treatment_total: int, alpha: float = 0.05) -> ABTestResult: """ Two-sample proportion test for binary outcomes (CTR, conversion). Uses Z-test for proportions with pooled variance. Args: control_successes: Number of successes (clicks, conversions) in control control_total: Total observations in control treatment_successes: Number of successes in treatment treatment_total: Total observations in treatment alpha: Significance level Returns: ABTestResult with test statistics """ # Proportions p_control = control_successes / control_total p_treatment = treatment_successes / treatment_total # Pooled proportion under null hypothesis p_pooled = (control_successes + treatment_successes) / (control_total + treatment_total) # Standard error under pooled variance se = np.sqrt( p_pooled * (1 - p_pooled) * (1/control_total + 1/treatment_total) ) # Z-statistic z = (p_treatment - p_control) / se if se > 0 else 0 # Two-tailed p-value p_value = 2 * (1 - stats.norm.cdf(abs(z))) # Confidence interval for the difference se_diff = np.sqrt( p_control * (1 - p_control) / control_total + p_treatment * (1 - p_treatment) / treatment_total ) z_critical = stats.norm.ppf(1 - alpha/2) ci_low = (p_treatment - p_control) - z_critical * se_diff ci_high = (p_treatment - p_control) + z_critical * se_diff return ABTestResult( control_mean=p_control, treatment_mean=p_treatment, relative_lift=(p_treatment - p_control) / p_control if p_control > 0 else 0, absolute_difference=p_treatment - p_control, p_value=p_value, confidence_interval=(ci_low, ci_high), is_significant=p_value < alpha, actual_sample_size=control_total + treatment_total ) def two_sample_means_test( control_values: np.ndarray, treatment_values: np.ndarray, alpha: float = 0.05) -> ABTestResult: """ Two-sample t-test for continuous outcomes (revenue, watch time). Uses Welch's t-test (unequal variances assumption). Args: control_values: Array of continuous values from control treatment_values: Array of continuous values from treatment alpha: Significance level Returns: ABTestResult with test statistics """ # Means and standard errors control_mean = np.mean(control_values) treatment_mean = np.mean(treatment_values) # Welch's t-test t_stat, p_value = stats.ttest_ind( treatment_values, control_values, equal_var=False # Welch's t-test ) # Confidence interval for difference se_control = stats.sem(control_values) se_treatment = stats.sem(treatment_values) se_diff = np.sqrt(se_control**2 + se_treatment**2) # Using conservative df for t-critical value df = min(len(control_values), len(treatment_values)) - 1 t_critical = stats.t.ppf(1 - alpha/2, df) diff = treatment_mean - control_mean ci_low = diff - t_critical * se_diff ci_high = diff + t_critical * se_diff return ABTestResult( control_mean=control_mean, treatment_mean=treatment_mean, relative_lift=(treatment_mean - control_mean) / control_mean if control_mean > 0 else 0, absolute_difference=treatment_mean - control_mean, p_value=p_value, confidence_interval=(ci_low, ci_high), is_significant=p_value < alpha, actual_sample_size=len(control_values) + len(treatment_values) ) def calculate_sample_size( baseline_rate: float, mde: float, # Minimum detectable effect (relative) alpha: float = 0.05, power: float = 0.80) -> int: """ Calculate required sample size per variant for proportion test. Args: baseline_rate: Current conversion/CTR rate mde: Minimum detectable effect as relative change (e.g., 0.05 for 5% lift) alpha: Significance level (Type I error rate) power: Statistical power (1 - Type II error rate) Returns: Required sample size per variant """ # Treatment rate under alternative treatment_rate = baseline_rate * (1 + mde) # Effect size (Cohen's h for proportions) h = 2 * (np.arcsin(np.sqrt(treatment_rate)) - np.arcsin(np.sqrt(baseline_rate))) # Z-scores z_alpha = stats.norm.ppf(1 - alpha/2) z_beta = stats.norm.ppf(power) # Sample size formula n = 2 * ((z_alpha + z_beta) / h) ** 2 return int(np.ceil(n)) class ABTestAnalyzer: """ Production-grade A/B test analyzer with multiple metric support and correction for multiple comparisons. """ def __init__(self, alpha: float = 0.05): self.alpha = alpha def analyze_multiple_metrics( self, control_data: Dict[str, np.ndarray], # metric_name -> values treatment_data: Dict[str, np.ndarray], metric_types: Dict[str, str] # metric_name -> 'proportion' or 'continuous' ) -> Dict[str, ABTestResult]: """ Analyze multiple metrics with Bonferroni correction. When testing multiple metrics, we need to adjust for multiple comparisons to control false positive rate. """ n_metrics = len(control_data) adjusted_alpha = self.alpha / n_metrics # Bonferroni correction results = {} for metric_name, control_values in control_data.items(): treatment_values = treatment_data[metric_name] metric_type = metric_types.get(metric_name, 'continuous') if metric_type == 'proportion': result = two_sample_proportions_test( control_successes=int(control_values.sum()), control_total=len(control_values), treatment_successes=int(treatment_values.sum()), treatment_total=len(treatment_values), alpha=adjusted_alpha ) else: result = two_sample_means_test( control_values=control_values, treatment_values=treatment_values, alpha=adjusted_alpha ) results[metric_name] = result return results def compute_power( self, control_values: np.ndarray, treatment_values: np.ndarray, mde: float ) -> float: """ Compute observed power (post-hoc power analysis). Warning: Post-hoc power is controversial in statistics. Best used for planning future experiments. """ from statsmodels.stats.power import TTestIndPower effect_size = mde * np.std(control_values) n = min(len(control_values), len(treatment_values)) power_analysis = TTestIndPower() power = power_analysis.power( effect_size=effect_size, nobs1=n, ratio=1.0, alpha=self.alpha, alternative='two-sided' ) return powerRunning an underpowered experiment is one of the most common mistakes in A/B testing. You waste time and resources, and may incorrectly conclude there's no effect when there actually is one.
What is Statistical Power?
Power is the probability of detecting a true effect when it exists:
$$\text{Power} = P(\text{reject } H_0 | H_1 \text{ is true}) = 1 - \beta$$
Factors Affecting Sample Size:
| Baseline CTR | 5% Relative Lift | 10% Relative Lift | 20% Relative Lift |
|---|---|---|---|
| 1% | 3,143,168 | 786,368 | 197,264 |
| 2% | 1,541,568 | 386,176 | 97,088 |
| 5% | 593,408 | 148,672 | 37,376 |
| 10% | 280,064 | 70,208 | 17,664 |
| 20% | 124,416 | 31,232 | 7,872 |
Key Insight: The MDE-Sample Size Trade-off
Sample size scales with $1/MDE^2$. Halving the MDE you want to detect quadruples the required sample size.
Practical Implications:
Duration Estimation:
$$\text{Days Required} = \frac{2 \times n_{per_variant}}{\text{Daily Traffic} \times \text{Experiment %}}$$
Example: Need 100K users per variant, 50K daily traffic, 10% in experiment: $$\text{Days} = \frac{2 \times 100,000}{50,000 \times 0.10} = 40 \text{ days}$$
Peeking at results and stopping when p < 0.05 massively inflates false positive rates. If you check twice, your Type I error is ~8% instead of 5%. If you check daily, it can exceed 30%. Always run to the predetermined sample size, or use sequential testing methods designed for early stopping.
Recommendation systems have unique characteristics that affect experimental design. Understanding these nuances is critical for valid experiments.
Randomization Unit Choices:
| Unit | Pros | Cons |
|---|---|---|
| User | Consistent experience, captures long-term effects | Needs user identification |
| Session | Works for anonymous users | Inconsistent experience across sessions |
| Request | Maximum sample size | Within-session inconsistency |
| Device | Compromise for logged-out | Multiple people may share device |
| Geographic | Tests regional effects | Limited samples, high variance |
Metric Selection for RecSys:
Primary Metrics (OEC candidates):
Guardrail Metrics:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186
import hashlibfrom datetime import datetime, timedeltafrom typing import Dict, List, Optionalfrom dataclasses import dataclassimport numpy as np @dataclassclass ExperimentConfig: """Configuration for a recommendation A/B test.""" experiment_id: str control_name: str treatment_name: str traffic_percentage: float # Total experiment traffic (0.0 to 1.0) treatment_split: float = 0.5 # Within experiment, how much goes to treatment start_date: datetime = None end_date: datetime = None target_sample_per_variant: int = None # Stratification user_segments: List[str] = None # e.g., ["new", "returning", "power_user"] # Metric configuration primary_metric: str = "conversion_rate" secondary_metrics: List[str] = None guardrail_metrics: List[str] = None class ExperimentAssigner: """ Deterministic assignment of users to experiment variants. Uses consistent hashing to ensure: 1. Same user always gets same variant 2. Assignment is pseudo-random 3. Reproducible results """ def __init__(self, salt: str = "recsys_experiment"): self.salt = salt def get_assignment( self, user_id: str, config: ExperimentConfig ) -> Optional[str]: """ Assign user to experiment variant. Returns: 'control', 'treatment', or None (not in experiment) """ # Hash user_id + experiment_id + salt for deterministic assignment hash_input = f"{user_id}:{config.experiment_id}:{self.salt}" hash_value = int(hashlib.md5(hash_input.encode()).hexdigest(), 16) # Convert to [0, 1) uniformly bucket = (hash_value % 10000) / 10000 # Is user in experiment at all? if bucket >= config.traffic_percentage: return None # Not in experiment # Within experiment, which variant? # Re-hash for independent randomization variant_input = f"{user_id}:{config.experiment_id}:variant:{self.salt}" variant_hash = int(hashlib.md5(variant_input.encode()).hexdigest(), 16) variant_bucket = (variant_hash % 10000) / 10000 if variant_bucket < config.treatment_split: return config.treatment_name else: return config.control_name def simulate_split( self, n_users: int, config: ExperimentConfig ) -> Dict[str, int]: """ Simulate experiment split for validation. Useful for verifying assignment is approximately correct. """ counts = {config.control_name: 0, config.treatment_name: 0, "not_in_experiment": 0} for i in range(n_users): user_id = f"test_user_{i}" assignment = self.get_assignment(user_id, config) if assignment is None: counts["not_in_experiment"] += 1 else: counts[assignment] += 1 return counts class ExperimentGuardrails: """ Monitor guardrail metrics to catch harmful experiments early. Even if primary metric improves, we may need to stop if guardrails fail. """ def __init__( self, degradation_threshold: float = 0.05, # 5% degradation triggers alert latency_increase_threshold_ms: float = 50 ): self.degradation_threshold = degradation_threshold self.latency_increase_threshold = latency_increase_threshold_ms def check_guardrails( self, control_metrics: Dict[str, float], treatment_metrics: Dict[str, float] ) -> Dict[str, Dict]: """ Check all guardrail metrics. Returns dict of metric_name -> {passed: bool, message: str} """ results = {} # Revenue guardrail (shouldn't decrease significantly) if 'revenue_per_user' in control_metrics: control_rev = control_metrics['revenue_per_user'] treatment_rev = treatment_metrics['revenue_per_user'] relative_change = (treatment_rev - control_rev) / control_rev if control_rev > 0 else 0 results['revenue'] = { 'passed': relative_change > -self.degradation_threshold, 'message': f"Revenue change: {relative_change:.2%}", 'control': control_rev, 'treatment': treatment_rev } # Latency guardrail (shouldn't increase significantly) if 'p50_latency_ms' in control_metrics: control_lat = control_metrics['p50_latency_ms'] treatment_lat = treatment_metrics['p50_latency_ms'] increase = treatment_lat - control_lat results['latency'] = { 'passed': increase < self.latency_increase_threshold, 'message': f"Latency change: {increase:.1f}ms", 'control': control_lat, 'treatment': treatment_lat } # Error rate guardrail if 'error_rate' in control_metrics: control_err = control_metrics['error_rate'] treatment_err = treatment_metrics['error_rate'] # Error rate shouldn't increase by more than 0.5 percentage points increase = treatment_err - control_err results['error_rate'] = { 'passed': increase < 0.005, 'message': f"Error rate change: {increase:.3%}", 'control': control_err, 'treatment': treatment_err } return results def should_stop_experiment( self, guardrail_results: Dict[str, Dict] ) -> bool: """ Determine if experiment should be stopped due to guardrail failures. """ failed_guardrails = [ name for name, result in guardrail_results.items() if not result['passed'] ] if failed_guardrails: print(f"⚠️ Guardrail failures: {failed_guardrails}") return True return FalseA/B testing seems simple but is filled with subtle pitfalls that can invalidate results. Knowledge of these pitfalls is what separates rigorous experimentation from cargo cult science.
Pitfall 1: Peeking (Multiple Testing Over Time)
Checking results daily and stopping when you see significance inflates false positive rates dramatically.
Why: Under the null hypothesis, p-values fluctuate randomly. If you check 10 times, you have 10 chances to get a false positive.
Solution: Pre-commit to sample size and duration. Or use sequential testing methods (SPRT, Bayesian methods) designed for early stopping.
Pitfall 2: Sample Ratio Mismatch (SRM)
The observed split differs from expected split (e.g., 52% control vs 48% treatment when expecting 50/50).
Why: Indicates a bug in randomization, assignment, or logging. Results are unreliable.
Solution: Always check for SRM before analyzing results. Chi-squared test on observed vs expected counts.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152
import numpy as npfrom scipy import statsfrom typing import Dict, Tuple def check_sample_ratio_mismatch( control_count: int, treatment_count: int, expected_ratio: float = 0.5, alpha: float = 0.01 # Use stricter threshold for SRM) -> Tuple[bool, float]: """ Check for Sample Ratio Mismatch (SRM). SRM indicates bugs in randomization/logging. If SRM is detected, experiment results are unreliable. Uses chi-squared goodness of fit test. Returns: (has_srm: bool, p_value: float) """ total = control_count + treatment_count expected_control = total * expected_ratio expected_treatment = total * (1 - expected_ratio) observed = [control_count, treatment_count] expected = [expected_control, expected_treatment] chi2, p_value = stats.chisquare(observed, expected) has_srm = p_value < alpha if has_srm: print(f"⚠️ SAMPLE RATIO MISMATCH DETECTED!") print(f" Expected: {expected_ratio:.2%} / {1-expected_ratio:.2%}") print(f" Observed: {control_count/total:.2%} / {treatment_count/total:.2%}") print(f" p-value: {p_value:.6f}") print(" Results are likely unreliable. Investigate assignment/logging bugs.") return has_srm, p_value def bonferroni_correction( p_values: Dict[str, float], alpha: float = 0.05) -> Dict[str, Dict]: """ Apply Bonferroni correction for multiple comparisons. When testing multiple metrics, we need stricter thresholds to maintain overall false positive rate. """ n_tests = len(p_values) adjusted_alpha = alpha / n_tests results = {} for metric, p_value in p_values.items(): results[metric] = { 'original_p': p_value, 'adjusted_alpha': adjusted_alpha, 'significant_after_correction': p_value < adjusted_alpha } return results def check_simpsons_paradox( segment_results: Dict[str, Dict], # segment -> {control_rate, treatment_rate} overall_result: Dict # {control_rate, treatment_rate}) -> bool: """ Check if overall result contradicts segment-level results. Simpson's Paradox: Treatment beats control in every segment, but control beats treatment overall (or vice versa). This happens when segment sizes differ between control/treatment. """ overall_treatment_wins = overall_result['treatment_rate'] > overall_result['control_rate'] segment_directions = [] for segment, result in segment_results.items(): treatment_wins = result['treatment_rate'] > result['control_rate'] segment_directions.append(treatment_wins) # Paradox if all segments agree but disagree with overall all_segments_agree = len(set(segment_directions)) == 1 segments_disagree_with_overall = all_segments_agree and (segment_directions[0] != overall_treatment_wins) if segments_disagree_with_overall: print("⚠️ SIMPSON'S PARADOX DETECTED!") print(" Overall and segment-level results contradict.") print(" This often indicates confounding or segment imbalance.") return True return False class ExperimentValidator: """ Comprehensive experiment validation before trusting results. """ def __init__(self): self.validation_results = {} def validate( self, control_users: int, treatment_users: int, control_metrics: Dict[str, np.ndarray], treatment_metrics: Dict[str, np.ndarray], expected_split: float = 0.5 ) -> Dict[str, bool]: """ Run all validation checks. Returns dict of check_name -> passed """ checks = {} # 1. SRM Check has_srm, srm_pvalue = check_sample_ratio_mismatch( control_users, treatment_users, expected_split ) checks['no_sample_ratio_mismatch'] = not has_srm # 2. Sufficient sample size (at least 1000 per variant for stable stats) checks['sufficient_sample'] = min(control_users, treatment_users) >= 1000 # 3. Metric variance reasonability for metric_name, control_values in control_metrics.items(): treatment_values = treatment_metrics[metric_name] # Check for suspiciously identical means (logging bug) if np.mean(control_values) == np.mean(treatment_values): checks[f'{metric_name}_not_identical'] = False else: checks[f'{metric_name}_not_identical'] = True # Check variance ratio (shouldn't be too different) var_ratio = np.var(control_values) / np.var(treatment_values) checks[f'{metric_name}_variance_reasonable'] = 0.5 < var_ratio < 2.0 self.validation_results = checks all_passed = all(checks.values()) if not all_passed: failed = [k for k, v in checks.items() if not v] print(f"⚠️ Validation failed: {failed}") return checksProduction A/B testing at scale involves sophisticated techniques beyond basic t-tests. Here are key advanced methods:
1. CUPED (Controlled-experiment Using Pre-Experiment Data)
Use pre-experiment user behavior to reduce variance and detect smaller effects with the same sample size.
$$Y_{adjusted} = Y - \theta (X - \bar{X})$$
Where $X$ is pre-experiment metric value and $\theta$ is chosen to minimize variance.
Benefit: Can reduce required sample size by 50%+ for metrics with strong pre-period correlation.
2. Sequential Testing
Methods that allow valid early stopping while controlling error rates.
Benefit: Reach conclusions faster for clear winners/losers.
3. Multi-Armed Bandits
Allocate traffic dynamically, sending more to winning variants.
Benefit: Reduces regret (opportunity cost) during experimentation. Tradeoff: Less statistical clarity; harder to compute precise effect sizes.
Netflix uses interleaving extensively for ranking tests. Microsoft's ExP platform serves trillions of daily experiments. Booking.com runs ~1000 concurrent experiments. These companies have published extensively—their papers are excellent resources for advanced methodology.
Getting statistical significance is just the beginning. Translating experimental results into business decisions requires additional analysis and clear communication.
Effect Size vs Statistical Significance:
Statistical significance tells you an effect exists. Effect size tells you if it matters.
A 0.01% lift in CTR might be significant with enough traffic, but is it worth deploying a complex new model?
Practical Significance Threshold:
Define before the experiment: "What is the minimum effect that would justify the complexity/maintenance cost?"
Example: "If the new model improves conversion by less than 0.5%, we won't ship it regardless of significance."
Confidence Interval Interpretation:
CI captures uncertainty better than p-value alone.
| Statistical Significance | Practical Significance | Decision |
|---|---|---|
| Significant (p < 0.05) | Above threshold | Ship — Strong evidence of meaningful improvement |
| Significant (p < 0.05) | Below threshold | Don't ship — Statistically real but not worth it |
| Not significant | Point estimate above threshold | Extend experiment — Might be underpowered |
| Not significant | Point estimate near zero | Don't ship — No evidence of improvement |
| N/A - Guardrails failed | N/A | Stop immediately — Investigate harm |
Communicating to Stakeholders:
Do:
Don't:
Standard sections: (1) Hypothesis, (2) Design (metrics, sample size, duration), (3) Validation checks (SRM, logging), (4) Results (tables, CIs), (5) Segment analysis, (6) Guardrail metrics, (7) Interpretation, (8) Recommendation. This structure ensures rigor and completeness.
We've comprehensively explored A/B testing methodology for recommendation systems. Let's consolidate the essential knowledge:
Module Complete!
Congratulations! You've completed Module 1: Recommender System Fundamentals. You now understand:
These fundamentals prepare you for the advanced topics ahead: collaborative filtering, matrix factorization, content-based methods, and deep learning approaches.
You now possess a solid foundation in recommendation system fundamentals. You can articulate problems, design experiments, and make rigorous decisions about what to ship. The next modules will build on this foundation with specific algorithmic techniques for collaborative filtering and beyond.