Loading content...
You've implemented a sophisticated boosting strategy. You've built a personalization system. You've added semantic search with embeddings. But here's the uncomfortable question: Is it actually better?
Intuition is unreliable for search quality. Changes that feel like improvements often aren't when measured rigorously. The "improved" ranking that satisfies ten people you asked might frustrate ten thousand users you didn't. Worse, changes can have delayed effects—users might click more today but search less overall next month.
A/B testing is the gold standard for measuring search quality changes. By showing different ranking algorithms to randomly selected user groups and measuring their behavior, we can make statistically valid claims about which approach is superior.
But search A/B testing is uniquely challenging: relevance is subjective, metrics conflict with each other, and user behavior is noisy. This page provides a comprehensive guide to designing, executing, and analyzing search A/B tests correctly.
By the end of this page, you will understand how to design search experiments, select appropriate metrics (both offline and online), calculate required sample sizes, avoid common statistical pitfalls, and interpret results correctly to make ship/no-ship decisions.
Search experimentation follows a structured framework that balances speed (shipping improvements quickly) with rigor (ensuring improvements are real).
The three-stage evaluation pipeline:
| Stage | Type | Speed | Cost | Confidence |
|---|---|---|---|---|
| Historical query logs + labels | Hours | Low | Medium |
| Live traffic experiment | Days-weeks | Medium | High |
| Persistent control group | Months | High | Very High |
Stage 1: Offline evaluation
Before exposing users to a change, evaluate it offline using historical queries and human relevance judgments. Offline metrics (NDCG, MRR, MAP) provide cheap, fast feedback on whether a change is directionally positive.
Limitations: Offline metrics don't perfectly predict online outcomes. A change might improve NDCG on judged queries but worsen the user experience due to factors not captured in judgments.
Stage 2: Online A/B test
If offline metrics look positive (or neutral without regression), run a controlled online experiment. Randomly assign users to control (current ranking) or treatment (new ranking), and compare behavioral metrics.
Stage 3: Long-term holdout
Maintain a small percentage of users (1-5%) who never receive experimental changes. This provides a stable baseline for detecting gradual degradation that A/B tests miss.
Each stage catches different problems. Offline catches obvious bugs and regressions before users see them. A/B tests measure actual user behavior. Long-term holdouts catch cumulative effects of many small changes that individually pass A/B tests but collectively degrade experience.
Offline metrics evaluate ranking quality against human relevance judgments. These judgments typically use graded scales (e.g., 0-4) where 0 is not relevant and 4 is perfectly relevant.
Key offline metrics:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161
import mathfrom typing import List, Dict, Tuplefrom dataclasses import dataclass @dataclassclass RankedResult: """A single result with position and relevance grade.""" doc_id: str position: int # 1-indexed relevance: int # 0-4 graded relevance (0=not relevant, 4=perfect) class OfflineMetrics: """ Standard offline evaluation metrics for search ranking. These metrics compare a ranking against human relevance judgments. Higher is better for all metrics. """ @staticmethod def precision_at_k(results: List[RankedResult], k: int, relevance_threshold: int = 1) -> float: """ Precision@K: Fraction of top-K results that are relevant. Simple but doesn't consider rank position or graded relevance. Good for quick sanity checks. Example: P@10 = 0.7 means 7 of top 10 results are relevant. """ top_k = [r for r in results if r.position <= k] relevant = [r for r in top_k if r.relevance >= relevance_threshold] return len(relevant) / k if k > 0 else 0.0 @staticmethod def recall_at_k(results: List[RankedResult], k: int, total_relevant: int, relevance_threshold: int = 1) -> float: """ Recall@K: Fraction of all relevant docs found in top K. Requires knowing total relevant docs for the query. Important for queries where users need to see ALL relevant results. """ top_k = [r for r in results if r.position <= k] found_relevant = len([r for r in top_k if r.relevance >= relevance_threshold]) return found_relevant / total_relevant if total_relevant > 0 else 0.0 @staticmethod def mean_reciprocal_rank(results: List[RankedResult], relevance_threshold: int = 1) -> float: """ MRR: 1 / position of first relevant result. Focuses only on finding ONE relevant result quickly. Ignores everything after the first relevant result. Good for navigational queries where user wants one answer. MRR = 1.0 means first result is always relevant MRR = 0.5 means first relevant result is typically at position 2 """ for result in sorted(results, key=lambda r: r.position): if result.relevance >= relevance_threshold: return 1.0 / result.position return 0.0 @staticmethod def dcg_at_k(results: List[RankedResult], k: int) -> float: """ Discounted Cumulative Gain at K. DCG = Σ (2^rel - 1) / log2(pos + 1) Key insight: Highly relevant results should rank higher. Position discount: Result at position 10 is less valuable than at position 1. Relevance gain: Higher relevance grades contribute more. """ top_k = sorted([r for r in results if r.position <= k], key=lambda r: r.position) dcg = 0.0 for result in top_k: # Gain from relevance (exponential for graded relevance) gain = (2 ** result.relevance) - 1 # Discount by log of position discount = math.log2(result.position + 1) dcg += gain / discount return dcg @staticmethod def ndcg_at_k(results: List[RankedResult], k: int, ideal_results: List[RankedResult]) -> float: """ Normalized Discounted Cumulative Gain at K. NDCG = DCG / IDCG Where IDCG is the ideal DCG if results were perfectly ranked (most relevant first). NDCG is normalized to [0, 1]: - 1.0 = perfect ranking (matches ideal) - 0.0 = no relevant results This is THE standard offline metric for graded relevance. """ dcg = OfflineMetrics.dcg_at_k(results, k) idcg = OfflineMetrics.dcg_at_k(ideal_results, k) return dcg / idcg if idcg > 0 else 0.0 @staticmethod def expected_reciprocal_rank(results: List[RankedResult], max_grade: int = 4) -> float: """ ERR: Expected Reciprocal Rank. Models user browsing: at each position, user stops with probability proportional to result relevance. ERR = Σ (1/pos) × P(stopping at pos) More realistic than NDCG for user satisfaction modeling. Captures that highly relevant early results may cause early satisfaction. """ err = 0.0 prob_not_stopped = 1.0 for result in sorted(results, key=lambda r: r.position): # Probability user is satisfied at this position prob_satisfied = (2 ** result.relevance - 1) / (2 ** max_grade) # Contribution to ERR err += prob_not_stopped * prob_satisfied / result.position # Update probability of reaching next position prob_not_stopped *= (1 - prob_satisfied) return err def compute_evaluation_suite( results: List[RankedResult], ideal_results: List[RankedResult], total_relevant: int) -> Dict[str, float]: """Compute full suite of offline metrics for a query.""" return { "P@1": OfflineMetrics.precision_at_k(results, 1), "P@5": OfflineMetrics.precision_at_k(results, 5), "P@10": OfflineMetrics.precision_at_k(results, 10), "R@10": OfflineMetrics.recall_at_k(results, 10, total_relevant), "R@50": OfflineMetrics.recall_at_k(results, 50, total_relevant), "MRR": OfflineMetrics.mean_reciprocal_rank(results), "NDCG@5": OfflineMetrics.ndcg_at_k(results, 5, ideal_results), "NDCG@10": OfflineMetrics.ndcg_at_k(results, 10, ideal_results), "ERR": OfflineMetrics.expected_reciprocal_rank(results), }Use MRR for navigational queries (one right answer). Use NDCG for informational queries (multiple relevant results with varying quality). Use Recall when finding ALL relevant results matters (legal discovery, academic research). Use ERR when modeling user satisfaction is important.
Online metrics measure actual user behavior rather than relevance judgments. They're noisier than offline metrics but capture the full user experience, including factors that judgments miss.
Categories of online metrics:
| Category | Metric | What It Measures | Interpretation |
|---|---|---|---|
| Engagement | Click-Through Rate (CTR) | % of searches with clicks | Higher = more engaging results (usually good) |
| Engagement | Clicks Per Search | Avg clicks per query | Higher can be good (exploration) or bad (not finding answer) |
| Satisfaction | Mean Reciprocal Rank of Click (MRC) | Position of first click | Higher = good results ranked higher |
| Satisfaction | Time to First Click | Seconds until first click | Lower = users find relevant results faster |
| Satisfaction | Dwell Time | Time spent on clicked result | Longer = result was useful |
| Satisfaction | Session Success Rate | % sessions ending in desired action | Higher = search led to goal completion |
| Effort | Query Refinements | of query modifications | Lower = users find what they want on first try |
| Effort | Abandonment Rate | % searches with no clicks | Lower usually = better (but can be good for quick answers) |
| System | Zero Results Rate | % queries with no results | Lower = better coverage |
| Long-term | Return Rate | % users who search again | Higher = search is valuable to users |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211
from dataclasses import dataclass, fieldfrom typing import List, Dict, Optionalfrom datetime import datetime, timedeltaimport statistics @dataclassclass SearchEvent: """Logged search event.""" search_id: str user_id: str query: str timestamp: datetime results_shown: int variant: str # "control" or "treatment" @dataclassclass ClickEvent: """Click on a search result.""" search_id: str result_position: int timestamp: datetime dwell_time_seconds: Optional[float] = None # Time on clicked page @dataclassclass ConversionEvent: """Conversion following search (purchase, signup, etc.).""" search_id: str conversion_type: str value: float # Revenue, etc. timestamp: datetime @dataclassclass SearchSession: """Aggregated view of a search session.""" search: SearchEvent clicks: List[ClickEvent] = field(default_factory=list) conversions: List[ConversionEvent] = field(default_factory=list) refinements: List[SearchEvent] = field(default_factory=list) @property def had_click(self) -> bool: return len(self.clicks) > 0 @property def first_click_position(self) -> Optional[int]: if not self.clicks: return None return min(c.result_position for c in self.clicks) @property def total_dwell_time(self) -> float: return sum(c.dwell_time_seconds or 0 for c in self.clicks) @property def had_conversion(self) -> bool: return len(self.conversions) > 0 class OnlineMetricsCalculator: """ Calculate online metrics from session data. These metrics are computed per variant in an A/B test, then compared statistically. """ @staticmethod def click_through_rate(sessions: List[SearchSession]) -> float: """ CTR: Fraction of sessions with at least one click. CTR = (sessions with clicks) / (total sessions) Industry average for web search: 30-50% E-commerce product search: 40-60% """ if not sessions: return 0.0 clicked = sum(1 for s in sessions if s.had_click) return clicked / len(sessions) @staticmethod def mean_reciprocal_rank_of_clicks(sessions: List[SearchSession]) -> float: """ MRC: Average of 1 / (first click position). Measures how high users click on average. MRC = 1.0 means first click is always position 1. MRC = 0.5 means first click averages position 2. """ reciprocals = [] for session in sessions: if session.first_click_position: reciprocals.append(1.0 / session.first_click_position) return statistics.mean(reciprocals) if reciprocals else 0.0 @staticmethod def abandonment_rate(sessions: List[SearchSession]) -> float: """ Fraction of sessions with no clicks. High abandonment can mean: - Bad results (user gave up) - Good results (answer shown directly, no click needed) Context matters! Zero-click answers (featured snippets) increase abandonment but improve satisfaction. """ if not sessions: return 0.0 abandoned = sum(1 for s in sessions if not s.had_click) return abandoned / len(sessions) @staticmethod def time_to_first_click(sessions: List[SearchSession]) -> Optional[float]: """ Average seconds between search and first click. Lower is generally better (users find results faster). """ times = [] for session in sessions: if session.clicks: first_click = min(session.clicks, key=lambda c: c.timestamp) time_to_click = (first_click.timestamp - session.search.timestamp).total_seconds() if time_to_click > 0: # Filter out invalid data times.append(time_to_click) return statistics.mean(times) if times else None @staticmethod def average_dwell_time(sessions: List[SearchSession]) -> float: """ Average time spent on clicked pages. Longer dwell time suggests result was relevant and useful. Very short dwell (pogo-sticking) suggests bad result. """ dwell_times = [s.total_dwell_time for s in sessions if s.total_dwell_time > 0] return statistics.mean(dwell_times) if dwell_times else 0.0 @staticmethod def long_click_rate(sessions: List[SearchSession], threshold_seconds: float = 30.0) -> float: """ Fraction of clicks with dwell time above threshold. Long clicks are strong relevance signals. Short clicks (< 10s) often indicate user bounced back unsatisfied. This is a key quality metric used by Google and others. """ total_clicks = 0 long_clicks = 0 for session in sessions: for click in session.clicks: total_clicks += 1 if (click.dwell_time_seconds or 0) >= threshold_seconds: long_clicks += 1 return long_clicks / total_clicks if total_clicks > 0 else 0.0 @staticmethod def reformulation_rate(sessions: List[SearchSession]) -> float: """ Fraction of sessions with query refinements. Lower is generally better—users found what they wanted. But some reformulation is natural for exploratory queries. """ reformulated = sum(1 for s in sessions if len(s.refinements) > 0) return reformulated / len(sessions) if sessions else 0.0 @staticmethod def conversion_rate(sessions: List[SearchSession]) -> float: """ Fraction of sessions leading to conversion. The ultimate business metric for commercial search. """ converted = sum(1 for s in sessions if s.had_conversion) return converted / len(sessions) if sessions else 0.0 @staticmethod def revenue_per_search(sessions: List[SearchSession]) -> float: """ Average revenue per search session. Combines conversion rate with order value. """ total_revenue = sum( c.value for s in sessions for c in s.conversions ) return total_revenue / len(sessions) if sessions else 0.0 def compute_all_metrics(sessions: List[SearchSession]) -> Dict[str, float]: """Compute full suite of online metrics.""" calc = OnlineMetricsCalculator() return { "ctr": calc.click_through_rate(sessions), "mrc": calc.mean_reciprocal_rank_of_clicks(sessions), "abandonment_rate": calc.abandonment_rate(sessions), "time_to_first_click": calc.time_to_first_click(sessions), "avg_dwell_time": calc.average_dwell_time(sessions), "long_click_rate": calc.long_click_rate(sessions), "reformulation_rate": calc.reformulation_rate(sessions), "conversion_rate": calc.conversion_rate(sessions), "revenue_per_search": calc.revenue_per_search(sessions), }Improving one metric often hurts others. Higher CTR might mean worse relevance (users click more because they're not finding what they want). Lower abandonment might mean worse direct answers (users click because answer isn't shown on SERP). Choose primary and guardrail metrics carefully, and investigate metric conflicts.
Proper experimental design ensures results are valid and actionable. Poor design leads to false conclusions, shipping changes that hurt users or rejecting changes that would help.
Key design elements:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172
import mathfrom scipy import statsfrom dataclasses import dataclassfrom typing import Optional @dataclassclass ExperimentConfig: """Configuration for an A/B test.""" name: str metrics: list primary_metric: str control_allocation: float = 0.5 significance_level: float = 0.05 # Alpha power: float = 0.80 # 1 - Beta min_detectable_effect: float = 0.02 # 2% relative change class SampleSizeCalculator: """ Calculate required sample size for experiments. The fundamental trade-off: - Smaller sample = faster results - Larger sample = detect smaller effects Under-powered experiments miss real improvements. Over-powered experiments waste time detecting insignificant effects. """ @staticmethod def for_proportion( baseline_rate: float, min_detectable_effect: float, # Relative (e.g., 0.05 = 5% improvement) alpha: float = 0.05, power: float = 0.80, two_sided: bool = True ) -> int: """ Sample size for proportion metrics (CTR, conversion rate, etc.). Formula based on z-test for two proportions. Example: - Baseline CTR: 40% - Want to detect: 5% relative improvement (40% → 42%) - Alpha: 0.05, Power: 0.80 → Need ~12,000 samples per variant """ p1 = baseline_rate p2 = baseline_rate * (1 + min_detectable_effect) # Pooled proportion p_pooled = (p1 + p2) / 2 # Standard effect size (Cohen's h for proportions) h = 2 * (math.asin(math.sqrt(p2)) - math.asin(math.sqrt(p1))) # Z-scores for alpha and power z_alpha = stats.norm.ppf(1 - alpha / (2 if two_sided else 1)) z_power = stats.norm.ppf(power) # Sample size formula n = 2 * ((z_alpha + z_power) / h) ** 2 return math.ceil(n) @staticmethod def for_continuous( baseline_mean: float, baseline_std: float, min_detectable_effect: float, # Relative alpha: float = 0.05, power: float = 0.80 ) -> int: """ Sample size for continuous metrics (dwell time, revenue, etc.). More variance = more samples needed. """ effect_size = baseline_mean * min_detectable_effect # Cohen's d d = effect_size / baseline_std # Z-scores z_alpha = stats.norm.ppf(1 - alpha / 2) z_power = stats.norm.ppf(power) # Sample size n = 2 * ((z_alpha + z_power) / d) ** 2 return math.ceil(n) @staticmethod def days_to_run( required_samples: int, daily_searches: int, experiment_allocation: float = 1.0 # Fraction of traffic in experiment ) -> float: """ Estimate days to reach required sample size. Each variant needs required_samples. Total samples needed = 2 * required_samples (for two variants). """ samples_per_day = daily_searches * experiment_allocation total_needed = 2 * required_samples return total_needed / samples_per_day def plan_experiment( name: str, metric: str, baseline_value: float, baseline_std: Optional[float], # For continuous metrics min_effect: float, daily_samples: int) -> dict: """ Plan an experiment: calculate sample size and duration. """ calc = SampleSizeCalculator() # Determine if proportion or continuous metric is_proportion = baseline_value <= 1.0 and (baseline_std is None or baseline_std <= 0.5) if is_proportion: samples_needed = calc.for_proportion( baseline_rate=baseline_value, min_detectable_effect=min_effect ) else: samples_needed = calc.for_continuous( baseline_mean=baseline_value, baseline_std=baseline_std or (baseline_value * 0.5), # Estimate min_detectable_effect=min_effect ) days_needed = calc.days_to_run(samples_needed, daily_samples) return { "experiment_name": name, "primary_metric": metric, "baseline": baseline_value, "min_detectable_effect": f"{min_effect*100}%", "samples_per_variant": samples_needed, "estimated_days": math.ceil(days_needed), "recommendation": _experiment_recommendation(days_needed, samples_needed) } def _experiment_recommendation(days: float, samples: int) -> str: if days > 60: return "CONSIDER: Effect size very small. Consider larger MDE or accept lower power." elif days < 1: return "WARNING: Very short experiment. Check for novelty effects and weekly patterns." elif samples < 1000: return "WARNING: Small sample. Results may not generalize. Consider longer run." else: return f"OK: Run for at least {math.ceil(days)} days with weekly pattern coverage." # Example usageif __name__ == "__main__": result = plan_experiment( name="New ranking model test", metric="click_through_rate", baseline_value=0.40, # 40% CTR baseline_std=None, # Proportion min_effect=0.05, # Detect 5% relative improvement daily_samples=50000 # 50K searches/day ) print(result)Proper statistical analysis turns raw metrics into actionable conclusions. Done wrong, you'll ship bad changes or miss good ones.
The hypothesis testing framework:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251
import mathfrom scipy import statsfrom dataclasses import dataclassfrom typing import Tuple, Optionalimport numpy as np @dataclassclass ExperimentResult: """Results from an A/B test variant.""" variant_name: str sample_size: int metric_value: float metric_std: float # Standard deviation @dataclassclass StatisticalTestResult: """Complete statistical test output.""" control: ExperimentResult treatment: ExperimentResult relative_change: float # (treatment - control) / control absolute_change: float # treatment - control p_value: float confidence_interval: Tuple[float, float] is_significant: bool effect_size: float # Cohen's d def summary(self) -> str: direction = "increase" if self.relative_change > 0 else "decrease" sig = "statistically significant" if self.is_significant else "not statistically significant" return ( f"{self.relative_change*100:+.2f}% {direction} ({sig})" f"95% CI: [{self.confidence_interval[0]*100:.2f}%, {self.confidence_interval[1]*100:.2f}%]" f"p-value: {self.p_value:.4f}" ) class ABTestAnalyzer: """ Statistical analysis for A/B test results. Provides hypothesis testing, confidence intervals, and decision recommendations. """ def __init__(self, alpha: float = 0.05): self.alpha = alpha def analyze_proportions( self, control: ExperimentResult, treatment: ExperimentResult ) -> StatisticalTestResult: """ Analyze proportion metrics (CTR, conversion rate). Uses two-proportion z-test. """ # Extract values p_c = control.metric_value p_t = treatment.metric_value n_c = control.sample_size n_t = treatment.sample_size # Pooled proportion under null hypothesis p_pooled = (p_c * n_c + p_t * n_t) / (n_c + n_t) # Standard error se = math.sqrt(p_pooled * (1 - p_pooled) * (1/n_c + 1/n_t)) # Z-statistic z = (p_t - p_c) / se if se > 0 else 0 # P-value (two-tailed) p_value = 2 * (1 - stats.norm.cdf(abs(z))) # Confidence interval for difference se_diff = math.sqrt(p_c*(1-p_c)/n_c + p_t*(1-p_t)/n_t) z_alpha = stats.norm.ppf(1 - self.alpha/2) ci_lower = (p_t - p_c) - z_alpha * se_diff ci_upper = (p_t - p_c) + z_alpha * se_diff # Effect size (Cohen's h) h = 2 * (math.asin(math.sqrt(p_t)) - math.asin(math.sqrt(p_c))) return StatisticalTestResult( control=control, treatment=treatment, relative_change=(p_t - p_c) / p_c if p_c > 0 else 0, absolute_change=p_t - p_c, p_value=p_value, confidence_interval=(ci_lower / p_c if p_c > 0 else 0, ci_upper / p_c if p_c > 0 else 0), is_significant=p_value < self.alpha, effect_size=h ) def analyze_continuous( self, control: ExperimentResult, treatment: ExperimentResult ) -> StatisticalTestResult: """ Analyze continuous metrics (dwell time, revenue). Uses Welch's t-test (unequal variances). """ m_c = control.metric_value m_t = treatment.metric_value s_c = control.metric_std s_t = treatment.metric_std n_c = control.sample_size n_t = treatment.sample_size # Welch's t-test se = math.sqrt(s_c**2/n_c + s_t**2/n_t) t_stat = (m_t - m_c) / se if se > 0 else 0 # Degrees of freedom (Welch-Satterthwaite) num = (s_c**2/n_c + s_t**2/n_t)**2 denom = (s_c**2/n_c)**2/(n_c-1) + (s_t**2/n_t)**2/(n_t-1) df = num / denom if denom > 0 else min(n_c, n_t) - 1 # P-value p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df)) # Confidence interval t_alpha = stats.t.ppf(1 - self.alpha/2, df) ci_lower = (m_t - m_c) - t_alpha * se ci_upper = (m_t - m_c) + t_alpha * se # Effect size (Cohen's d) pooled_std = math.sqrt(((n_c-1)*s_c**2 + (n_t-1)*s_t**2) / (n_c+n_t-2)) d = (m_t - m_c) / pooled_std if pooled_std > 0 else 0 return StatisticalTestResult( control=control, treatment=treatment, relative_change=(m_t - m_c) / m_c if m_c > 0 else 0, absolute_change=m_t - m_c, p_value=p_value, confidence_interval=(ci_lower / m_c if m_c > 0 else 0, ci_upper / m_c if m_c > 0 else 0), is_significant=p_value < self.alpha, effect_size=d ) class MultipleTestingCorrection: """ Corrections for testing multiple metrics simultaneously. When you test 20 metrics, you expect 1 false positive at α=0.05. Corrections control the family-wise error rate. """ @staticmethod def bonferroni(p_values: list, alpha: float = 0.05) -> dict: """ Bonferroni correction: most conservative. Divide α by number of tests. Protects strongly against false positives but may miss real effects (high false negative rate). """ n = len(p_values) corrected_alpha = alpha / n return { "corrected_alpha": corrected_alpha, "significant": [p < corrected_alpha for p in p_values] } @staticmethod def benjamini_hochberg(p_values: list, alpha: float = 0.05) -> dict: """ Benjamini-Hochberg procedure: controls False Discovery Rate. Less conservative than Bonferroni. Allows some false positives in exchange for more power to detect real effects. Recommended for exploratory analysis with many metrics. """ n = len(p_values) sorted_idx = np.argsort(p_values) sorted_p = np.array(p_values)[sorted_idx] # BH threshold: p[i] < (i/n) * alpha thresholds = [(i+1)/n * alpha for i in range(n)] # Find largest p-value below its threshold significant = [False] * n for i in range(n-1, -1, -1): if sorted_p[i] <= thresholds[i]: for j in range(i+1): significant[sorted_idx[j]] = True break return { "fdr_level": alpha, "significant": significant } # Complete analysis workflowdef full_experiment_analysis( control_sessions: list, treatment_sessions: list, metrics_to_test: list) -> dict: """ Run complete statistical analysis for an experiment. """ analyzer = ABTestAnalyzer(alpha=0.05) results = {} p_values = [] for metric_name in metrics_to_test: # Compute metric for each group control_values = [getattr(s, metric_name) for s in control_sessions] treatment_values = [getattr(s, metric_name) for s in treatment_sessions] control = ExperimentResult( variant_name="control", sample_size=len(control_values), metric_value=sum(control_values) / len(control_values), metric_std=np.std(control_values) ) treatment = ExperimentResult( variant_name="treatment", sample_size=len(treatment_values), metric_value=sum(treatment_values) / len(treatment_values), metric_std=np.std(treatment_values) ) # Run appropriate test is_proportion = metric_name in ["ctr", "conversion_rate", "abandonment_rate"] if is_proportion: test_result = analyzer.analyze_proportions(control, treatment) else: test_result = analyzer.analyze_continuous(control, treatment) results[metric_name] = test_result p_values.append(test_result.p_value) # Apply multiple testing correction bh_correction = MultipleTestingCorrection.benjamini_hochberg(p_values) return { "individual_results": results, "bh_significant": bh_correction["significant"] }Don't peek at results repeatedly and stop when significant. This inflates false positive rates. Pre-register your stopping criteria. Either run for fixed duration or use sequential testing methods (e.g., multi-armed bandits) designed for early stopping.
Search A/B testing has unique challenges that can invalidate results if not handled properly.
A/B testing transforms search relevance from opinion-driven to evidence-driven. It's the difference between "I think this is better" and "We measured that this is 3% better with p < 0.01." This rigor is essential for building and maintaining high-quality search systems.
What's next:
A/B testing measures what users do, but not why. The final page in this module explores user feedback incorporation—how to collect, analyze, and act on direct user feedback to complement behavioral metrics.
You now understand how to scientifically measure search quality improvements through controlled experiments. From offline metrics to online A/B tests, from sample size calculation to statistical analysis, you have the tools to make evidence-based decisions about relevance changes.