Search Relevance Tuning - Learning Module

Loading content...

0/273

A/B Testing Relevance: Measuring What Matters

The Measurement Challenge

You've implemented a sophisticated boosting strategy. You've built a personalization system. You've added semantic search with embeddings. But here's the uncomfortable question: Is it actually better?

Intuition is unreliable for search quality. Changes that feel like improvements often aren't when measured rigorously. The "improved" ranking that satisfies ten people you asked might frustrate ten thousand users you didn't. Worse, changes can have delayed effects—users might click more today but search less overall next month.

A/B testing is the gold standard for measuring search quality changes. By showing different ranking algorithms to randomly selected user groups and measuring their behavior, we can make statistically valid claims about which approach is superior.

But search A/B testing is uniquely challenging: relevance is subjective, metrics conflict with each other, and user behavior is noisy. This page provides a comprehensive guide to designing, executing, and analyzing search A/B tests correctly.

What You Will Learn

By the end of this page, you will understand how to design search experiments, select appropriate metrics (both offline and online), calculate required sample sizes, avoid common statistical pitfalls, and interpret results correctly to make ship/no-ship decisions.

The Search Experimentation Framework

Search experimentation follows a structured framework that balances speed (shipping improvements quickly) with rigor (ensuring improvements are real).

The three-stage evaluation pipeline:

Search Evaluation Pipeline
Stage	Type	Speed	Cost	Confidence
Offline Evaluation	Historical query logs + labels	Hours	Low	Medium
Online A/B Test	Live traffic experiment	Days-weeks	Medium	High
Long-term Holdout	Persistent control group	Months	High	Very High

Stage 1: Offline evaluation

Before exposing users to a change, evaluate it offline using historical queries and human relevance judgments. Offline metrics (NDCG, MRR, MAP) provide cheap, fast feedback on whether a change is directionally positive.

Limitations: Offline metrics don't perfectly predict online outcomes. A change might improve NDCG on judged queries but worsen the user experience due to factors not captured in judgments.

Stage 2: Online A/B test

If offline metrics look positive (or neutral without regression), run a controlled online experiment. Randomly assign users to control (current ranking) or treatment (new ranking), and compare behavioral metrics.

Stage 3: Long-term holdout

Maintain a small percentage of users (1-5%) who never receive experimental changes. This provides a stable baseline for detecting gradual degradation that A/B tests miss.

Why All Three Stages?

Each stage catches different problems. Offline catches obvious bugs and regressions before users see them. A/B tests measure actual user behavior. Long-term holdouts catch cumulative effects of many small changes that individually pass A/B tests but collectively degrade experience.

Offline Evaluation Metrics

Offline metrics evaluate ranking quality against human relevance judgments. These judgments typically use graded scales (e.g., 0-4) where 0 is not relevant and 4 is perfectly relevant.

Key offline metrics:

offline_metrics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
import math
from typing import List, Dict, Tuple
from dataclasses import dataclass
 
@dataclass
class RankedResult:
    """A single result with position and relevance grade."""
    doc_id: str
    position: int  # 1-indexed
    relevance: int  # 0-4 graded relevance (0=not relevant, 4=perfect)
 
class OfflineMetrics:
    """
    Standard offline evaluation metrics for search ranking.
    
    These metrics compare a ranking against human relevance judgments.
    Higher is better for all metrics.
    """
    
    @staticmethod
    def precision_at_k(results: List[RankedResult], k: int, 
                      relevance_threshold: int = 1) -> float:
        """
        Precision@K: Fraction of top-K results that are relevant.
        
        Simple but doesn't consider rank position or graded relevance.
        Good for quick sanity checks.
        
        Example: P@10 = 0.7 means 7 of top 10 results are relevant.
        """
        top_k = [r for r in results if r.position <= k]
        relevant = [r for r in top_k if r.relevance >= relevance_threshold]
        return len(relevant) / k if k > 0 else 0.0
    
    @staticmethod
    def recall_at_k(results: List[RankedResult], k: int,
                   total_relevant: int,
                   relevance_threshold: int = 1) -> float:
        """
        Recall@K: Fraction of all relevant docs found in top K.
        
        Requires knowing total relevant docs for the query.
        Important for queries where users need to see ALL relevant results.
        """
        top_k = [r for r in results if r.position <= k]
        found_relevant = len([r for r in top_k if r.relevance >= relevance_threshold])
        return found_relevant / total_relevant if total_relevant > 0 else 0.0
    
    @staticmethod
    def mean_reciprocal_rank(results: List[RankedResult],
                            relevance_threshold: int = 1) -> float:
        """
        MRR: 1 / position of first relevant result.
        
        Focuses only on finding ONE relevant result quickly.
        Ignores everything after the first relevant result.
        
        Good for navigational queries where user wants one answer.
        
        MRR = 1.0 means first result is always relevant
        MRR = 0.5 means first relevant result is typically at position 2
        """
        for result in sorted(results, key=lambda r: r.position):
            if result.relevance >= relevance_threshold:
                return 1.0 / result.position
        return 0.0
    
    @staticmethod
    def dcg_at_k(results: List[RankedResult], k: int) -> float:
        """
        Discounted Cumulative Gain at K.
        
        DCG = Σ (2^rel - 1) / log2(pos + 1)
        
        Key insight: Highly relevant results should rank higher.
        Position discount: Result at position 10 is less valuable than at position 1.
        Relevance gain: Higher relevance grades contribute more.
        """
        top_k = sorted([r for r in results if r.position <= k], 
                      key=lambda r: r.position)
        
        dcg = 0.0
        for result in top_k:
            # Gain from relevance (exponential for graded relevance)
            gain = (2 ** result.relevance) - 1
            # Discount by log of position
            discount = math.log2(result.position + 1)
            dcg += gain / discount
        
        return dcg
    
    @staticmethod
    def ndcg_at_k(results: List[RankedResult], k: int,
                 ideal_results: List[RankedResult]) -> float:
        """
        Normalized Discounted Cumulative Gain at K.
        
        NDCG = DCG / IDCG
        
        Where IDCG is the ideal DCG if results were perfectly ranked
        (most relevant first).
        
        NDCG is normalized to [0, 1]:
        - 1.0 = perfect ranking (matches ideal)
        - 0.0 = no relevant results
        
        This is THE standard offline metric for graded relevance.
        """
        dcg = OfflineMetrics.dcg_at_k(results, k)
        idcg = OfflineMetrics.dcg_at_k(ideal_results, k)
        
        return dcg / idcg if idcg > 0 else 0.0
    
    @staticmethod
    def expected_reciprocal_rank(results: List[RankedResult],
                                max_grade: int = 4) -> float:
        """
        ERR: Expected Reciprocal Rank.
        
        Models user browsing: at each position, user stops with probability
        proportional to result relevance.
        
        ERR = Σ (1/pos) × P(stopping at pos)
        
        More realistic than NDCG for user satisfaction modeling.
        Captures that highly relevant early results may cause early satisfaction.
        """
        err = 0.0
        prob_not_stopped = 1.0
        
        for result in sorted(results, key=lambda r: r.position):
            # Probability user is satisfied at this position
            prob_satisfied = (2 ** result.relevance - 1) / (2 ** max_grade)
            
            # Contribution to ERR
            err += prob_not_stopped * prob_satisfied / result.position
            
            # Update probability of reaching next position
            prob_not_stopped *= (1 - prob_satisfied)
        
        return err
 
 
def compute_evaluation_suite(
    results: List[RankedResult],
    ideal_results: List[RankedResult],
    total_relevant: int
) -> Dict[str, float]:
    """Compute full suite of offline metrics for a query."""
    
    return {
        "P@1": OfflineMetrics.precision_at_k(results, 1),
        "P@5": OfflineMetrics.precision_at_k(results, 5),
        "P@10": OfflineMetrics.precision_at_k(results, 10),
        "R@10": OfflineMetrics.recall_at_k(results, 10, total_relevant),
        "R@50": OfflineMetrics.recall_at_k(results, 50, total_relevant),
        "MRR": OfflineMetrics.mean_reciprocal_rank(results),
        "NDCG@5": OfflineMetrics.ndcg_at_k(results, 5, ideal_results),
        "NDCG@10": OfflineMetrics.ndcg_at_k(results, 10, ideal_results),
        "ERR": OfflineMetrics.expected_reciprocal_rank(results),
    }

Which Metric to Use?

Use MRR for navigational queries (one right answer). Use NDCG for informational queries (multiple relevant results with varying quality). Use Recall when finding ALL relevant results matters (legal discovery, academic research). Use ERR when modeling user satisfaction is important.

Online Metrics: Measuring User Behavior

Online metrics measure actual user behavior rather than relevance judgments. They're noisier than offline metrics but capture the full user experience, including factors that judgments miss.

Categories of online metrics:

Online Search Metrics Taxonomy
Category	Metric	What It Measures	Interpretation
Engagement	Click-Through Rate (CTR)	% of searches with clicks	Higher = more engaging results (usually good)
Engagement	Clicks Per Search	Avg clicks per query	Higher can be good (exploration) or bad (not finding answer)
Satisfaction	Mean Reciprocal Rank of Click (MRC)	Position of first click	Higher = good results ranked higher
Satisfaction	Time to First Click	Seconds until first click	Lower = users find relevant results faster
Satisfaction	Dwell Time	Time spent on clicked result	Longer = result was useful
Satisfaction	Session Success Rate	% sessions ending in desired action	Higher = search led to goal completion
Effort	Query Refinements	of query modifications	Lower = users find what they want on first try
Effort	Abandonment Rate	% searches with no clicks	Lower usually = better (but can be good for quick answers)
System	Zero Results Rate	% queries with no results	Lower = better coverage
Long-term	Return Rate	% users who search again	Higher = search is valuable to users

online_metrics_calculation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
from dataclasses import dataclass, field
from typing import List, Dict, Optional
from datetime import datetime, timedelta
import statistics
 
@dataclass
class SearchEvent:
    """Logged search event."""
    search_id: str
    user_id: str
    query: str
    timestamp: datetime
    results_shown: int
    variant: str  # "control" or "treatment"
 
@dataclass
class ClickEvent:
    """Click on a search result."""
    search_id: str
    result_position: int
    timestamp: datetime
    dwell_time_seconds: Optional[float] = None  # Time on clicked page
 
@dataclass
class ConversionEvent:
    """Conversion following search (purchase, signup, etc.)."""
    search_id: str
    conversion_type: str
    value: float  # Revenue, etc.
    timestamp: datetime
 
@dataclass
class SearchSession:
    """Aggregated view of a search session."""
    search: SearchEvent
    clicks: List[ClickEvent] = field(default_factory=list)
    conversions: List[ConversionEvent] = field(default_factory=list)
    refinements: List[SearchEvent] = field(default_factory=list)
    
    @property
    def had_click(self) -> bool:
        return len(self.clicks) > 0
    
    @property
    def first_click_position(self) -> Optional[int]:
        if not self.clicks:
            return None
        return min(c.result_position for c in self.clicks)
    
    @property
    def total_dwell_time(self) -> float:
        return sum(c.dwell_time_seconds or 0 for c in self.clicks)
    
    @property
    def had_conversion(self) -> bool:
        return len(self.conversions) > 0
 
class OnlineMetricsCalculator:
    """
    Calculate online metrics from session data.
    
    These metrics are computed per variant in an A/B test,
    then compared statistically.
    """
    
    @staticmethod
    def click_through_rate(sessions: List[SearchSession]) -> float:
        """
        CTR: Fraction of sessions with at least one click.
        
        CTR = (sessions with clicks) / (total sessions)
        
        Industry average for web search: 30-50%
        E-commerce product search: 40-60%
        """
        if not sessions:
            return 0.0
        clicked = sum(1 for s in sessions if s.had_click)
        return clicked / len(sessions)
    
    @staticmethod
    def mean_reciprocal_rank_of_clicks(sessions: List[SearchSession]) -> float:
        """
        MRC: Average of 1 / (first click position).
        
        Measures how high users click on average.
        MRC = 1.0 means first click is always position 1.
        MRC = 0.5 means first click averages position 2.
        """
        reciprocals = []
        for session in sessions:
            if session.first_click_position:
                reciprocals.append(1.0 / session.first_click_position)
        
        return statistics.mean(reciprocals) if reciprocals else 0.0
    
    @staticmethod
    def abandonment_rate(sessions: List[SearchSession]) -> float:
        """
        Fraction of sessions with no clicks.
        
        High abandonment can mean:
        - Bad results (user gave up)
        - Good results (answer shown directly, no click needed)
        
        Context matters! Zero-click answers (featured snippets) increase
        abandonment but improve satisfaction.
        """
        if not sessions:
            return 0.0
        abandoned = sum(1 for s in sessions if not s.had_click)
        return abandoned / len(sessions)
    
    @staticmethod
    def time_to_first_click(sessions: List[SearchSession]) -> Optional[float]:
        """
        Average seconds between search and first click.
        Lower is generally better (users find results faster).
        """
        times = []
        for session in sessions:
            if session.clicks:
                first_click = min(session.clicks, key=lambda c: c.timestamp)
                time_to_click = (first_click.timestamp - session.search.timestamp).total_seconds()
                if time_to_click > 0:  # Filter out invalid data
                    times.append(time_to_click)
        
        return statistics.mean(times) if times else None
    
    @staticmethod
    def average_dwell_time(sessions: List[SearchSession]) -> float:
        """
        Average time spent on clicked pages.
        
        Longer dwell time suggests result was relevant and useful.
        Very short dwell (pogo-sticking) suggests bad result.
        """
        dwell_times = [s.total_dwell_time for s in sessions if s.total_dwell_time > 0]
        return statistics.mean(dwell_times) if dwell_times else 0.0
    
    @staticmethod
    def long_click_rate(sessions: List[SearchSession], 
                       threshold_seconds: float = 30.0) -> float:
        """
        Fraction of clicks with dwell time above threshold.
        
        Long clicks are strong relevance signals. Short clicks (< 10s)
        often indicate user bounced back unsatisfied.
        
        This is a key quality metric used by Google and others.
        """
        total_clicks = 0
        long_clicks = 0
        
        for session in sessions:
            for click in session.clicks:
                total_clicks += 1
                if (click.dwell_time_seconds or 0) >= threshold_seconds:
                    long_clicks += 1
        
        return long_clicks / total_clicks if total_clicks > 0 else 0.0
    
    @staticmethod
    def reformulation_rate(sessions: List[SearchSession]) -> float:
        """
        Fraction of sessions with query refinements.
        
        Lower is generally better—users found what they wanted.
        But some reformulation is natural for exploratory queries.
        """
        reformulated = sum(1 for s in sessions if len(s.refinements) > 0)
        return reformulated / len(sessions) if sessions else 0.0
    
    @staticmethod
    def conversion_rate(sessions: List[SearchSession]) -> float:
        """
        Fraction of sessions leading to conversion.
        
        The ultimate business metric for commercial search.
        """
        converted = sum(1 for s in sessions if s.had_conversion)
        return converted / len(sessions) if sessions else 0.0
    
    @staticmethod
    def revenue_per_search(sessions: List[SearchSession]) -> float:
        """
        Average revenue per search session.
        Combines conversion rate with order value.
        """
        total_revenue = sum(
            c.value for s in sessions 
            for c in s.conversions
        )
        return total_revenue / len(sessions) if sessions else 0.0
 
 
def compute_all_metrics(sessions: List[SearchSession]) -> Dict[str, float]:
    """Compute full suite of online metrics."""
    calc = OnlineMetricsCalculator()
    
    return {
        "ctr": calc.click_through_rate(sessions),
        "mrc": calc.mean_reciprocal_rank_of_clicks(sessions),
        "abandonment_rate": calc.abandonment_rate(sessions),
        "time_to_first_click": calc.time_to_first_click(sessions),
        "avg_dwell_time": calc.average_dwell_time(sessions),
        "long_click_rate": calc.long_click_rate(sessions),
        "reformulation_rate": calc.reformulation_rate(sessions),
        "conversion_rate": calc.conversion_rate(sessions),
        "revenue_per_search": calc.revenue_per_search(sessions),
    }

Metrics Can Conflict

Improving one metric often hurts others. Higher CTR might mean worse relevance (users click more because they're not finding what they want). Lower abandonment might mean worse direct answers (users click because answer isn't shown on SERP). Choose primary and guardrail metrics carefully, and investigate metric conflicts.

Experimental Design for Search

Proper experimental design ensures results are valid and actionable. Poor design leads to false conclusions, shipping changes that hurt users or rejecting changes that would help.

Key design elements:

Experimental Design Checklist

•Randomization unit: What entity is assigned to variants? User, session, or query? User-level prevents within-user variation but reduces power. Query-level has more power but users experience inconsistency.
•Sample size: How many samples per variant? Must be calculated based on desired statistical power, expected effect size, and baseline variance.
•Duration: How long to run? Long enough to capture weekly patterns and reach statistical significance, short enough to ship quickly.
•Traffic allocation: What percentage to each variant? Typically 50/50 but can be unequal to reduce risk from potentially harmful changes.
•Stratification: Are samples balanced on important dimensions (device type, user segment, query type)?
•Exclusions: Which users/queries to exclude? Bots, power users, edge cases that don't generalize?

power_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
import math
from scipy import stats
from dataclasses import dataclass
from typing import Optional
 
@dataclass
class ExperimentConfig:
    """Configuration for an A/B test."""
    name: str
    metrics: list
    primary_metric: str
    control_allocation: float = 0.5
    significance_level: float = 0.05  # Alpha
    power: float = 0.80  # 1 - Beta
    min_detectable_effect: float = 0.02  # 2% relative change
    
class SampleSizeCalculator:
    """
    Calculate required sample size for experiments.
    
    The fundamental trade-off:
    - Smaller sample = faster results
    - Larger sample = detect smaller effects
    
    Under-powered experiments miss real improvements.
    Over-powered experiments waste time detecting insignificant effects.
    """
    
    @staticmethod
    def for_proportion(
        baseline_rate: float,
        min_detectable_effect: float,  # Relative (e.g., 0.05 = 5% improvement)
        alpha: float = 0.05,
        power: float = 0.80,
        two_sided: bool = True
    ) -> int:
        """
        Sample size for proportion metrics (CTR, conversion rate, etc.).
        
        Formula based on z-test for two proportions.
        
        Example:
        - Baseline CTR: 40%
        - Want to detect: 5% relative improvement (40% → 42%)
        - Alpha: 0.05, Power: 0.80
        → Need ~12,000 samples per variant
        """
        p1 = baseline_rate
        p2 = baseline_rate * (1 + min_detectable_effect)
        
        # Pooled proportion
        p_pooled = (p1 + p2) / 2
        
        # Standard effect size (Cohen's h for proportions)
        h = 2 * (math.asin(math.sqrt(p2)) - math.asin(math.sqrt(p1)))
        
        # Z-scores for alpha and power
        z_alpha = stats.norm.ppf(1 - alpha / (2 if two_sided else 1))
        z_power = stats.norm.ppf(power)
        
        # Sample size formula
        n = 2 * ((z_alpha + z_power) / h) ** 2
        
        return math.ceil(n)
    
    @staticmethod
    def for_continuous(
        baseline_mean: float,
        baseline_std: float,
        min_detectable_effect: float,  # Relative
        alpha: float = 0.05,
        power: float = 0.80
    ) -> int:
        """
        Sample size for continuous metrics (dwell time, revenue, etc.).
        
        More variance = more samples needed.
        """
        effect_size = baseline_mean * min_detectable_effect
        
        # Cohen's d
        d = effect_size / baseline_std
        
        # Z-scores
        z_alpha = stats.norm.ppf(1 - alpha / 2)
        z_power = stats.norm.ppf(power)
        
        # Sample size
        n = 2 * ((z_alpha + z_power) / d) ** 2
        
        return math.ceil(n)
    
    @staticmethod
    def days_to_run(
        required_samples: int,
        daily_searches: int,
        experiment_allocation: float = 1.0  # Fraction of traffic in experiment
    ) -> float:
        """
        Estimate days to reach required sample size.
        
        Each variant needs required_samples.
        Total samples needed = 2 * required_samples (for two variants).
        """
        samples_per_day = daily_searches * experiment_allocation
        total_needed = 2 * required_samples
        
        return total_needed / samples_per_day
 
 
def plan_experiment(
    name: str,
    metric: str,
    baseline_value: float,
    baseline_std: Optional[float],  # For continuous metrics
    min_effect: float,
    daily_samples: int
) -> dict:
    """
    Plan an experiment: calculate sample size and duration.
    """
    calc = SampleSizeCalculator()
    
    # Determine if proportion or continuous metric
    is_proportion = baseline_value <= 1.0 and (baseline_std is None or baseline_std <= 0.5)
    
    if is_proportion:
        samples_needed = calc.for_proportion(
            baseline_rate=baseline_value,
            min_detectable_effect=min_effect
        )
    else:
        samples_needed = calc.for_continuous(
            baseline_mean=baseline_value,
            baseline_std=baseline_std or (baseline_value * 0.5),  # Estimate
            min_detectable_effect=min_effect
        )
    
    days_needed = calc.days_to_run(samples_needed, daily_samples)
    
    return {
        "experiment_name": name,
        "primary_metric": metric,
        "baseline": baseline_value,
        "min_detectable_effect": f"{min_effect*100}%",
        "samples_per_variant": samples_needed,
        "estimated_days": math.ceil(days_needed),
        "recommendation": _experiment_recommendation(days_needed, samples_needed)
    }
 
def _experiment_recommendation(days: float, samples: int) -> str:
    if days > 60:
        return "CONSIDER: Effect size very small. Consider larger MDE or accept lower power."
    elif days < 1:
        return "WARNING: Very short experiment. Check for novelty effects and weekly patterns."
    elif samples < 1000:
        return "WARNING: Small sample. Results may not generalize. Consider longer run."
    else:
        return f"OK: Run for at least {math.ceil(days)} days with weekly pattern coverage."
 
 
# Example usage
if __name__ == "__main__":
    result = plan_experiment(
        name="New ranking model test",
        metric="click_through_rate",
        baseline_value=0.40,  # 40% CTR
        baseline_std=None,    # Proportion
        min_effect=0.05,      # Detect 5% relative improvement
        daily_samples=50000   # 50K searches/day
    )
    print(result)

Statistical Analysis of Search Experiments

Proper statistical analysis turns raw metrics into actionable conclusions. Done wrong, you'll ship bad changes or miss good ones.

The hypothesis testing framework:

statistical_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
import math
from scipy import stats
from dataclasses import dataclass
from typing import Tuple, Optional
import numpy as np
 
@dataclass
class ExperimentResult:
    """Results from an A/B test variant."""
    variant_name: str
    sample_size: int
    metric_value: float
    metric_std: float  # Standard deviation
 
@dataclass
class StatisticalTestResult:
    """Complete statistical test output."""
    control: ExperimentResult
    treatment: ExperimentResult
    relative_change: float  # (treatment - control) / control
    absolute_change: float  # treatment - control
    p_value: float
    confidence_interval: Tuple[float, float]
    is_significant: bool
    effect_size: float  # Cohen's d
    
    def summary(self) -> str:
        direction = "increase" if self.relative_change > 0 else "decrease"
        sig = "statistically significant" if self.is_significant else "not statistically significant"
        return (
            f"{self.relative_change*100:+.2f}% {direction} ({sig})
"
            f"95% CI: [{self.confidence_interval[0]*100:.2f}%, {self.confidence_interval[1]*100:.2f}%]
"
            f"p-value: {self.p_value:.4f}"
        )
 
class ABTestAnalyzer:
    """
    Statistical analysis for A/B test results.
    
    Provides hypothesis testing, confidence intervals,
    and decision recommendations.
    """
    
    def __init__(self, alpha: float = 0.05):
        self.alpha = alpha
    
    def analyze_proportions(
        self,
        control: ExperimentResult,
        treatment: ExperimentResult
    ) -> StatisticalTestResult:
        """
        Analyze proportion metrics (CTR, conversion rate).
        Uses two-proportion z-test.
        """
        # Extract values
        p_c = control.metric_value
        p_t = treatment.metric_value
        n_c = control.sample_size
        n_t = treatment.sample_size
        
        # Pooled proportion under null hypothesis
        p_pooled = (p_c * n_c + p_t * n_t) / (n_c + n_t)
        
        # Standard error
        se = math.sqrt(p_pooled * (1 - p_pooled) * (1/n_c + 1/n_t))
        
        # Z-statistic
        z = (p_t - p_c) / se if se > 0 else 0
        
        # P-value (two-tailed)
        p_value = 2 * (1 - stats.norm.cdf(abs(z)))
        
        # Confidence interval for difference
        se_diff = math.sqrt(p_c*(1-p_c)/n_c + p_t*(1-p_t)/n_t)
        z_alpha = stats.norm.ppf(1 - self.alpha/2)
        ci_lower = (p_t - p_c) - z_alpha * se_diff
        ci_upper = (p_t - p_c) + z_alpha * se_diff
        
        # Effect size (Cohen's h)
        h = 2 * (math.asin(math.sqrt(p_t)) - math.asin(math.sqrt(p_c)))
        
        return StatisticalTestResult(
            control=control,
            treatment=treatment,
            relative_change=(p_t - p_c) / p_c if p_c > 0 else 0,
            absolute_change=p_t - p_c,
            p_value=p_value,
            confidence_interval=(ci_lower / p_c if p_c > 0 else 0, 
                                ci_upper / p_c if p_c > 0 else 0),
            is_significant=p_value < self.alpha,
            effect_size=h
        )
    
    def analyze_continuous(
        self,
        control: ExperimentResult,
        treatment: ExperimentResult
    ) -> StatisticalTestResult:
        """
        Analyze continuous metrics (dwell time, revenue).
        Uses Welch's t-test (unequal variances).
        """
        m_c = control.metric_value
        m_t = treatment.metric_value
        s_c = control.metric_std
        s_t = treatment.metric_std
        n_c = control.sample_size
        n_t = treatment.sample_size
        
        # Welch's t-test
        se = math.sqrt(s_c**2/n_c + s_t**2/n_t)
        t_stat = (m_t - m_c) / se if se > 0 else 0
        
        # Degrees of freedom (Welch-Satterthwaite)
        num = (s_c**2/n_c + s_t**2/n_t)**2
        denom = (s_c**2/n_c)**2/(n_c-1) + (s_t**2/n_t)**2/(n_t-1)
        df = num / denom if denom > 0 else min(n_c, n_t) - 1
        
        # P-value
        p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df))
        
        # Confidence interval
        t_alpha = stats.t.ppf(1 - self.alpha/2, df)
        ci_lower = (m_t - m_c) - t_alpha * se
        ci_upper = (m_t - m_c) + t_alpha * se
        
        # Effect size (Cohen's d)
        pooled_std = math.sqrt(((n_c-1)*s_c**2 + (n_t-1)*s_t**2) / (n_c+n_t-2))
        d = (m_t - m_c) / pooled_std if pooled_std > 0 else 0
        
        return StatisticalTestResult(
            control=control,
            treatment=treatment,
            relative_change=(m_t - m_c) / m_c if m_c > 0 else 0,
            absolute_change=m_t - m_c,
            p_value=p_value,
            confidence_interval=(ci_lower / m_c if m_c > 0 else 0,
                                ci_upper / m_c if m_c > 0 else 0),
            is_significant=p_value < self.alpha,
            effect_size=d
        )
 
 
class MultipleTestingCorrection:
    """
    Corrections for testing multiple metrics simultaneously.
    
    When you test 20 metrics, you expect 1 false positive at α=0.05.
    Corrections control the family-wise error rate.
    """
    
    @staticmethod
    def bonferroni(p_values: list, alpha: float = 0.05) -> dict:
        """
        Bonferroni correction: most conservative.
        Divide α by number of tests.
        
        Protects strongly against false positives but
        may miss real effects (high false negative rate).
        """
        n = len(p_values)
        corrected_alpha = alpha / n
        
        return {
            "corrected_alpha": corrected_alpha,
            "significant": [p < corrected_alpha for p in p_values]
        }
    
    @staticmethod
    def benjamini_hochberg(p_values: list, alpha: float = 0.05) -> dict:
        """
        Benjamini-Hochberg procedure: controls False Discovery Rate.
        
        Less conservative than Bonferroni. Allows some false positives
        in exchange for more power to detect real effects.
        
        Recommended for exploratory analysis with many metrics.
        """
        n = len(p_values)
        sorted_idx = np.argsort(p_values)
        sorted_p = np.array(p_values)[sorted_idx]
        
        # BH threshold: p[i] < (i/n) * alpha
        thresholds = [(i+1)/n * alpha for i in range(n)]
        
        # Find largest p-value below its threshold
        significant = [False] * n
        for i in range(n-1, -1, -1):
            if sorted_p[i] <= thresholds[i]:
                for j in range(i+1):
                    significant[sorted_idx[j]] = True
                break
        
        return {
            "fdr_level": alpha,
            "significant": significant
        }
 
 
# Complete analysis workflow
def full_experiment_analysis(
    control_sessions: list,
    treatment_sessions: list,
    metrics_to_test: list
) -> dict:
    """
    Run complete statistical analysis for an experiment.
    """
    analyzer = ABTestAnalyzer(alpha=0.05)
    
    results = {}
    p_values = []
    
    for metric_name in metrics_to_test:
        # Compute metric for each group
        control_values = [getattr(s, metric_name) for s in control_sessions]
        treatment_values = [getattr(s, metric_name) for s in treatment_sessions]
        
        control = ExperimentResult(
            variant_name="control",
            sample_size=len(control_values),
            metric_value=sum(control_values) / len(control_values),
            metric_std=np.std(control_values)
        )
        treatment = ExperimentResult(
            variant_name="treatment",
            sample_size=len(treatment_values),
            metric_value=sum(treatment_values) / len(treatment_values),
            metric_std=np.std(treatment_values)
        )
        
        # Run appropriate test
        is_proportion = metric_name in ["ctr", "conversion_rate", "abandonment_rate"]
        if is_proportion:
            test_result = analyzer.analyze_proportions(control, treatment)
        else:
            test_result = analyzer.analyze_continuous(control, treatment)
        
        results[metric_name] = test_result
        p_values.append(test_result.p_value)
    
    # Apply multiple testing correction
    bh_correction = MultipleTestingCorrection.benjamini_hochberg(p_values)
    
    return {
        "individual_results": results,
        "bh_significant": bh_correction["significant"]
    }

P-Hacking Warning

Don't peek at results repeatedly and stop when significant. This inflates false positive rates. Pre-register your stopping criteria. Either run for fixed duration or use sequential testing methods (e.g., multi-armed bandits) designed for early stopping.

Common Pitfalls in Search Experimentation

Search A/B testing has unique challenges that can invalidate results if not handled properly.

Critical Pitfalls to Avoid

•Novelty effects: Users click more on new layouts initially, then revert. Run experiments long enough (2+ weeks) for novelty to wear off.
•Primacy bias: Users develop muscle memory for result positions. Changing positions causes temporary disorientation. Account for learning curves.
•Query sampling bias: Popular queries dominate aggregate metrics. Rare queries may regress without affecting averages. Segment analysis by query frequency.
•User segment heterogeneity: Change helps power users, hurts novices (or vice versa). Segment analysis prevents shipping changes that harm important segments.
•Day-of-week effects: Behavior differs on weekdays vs weekends. Always run full week multiples (7, 14, 21 days).
•Interaction effects: New ranking interacts with other features. Test in production configuration, not isolated sandbox.
•Simpson's paradox: Metric improves in every segment but decreases overall due to segment mix shifts. Always check this.
•Survivor bias: Users who abandon don't generate data. Silent failures are invisible to standard metrics.

Signs of Invalid Experiments

•Sample ratio mismatch (50.3%/49.7% expected, got 55%/45%)
•Metrics moving in impossible directions
•Guardrail metrics degrading unexpectedly
•Effect size implausibly large (5x improvement)
•Results flip after first few days

Signs of Valid Experiments

•Sample ratio matches allocation
•Unaffected metrics are flat
•Effects consistent over time
•Effect direction matches offline evaluation
•Segment effects are consistent

Summary: Evidence-Based Relevance Improvement

A/B testing transforms search relevance from opinion-driven to evidence-driven. It's the difference between "I think this is better" and "We measured that this is 3% better with p < 0.01." This rigor is essential for building and maintaining high-quality search systems.

Key Takeaways

•Use three-stage evaluation — Offline metrics for fast iteration, online A/B tests for validation, long-term holdouts for cumulative effects.
•Choose appropriate metrics — NDCG for graded relevance, MRR for navigational queries, CTR and dwell time for behavioral signals, conversion for business impact.
•Calculate sample size correctly — Under-powered experiments miss real improvements. Use power analysis to determine required duration.
•Apply proper statistical methods — Use the right tests (z-test for proportions, t-test for continuous). Correct for multiple comparisons.
•Avoid common pitfalls — Novelty effects, primacy bias, query sampling bias, day-of-week effects. Design experiments to control for these.
•Make decisions holistically — Don't ship on a single metric. Consider trade-offs, segment effects, and long-term implications.
•Document and learn — Every experiment teaches something. Build institutional knowledge about what works and doesn't work.

What's next:

A/B testing measures what users do, but not why. The final page in this module explores user feedback incorporation—how to collect, analyze, and act on direct user feedback to complement behavioral metrics.

Page Complete

You now understand how to scientifically measure search quality improvements through controlled experiments. From offline metrics to online A/B tests, from sample size calculation to statistical analysis, you have the tools to make evidence-based decisions about relevance changes.

A/B Testing Relevance: Measuring What Matters

The Measurement Challenge

What You Will Learn

The Search Experimentation Framework

Search experimentation follows a structured framework that balances speed (shipping improvements quickly) with rigor (ensuring improvements are real).

The three-stage evaluation pipeline:

Search Evaluation Pipeline
Stage	Type	Speed	Cost	Confidence
Offline Evaluation	Historical query logs + labels	Hours	Low	Medium
Online A/B Test	Live traffic experiment	Days-weeks	Medium	High
Long-term Holdout	Persistent control group	Months	High	Very High

Stage 1: Offline evaluation

Limitations: Offline metrics don't perfectly predict online outcomes. A change might improve NDCG on judged queries but worsen the user experience due to factors not captured in judgments.

Stage 2: Online A/B test

Stage 3: Long-term holdout

Maintain a small percentage of users (1-5%) who never receive experimental changes. This provides a stable baseline for detecting gradual degradation that A/B tests miss.

Why All Three Stages?

Offline Evaluation Metrics

Offline metrics evaluate ranking quality against human relevance judgments. These judgments typically use graded scales (e.g., 0-4) where 0 is not relevant and 4 is perfectly relevant.

Key offline metrics:

offline_metrics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
import math
from typing import List, Dict, Tuple
from dataclasses import dataclass
 
@dataclass
class RankedResult:
    """A single result with position and relevance grade."""
    doc_id: str
    position: int  # 1-indexed
    relevance: int  # 0-4 graded relevance (0=not relevant, 4=perfect)
 
class OfflineMetrics:
    """
    Standard offline evaluation metrics for search ranking.
    
    These metrics compare a ranking against human relevance judgments.
    Higher is better for all metrics.
    """
    
    @staticmethod
    def precision_at_k(results: List[RankedResult], k: int, 
                      relevance_threshold: int = 1) -> float:
        """
        Precision@K: Fraction of top-K results that are relevant.
        
        Simple but doesn't consider rank position or graded relevance.
        Good for quick sanity checks.
        
        Example: P@10 = 0.7 means 7 of top 10 results are relevant.
        """
        top_k = [r for r in results if r.position <= k]
        relevant = [r for r in top_k if r.relevance >= relevance_threshold]
        return len(relevant) / k if k > 0 else 0.0
    
    @staticmethod
    def recall_at_k(results: List[RankedResult], k: int,
                   total_relevant: int,
                   relevance_threshold: int = 1) -> float:
        """
        Recall@K: Fraction of all relevant docs found in top K.
        
        Requires knowing total relevant docs for the query.
        Important for queries where users need to see ALL relevant results.
        """
        top_k = [r for r in results if r.position <= k]
        found_relevant = len([r for r in top_k if r.relevance >= relevance_threshold])
        return found_relevant / total_relevant if total_relevant > 0 else 0.0
    
    @staticmethod
    def mean_reciprocal_rank(results: List[RankedResult],
                            relevance_threshold: int = 1) -> float:
        """
        MRR: 1 / position of first relevant result.
        
        Focuses only on finding ONE relevant result quickly.
        Ignores everything after the first relevant result.
        
        Good for navigational queries where user wants one answer.
        
        MRR = 1.0 means first result is always relevant
        MRR = 0.5 means first relevant result is typically at position 2
        """
        for result in sorted(results, key=lambda r: r.position):
            if result.relevance >= relevance_threshold:
                return 1.0 / result.position
        return 0.0
    
    @staticmethod
    def dcg_at_k(results: List[RankedResult], k: int) -> float:
        """
        Discounted Cumulative Gain at K.
        
        DCG = Σ (2^rel - 1) / log2(pos + 1)
        
        Key insight: Highly relevant results should rank higher.
        Position discount: Result at position 10 is less valuable than at position 1.
        Relevance gain: Higher relevance grades contribute more.
        """
        top_k = sorted([r for r in results if r.position <= k], 
                      key=lambda r: r.position)
        
        dcg = 0.0
        for result in top_k:
            # Gain from relevance (exponential for graded relevance)
            gain = (2 ** result.relevance) - 1
            # Discount by log of position
            discount = math.log2(result.position + 1)
            dcg += gain / discount
        
        return dcg
    
    @staticmethod
    def ndcg_at_k(results: List[RankedResult], k: int,
                 ideal_results: List[RankedResult]) -> float:
        """
        Normalized Discounted Cumulative Gain at K.
        
        NDCG = DCG / IDCG
        
        Where IDCG is the ideal DCG if results were perfectly ranked
        (most relevant first).
        
        NDCG is normalized to [0, 1]:
        - 1.0 = perfect ranking (matches ideal)
        - 0.0 = no relevant results
        
        This is THE standard offline metric for graded relevance.
        """
        dcg = OfflineMetrics.dcg_at_k(results, k)
        idcg = OfflineMetrics.dcg_at_k(ideal_results, k)
        
        return dcg / idcg if idcg > 0 else 0.0
    
    @staticmethod
    def expected_reciprocal_rank(results: List[RankedResult],
                                max_grade: int = 4) -> float:
        """
        ERR: Expected Reciprocal Rank.
        
        Models user browsing: at each position, user stops with probability
        proportional to result relevance.
        
        ERR = Σ (1/pos) × P(stopping at pos)
        
        More realistic than NDCG for user satisfaction modeling.
        Captures that highly relevant early results may cause early satisfaction.
        """
        err = 0.0
        prob_not_stopped = 1.0
        
        for result in sorted(results, key=lambda r: r.position):
            # Probability user is satisfied at this position
            prob_satisfied = (2 ** result.relevance - 1) / (2 ** max_grade)
            
            # Contribution to ERR
            err += prob_not_stopped * prob_satisfied / result.position
            
            # Update probability of reaching next position
            prob_not_stopped *= (1 - prob_satisfied)
        
        return err
 
 
def compute_evaluation_suite(
    results: List[RankedResult],
    ideal_results: List[RankedResult],
    total_relevant: int
) -> Dict[str, float]:
    """Compute full suite of offline metrics for a query."""
    
    return {
        "P@1": OfflineMetrics.precision_at_k(results, 1),
        "P@5": OfflineMetrics.precision_at_k(results, 5),
        "P@10": OfflineMetrics.precision_at_k(results, 10),
        "R@10": OfflineMetrics.recall_at_k(results, 10, total_relevant),
        "R@50": OfflineMetrics.recall_at_k(results, 50, total_relevant),
        "MRR": OfflineMetrics.mean_reciprocal_rank(results),
        "NDCG@5": OfflineMetrics.ndcg_at_k(results, 5, ideal_results),
        "NDCG@10": OfflineMetrics.ndcg_at_k(results, 10, ideal_results),
        "ERR": OfflineMetrics.expected_reciprocal_rank(results),
    }

Which Metric to Use?

Online Metrics: Measuring User Behavior

Online metrics measure actual user behavior rather than relevance judgments. They're noisier than offline metrics but capture the full user experience, including factors that judgments miss.

Categories of online metrics:

Online Search Metrics Taxonomy
Category	Metric	What It Measures	Interpretation
Engagement	Click-Through Rate (CTR)	% of searches with clicks	Higher = more engaging results (usually good)
Engagement	Clicks Per Search	Avg clicks per query	Higher can be good (exploration) or bad (not finding answer)
Satisfaction	Mean Reciprocal Rank of Click (MRC)	Position of first click	Higher = good results ranked higher
Satisfaction	Time to First Click	Seconds until first click	Lower = users find relevant results faster
Satisfaction	Dwell Time	Time spent on clicked result	Longer = result was useful
Satisfaction	Session Success Rate	% sessions ending in desired action	Higher = search led to goal completion
Effort	Query Refinements	of query modifications	Lower = users find what they want on first try
Effort	Abandonment Rate	% searches with no clicks	Lower usually = better (but can be good for quick answers)
System	Zero Results Rate	% queries with no results	Lower = better coverage
Long-term	Return Rate	% users who search again	Higher = search is valuable to users

online_metrics_calculation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
from dataclasses import dataclass, field
from typing import List, Dict, Optional
from datetime import datetime, timedelta
import statistics
 
@dataclass
class SearchEvent:
    """Logged search event."""
    search_id: str
    user_id: str
    query: str
    timestamp: datetime
    results_shown: int
    variant: str  # "control" or "treatment"
 
@dataclass
class ClickEvent:
    """Click on a search result."""
    search_id: str
    result_position: int
    timestamp: datetime
    dwell_time_seconds: Optional[float] = None  # Time on clicked page
 
@dataclass
class ConversionEvent:
    """Conversion following search (purchase, signup, etc.)."""
    search_id: str
    conversion_type: str
    value: float  # Revenue, etc.
    timestamp: datetime
 
@dataclass
class SearchSession:
    """Aggregated view of a search session."""
    search: SearchEvent
    clicks: List[ClickEvent] = field(default_factory=list)
    conversions: List[ConversionEvent] = field(default_factory=list)
    refinements: List[SearchEvent] = field(default_factory=list)
    
    @property
    def had_click(self) -> bool:
        return len(self.clicks) > 0
    
    @property
    def first_click_position(self) -> Optional[int]:
        if not self.clicks:
            return None
        return min(c.result_position for c in self.clicks)
    
    @property
    def total_dwell_time(self) -> float:
        return sum(c.dwell_time_seconds or 0 for c in self.clicks)
    
    @property
    def had_conversion(self) -> bool:
        return len(self.conversions) > 0
 
class OnlineMetricsCalculator:
    """
    Calculate online metrics from session data.
    
    These metrics are computed per variant in an A/B test,
    then compared statistically.
    """
    
    @staticmethod
    def click_through_rate(sessions: List[SearchSession]) -> float:
        """
        CTR: Fraction of sessions with at least one click.
        
        CTR = (sessions with clicks) / (total sessions)
        
        Industry average for web search: 30-50%
        E-commerce product search: 40-60%
        """
        if not sessions:
            return 0.0
        clicked = sum(1 for s in sessions if s.had_click)
        return clicked / len(sessions)
    
    @staticmethod
    def mean_reciprocal_rank_of_clicks(sessions: List[SearchSession]) -> float:
        """
        MRC: Average of 1 / (first click position).
        
        Measures how high users click on average.
        MRC = 1.0 means first click is always position 1.
        MRC = 0.5 means first click averages position 2.
        """
        reciprocals = []
        for session in sessions:
            if session.first_click_position:
                reciprocals.append(1.0 / session.first_click_position)
        
        return statistics.mean(reciprocals) if reciprocals else 0.0
    
    @staticmethod
    def abandonment_rate(sessions: List[SearchSession]) -> float:
        """
        Fraction of sessions with no clicks.
        
        High abandonment can mean:
        - Bad results (user gave up)
        - Good results (answer shown directly, no click needed)
        
        Context matters! Zero-click answers (featured snippets) increase
        abandonment but improve satisfaction.
        """
        if not sessions:
            return 0.0
        abandoned = sum(1 for s in sessions if not s.had_click)
        return abandoned / len(sessions)
    
    @staticmethod
    def time_to_first_click(sessions: List[SearchSession]) -> Optional[float]:
        """
        Average seconds between search and first click.
        Lower is generally better (users find results faster).
        """
        times = []
        for session in sessions:
            if session.clicks:
                first_click = min(session.clicks, key=lambda c: c.timestamp)
                time_to_click = (first_click.timestamp - session.search.timestamp).total_seconds()
                if time_to_click > 0:  # Filter out invalid data
                    times.append(time_to_click)
        
        return statistics.mean(times) if times else None
    
    @staticmethod
    def average_dwell_time(sessions: List[SearchSession]) -> float:
        """
        Average time spent on clicked pages.
        
        Longer dwell time suggests result was relevant and useful.
        Very short dwell (pogo-sticking) suggests bad result.
        """
        dwell_times = [s.total_dwell_time for s in sessions if s.total_dwell_time > 0]
        return statistics.mean(dwell_times) if dwell_times else 0.0
    
    @staticmethod
    def long_click_rate(sessions: List[SearchSession], 
                       threshold_seconds: float = 30.0) -> float:
        """
        Fraction of clicks with dwell time above threshold.
        
        Long clicks are strong relevance signals. Short clicks (< 10s)
        often indicate user bounced back unsatisfied.
        
        This is a key quality metric used by Google and others.
        """
        total_clicks = 0
        long_clicks = 0
        
        for session in sessions:
            for click in session.clicks:
                total_clicks += 1
                if (click.dwell_time_seconds or 0) >= threshold_seconds:
                    long_clicks += 1
        
        return long_clicks / total_clicks if total_clicks > 0 else 0.0
    
    @staticmethod
    def reformulation_rate(sessions: List[SearchSession]) -> float:
        """
        Fraction of sessions with query refinements.
        
        Lower is generally better—users found what they wanted.
        But some reformulation is natural for exploratory queries.
        """
        reformulated = sum(1 for s in sessions if len(s.refinements) > 0)
        return reformulated / len(sessions) if sessions else 0.0
    
    @staticmethod
    def conversion_rate(sessions: List[SearchSession]) -> float:
        """
        Fraction of sessions leading to conversion.
        
        The ultimate business metric for commercial search.
        """
        converted = sum(1 for s in sessions if s.had_conversion)
        return converted / len(sessions) if sessions else 0.0
    
    @staticmethod
    def revenue_per_search(sessions: List[SearchSession]) -> float:
        """
        Average revenue per search session.
        Combines conversion rate with order value.
        """
        total_revenue = sum(
            c.value for s in sessions 
            for c in s.conversions
        )
        return total_revenue / len(sessions) if sessions else 0.0
 
 
def compute_all_metrics(sessions: List[SearchSession]) -> Dict[str, float]:
    """Compute full suite of online metrics."""
    calc = OnlineMetricsCalculator()
    
    return {
        "ctr": calc.click_through_rate(sessions),
        "mrc": calc.mean_reciprocal_rank_of_clicks(sessions),
        "abandonment_rate": calc.abandonment_rate(sessions),
        "time_to_first_click": calc.time_to_first_click(sessions),
        "avg_dwell_time": calc.average_dwell_time(sessions),
        "long_click_rate": calc.long_click_rate(sessions),
        "reformulation_rate": calc.reformulation_rate(sessions),
        "conversion_rate": calc.conversion_rate(sessions),
        "revenue_per_search": calc.revenue_per_search(sessions),
    }

Metrics Can Conflict

Experimental Design for Search

Proper experimental design ensures results are valid and actionable. Poor design leads to false conclusions, shipping changes that hurt users or rejecting changes that would help.

Key design elements:

Experimental Design Checklist

•Randomization unit: What entity is assigned to variants? User, session, or query? User-level prevents within-user variation but reduces power. Query-level has more power but users experience inconsistency.
•Sample size: How many samples per variant? Must be calculated based on desired statistical power, expected effect size, and baseline variance.
•Duration: How long to run? Long enough to capture weekly patterns and reach statistical significance, short enough to ship quickly.
•Traffic allocation: What percentage to each variant? Typically 50/50 but can be unequal to reduce risk from potentially harmful changes.
•Stratification: Are samples balanced on important dimensions (device type, user segment, query type)?
•Exclusions: Which users/queries to exclude? Bots, power users, edge cases that don't generalize?

power_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
import math
from scipy import stats
from dataclasses import dataclass
from typing import Optional
 
@dataclass
class ExperimentConfig:
    """Configuration for an A/B test."""
    name: str
    metrics: list
    primary_metric: str
    control_allocation: float = 0.5
    significance_level: float = 0.05  # Alpha
    power: float = 0.80  # 1 - Beta
    min_detectable_effect: float = 0.02  # 2% relative change
    
class SampleSizeCalculator:
    """
    Calculate required sample size for experiments.
    
    The fundamental trade-off:
    - Smaller sample = faster results
    - Larger sample = detect smaller effects
    
    Under-powered experiments miss real improvements.
    Over-powered experiments waste time detecting insignificant effects.
    """
    
    @staticmethod
    def for_proportion(
        baseline_rate: float,
        min_detectable_effect: float,  # Relative (e.g., 0.05 = 5% improvement)
        alpha: float = 0.05,
        power: float = 0.80,
        two_sided: bool = True
    ) -> int:
        """
        Sample size for proportion metrics (CTR, conversion rate, etc.).
        
        Formula based on z-test for two proportions.
        
        Example:
        - Baseline CTR: 40%
        - Want to detect: 5% relative improvement (40% → 42%)
        - Alpha: 0.05, Power: 0.80
        → Need ~12,000 samples per variant
        """
        p1 = baseline_rate
        p2 = baseline_rate * (1 + min_detectable_effect)
        
        # Pooled proportion
        p_pooled = (p1 + p2) / 2
        
        # Standard effect size (Cohen's h for proportions)
        h = 2 * (math.asin(math.sqrt(p2)) - math.asin(math.sqrt(p1)))
        
        # Z-scores for alpha and power
        z_alpha = stats.norm.ppf(1 - alpha / (2 if two_sided else 1))
        z_power = stats.norm.ppf(power)
        
        # Sample size formula
        n = 2 * ((z_alpha + z_power) / h) ** 2
        
        return math.ceil(n)
    
    @staticmethod
    def for_continuous(
        baseline_mean: float,
        baseline_std: float,
        min_detectable_effect: float,  # Relative
        alpha: float = 0.05,
        power: float = 0.80
    ) -> int:
        """
        Sample size for continuous metrics (dwell time, revenue, etc.).
        
        More variance = more samples needed.
        """
        effect_size = baseline_mean * min_detectable_effect
        
        # Cohen's d
        d = effect_size / baseline_std
        
        # Z-scores
        z_alpha = stats.norm.ppf(1 - alpha / 2)
        z_power = stats.norm.ppf(power)
        
        # Sample size
        n = 2 * ((z_alpha + z_power) / d) ** 2
        
        return math.ceil(n)
    
    @staticmethod
    def days_to_run(
        required_samples: int,
        daily_searches: int,
        experiment_allocation: float = 1.0  # Fraction of traffic in experiment
    ) -> float:
        """
        Estimate days to reach required sample size.
        
        Each variant needs required_samples.
        Total samples needed = 2 * required_samples (for two variants).
        """
        samples_per_day = daily_searches * experiment_allocation
        total_needed = 2 * required_samples
        
        return total_needed / samples_per_day
 
 
def plan_experiment(
    name: str,
    metric: str,
    baseline_value: float,
    baseline_std: Optional[float],  # For continuous metrics
    min_effect: float,
    daily_samples: int
) -> dict:
    """
    Plan an experiment: calculate sample size and duration.
    """
    calc = SampleSizeCalculator()
    
    # Determine if proportion or continuous metric
    is_proportion = baseline_value <= 1.0 and (baseline_std is None or baseline_std <= 0.5)
    
    if is_proportion:
        samples_needed = calc.for_proportion(
            baseline_rate=baseline_value,
            min_detectable_effect=min_effect
        )
    else:
        samples_needed = calc.for_continuous(
            baseline_mean=baseline_value,
            baseline_std=baseline_std or (baseline_value * 0.5),  # Estimate
            min_detectable_effect=min_effect
        )
    
    days_needed = calc.days_to_run(samples_needed, daily_samples)
    
    return {
        "experiment_name": name,
        "primary_metric": metric,
        "baseline": baseline_value,
        "min_detectable_effect": f"{min_effect*100}%",
        "samples_per_variant": samples_needed,
        "estimated_days": math.ceil(days_needed),
        "recommendation": _experiment_recommendation(days_needed, samples_needed)
    }
 
def _experiment_recommendation(days: float, samples: int) -> str:
    if days > 60:
        return "CONSIDER: Effect size very small. Consider larger MDE or accept lower power."
    elif days < 1:
        return "WARNING: Very short experiment. Check for novelty effects and weekly patterns."
    elif samples < 1000:
        return "WARNING: Small sample. Results may not generalize. Consider longer run."
    else:
        return f"OK: Run for at least {math.ceil(days)} days with weekly pattern coverage."
 
 
# Example usage
if __name__ == "__main__":
    result = plan_experiment(
        name="New ranking model test",
        metric="click_through_rate",
        baseline_value=0.40,  # 40% CTR
        baseline_std=None,    # Proportion
        min_effect=0.05,      # Detect 5% relative improvement
        daily_samples=50000   # 50K searches/day
    )
    print(result)

Statistical Analysis of Search Experiments

Proper statistical analysis turns raw metrics into actionable conclusions. Done wrong, you'll ship bad changes or miss good ones.

The hypothesis testing framework:

statistical_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
import math
from scipy import stats
from dataclasses import dataclass
from typing import Tuple, Optional
import numpy as np
 
@dataclass
class ExperimentResult:
    """Results from an A/B test variant."""
    variant_name: str
    sample_size: int
    metric_value: float
    metric_std: float  # Standard deviation
 
@dataclass
class StatisticalTestResult:
    """Complete statistical test output."""
    control: ExperimentResult
    treatment: ExperimentResult
    relative_change: float  # (treatment - control) / control
    absolute_change: float  # treatment - control
    p_value: float
    confidence_interval: Tuple[float, float]
    is_significant: bool
    effect_size: float  # Cohen's d
    
    def summary(self) -> str:
        direction = "increase" if self.relative_change > 0 else "decrease"
        sig = "statistically significant" if self.is_significant else "not statistically significant"
        return (
            f"{self.relative_change*100:+.2f}% {direction} ({sig})
"
            f"95% CI: [{self.confidence_interval[0]*100:.2f}%, {self.confidence_interval[1]*100:.2f}%]
"
            f"p-value: {self.p_value:.4f}"
        )
 
class ABTestAnalyzer:
    """
    Statistical analysis for A/B test results.
    
    Provides hypothesis testing, confidence intervals,
    and decision recommendations.
    """
    
    def __init__(self, alpha: float = 0.05):
        self.alpha = alpha
    
    def analyze_proportions(
        self,
        control: ExperimentResult,
        treatment: ExperimentResult
    ) -> StatisticalTestResult:
        """
        Analyze proportion metrics (CTR, conversion rate).
        Uses two-proportion z-test.
        """
        # Extract values
        p_c = control.metric_value
        p_t = treatment.metric_value
        n_c = control.sample_size
        n_t = treatment.sample_size
        
        # Pooled proportion under null hypothesis
        p_pooled = (p_c * n_c + p_t * n_t) / (n_c + n_t)
        
        # Standard error
        se = math.sqrt(p_pooled * (1 - p_pooled) * (1/n_c + 1/n_t))
        
        # Z-statistic
        z = (p_t - p_c) / se if se > 0 else 0
        
        # P-value (two-tailed)
        p_value = 2 * (1 - stats.norm.cdf(abs(z)))
        
        # Confidence interval for difference
        se_diff = math.sqrt(p_c*(1-p_c)/n_c + p_t*(1-p_t)/n_t)
        z_alpha = stats.norm.ppf(1 - self.alpha/2)
        ci_lower = (p_t - p_c) - z_alpha * se_diff
        ci_upper = (p_t - p_c) + z_alpha * se_diff
        
        # Effect size (Cohen's h)
        h = 2 * (math.asin(math.sqrt(p_t)) - math.asin(math.sqrt(p_c)))
        
        return StatisticalTestResult(
            control=control,
            treatment=treatment,
            relative_change=(p_t - p_c) / p_c if p_c > 0 else 0,
            absolute_change=p_t - p_c,
            p_value=p_value,
            confidence_interval=(ci_lower / p_c if p_c > 0 else 0, 
                                ci_upper / p_c if p_c > 0 else 0),
            is_significant=p_value < self.alpha,
            effect_size=h
        )
    
    def analyze_continuous(
        self,
        control: ExperimentResult,
        treatment: ExperimentResult
    ) -> StatisticalTestResult:
        """
        Analyze continuous metrics (dwell time, revenue).
        Uses Welch's t-test (unequal variances).
        """
        m_c = control.metric_value
        m_t = treatment.metric_value
        s_c = control.metric_std
        s_t = treatment.metric_std
        n_c = control.sample_size
        n_t = treatment.sample_size
        
        # Welch's t-test
        se = math.sqrt(s_c**2/n_c + s_t**2/n_t)
        t_stat = (m_t - m_c) / se if se > 0 else 0
        
        # Degrees of freedom (Welch-Satterthwaite)
        num = (s_c**2/n_c + s_t**2/n_t)**2
        denom = (s_c**2/n_c)**2/(n_c-1) + (s_t**2/n_t)**2/(n_t-1)
        df = num / denom if denom > 0 else min(n_c, n_t) - 1
        
        # P-value
        p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df))
        
        # Confidence interval
        t_alpha = stats.t.ppf(1 - self.alpha/2, df)
        ci_lower = (m_t - m_c) - t_alpha * se
        ci_upper = (m_t - m_c) + t_alpha * se
        
        # Effect size (Cohen's d)
        pooled_std = math.sqrt(((n_c-1)*s_c**2 + (n_t-1)*s_t**2) / (n_c+n_t-2))
        d = (m_t - m_c) / pooled_std if pooled_std > 0 else 0
        
        return StatisticalTestResult(
            control=control,
            treatment=treatment,
            relative_change=(m_t - m_c) / m_c if m_c > 0 else 0,
            absolute_change=m_t - m_c,
            p_value=p_value,
            confidence_interval=(ci_lower / m_c if m_c > 0 else 0,
                                ci_upper / m_c if m_c > 0 else 0),
            is_significant=p_value < self.alpha,
            effect_size=d
        )
 
 
class MultipleTestingCorrection:
    """
    Corrections for testing multiple metrics simultaneously.
    
    When you test 20 metrics, you expect 1 false positive at α=0.05.
    Corrections control the family-wise error rate.
    """
    
    @staticmethod
    def bonferroni(p_values: list, alpha: float = 0.05) -> dict:
        """
        Bonferroni correction: most conservative.
        Divide α by number of tests.
        
        Protects strongly against false positives but
        may miss real effects (high false negative rate).
        """
        n = len(p_values)
        corrected_alpha = alpha / n
        
        return {
            "corrected_alpha": corrected_alpha,
            "significant": [p < corrected_alpha for p in p_values]
        }
    
    @staticmethod
    def benjamini_hochberg(p_values: list, alpha: float = 0.05) -> dict:
        """
        Benjamini-Hochberg procedure: controls False Discovery Rate.
        
        Less conservative than Bonferroni. Allows some false positives
        in exchange for more power to detect real effects.
        
        Recommended for exploratory analysis with many metrics.
        """
        n = len(p_values)
        sorted_idx = np.argsort(p_values)
        sorted_p = np.array(p_values)[sorted_idx]
        
        # BH threshold: p[i] < (i/n) * alpha
        thresholds = [(i+1)/n * alpha for i in range(n)]
        
        # Find largest p-value below its threshold
        significant = [False] * n
        for i in range(n-1, -1, -1):
            if sorted_p[i] <= thresholds[i]:
                for j in range(i+1):
                    significant[sorted_idx[j]] = True
                break
        
        return {
            "fdr_level": alpha,
            "significant": significant
        }
 
 
# Complete analysis workflow
def full_experiment_analysis(
    control_sessions: list,
    treatment_sessions: list,
    metrics_to_test: list
) -> dict:
    """
    Run complete statistical analysis for an experiment.
    """
    analyzer = ABTestAnalyzer(alpha=0.05)
    
    results = {}
    p_values = []
    
    for metric_name in metrics_to_test:
        # Compute metric for each group
        control_values = [getattr(s, metric_name) for s in control_sessions]
        treatment_values = [getattr(s, metric_name) for s in treatment_sessions]
        
        control = ExperimentResult(
            variant_name="control",
            sample_size=len(control_values),
            metric_value=sum(control_values) / len(control_values),
            metric_std=np.std(control_values)
        )
        treatment = ExperimentResult(
            variant_name="treatment",
            sample_size=len(treatment_values),
            metric_value=sum(treatment_values) / len(treatment_values),
            metric_std=np.std(treatment_values)
        )
        
        # Run appropriate test
        is_proportion = metric_name in ["ctr", "conversion_rate", "abandonment_rate"]
        if is_proportion:
            test_result = analyzer.analyze_proportions(control, treatment)
        else:
            test_result = analyzer.analyze_continuous(control, treatment)
        
        results[metric_name] = test_result
        p_values.append(test_result.p_value)
    
    # Apply multiple testing correction
    bh_correction = MultipleTestingCorrection.benjamini_hochberg(p_values)
    
    return {
        "individual_results": results,
        "bh_significant": bh_correction["significant"]
    }

P-Hacking Warning

Common Pitfalls in Search Experimentation

Search A/B testing has unique challenges that can invalidate results if not handled properly.

Critical Pitfalls to Avoid

•Novelty effects: Users click more on new layouts initially, then revert. Run experiments long enough (2+ weeks) for novelty to wear off.
•Primacy bias: Users develop muscle memory for result positions. Changing positions causes temporary disorientation. Account for learning curves.
•Query sampling bias: Popular queries dominate aggregate metrics. Rare queries may regress without affecting averages. Segment analysis by query frequency.
•User segment heterogeneity: Change helps power users, hurts novices (or vice versa). Segment analysis prevents shipping changes that harm important segments.
•Day-of-week effects: Behavior differs on weekdays vs weekends. Always run full week multiples (7, 14, 21 days).
•Interaction effects: New ranking interacts with other features. Test in production configuration, not isolated sandbox.
•Simpson's paradox: Metric improves in every segment but decreases overall due to segment mix shifts. Always check this.
•Survivor bias: Users who abandon don't generate data. Silent failures are invisible to standard metrics.

Signs of Invalid Experiments

•Sample ratio mismatch (50.3%/49.7% expected, got 55%/45%)
•Metrics moving in impossible directions
•Guardrail metrics degrading unexpectedly
•Effect size implausibly large (5x improvement)
•Results flip after first few days

Signs of Valid Experiments

•Sample ratio matches allocation
•Unaffected metrics are flat
•Effects consistent over time
•Effect direction matches offline evaluation
•Segment effects are consistent

Summary: Evidence-Based Relevance Improvement

Key Takeaways

•Use three-stage evaluation — Offline metrics for fast iteration, online A/B tests for validation, long-term holdouts for cumulative effects.
•Choose appropriate metrics — NDCG for graded relevance, MRR for navigational queries, CTR and dwell time for behavioral signals, conversion for business impact.
•Calculate sample size correctly — Under-powered experiments miss real improvements. Use power analysis to determine required duration.
•Apply proper statistical methods — Use the right tests (z-test for proportions, t-test for continuous). Correct for multiple comparisons.
•Avoid common pitfalls — Novelty effects, primacy bias, query sampling bias, day-of-week effects. Design experiments to control for these.
•Make decisions holistically — Don't ship on a single metric. Consider trade-offs, segment effects, and long-term implications.
•Document and learn — Every experiment teaches something. Build institutional knowledge about what works and doesn't work.

What's next:

Page Complete

A/B Testing Relevance: Measuring What Matters

of query modifications

A/B Testing Relevance: Measuring What Matters

of query modifications