System Design (HLD)Search Relevance Tuning

Search Relevance Tuning

LevelAdvanced

Duration75 mins

TopicSearch Relevance Tuning

5 / 5

User Feedback Incorporation: Closing the Loop

Learning from Users

Every search interaction is a learning opportunity. When a user clicks the third result instead of the first, they're telling us something about relevance. When they rephrase a query, they're showing us the gap between what they wanted and what we returned. When they explicitly report a bad result, they're investing effort to improve the system.

User feedback incorporation is the practice of systematically collecting, analyzing, and acting on these signals to continuously improve search quality. It closes the loop between user intent and system behavior, creating a search engine that gets smarter with every query.

This final page in our relevance tuning module explores the full spectrum of user feedback—from implicit behavioral signals to explicit ratings, from manual annotation to machine learning-driven continuous improvement.

What You Will Learn

By the end of this page, you will understand the taxonomy of user feedback types, how to collect and process feedback at scale, techniques for incorporating feedback into ranking models, the challenges of feedback loops and position bias, and how to build continuous learning systems.

The Feedback Taxonomy

User feedback comes in many forms, each with different signal quality, volume, and collection costs.

The feedback spectrum:

User Feedback Types
Type	Examples	Signal Quality	Volume	Collection Cost
Explicit Direct	Thumbs up/down, 5-star ratings, 'Was this helpful?'	High (clear intent)	Low (few users bother)	Low (UI element)
Explicit Report	Report bad result, flag spam, correction suggestions	Very High (specific)	Very Low (motivated users)	Low (UI element)
Explicit Survey	Post-search satisfaction surveys, NPS	High (structured)	Very Low (intrusive)	Medium (survey design)
Implicit Primary	Clicks, purchases, conversions	Medium (action ≠ satisfaction)	Very High (every interaction)	Low (logging)
Implicit Secondary	Dwell time, scroll depth, return visits	Medium (requires interpretation)	High (computed from logs)	Medium (tracking)
Implicit Negative	Skip, quick return, query refinement	Medium (absence of signal)	High (requires inference)	Medium (complex logic)
Editorial Judgment	Human raters, quality editors	Very High (expert assessment)	Low (expensive, slow)	Very High (human cost)

The quality-volume trade-off:

Explicit feedback is cleaner but rare. Most users don't rate results—they just use them (or don't). This means relying primarily on implicit signals, which are abundant but noisy.

The art of feedback incorporation is combining these signals appropriately: using rare explicit feedback to calibrate and validate, while leveraging abundant implicit signals for coverage.

The 1% Rule

Expect only 1% of users to provide explicit feedback, and of those, negative feedback is 3-10x more likely than positive (negativity bias). Design systems that work with implicit signals first, then enhance with explicit signals where available.

Collecting Explicit Feedback

Explicit feedback requires the user to take an action specifically to rate or report. Designing effective explicit feedback collection requires minimizing friction while maximizing signal quality.

Design principles for explicit feedback:

Explicit Feedback Best Practices

•Minimize friction — Thumbs up/down is easier than 5-star ratings. Binary choices get more responses than gradations.
•Provide context — Ask 'Was this result helpful for [query]?' not just 'Rate this result.' Context improves signal quality.
•Time it right — Ask after the user has had time to evaluate, not immediately. Post-dwell feedback is more informative.
•Don't over-ask — Showing feedback prompts on every search causes fatigue and lower-quality responses.
•Make reporting easy — 'Report bad result' should be one click away with optional elaboration.
•Close the loop — If users report issues, show that you acted on them. This encourages future feedback.

explicit_feedback_system.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
from dataclasses import dataclass, field
from typing import Optional, List, Dict
from datetime import datetime
from enum import Enum
 
class FeedbackType(Enum):
    THUMBS_UP = "thumbs_up"
    THUMBS_DOWN = "thumbs_down"
    REPORT_SPAM = "report_spam"
    REPORT_OUTDATED = "report_outdated"
    REPORT_IRRELEVANT = "report_irrelevant"
    REPORT_OFFENSIVE = "report_offensive"
    QUERY_CORRECTION = "query_correction"
    RESULT_CORRECTION = "result_correction"
 
@dataclass
class ExplicitFeedback:
    """A single explicit feedback event."""
    feedback_id: str
    user_id: Optional[str]
    session_id: str
    query: str
    result_id: str
    result_position: int
    feedback_type: FeedbackType
    timestamp: datetime
    
    # Optional enrichment
    user_comment: Optional[str] = None
    suggested_correction: Optional[str] = None
    dwell_time_before_feedback: Optional[float] = None  # Seconds
    
    # Metadata for analysis
    user_tenure_days: Optional[int] = None
    user_search_count: Optional[int] = None
    is_power_user: bool = False
 
class ExplicitFeedbackCollector:
    """
    System for collecting and processing explicit user feedback.
    
    Design considerations:
    - Throttle to prevent survey fatigue
    - Weight by user credibility
    - Aggregate across users before acting
    """
    
    def __init__(self,
                 min_dwell_before_ask: float = 5.0,  # Seconds
                 ask_probability: float = 0.05,       # 5% of eligible
                 cooldown_hours: float = 24.0):       # Per user
        self.min_dwell = min_dwell_before_ask
        self.ask_prob = ask_probability
        self.cooldown = cooldown_hours
    
    def should_show_feedback_prompt(
        self,
        user_id: str,
        dwell_time: float,
        last_feedback_time: Optional[datetime]
    ) -> bool:
        """
        Determine whether to show feedback prompt.
        
        Balances signal collection with user experience.
        """
        # Minimum engagement threshold
        if dwell_time < self.min_dwell:
            return False
        
        # Cooldown: don't ask same user too frequently
        if last_feedback_time:
            hours_since = (datetime.now() - last_feedback_time).total_seconds() / 3600
            if hours_since < self.cooldown:
                return False
        
        # Probabilistic sampling (avoid over-collection)
        import random
        return random.random() < self.ask_prob
    
    def process_feedback(
        self,
        feedback: ExplicitFeedback,
        aggregator: 'FeedbackAggregator'
    ) -> Dict:
        """
        Process a single feedback event.
        
        Actions depend on feedback type and severity.
        """
        # Weight the feedback
        weight = self._calculate_feedback_weight(feedback)
        
        # Aggregate into result-level signals
        aggregator.add_feedback(
            result_id=feedback.result_id,
            query=feedback.query,
            feedback_type=feedback.feedback_type,
            weight=weight
        )
        
        # Immediate action for high-severity reports
        if feedback.feedback_type in [FeedbackType.REPORT_SPAM, FeedbackType.REPORT_OFFENSIVE]:
            return {
                "action": "escalate_review",
                "priority": "high",
                "result_id": feedback.result_id
            }
        
        return {"action": "aggregated", "weight": weight}
    
    def _calculate_feedback_weight(self, feedback: ExplicitFeedback) -> float:
        """
        Weight feedback based on user credibility.
        
        Not all feedback is equally valuable:
        - Power users who search frequently: more credible
        - Users with feedback history that matches quality: more credible
        - Very new users: might not understand product
        - Users who report everything as spam: less credible
        """
        weight = 1.0
        
        # Power user boost
        if feedback.is_power_user:
            weight *= 1.5
        
        # New user discount (might not understand)
        if feedback.user_tenure_days and feedback.user_tenure_days < 7:
            weight *= 0.7
        
        # Engaged session boost
        if feedback.dwell_time_before_feedback and feedback.dwell_time_before_feedback > 30:
            weight *= 1.3
        
        return weight
 
 
class FeedbackAggregator:
    """
    Aggregates feedback across users to make decisions.
    
    Single user reports can be noise; patterns across users are signal.
    """
    
    def __init__(self,
                 action_threshold_positive: float = 5.0,
                 action_threshold_negative: float = 3.0):
        self.threshold_positive = action_threshold_positive
        self.threshold_negative = action_threshold_negative
        self.feedback_store: Dict[str, Dict] = {}  # result_id -> aggregated feedback
    
    def add_feedback(
        self,
        result_id: str,
        query: str,
        feedback_type: FeedbackType,
        weight: float
    ):
        """Add weighted feedback to aggregation."""
        if result_id not in self.feedback_store:
            self.feedback_store[result_id] = {
                "positive_weight": 0.0,
                "negative_weight": 0.0,
                "queries": set(),
                "feedback_count": 0,
                "report_types": []
            }
        
        store = self.feedback_store[result_id]
        store["feedback_count"] += 1
        store["queries"].add(query)
        
        if feedback_type == FeedbackType.THUMBS_UP:
            store["positive_weight"] += weight
        elif feedback_type in [FeedbackType.THUMBS_DOWN, FeedbackType.REPORT_IRRELEVANT]:
            store["negative_weight"] += weight
        
        if feedback_type.name.startswith("REPORT_"):
            store["report_types"].append(feedback_type)
    
    def get_actionable_items(self) -> Dict[str, List]:
        """
        Get items that should be acted on based on aggregated feedback.
        """
        actions = {
            "demote": [],      # Consistently negative feedback
            "promote": [],     # Consistently positive feedback
            "investigate": [], # Mixed or suspicious patterns
            "remove": []       # Severe reports
        }
        
        for result_id, data in self.feedback_store.items():
            net_score = data["positive_weight"] - data["negative_weight"]
            
            if net_score < -self.threshold_negative:
                actions["demote"].append({
                    "result_id": result_id,
                    "net_score": net_score,
                    "queries": list(data["queries"])
                })
            
            if net_score > self.threshold_positive:
                actions["promote"].append({
                    "result_id": result_id,
                    "net_score": net_score,
                    "queries": list(data["queries"])
                })
            
            # Multiple spam/offensive reports = investigate/remove
            spam_reports = sum(1 for t in data["report_types"] 
                             if t in [FeedbackType.REPORT_SPAM, FeedbackType.REPORT_OFFENSIVE])
            if spam_reports >= 3:
                actions["investigate"].append({
                    "result_id": result_id,
                    "report_count": spam_reports
                })
        
        return actions
 
 
@dataclass
class FeedbackPromptConfig:
    """Configuration for feedback UI prompts."""
    
    # Binary feedback (simplest)
    binary_prompt: str = "Was this result helpful?"
    binary_positive: str = "Yes"
    binary_negative: str = "No"
    
    # Report options
    report_prompt: str = "What's wrong with this result?"
    report_options: List[str] = field(default_factory=lambda: [
        "Spam or misleading",
        "Outdated information",
        "Not relevant to my search",
        "Offensive content",
        "Other (please describe)"
    ])
    
    # Optional comment
    comment_prompt: str = "Tell us more (optional)"
    comment_max_length: int = 500

Interpreting Implicit Signals

Implicit feedback is the gold mine—high volume, no user friction. But interpretation is tricky. A click doesn't mean satisfaction. A non-click doesn't mean irrelevance.

The implicit signal interpretation challenge:

Positive Implicit Signals

•Long dwell time — User engaged with content
•Conversion after click — True satisfaction
•No query refinement — Found what they wanted
•Saved/bookmarked — Explicit value
•Scrolled deeply — Content was engaging
•Clicked first result — Good ranking (often)

Negative Implicit Signals

•Short dwell (pogo stick) — Wrong result
•Query refinement — Initial results inadequate
•Abandonment — Nothing useful shown
•Click position 5+ — Top results weren't good
•Many clicks same query — No single answer
•Immediate back — Content didn't match snippet

implicit_signal_processing.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
from dataclasses import dataclass
from typing import List, Optional, Tuple
from datetime import datetime
import math
 
@dataclass
class ImplicitEvent:
    """Base class for implicit feedback events."""
    session_id: str
    timestamp: datetime
    result_id: str
    query: str
    position: int
 
@dataclass
class ClickEvent(ImplicitEvent):
    dwell_time_seconds: Optional[float] = None
    scroll_depth_percent: Optional[float] = None
    converted: bool = False
    returned_to_serp: bool = False
 
@dataclass  
class ImpressionEvent(ImplicitEvent):
    """Result was shown but not clicked."""
    viewport_visible: bool = True  # Was it above the fold?
    visible_duration_ms: int = 0
 
class ImplicitSignalInterpreter:
    """
    Interprets implicit user behavior as relevance signals.
    
    Key challenges:
    1. Position bias: Higher positions get more clicks regardless of relevance
    2. Presentation bias: Snippets affect clicks independent of content
    3. Selection bias: Only see data for what we showed
    4. Trust calibration: Different users have different click patterns
    """
    
    def __init__(self,
                 long_dwell_threshold: float = 30.0,
                 short_dwell_threshold: float = 10.0,
                 pogo_stick_threshold: float = 5.0):
        self.long_dwell = long_dwell_threshold
        self.short_dwell = short_dwell_threshold
        self.pogo_threshold = pogo_stick_threshold
    
    def classify_click(self, click: ClickEvent) -> Tuple[str, float]:
        """
        Classify a click event as positive, negative, or neutral.
        
        Returns (classification, confidence).
        """
        dwell = click.dwell_time_seconds or 0
        
        # Strong positive signals
        if click.converted:
            return ("strong_positive", 0.95)
        
        if dwell >= self.long_dwell:
            return ("positive", 0.8)
        
        # Strong negative signals
        if dwell <= self.pogo_threshold and click.returned_to_serp:
            return ("pogo_stick", 0.85)
        
        if dwell <= self.short_dwell and click.returned_to_serp:
            return ("negative", 0.6)
        
        # Ambiguous
        return ("neutral", 0.3)
    
    def score_result_from_clicks(
        self,
        clicks: List[ClickEvent],
        impressions: int
    ) -> float:
        """
        Compute relevance score from click patterns.
        
        Combines click-through rate with click quality.
        """
        if impressions == 0:
            return 0.0
        
        ctr = len(clicks) / impressions
        
        # Weight by click quality
        quality_sum = 0.0
        for click in clicks:
            classification, confidence = self.classify_click(click)
            if classification in ["strong_positive", "positive"]:
                quality_sum += confidence
            elif classification in ["pogo_stick", "negative"]:
                quality_sum -= confidence * 0.5  # Negative less penalized than positive boosted
            # Neutral contributes 0
        
        avg_quality = quality_sum / len(clicks) if clicks else 0
        
        # Combine CTR and quality
        # High CTR + negative quality = clickbait
        # Low CTR + positive quality = hidden gem
        return ctr * (0.5 + 0.5 * avg_quality)
 
 
class PositionBiasCorrector:
    """
    Corrects for position bias in click data.
    
    Position bias: Users are more likely to click higher positions
    regardless of relevance. Without correction, top positions
    get all the positive feedback regardless of merit.
    
    Approaches:
    1. Inverse propensity weighting
    2. Result randomization (costly to user experience)
    3. Click models (examine → click probability)
    """
    
    def __init__(self):
        # Estimated examination probabilities by position
        # These should be learned from data (e.g., eye-tracking studies)
        self.position_examination_prob = {
            1: 0.95,
            2: 0.85,
            3: 0.70,
            4: 0.55,
            5: 0.40,
            6: 0.30,
            7: 0.22,
            8: 0.16,
            9: 0.12,
            10: 0.10,
        }
    
    def inverse_propensity_weight(self, position: int) -> float:
        """
        Weight clicks inversely to examination probability.
        
        Intuition: A click at position 8 is more informative than
        a click at position 1, because fewer users even see position 8.
        """
        exam_prob = self.position_examination_prob.get(position, 0.05)
        # Cap weight to prevent extreme values
        return min(1.0 / exam_prob, 10.0)
    
    def correct_ctr(
        self,
        clicks_by_position: dict,
        impressions_by_position: dict
    ) -> dict:
        """
        Compute position-corrected CTR.
        
        Raw CTR at position 1 might be 50%.
        Raw CTR at position 5 might be 10%.
        
        But if we correct for the fact that position 5 is only
        examined 40% of the time, the corrected CTR is:
        10% / 40% = 25% (higher relative to position 1's 50%/95% = 53%)
        """
        corrected = {}
        
        for pos, clicks in clicks_by_position.items():
            impressions = impressions_by_position.get(pos, 0)
            if impressions == 0:
                continue
            
            raw_ctr = clicks / impressions
            exam_prob = self.position_examination_prob.get(pos, 0.05)
            
            # Corrected CTR: assuming click only if examined
            corrected[pos] = raw_ctr / exam_prob
        
        return corrected
 
 
class QueryRefinementAnalyzer:
    """
    Analyze query refinements as implicit negative feedback.
    
    If users refine their query, initial results were inadequate.
    The refinement type tells us why.
    """
    
    @staticmethod
    def classify_refinement(original: str, refined: str) -> str:
        """
        Classify the type of query refinement.
        """
        orig_tokens = set(original.lower().split())
        ref_tokens = set(refined.lower().split())
        
        # Added words = specification/clarification
        added = ref_tokens - orig_tokens
        removed = orig_tokens - ref_tokens
        
        if removed and not added:
            return "simplification"  # Too specific, broadening
        
        if added and not removed:
            return "specification"  # Too broad, narrowing
        
        if added and removed:
            if len(added & orig_tokens) > 0:
                return "correction"  # Typo fix or synonym
            return "pivot"  # Different intent entirely
        
        return "unknown"
    
    @staticmethod
    def refinement_to_feedback(
        original: str,
        refined: str,
        original_results: List[str]
    ) -> dict:
        """
        Convert refinement to implicit feedback signal.
        
        Returns feedback for original query's results.
        """
        refinement_type = QueryRefinementAnalyzer.classify_refinement(original, refined)
        
        # All original results get negative signal (varying strength)
        feedback = {}
        
        for i, result_id in enumerate(original_results):
            position = i + 1
            
            if refinement_type == "specification":
                # Results too broad - moderate negative
                feedback[result_id] = -0.3 * (1 / position)
            elif refinement_type == "pivot":
                # Wrong intent entirely - strong negative for top results
                feedback[result_id] = -0.8 * (1 / position)
            elif refinement_type == "simplification":
                # Results too specific - weak negative
                feedback[result_id] = -0.1 * (1 / position)
            else:
                feedback[result_id] = -0.2 * (1 / position)
        
        return feedback

Beware Feedback Loops

Using click data to train ranking creates a feedback loop: high-ranked items get more clicks, more clicks mean higher ranking, etc. This can reinforce initial biases and suppress good content that never got shown. Use exploration, position bias correction, and periodic evaluation against fresh judgment data.

Learning from Feedback

Collecting feedback is useless without systems to learn from it. This section covers approaches for incorporating feedback into ranking models.

Integration patterns:

Feedback Integration Approaches
Approach	Description	Speed	Risk	Best For
Manual override	Editors manually boost/demote based on feedback	Slow	Low	High-stakes, low-volume queries
Rule-based	Automated rules: 'If X reports, demote'	Fast	Medium	Clear-cut cases (spam, offensive)
Feature engineering	Feedback becomes features in ML model	Medium	Medium	Existing ML pipeline
Online learning	Model updates in real-time from feedback	Very Fast	High	Rapidly changing content
Batch retraining	Periodic model retraining with new data	Slow	Low	Stable domains

feedback_learning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
from dataclasses import dataclass
from typing import Dict, List, Optional, Callable
from datetime import datetime, timedelta
from abc import ABC, abstractmethod
import numpy as np
 
@dataclass
class FeedbackSignal:
    """Processed feedback ready for learning."""
    query: str
    result_id: str
    signal_type: str      # e.g., "click", "dwell", "explicit_positive"
    signal_value: float   # Normalized value
    weight: float         # Confidence weight
    timestamp: datetime
 
class FeedbackIntegrator(ABC):
    """Base class for feedback integration approaches."""
    
    @abstractmethod
    def incorporate(self, signals: List[FeedbackSignal]):
        pass
    
    @abstractmethod
    def get_boost(self, query: str, result_id: str) -> float:
        pass
 
class ManualOverrideIntegrator(FeedbackIntegrator):
    """
    Queue feedback for manual review and override.
    
    Best for high-stakes situations where automated changes
    could have significant negative impact.
    """
    
    def __init__(self, threshold_for_review: float = 5.0):
        self.review_queue: List[Dict] = []
        self.manual_overrides: Dict[str, Dict[str, float]] = {}  # query -> result_id -> boost
        self.threshold = threshold_for_review
    
    def incorporate(self, signals: List[FeedbackSignal]):
        """Add signals to review queue if threshold met."""
        # Aggregate by (query, result_id)
        aggregated = {}
        for signal in signals:
            key = (signal.query, signal.result_id)
            if key not in aggregated:
                aggregated[key] = {"positive": 0.0, "negative": 0.0}
            
            if signal.signal_value > 0:
                aggregated[key]["positive"] += signal.signal_value * signal.weight
            else:
                aggregated[key]["negative"] += abs(signal.signal_value) * signal.weight
        
        # Queue items above threshold
        for (query, result_id), scores in aggregated.items():
            net = scores["positive"] - scores["negative"]
            if abs(net) >= self.threshold:
                self.review_queue.append({
                    "query": query,
                    "result_id": result_id,
                    "net_score": net,
                    "recommendation": "promote" if net > 0 else "demote"
                })
    
    def apply_override(self, query: str, result_id: str, boost: float):
        """Manual editor action."""
        if query not in self.manual_overrides:
            self.manual_overrides[query] = {}
        self.manual_overrides[query][result_id] = boost
    
    def get_boost(self, query: str, result_id: str) -> float:
        return self.manual_overrides.get(query, {}).get(result_id, 1.0)
 
 
class RuleBasedIntegrator(FeedbackIntegrator):
    """
    Apply automated rules based on feedback patterns.
    
    Rules are explicit, auditable, and controllable.
    Good for well-understood signal patterns.
    """
    
    def __init__(self):
        self.result_scores: Dict[str, Dict[str, float]] = {}
        self.rules = [
            self._spam_report_rule,
            self._consistent_negative_rule,
            self._consistent_positive_rule,
        ]
    
    def incorporate(self, signals: List[FeedbackSignal]):
        for signal in signals:
            if signal.query not in self.result_scores:
                self.result_scores[signal.query] = {}
            
            if signal.result_id not in self.result_scores[signal.query]:
                self.result_scores[signal.query][signal.result_id] = {
                    "positive": 0, "negative": 0, "spam_reports": 0
                }
            
            scores = self.result_scores[signal.query][signal.result_id]
            
            if signal.signal_type == "spam_report":
                scores["spam_reports"] += 1
            elif signal.signal_value > 0:
                scores["positive"] += signal.signal_value * signal.weight
            else:
                scores["negative"] += abs(signal.signal_value) * signal.weight
    
    def _spam_report_rule(self, scores: Dict) -> Optional[float]:
        """Multiple spam reports → severe demotion."""
        if scores.get("spam_reports", 0) >= 3:
            return 0.1  # 90% demotion
        return None
    
    def _consistent_negative_rule(self, scores: Dict) -> Optional[float]:
        """Consistent negative feedback → moderate demotion."""
        if scores.get("negative", 0) > 5 and scores.get("positive", 0) < 1:
            return 0.5  # 50% demotion
        return None
    
    def _consistent_positive_rule(self, scores: Dict) -> Optional[float]:
        """Consistent positive feedback → promotion."""
        if scores.get("positive", 0) > 10 and scores.get("negative", 0) < 2:
            return 1.3  # 30% boost
        return None
    
    def get_boost(self, query: str, result_id: str) -> float:
        scores = self.result_scores.get(query, {}).get(result_id, {})
        
        # Apply rules in priority order
        for rule in self.rules:
            boost = rule(scores)
            if boost is not None:
                return boost
        
        return 1.0  # No rule matched
 
 
class FeatureEngineeringIntegrator(FeedbackIntegrator):
    """
    Convert feedback into features for ML ranking model.
    
    Doesn't directly modify rankings—provides signal
    for learning-to-rank model to use.
    """
    
    def __init__(self, decay_half_life_days: float = 30.0):
        self.decay_half_life = decay_half_life_days
        self.feedback_features: Dict[str, Dict] = {}  # result_id -> features
    
    def incorporate(self, signals: List[FeedbackSignal]):
        """Update feature store with new signals."""
        for signal in signals:
            if signal.result_id not in self.feedback_features:
                self.feedback_features[signal.result_id] = self._init_features()
            
            features = self.feedback_features[signal.result_id]
            decay = self._time_decay(signal.timestamp)
            
            # Update running statistics
            if signal.signal_value > 0:
                features["positive_feedback_score"] += signal.signal_value * signal.weight * decay
                features["positive_feedback_count"] += 1
            else:
                features["negative_feedback_score"] += abs(signal.signal_value) * signal.weight * decay
                features["negative_feedback_count"] += 1
            
            # Update type-specific features
            type_key = f"feedback_{signal.signal_type}_score"
            if type_key in features:
                features[type_key] += signal.signal_value * signal.weight * decay
    
    def _init_features(self) -> Dict:
        return {
            "positive_feedback_score": 0.0,
            "negative_feedback_score": 0.0,
            "positive_feedback_count": 0,
            "negative_feedback_count": 0,
            "feedback_click_score": 0.0,
            "feedback_dwell_score": 0.0,
            "feedback_explicit_score": 0.0,
        }
    
    def _time_decay(self, timestamp: datetime) -> float:
        """Apply time decay to older signals."""
        age_days = (datetime.now() - timestamp).days
        return 0.5 ** (age_days / self.decay_half_life)
    
    def get_features(self, result_id: str) -> Dict:
        """Get feature vector for a result."""
        if result_id not in self.feedback_features:
            return self._init_features()
        
        features = self.feedback_features[result_id].copy()
        
        # Derived features
        total_pos = features["positive_feedback_count"]
        total_neg = features["negative_feedback_count"]
        total = total_pos + total_neg
        
        if total > 0:
            features["feedback_positive_ratio"] = total_pos / total
            features["feedback_net_score"] = (
                features["positive_feedback_score"] - 
                features["negative_feedback_score"]
            )
        else:
            features["feedback_positive_ratio"] = 0.5
            features["feedback_net_score"] = 0.0
        
        return features
    
    def get_boost(self, query: str, result_id: str) -> float:
        # This integrator doesn't directly boost—ML model uses features
        return 1.0
 
 
class OnlineLearningIntegrator(FeedbackIntegrator):
    """
    Update model weights in real-time from feedback.
    
    Most responsive but highest risk—bad feedback can
    quickly degrade rankings.
    
    Uses multi-armed bandit approach for exploration-exploitation.
    """
    
    def __init__(self, 
                 learning_rate: float = 0.01,
                 exploration_rate: float = 0.1):
        self.learning_rate = learning_rate
        self.exploration = exploration_rate
        self.result_values: Dict[str, Dict] = {}  # query -> result_id -> Thompson params
    
    def incorporate(self, signals: List[FeedbackSignal]):
        """Update bandit parameters from new signals."""
        for signal in signals:
            if signal.query not in self.result_values:
                self.result_values[signal.query] = {}
            
            if signal.result_id not in self.result_values[signal.query]:
                # Initialize with prior (Beta distribution parameters)
                self.result_values[signal.query][signal.result_id] = {
                    "alpha": 1.0,  # Pseudo-successes
                    "beta": 1.0,   # Pseudo-failures
                }
            
            params = self.result_values[signal.query][signal.result_id]
            
            # Update Beta distribution based on feedback
            if signal.signal_value > 0:
                params["alpha"] += signal.signal_value * signal.weight
            else:
                params["beta"] += abs(signal.signal_value) * signal.weight
    
    def get_boost(self, query: str, result_id: str) -> float:
        """
        Get boost using Thompson Sampling.
        
        Sample from posterior to balance exploration/exploitation.
        """
        if query not in self.result_values:
            return 1.0
        
        if result_id not in self.result_values[query]:
            return 1.0
        
        params = self.result_values[query][result_id]
        
        # Sample from Beta distribution
        sampled_value = np.random.beta(params["alpha"], params["beta"])
        
        # Convert to boost factor (0.5 = neutral)
        # sampled_value ≈ 0.3 → boost ≈ 0.6
        # sampled_value ≈ 0.7 → boost ≈ 1.4
        boost = 0.5 + sampled_value
        
        return boost

Continuous Learning Systems

The ultimate goal is a search system that continuously improves from user feedback without manual intervention. This requires careful architecture to balance responsiveness with stability.

Key components of continuous learning:

Continuous Learning Pipeline Components

•Signal collection: Low-latency logging of all user interactions with minimal impact on search performance.
•Signal processing: Stream processing to clean, normalize, and aggregate raw signals into usable feedback.
•Feature computation: Convert processed signals into features for model training and serving.
•Model training: Periodic or continuous model updates with new feedback data.
•Model validation: Automated testing to prevent bad models from deploying.
•Gradual rollout: Canary deployments to catch regressions before full rollout.
•Monitoring: Continuous monitoring for degradation, feedback loop issues, and adversarial attacks.

continuous_learning_architecture.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
from dataclasses import dataclass
from typing import Dict, List, Optional, Callable
from datetime import datetime, timedelta
from abc import ABC, abstractmethod
from enum import Enum
 
class ModelState(Enum):
    TRAINING = "training"
    VALIDATING = "validating"
    CANARY = "canary"
    PRODUCTION = "production"
    ROLLBACK = "rollback"
 
@dataclass
class ModelVersion:
    """A versioned model with metadata."""
    version_id: str
    created_at: datetime
    training_data_end: datetime
    metrics: Dict[str, float]  # Validation metrics
    state: ModelState
 
class ContinuousLearningPipeline:
    """
    Orchestrates continuous learning from user feedback.
    
    Pipeline stages:
    1. Collect: Stream user interactions
    2. Process: Convert to training signals
    3. Train: Update model with new data
    4. Validate: Ensure model quality
    5. Deploy: Gradual rollout
    6. Monitor: Watch for issues
    """
    
    def __init__(self,
                 training_frequency_hours: float = 24.0,
                 min_samples_for_training: int = 10000,
                 validation_threshold: float = 0.95):
        self.training_freq = training_frequency_hours
        self.min_samples = min_samples_for_training
        self.validation_threshold = validation_threshold
        
        self.current_model: Optional[ModelVersion] = None
        self.candidate_model: Optional[ModelVersion] = None
        self.signal_buffer: List = []
    
    async def run_training_cycle(self):
        """
        Execute one training cycle.
        Called periodically (e.g., daily or when enough data).
        """
        # Step 1: Collect processed signals
        signals = await self.collect_signals()
        
        if len(signals) < self.min_samples:
            return {"status": "skipped", "reason": "insufficient_data"}
        
        # Step 2: Train new model
        new_model = await self.train_model(signals)
        new_model.state = ModelState.VALIDATING
        
        # Step 3: Validate offline
        validation_result = await self.validate_model(new_model)
        
        if not validation_result["passed"]:
            return {"status": "failed", "reason": "validation_failed", 
                   "metrics": validation_result}
        
        # Step 4: Deploy to canary
        new_model.state = ModelState.CANARY
        await self.deploy_canary(new_model)
        
        # Step 5: Monitor canary
        canary_result = await self.monitor_canary(new_model, duration_hours=2.0)
        
        if not canary_result["passed"]:
            await self.rollback_canary()
            return {"status": "rolled_back", "reason": "canary_failed"}
        
        # Step 6: Promote to production
        new_model.state = ModelState.PRODUCTION
        await self.promote_to_production(new_model)
        
        return {"status": "success", "model_version": new_model.version_id}
    
    async def collect_signals(self) -> List:
        """Collect and process feedback signals."""
        # In practice: read from Kafka/Kinesis stream
        signals = self.signal_buffer.copy()
        self.signal_buffer = []
        return signals
    
    async def train_model(self, signals: List) -> ModelVersion:
        """Train new model version."""
        # In practice: launch training job (Spark, SageMaker, etc.)
        return ModelVersion(
            version_id=f"model_{datetime.now().isoformat()}",
            created_at=datetime.now(),
            training_data_end=datetime.now(),
            metrics={},
            state=ModelState.TRAINING
        )
    
    async def validate_model(self, model: ModelVersion) -> Dict:
        """
        Validate model against held-out data and baselines.
        
        Checks:
        1. Offline metrics vs baseline (NDCG, MRR)
        2. No regression on critical query types
        3. Consistent behavior (no dramatic ranking changes)
        """
        # Compare against baseline model
        baseline_metrics = self.current_model.metrics if self.current_model else {}
        
        checks = []
        
        # Check 1: Overall metric improvement
        ndcg_ratio = model.metrics.get("ndcg", 0) / baseline_metrics.get("ndcg", 1)
        checks.append({
            "check": "ndcg_improvement",
            "passed": ndcg_ratio >= self.validation_threshold,
            "value": ndcg_ratio
        })
        
        # Check 2: No major regression on any query segment
        for segment in ["navigational", "informational", "transactional"]:
            seg_key = f"ndcg_{segment}"
            if seg_key in model.metrics and seg_key in baseline_metrics:
                ratio = model.metrics[seg_key] / baseline_metrics[seg_key]
                checks.append({
                    "check": f"{segment}_stability",
                    "passed": ratio >= 0.98,  # Allow 2% regression per segment
                    "value": ratio
                })
        
        all_passed = all(c["passed"] for c in checks)
        
        return {
            "passed": all_passed,
            "checks": checks
        }
    
    async def deploy_canary(self, model: ModelVersion):
        """Deploy to small percentage of traffic."""
        # In practice: update load balancer / feature flag
        pass
    
    async def monitor_canary(self, 
                            model: ModelVersion,
                            duration_hours: float) -> Dict:
        """
        Monitor canary deployment for issues.
        
        Watch for:
        - Error rate spikes
        - Latency increases
        - Metric degradation vs control
        """
        # In practice: query metrics system, compare canary vs control
        # For now, simulate monitoring
        import asyncio
        await asyncio.sleep(0.1)  # Would wait for duration_hours
        
        return {
            "passed": True,
            "error_rate_delta": 0.001,
            "latency_delta_ms": 2,
            "ctr_delta": 0.02
        }
    
    async def rollback_canary(self):
        """Roll back canary deployment."""
        pass
    
    async def promote_to_production(self, model: ModelVersion):
        """Promote canary to full production."""
        self.current_model = model
 
 
class FeedbackGuardrails:
    """
    Safety mechanisms to prevent feedback exploitation.
    
    Bad actors may try to manipulate rankings by:
    - Generating fake positive feedback
    - Coordinated negative feedback attacks
    - Gaming click patterns
    """
    
    @staticmethod
    def detect_coordinated_attack(
        signals: List[FeedbackSignal],
        window_hours: float = 1.0,
        threshold_ratio: float = 10.0
    ) -> List[str]:
        """
        Detect suspiciously coordinated feedback.
        
        Returns list of result_ids that appear to be under attack.
        """
        suspicious = []
        now = datetime.now()
        cutoff = now - timedelta(hours=window_hours)
        
        # Group recent signals by result_id
        by_result = {}
        for signal in signals:
            if signal.timestamp > cutoff:
                if signal.result_id not in by_result:
                    by_result[signal.result_id] = []
                by_result[signal.result_id].append(signal)
        
        # Check for abnormal patterns
        for result_id, result_signals in by_result.items():
            if len(result_signals) < 5:
                continue
            
            # Check: All signals same polarity = suspicious
            values = [s.signal_value for s in result_signals]
            if all(v > 0 for v in values) or all(v < 0 for v in values):
                # Check signal rate vs historical
                signal_rate = len(result_signals) / window_hours
                # Would compare against historical rate
                historical_rate = 1.0  # Placeholder
                
                if signal_rate > historical_rate * threshold_ratio:
                    suspicious.append(result_id)
        
        return suspicious
    
    @staticmethod
    def rate_limit_user_feedback(
        user_id: str,
        feedback_count_today: int,
        max_daily_feedback: int = 20
    ) -> bool:
        """
        Rate limit feedback per user to prevent manipulation.
        """
        return feedback_count_today < max_daily_feedback
    
    @staticmethod
    def detect_bot_patterns(
        user_signals: List[FeedbackSignal]
    ) -> float:
        """
        Detect bot-like feedback patterns.
        Returns probability of bot (0-1).
        """
        if len(user_signals) < 3:
            return 0.0
        
        # Check for inhuman patterns
        bot_score = 0.0
        
        # Pattern: Exactly regular timing
        intervals = []
        for i in range(1, len(user_signals)):
            delta = (user_signals[i].timestamp - user_signals[i-1].timestamp).total_seconds()
            intervals.append(delta)
        
        if len(intervals) >= 2:
            avg_interval = sum(intervals) / len(intervals)
            variance = sum((i - avg_interval)**2 for i in intervals) / len(intervals)
            
            # Very low variance = suspicious regularity
            if variance < 1.0 and avg_interval < 5.0:
                bot_score += 0.5
        
        # Pattern: No dwell time (instant feedback)
        instant_feedback = sum(
            1 for s in user_signals 
            if s.signal_type == "explicit" and 
               (getattr(s, 'dwell_before', 0) or 1) < 2
        )
        if instant_feedback / len(user_signals) > 0.8:
            bot_score += 0.3
        
        return min(1.0, bot_score)

Human in the Loop

Even with automation, maintain human oversight. Set up alerts for unusual model behavior. Review a sample of feedback-driven changes weekly. Automated systems should have escalation paths to human reviewers for edge cases.

Summary: The Continuous Improvement Cycle

User feedback incorporation transforms search from a static system into a learning system. Every interaction becomes an opportunity to improve. But this power comes with responsibility—to handle feedback correctly, avoid manipulation, and maintain quality.

With this page, we complete our exploration of Search Relevance Tuning. From understanding relevance factors, through boosting and personalization, to rigorous A/B testing and feedback loops, you now have a comprehensive toolkit for building and continuously improving search quality.

Key Takeaways

•Feedback comes in many forms — From explicit ratings (high quality, low volume) to implicit clicks (high volume, requires interpretation). Design systems that use both.
•Explicit feedback requires careful UX — Minimize friction, provide context, don't over-ask. Aggregate across users before acting on reports.
•Implicit signals need interpretation — Clicks don't equal satisfaction. Dwell time, pogo-sticking, and query refinements all carry meaning. Correct for position bias.
•Multiple integration approaches exist — Manual overrides for high-stakes, rules for clear cases, features for ML models, online learning for rapid adaptation. Choose based on risk tolerance and domain.
•Continuous learning requires guardrails — Validate models before deployment, monitor canaries, detect manipulation attempts, maintain human oversight.
•The feedback loop is both opportunity and risk — Learning from feedback can create self-reinforcing biases. Use exploration, position correction, and periodic fresh evaluation.
•Close the loop with users — When you act on feedback, show users their input mattered. This encourages future feedback and builds trust.

Module Complete: Search Relevance Tuning

Congratulations! You've completed the comprehensive module on Search Relevance Tuning. You now understand relevance factors and scoring functions, how to boost fields and queries for specific goals, personalization from profile construction to privacy, A/B testing for rigorous quality measurement, and feedback incorporation for continuous improvement. These skills form the foundation for building world-class search experiences that users love.

5 / 5

Loading learning content...

System Design (HLD)Search Relevance Tuning

Search Relevance Tuning

LevelAdvanced

Duration75 mins

TopicSearch Relevance Tuning

5 / 5

User Feedback Incorporation: Closing the Loop

Learning from Users

What You Will Learn

The Feedback Taxonomy

User feedback comes in many forms, each with different signal quality, volume, and collection costs.

The feedback spectrum:

User Feedback Types
Type	Examples	Signal Quality	Volume	Collection Cost
Explicit Direct	Thumbs up/down, 5-star ratings, 'Was this helpful?'	High (clear intent)	Low (few users bother)	Low (UI element)
Explicit Report	Report bad result, flag spam, correction suggestions	Very High (specific)	Very Low (motivated users)	Low (UI element)
Explicit Survey	Post-search satisfaction surveys, NPS	High (structured)	Very Low (intrusive)	Medium (survey design)
Implicit Primary	Clicks, purchases, conversions	Medium (action ≠ satisfaction)	Very High (every interaction)	Low (logging)
Implicit Secondary	Dwell time, scroll depth, return visits	Medium (requires interpretation)	High (computed from logs)	Medium (tracking)
Implicit Negative	Skip, quick return, query refinement	Medium (absence of signal)	High (requires inference)	Medium (complex logic)
Editorial Judgment	Human raters, quality editors	Very High (expert assessment)	Low (expensive, slow)	Very High (human cost)

The quality-volume trade-off:

Explicit feedback is cleaner but rare. Most users don't rate results—they just use them (or don't). This means relying primarily on implicit signals, which are abundant but noisy.

The art of feedback incorporation is combining these signals appropriately: using rare explicit feedback to calibrate and validate, while leveraging abundant implicit signals for coverage.

The 1% Rule

Collecting Explicit Feedback

Explicit feedback requires the user to take an action specifically to rate or report. Designing effective explicit feedback collection requires minimizing friction while maximizing signal quality.

Design principles for explicit feedback:

Explicit Feedback Best Practices

•Minimize friction — Thumbs up/down is easier than 5-star ratings. Binary choices get more responses than gradations.
•Provide context — Ask 'Was this result helpful for [query]?' not just 'Rate this result.' Context improves signal quality.
•Time it right — Ask after the user has had time to evaluate, not immediately. Post-dwell feedback is more informative.
•Don't over-ask — Showing feedback prompts on every search causes fatigue and lower-quality responses.
•Make reporting easy — 'Report bad result' should be one click away with optional elaboration.
•Close the loop — If users report issues, show that you acted on them. This encourages future feedback.

explicit_feedback_system.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
from dataclasses import dataclass, field
from typing import Optional, List, Dict
from datetime import datetime
from enum import Enum
 
class FeedbackType(Enum):
    THUMBS_UP = "thumbs_up"
    THUMBS_DOWN = "thumbs_down"
    REPORT_SPAM = "report_spam"
    REPORT_OUTDATED = "report_outdated"
    REPORT_IRRELEVANT = "report_irrelevant"
    REPORT_OFFENSIVE = "report_offensive"
    QUERY_CORRECTION = "query_correction"
    RESULT_CORRECTION = "result_correction"
 
@dataclass
class ExplicitFeedback:
    """A single explicit feedback event."""
    feedback_id: str
    user_id: Optional[str]
    session_id: str
    query: str
    result_id: str
    result_position: int
    feedback_type: FeedbackType
    timestamp: datetime
    
    # Optional enrichment
    user_comment: Optional[str] = None
    suggested_correction: Optional[str] = None
    dwell_time_before_feedback: Optional[float] = None  # Seconds
    
    # Metadata for analysis
    user_tenure_days: Optional[int] = None
    user_search_count: Optional[int] = None
    is_power_user: bool = False
 
class ExplicitFeedbackCollector:
    """
    System for collecting and processing explicit user feedback.
    
    Design considerations:
    - Throttle to prevent survey fatigue
    - Weight by user credibility
    - Aggregate across users before acting
    """
    
    def __init__(self,
                 min_dwell_before_ask: float = 5.0,  # Seconds
                 ask_probability: float = 0.05,       # 5% of eligible
                 cooldown_hours: float = 24.0):       # Per user
        self.min_dwell = min_dwell_before_ask
        self.ask_prob = ask_probability
        self.cooldown = cooldown_hours
    
    def should_show_feedback_prompt(
        self,
        user_id: str,
        dwell_time: float,
        last_feedback_time: Optional[datetime]
    ) -> bool:
        """
        Determine whether to show feedback prompt.
        
        Balances signal collection with user experience.
        """
        # Minimum engagement threshold
        if dwell_time < self.min_dwell:
            return False
        
        # Cooldown: don't ask same user too frequently
        if last_feedback_time:
            hours_since = (datetime.now() - last_feedback_time).total_seconds() / 3600
            if hours_since < self.cooldown:
                return False
        
        # Probabilistic sampling (avoid over-collection)
        import random
        return random.random() < self.ask_prob
    
    def process_feedback(
        self,
        feedback: ExplicitFeedback,
        aggregator: 'FeedbackAggregator'
    ) -> Dict:
        """
        Process a single feedback event.
        
        Actions depend on feedback type and severity.
        """
        # Weight the feedback
        weight = self._calculate_feedback_weight(feedback)
        
        # Aggregate into result-level signals
        aggregator.add_feedback(
            result_id=feedback.result_id,
            query=feedback.query,
            feedback_type=feedback.feedback_type,
            weight=weight
        )
        
        # Immediate action for high-severity reports
        if feedback.feedback_type in [FeedbackType.REPORT_SPAM, FeedbackType.REPORT_OFFENSIVE]:
            return {
                "action": "escalate_review",
                "priority": "high",
                "result_id": feedback.result_id
            }
        
        return {"action": "aggregated", "weight": weight}
    
    def _calculate_feedback_weight(self, feedback: ExplicitFeedback) -> float:
        """
        Weight feedback based on user credibility.
        
        Not all feedback is equally valuable:
        - Power users who search frequently: more credible
        - Users with feedback history that matches quality: more credible
        - Very new users: might not understand product
        - Users who report everything as spam: less credible
        """
        weight = 1.0
        
        # Power user boost
        if feedback.is_power_user:
            weight *= 1.5
        
        # New user discount (might not understand)
        if feedback.user_tenure_days and feedback.user_tenure_days < 7:
            weight *= 0.7
        
        # Engaged session boost
        if feedback.dwell_time_before_feedback and feedback.dwell_time_before_feedback > 30:
            weight *= 1.3
        
        return weight
 
 
class FeedbackAggregator:
    """
    Aggregates feedback across users to make decisions.
    
    Single user reports can be noise; patterns across users are signal.
    """
    
    def __init__(self,
                 action_threshold_positive: float = 5.0,
                 action_threshold_negative: float = 3.0):
        self.threshold_positive = action_threshold_positive
        self.threshold_negative = action_threshold_negative
        self.feedback_store: Dict[str, Dict] = {}  # result_id -> aggregated feedback
    
    def add_feedback(
        self,
        result_id: str,
        query: str,
        feedback_type: FeedbackType,
        weight: float
    ):
        """Add weighted feedback to aggregation."""
        if result_id not in self.feedback_store:
            self.feedback_store[result_id] = {
                "positive_weight": 0.0,
                "negative_weight": 0.0,
                "queries": set(),
                "feedback_count": 0,
                "report_types": []
            }
        
        store = self.feedback_store[result_id]
        store["feedback_count"] += 1
        store["queries"].add(query)
        
        if feedback_type == FeedbackType.THUMBS_UP:
            store["positive_weight"] += weight
        elif feedback_type in [FeedbackType.THUMBS_DOWN, FeedbackType.REPORT_IRRELEVANT]:
            store["negative_weight"] += weight
        
        if feedback_type.name.startswith("REPORT_"):
            store["report_types"].append(feedback_type)
    
    def get_actionable_items(self) -> Dict[str, List]:
        """
        Get items that should be acted on based on aggregated feedback.
        """
        actions = {
            "demote": [],      # Consistently negative feedback
            "promote": [],     # Consistently positive feedback
            "investigate": [], # Mixed or suspicious patterns
            "remove": []       # Severe reports
        }
        
        for result_id, data in self.feedback_store.items():
            net_score = data["positive_weight"] - data["negative_weight"]
            
            if net_score < -self.threshold_negative:
                actions["demote"].append({
                    "result_id": result_id,
                    "net_score": net_score,
                    "queries": list(data["queries"])
                })
            
            if net_score > self.threshold_positive:
                actions["promote"].append({
                    "result_id": result_id,
                    "net_score": net_score,
                    "queries": list(data["queries"])
                })
            
            # Multiple spam/offensive reports = investigate/remove
            spam_reports = sum(1 for t in data["report_types"] 
                             if t in [FeedbackType.REPORT_SPAM, FeedbackType.REPORT_OFFENSIVE])
            if spam_reports >= 3:
                actions["investigate"].append({
                    "result_id": result_id,
                    "report_count": spam_reports
                })
        
        return actions
 
 
@dataclass
class FeedbackPromptConfig:
    """Configuration for feedback UI prompts."""
    
    # Binary feedback (simplest)
    binary_prompt: str = "Was this result helpful?"
    binary_positive: str = "Yes"
    binary_negative: str = "No"
    
    # Report options
    report_prompt: str = "What's wrong with this result?"
    report_options: List[str] = field(default_factory=lambda: [
        "Spam or misleading",
        "Outdated information",
        "Not relevant to my search",
        "Offensive content",
        "Other (please describe)"
    ])
    
    # Optional comment
    comment_prompt: str = "Tell us more (optional)"
    comment_max_length: int = 500

Interpreting Implicit Signals

Implicit feedback is the gold mine—high volume, no user friction. But interpretation is tricky. A click doesn't mean satisfaction. A non-click doesn't mean irrelevance.

The implicit signal interpretation challenge:

Positive Implicit Signals

•Long dwell time — User engaged with content
•Conversion after click — True satisfaction
•No query refinement — Found what they wanted
•Saved/bookmarked — Explicit value
•Scrolled deeply — Content was engaging
•Clicked first result — Good ranking (often)

Negative Implicit Signals

•Short dwell (pogo stick) — Wrong result
•Query refinement — Initial results inadequate
•Abandonment — Nothing useful shown
•Click position 5+ — Top results weren't good
•Many clicks same query — No single answer
•Immediate back — Content didn't match snippet

implicit_signal_processing.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
from dataclasses import dataclass
from typing import List, Optional, Tuple
from datetime import datetime
import math
 
@dataclass
class ImplicitEvent:
    """Base class for implicit feedback events."""
    session_id: str
    timestamp: datetime
    result_id: str
    query: str
    position: int
 
@dataclass
class ClickEvent(ImplicitEvent):
    dwell_time_seconds: Optional[float] = None
    scroll_depth_percent: Optional[float] = None
    converted: bool = False
    returned_to_serp: bool = False
 
@dataclass  
class ImpressionEvent(ImplicitEvent):
    """Result was shown but not clicked."""
    viewport_visible: bool = True  # Was it above the fold?
    visible_duration_ms: int = 0
 
class ImplicitSignalInterpreter:
    """
    Interprets implicit user behavior as relevance signals.
    
    Key challenges:
    1. Position bias: Higher positions get more clicks regardless of relevance
    2. Presentation bias: Snippets affect clicks independent of content
    3. Selection bias: Only see data for what we showed
    4. Trust calibration: Different users have different click patterns
    """
    
    def __init__(self,
                 long_dwell_threshold: float = 30.0,
                 short_dwell_threshold: float = 10.0,
                 pogo_stick_threshold: float = 5.0):
        self.long_dwell = long_dwell_threshold
        self.short_dwell = short_dwell_threshold
        self.pogo_threshold = pogo_stick_threshold
    
    def classify_click(self, click: ClickEvent) -> Tuple[str, float]:
        """
        Classify a click event as positive, negative, or neutral.
        
        Returns (classification, confidence).
        """
        dwell = click.dwell_time_seconds or 0
        
        # Strong positive signals
        if click.converted:
            return ("strong_positive", 0.95)
        
        if dwell >= self.long_dwell:
            return ("positive", 0.8)
        
        # Strong negative signals
        if dwell <= self.pogo_threshold and click.returned_to_serp:
            return ("pogo_stick", 0.85)
        
        if dwell <= self.short_dwell and click.returned_to_serp:
            return ("negative", 0.6)
        
        # Ambiguous
        return ("neutral", 0.3)
    
    def score_result_from_clicks(
        self,
        clicks: List[ClickEvent],
        impressions: int
    ) -> float:
        """
        Compute relevance score from click patterns.
        
        Combines click-through rate with click quality.
        """
        if impressions == 0:
            return 0.0
        
        ctr = len(clicks) / impressions
        
        # Weight by click quality
        quality_sum = 0.0
        for click in clicks:
            classification, confidence = self.classify_click(click)
            if classification in ["strong_positive", "positive"]:
                quality_sum += confidence
            elif classification in ["pogo_stick", "negative"]:
                quality_sum -= confidence * 0.5  # Negative less penalized than positive boosted
            # Neutral contributes 0
        
        avg_quality = quality_sum / len(clicks) if clicks else 0
        
        # Combine CTR and quality
        # High CTR + negative quality = clickbait
        # Low CTR + positive quality = hidden gem
        return ctr * (0.5 + 0.5 * avg_quality)
 
 
class PositionBiasCorrector:
    """
    Corrects for position bias in click data.
    
    Position bias: Users are more likely to click higher positions
    regardless of relevance. Without correction, top positions
    get all the positive feedback regardless of merit.
    
    Approaches:
    1. Inverse propensity weighting
    2. Result randomization (costly to user experience)
    3. Click models (examine → click probability)
    """
    
    def __init__(self):
        # Estimated examination probabilities by position
        # These should be learned from data (e.g., eye-tracking studies)
        self.position_examination_prob = {
            1: 0.95,
            2: 0.85,
            3: 0.70,
            4: 0.55,
            5: 0.40,
            6: 0.30,
            7: 0.22,
            8: 0.16,
            9: 0.12,
            10: 0.10,
        }
    
    def inverse_propensity_weight(self, position: int) -> float:
        """
        Weight clicks inversely to examination probability.
        
        Intuition: A click at position 8 is more informative than
        a click at position 1, because fewer users even see position 8.
        """
        exam_prob = self.position_examination_prob.get(position, 0.05)
        # Cap weight to prevent extreme values
        return min(1.0 / exam_prob, 10.0)
    
    def correct_ctr(
        self,
        clicks_by_position: dict,
        impressions_by_position: dict
    ) -> dict:
        """
        Compute position-corrected CTR.
        
        Raw CTR at position 1 might be 50%.
        Raw CTR at position 5 might be 10%.
        
        But if we correct for the fact that position 5 is only
        examined 40% of the time, the corrected CTR is:
        10% / 40% = 25% (higher relative to position 1's 50%/95% = 53%)
        """
        corrected = {}
        
        for pos, clicks in clicks_by_position.items():
            impressions = impressions_by_position.get(pos, 0)
            if impressions == 0:
                continue
            
            raw_ctr = clicks / impressions
            exam_prob = self.position_examination_prob.get(pos, 0.05)
            
            # Corrected CTR: assuming click only if examined
            corrected[pos] = raw_ctr / exam_prob
        
        return corrected
 
 
class QueryRefinementAnalyzer:
    """
    Analyze query refinements as implicit negative feedback.
    
    If users refine their query, initial results were inadequate.
    The refinement type tells us why.
    """
    
    @staticmethod
    def classify_refinement(original: str, refined: str) -> str:
        """
        Classify the type of query refinement.
        """
        orig_tokens = set(original.lower().split())
        ref_tokens = set(refined.lower().split())
        
        # Added words = specification/clarification
        added = ref_tokens - orig_tokens
        removed = orig_tokens - ref_tokens
        
        if removed and not added:
            return "simplification"  # Too specific, broadening
        
        if added and not removed:
            return "specification"  # Too broad, narrowing
        
        if added and removed:
            if len(added & orig_tokens) > 0:
                return "correction"  # Typo fix or synonym
            return "pivot"  # Different intent entirely
        
        return "unknown"
    
    @staticmethod
    def refinement_to_feedback(
        original: str,
        refined: str,
        original_results: List[str]
    ) -> dict:
        """
        Convert refinement to implicit feedback signal.
        
        Returns feedback for original query's results.
        """
        refinement_type = QueryRefinementAnalyzer.classify_refinement(original, refined)
        
        # All original results get negative signal (varying strength)
        feedback = {}
        
        for i, result_id in enumerate(original_results):
            position = i + 1
            
            if refinement_type == "specification":
                # Results too broad - moderate negative
                feedback[result_id] = -0.3 * (1 / position)
            elif refinement_type == "pivot":
                # Wrong intent entirely - strong negative for top results
                feedback[result_id] = -0.8 * (1 / position)
            elif refinement_type == "simplification":
                # Results too specific - weak negative
                feedback[result_id] = -0.1 * (1 / position)
            else:
                feedback[result_id] = -0.2 * (1 / position)
        
        return feedback

Beware Feedback Loops

Learning from Feedback

Collecting feedback is useless without systems to learn from it. This section covers approaches for incorporating feedback into ranking models.

Integration patterns:

Feedback Integration Approaches
Approach	Description	Speed	Risk	Best For
Manual override	Editors manually boost/demote based on feedback	Slow	Low	High-stakes, low-volume queries
Rule-based	Automated rules: 'If X reports, demote'	Fast	Medium	Clear-cut cases (spam, offensive)
Feature engineering	Feedback becomes features in ML model	Medium	Medium	Existing ML pipeline
Online learning	Model updates in real-time from feedback	Very Fast	High	Rapidly changing content
Batch retraining	Periodic model retraining with new data	Slow	Low	Stable domains

feedback_learning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
from dataclasses import dataclass
from typing import Dict, List, Optional, Callable
from datetime import datetime, timedelta
from abc import ABC, abstractmethod
import numpy as np
 
@dataclass
class FeedbackSignal:
    """Processed feedback ready for learning."""
    query: str
    result_id: str
    signal_type: str      # e.g., "click", "dwell", "explicit_positive"
    signal_value: float   # Normalized value
    weight: float         # Confidence weight
    timestamp: datetime
 
class FeedbackIntegrator(ABC):
    """Base class for feedback integration approaches."""
    
    @abstractmethod
    def incorporate(self, signals: List[FeedbackSignal]):
        pass
    
    @abstractmethod
    def get_boost(self, query: str, result_id: str) -> float:
        pass
 
class ManualOverrideIntegrator(FeedbackIntegrator):
    """
    Queue feedback for manual review and override.
    
    Best for high-stakes situations where automated changes
    could have significant negative impact.
    """
    
    def __init__(self, threshold_for_review: float = 5.0):
        self.review_queue: List[Dict] = []
        self.manual_overrides: Dict[str, Dict[str, float]] = {}  # query -> result_id -> boost
        self.threshold = threshold_for_review
    
    def incorporate(self, signals: List[FeedbackSignal]):
        """Add signals to review queue if threshold met."""
        # Aggregate by (query, result_id)
        aggregated = {}
        for signal in signals:
            key = (signal.query, signal.result_id)
            if key not in aggregated:
                aggregated[key] = {"positive": 0.0, "negative": 0.0}
            
            if signal.signal_value > 0:
                aggregated[key]["positive"] += signal.signal_value * signal.weight
            else:
                aggregated[key]["negative"] += abs(signal.signal_value) * signal.weight
        
        # Queue items above threshold
        for (query, result_id), scores in aggregated.items():
            net = scores["positive"] - scores["negative"]
            if abs(net) >= self.threshold:
                self.review_queue.append({
                    "query": query,
                    "result_id": result_id,
                    "net_score": net,
                    "recommendation": "promote" if net > 0 else "demote"
                })
    
    def apply_override(self, query: str, result_id: str, boost: float):
        """Manual editor action."""
        if query not in self.manual_overrides:
            self.manual_overrides[query] = {}
        self.manual_overrides[query][result_id] = boost
    
    def get_boost(self, query: str, result_id: str) -> float:
        return self.manual_overrides.get(query, {}).get(result_id, 1.0)
 
 
class RuleBasedIntegrator(FeedbackIntegrator):
    """
    Apply automated rules based on feedback patterns.
    
    Rules are explicit, auditable, and controllable.
    Good for well-understood signal patterns.
    """
    
    def __init__(self):
        self.result_scores: Dict[str, Dict[str, float]] = {}
        self.rules = [
            self._spam_report_rule,
            self._consistent_negative_rule,
            self._consistent_positive_rule,
        ]
    
    def incorporate(self, signals: List[FeedbackSignal]):
        for signal in signals:
            if signal.query not in self.result_scores:
                self.result_scores[signal.query] = {}
            
            if signal.result_id not in self.result_scores[signal.query]:
                self.result_scores[signal.query][signal.result_id] = {
                    "positive": 0, "negative": 0, "spam_reports": 0
                }
            
            scores = self.result_scores[signal.query][signal.result_id]
            
            if signal.signal_type == "spam_report":
                scores["spam_reports"] += 1
            elif signal.signal_value > 0:
                scores["positive"] += signal.signal_value * signal.weight
            else:
                scores["negative"] += abs(signal.signal_value) * signal.weight
    
    def _spam_report_rule(self, scores: Dict) -> Optional[float]:
        """Multiple spam reports → severe demotion."""
        if scores.get("spam_reports", 0) >= 3:
            return 0.1  # 90% demotion
        return None
    
    def _consistent_negative_rule(self, scores: Dict) -> Optional[float]:
        """Consistent negative feedback → moderate demotion."""
        if scores.get("negative", 0) > 5 and scores.get("positive", 0) < 1:
            return 0.5  # 50% demotion
        return None
    
    def _consistent_positive_rule(self, scores: Dict) -> Optional[float]:
        """Consistent positive feedback → promotion."""
        if scores.get("positive", 0) > 10 and scores.get("negative", 0) < 2:
            return 1.3  # 30% boost
        return None
    
    def get_boost(self, query: str, result_id: str) -> float:
        scores = self.result_scores.get(query, {}).get(result_id, {})
        
        # Apply rules in priority order
        for rule in self.rules:
            boost = rule(scores)
            if boost is not None:
                return boost
        
        return 1.0  # No rule matched
 
 
class FeatureEngineeringIntegrator(FeedbackIntegrator):
    """
    Convert feedback into features for ML ranking model.
    
    Doesn't directly modify rankings—provides signal
    for learning-to-rank model to use.
    """
    
    def __init__(self, decay_half_life_days: float = 30.0):
        self.decay_half_life = decay_half_life_days
        self.feedback_features: Dict[str, Dict] = {}  # result_id -> features
    
    def incorporate(self, signals: List[FeedbackSignal]):
        """Update feature store with new signals."""
        for signal in signals:
            if signal.result_id not in self.feedback_features:
                self.feedback_features[signal.result_id] = self._init_features()
            
            features = self.feedback_features[signal.result_id]
            decay = self._time_decay(signal.timestamp)
            
            # Update running statistics
            if signal.signal_value > 0:
                features["positive_feedback_score"] += signal.signal_value * signal.weight * decay
                features["positive_feedback_count"] += 1
            else:
                features["negative_feedback_score"] += abs(signal.signal_value) * signal.weight * decay
                features["negative_feedback_count"] += 1
            
            # Update type-specific features
            type_key = f"feedback_{signal.signal_type}_score"
            if type_key in features:
                features[type_key] += signal.signal_value * signal.weight * decay
    
    def _init_features(self) -> Dict:
        return {
            "positive_feedback_score": 0.0,
            "negative_feedback_score": 0.0,
            "positive_feedback_count": 0,
            "negative_feedback_count": 0,
            "feedback_click_score": 0.0,
            "feedback_dwell_score": 0.0,
            "feedback_explicit_score": 0.0,
        }
    
    def _time_decay(self, timestamp: datetime) -> float:
        """Apply time decay to older signals."""
        age_days = (datetime.now() - timestamp).days
        return 0.5 ** (age_days / self.decay_half_life)
    
    def get_features(self, result_id: str) -> Dict:
        """Get feature vector for a result."""
        if result_id not in self.feedback_features:
            return self._init_features()
        
        features = self.feedback_features[result_id].copy()
        
        # Derived features
        total_pos = features["positive_feedback_count"]
        total_neg = features["negative_feedback_count"]
        total = total_pos + total_neg
        
        if total > 0:
            features["feedback_positive_ratio"] = total_pos / total
            features["feedback_net_score"] = (
                features["positive_feedback_score"] - 
                features["negative_feedback_score"]
            )
        else:
            features["feedback_positive_ratio"] = 0.5
            features["feedback_net_score"] = 0.0
        
        return features
    
    def get_boost(self, query: str, result_id: str) -> float:
        # This integrator doesn't directly boost—ML model uses features
        return 1.0
 
 
class OnlineLearningIntegrator(FeedbackIntegrator):
    """
    Update model weights in real-time from feedback.
    
    Most responsive but highest risk—bad feedback can
    quickly degrade rankings.
    
    Uses multi-armed bandit approach for exploration-exploitation.
    """
    
    def __init__(self, 
                 learning_rate: float = 0.01,
                 exploration_rate: float = 0.1):
        self.learning_rate = learning_rate
        self.exploration = exploration_rate
        self.result_values: Dict[str, Dict] = {}  # query -> result_id -> Thompson params
    
    def incorporate(self, signals: List[FeedbackSignal]):
        """Update bandit parameters from new signals."""
        for signal in signals:
            if signal.query not in self.result_values:
                self.result_values[signal.query] = {}
            
            if signal.result_id not in self.result_values[signal.query]:
                # Initialize with prior (Beta distribution parameters)
                self.result_values[signal.query][signal.result_id] = {
                    "alpha": 1.0,  # Pseudo-successes
                    "beta": 1.0,   # Pseudo-failures
                }
            
            params = self.result_values[signal.query][signal.result_id]
            
            # Update Beta distribution based on feedback
            if signal.signal_value > 0:
                params["alpha"] += signal.signal_value * signal.weight
            else:
                params["beta"] += abs(signal.signal_value) * signal.weight
    
    def get_boost(self, query: str, result_id: str) -> float:
        """
        Get boost using Thompson Sampling.
        
        Sample from posterior to balance exploration/exploitation.
        """
        if query not in self.result_values:
            return 1.0
        
        if result_id not in self.result_values[query]:
            return 1.0
        
        params = self.result_values[query][result_id]
        
        # Sample from Beta distribution
        sampled_value = np.random.beta(params["alpha"], params["beta"])
        
        # Convert to boost factor (0.5 = neutral)
        # sampled_value ≈ 0.3 → boost ≈ 0.6
        # sampled_value ≈ 0.7 → boost ≈ 1.4
        boost = 0.5 + sampled_value
        
        return boost

Continuous Learning Systems

The ultimate goal is a search system that continuously improves from user feedback without manual intervention. This requires careful architecture to balance responsiveness with stability.

Key components of continuous learning:

Continuous Learning Pipeline Components

•Signal collection: Low-latency logging of all user interactions with minimal impact on search performance.
•Signal processing: Stream processing to clean, normalize, and aggregate raw signals into usable feedback.
•Feature computation: Convert processed signals into features for model training and serving.
•Model training: Periodic or continuous model updates with new feedback data.
•Model validation: Automated testing to prevent bad models from deploying.
•Gradual rollout: Canary deployments to catch regressions before full rollout.
•Monitoring: Continuous monitoring for degradation, feedback loop issues, and adversarial attacks.

continuous_learning_architecture.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
from dataclasses import dataclass
from typing import Dict, List, Optional, Callable
from datetime import datetime, timedelta
from abc import ABC, abstractmethod
from enum import Enum
 
class ModelState(Enum):
    TRAINING = "training"
    VALIDATING = "validating"
    CANARY = "canary"
    PRODUCTION = "production"
    ROLLBACK = "rollback"
 
@dataclass
class ModelVersion:
    """A versioned model with metadata."""
    version_id: str
    created_at: datetime
    training_data_end: datetime
    metrics: Dict[str, float]  # Validation metrics
    state: ModelState
 
class ContinuousLearningPipeline:
    """
    Orchestrates continuous learning from user feedback.
    
    Pipeline stages:
    1. Collect: Stream user interactions
    2. Process: Convert to training signals
    3. Train: Update model with new data
    4. Validate: Ensure model quality
    5. Deploy: Gradual rollout
    6. Monitor: Watch for issues
    """
    
    def __init__(self,
                 training_frequency_hours: float = 24.0,
                 min_samples_for_training: int = 10000,
                 validation_threshold: float = 0.95):
        self.training_freq = training_frequency_hours
        self.min_samples = min_samples_for_training
        self.validation_threshold = validation_threshold
        
        self.current_model: Optional[ModelVersion] = None
        self.candidate_model: Optional[ModelVersion] = None
        self.signal_buffer: List = []
    
    async def run_training_cycle(self):
        """
        Execute one training cycle.
        Called periodically (e.g., daily or when enough data).
        """
        # Step 1: Collect processed signals
        signals = await self.collect_signals()
        
        if len(signals) < self.min_samples:
            return {"status": "skipped", "reason": "insufficient_data"}
        
        # Step 2: Train new model
        new_model = await self.train_model(signals)
        new_model.state = ModelState.VALIDATING
        
        # Step 3: Validate offline
        validation_result = await self.validate_model(new_model)
        
        if not validation_result["passed"]:
            return {"status": "failed", "reason": "validation_failed", 
                   "metrics": validation_result}
        
        # Step 4: Deploy to canary
        new_model.state = ModelState.CANARY
        await self.deploy_canary(new_model)
        
        # Step 5: Monitor canary
        canary_result = await self.monitor_canary(new_model, duration_hours=2.0)
        
        if not canary_result["passed"]:
            await self.rollback_canary()
            return {"status": "rolled_back", "reason": "canary_failed"}
        
        # Step 6: Promote to production
        new_model.state = ModelState.PRODUCTION
        await self.promote_to_production(new_model)
        
        return {"status": "success", "model_version": new_model.version_id}
    
    async def collect_signals(self) -> List:
        """Collect and process feedback signals."""
        # In practice: read from Kafka/Kinesis stream
        signals = self.signal_buffer.copy()
        self.signal_buffer = []
        return signals
    
    async def train_model(self, signals: List) -> ModelVersion:
        """Train new model version."""
        # In practice: launch training job (Spark, SageMaker, etc.)
        return ModelVersion(
            version_id=f"model_{datetime.now().isoformat()}",
            created_at=datetime.now(),
            training_data_end=datetime.now(),
            metrics={},
            state=ModelState.TRAINING
        )
    
    async def validate_model(self, model: ModelVersion) -> Dict:
        """
        Validate model against held-out data and baselines.
        
        Checks:
        1. Offline metrics vs baseline (NDCG, MRR)
        2. No regression on critical query types
        3. Consistent behavior (no dramatic ranking changes)
        """
        # Compare against baseline model
        baseline_metrics = self.current_model.metrics if self.current_model else {}
        
        checks = []
        
        # Check 1: Overall metric improvement
        ndcg_ratio = model.metrics.get("ndcg", 0) / baseline_metrics.get("ndcg", 1)
        checks.append({
            "check": "ndcg_improvement",
            "passed": ndcg_ratio >= self.validation_threshold,
            "value": ndcg_ratio
        })
        
        # Check 2: No major regression on any query segment
        for segment in ["navigational", "informational", "transactional"]:
            seg_key = f"ndcg_{segment}"
            if seg_key in model.metrics and seg_key in baseline_metrics:
                ratio = model.metrics[seg_key] / baseline_metrics[seg_key]
                checks.append({
                    "check": f"{segment}_stability",
                    "passed": ratio >= 0.98,  # Allow 2% regression per segment
                    "value": ratio
                })
        
        all_passed = all(c["passed"] for c in checks)
        
        return {
            "passed": all_passed,
            "checks": checks
        }
    
    async def deploy_canary(self, model: ModelVersion):
        """Deploy to small percentage of traffic."""
        # In practice: update load balancer / feature flag
        pass
    
    async def monitor_canary(self, 
                            model: ModelVersion,
                            duration_hours: float) -> Dict:
        """
        Monitor canary deployment for issues.
        
        Watch for:
        - Error rate spikes
        - Latency increases
        - Metric degradation vs control
        """
        # In practice: query metrics system, compare canary vs control
        # For now, simulate monitoring
        import asyncio
        await asyncio.sleep(0.1)  # Would wait for duration_hours
        
        return {
            "passed": True,
            "error_rate_delta": 0.001,
            "latency_delta_ms": 2,
            "ctr_delta": 0.02
        }
    
    async def rollback_canary(self):
        """Roll back canary deployment."""
        pass
    
    async def promote_to_production(self, model: ModelVersion):
        """Promote canary to full production."""
        self.current_model = model
 
 
class FeedbackGuardrails:
    """
    Safety mechanisms to prevent feedback exploitation.
    
    Bad actors may try to manipulate rankings by:
    - Generating fake positive feedback
    - Coordinated negative feedback attacks
    - Gaming click patterns
    """
    
    @staticmethod
    def detect_coordinated_attack(
        signals: List[FeedbackSignal],
        window_hours: float = 1.0,
        threshold_ratio: float = 10.0
    ) -> List[str]:
        """
        Detect suspiciously coordinated feedback.
        
        Returns list of result_ids that appear to be under attack.
        """
        suspicious = []
        now = datetime.now()
        cutoff = now - timedelta(hours=window_hours)
        
        # Group recent signals by result_id
        by_result = {}
        for signal in signals:
            if signal.timestamp > cutoff:
                if signal.result_id not in by_result:
                    by_result[signal.result_id] = []
                by_result[signal.result_id].append(signal)
        
        # Check for abnormal patterns
        for result_id, result_signals in by_result.items():
            if len(result_signals) < 5:
                continue
            
            # Check: All signals same polarity = suspicious
            values = [s.signal_value for s in result_signals]
            if all(v > 0 for v in values) or all(v < 0 for v in values):
                # Check signal rate vs historical
                signal_rate = len(result_signals) / window_hours
                # Would compare against historical rate
                historical_rate = 1.0  # Placeholder
                
                if signal_rate > historical_rate * threshold_ratio:
                    suspicious.append(result_id)
        
        return suspicious
    
    @staticmethod
    def rate_limit_user_feedback(
        user_id: str,
        feedback_count_today: int,
        max_daily_feedback: int = 20
    ) -> bool:
        """
        Rate limit feedback per user to prevent manipulation.
        """
        return feedback_count_today < max_daily_feedback
    
    @staticmethod
    def detect_bot_patterns(
        user_signals: List[FeedbackSignal]
    ) -> float:
        """
        Detect bot-like feedback patterns.
        Returns probability of bot (0-1).
        """
        if len(user_signals) < 3:
            return 0.0
        
        # Check for inhuman patterns
        bot_score = 0.0
        
        # Pattern: Exactly regular timing
        intervals = []
        for i in range(1, len(user_signals)):
            delta = (user_signals[i].timestamp - user_signals[i-1].timestamp).total_seconds()
            intervals.append(delta)
        
        if len(intervals) >= 2:
            avg_interval = sum(intervals) / len(intervals)
            variance = sum((i - avg_interval)**2 for i in intervals) / len(intervals)
            
            # Very low variance = suspicious regularity
            if variance < 1.0 and avg_interval < 5.0:
                bot_score += 0.5
        
        # Pattern: No dwell time (instant feedback)
        instant_feedback = sum(
            1 for s in user_signals 
            if s.signal_type == "explicit" and 
               (getattr(s, 'dwell_before', 0) or 1) < 2
        )
        if instant_feedback / len(user_signals) > 0.8:
            bot_score += 0.3
        
        return min(1.0, bot_score)

Human in the Loop

Summary: The Continuous Improvement Cycle

Key Takeaways

•Feedback comes in many forms — From explicit ratings (high quality, low volume) to implicit clicks (high volume, requires interpretation). Design systems that use both.
•Explicit feedback requires careful UX — Minimize friction, provide context, don't over-ask. Aggregate across users before acting on reports.
•Implicit signals need interpretation — Clicks don't equal satisfaction. Dwell time, pogo-sticking, and query refinements all carry meaning. Correct for position bias.
•Multiple integration approaches exist — Manual overrides for high-stakes, rules for clear cases, features for ML models, online learning for rapid adaptation. Choose based on risk tolerance and domain.
•Continuous learning requires guardrails — Validate models before deployment, monitor canaries, detect manipulation attempts, maintain human oversight.
•The feedback loop is both opportunity and risk — Learning from feedback can create self-reinforcing biases. Use exploration, position correction, and periodic fresh evaluation.
•Close the loop with users — When you act on feedback, show users their input mattered. This encourages future feedback and builds trust.

Module Complete: Search Relevance Tuning

5 / 5