Loading learning content...
Every search interaction is a learning opportunity. When a user clicks the third result instead of the first, they're telling us something about relevance. When they rephrase a query, they're showing us the gap between what they wanted and what we returned. When they explicitly report a bad result, they're investing effort to improve the system.
User feedback incorporation is the practice of systematically collecting, analyzing, and acting on these signals to continuously improve search quality. It closes the loop between user intent and system behavior, creating a search engine that gets smarter with every query.
This final page in our relevance tuning module explores the full spectrum of user feedback—from implicit behavioral signals to explicit ratings, from manual annotation to machine learning-driven continuous improvement.
By the end of this page, you will understand the taxonomy of user feedback types, how to collect and process feedback at scale, techniques for incorporating feedback into ranking models, the challenges of feedback loops and position bias, and how to build continuous learning systems.
User feedback comes in many forms, each with different signal quality, volume, and collection costs.
The feedback spectrum:
| Type | Examples | Signal Quality | Volume | Collection Cost |
|---|---|---|---|---|
| Explicit Direct | Thumbs up/down, 5-star ratings, 'Was this helpful?' | High (clear intent) | Low (few users bother) | Low (UI element) |
| Explicit Report | Report bad result, flag spam, correction suggestions | Very High (specific) | Very Low (motivated users) | Low (UI element) |
| Explicit Survey | Post-search satisfaction surveys, NPS | High (structured) | Very Low (intrusive) | Medium (survey design) |
| Implicit Primary | Clicks, purchases, conversions | Medium (action ≠ satisfaction) | Very High (every interaction) | Low (logging) |
| Implicit Secondary | Dwell time, scroll depth, return visits | Medium (requires interpretation) | High (computed from logs) | Medium (tracking) |
| Implicit Negative | Skip, quick return, query refinement | Medium (absence of signal) | High (requires inference) | Medium (complex logic) |
| Editorial Judgment | Human raters, quality editors | Very High (expert assessment) | Low (expensive, slow) | Very High (human cost) |
The quality-volume trade-off:
Explicit feedback is cleaner but rare. Most users don't rate results—they just use them (or don't). This means relying primarily on implicit signals, which are abundant but noisy.
The art of feedback incorporation is combining these signals appropriately: using rare explicit feedback to calibrate and validate, while leveraging abundant implicit signals for coverage.
Expect only 1% of users to provide explicit feedback, and of those, negative feedback is 3-10x more likely than positive (negativity bias). Design systems that work with implicit signals first, then enhance with explicit signals where available.
Explicit feedback requires the user to take an action specifically to rate or report. Designing effective explicit feedback collection requires minimizing friction while maximizing signal quality.
Design principles for explicit feedback:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243
from dataclasses import dataclass, fieldfrom typing import Optional, List, Dictfrom datetime import datetimefrom enum import Enum class FeedbackType(Enum): THUMBS_UP = "thumbs_up" THUMBS_DOWN = "thumbs_down" REPORT_SPAM = "report_spam" REPORT_OUTDATED = "report_outdated" REPORT_IRRELEVANT = "report_irrelevant" REPORT_OFFENSIVE = "report_offensive" QUERY_CORRECTION = "query_correction" RESULT_CORRECTION = "result_correction" @dataclassclass ExplicitFeedback: """A single explicit feedback event.""" feedback_id: str user_id: Optional[str] session_id: str query: str result_id: str result_position: int feedback_type: FeedbackType timestamp: datetime # Optional enrichment user_comment: Optional[str] = None suggested_correction: Optional[str] = None dwell_time_before_feedback: Optional[float] = None # Seconds # Metadata for analysis user_tenure_days: Optional[int] = None user_search_count: Optional[int] = None is_power_user: bool = False class ExplicitFeedbackCollector: """ System for collecting and processing explicit user feedback. Design considerations: - Throttle to prevent survey fatigue - Weight by user credibility - Aggregate across users before acting """ def __init__(self, min_dwell_before_ask: float = 5.0, # Seconds ask_probability: float = 0.05, # 5% of eligible cooldown_hours: float = 24.0): # Per user self.min_dwell = min_dwell_before_ask self.ask_prob = ask_probability self.cooldown = cooldown_hours def should_show_feedback_prompt( self, user_id: str, dwell_time: float, last_feedback_time: Optional[datetime] ) -> bool: """ Determine whether to show feedback prompt. Balances signal collection with user experience. """ # Minimum engagement threshold if dwell_time < self.min_dwell: return False # Cooldown: don't ask same user too frequently if last_feedback_time: hours_since = (datetime.now() - last_feedback_time).total_seconds() / 3600 if hours_since < self.cooldown: return False # Probabilistic sampling (avoid over-collection) import random return random.random() < self.ask_prob def process_feedback( self, feedback: ExplicitFeedback, aggregator: 'FeedbackAggregator' ) -> Dict: """ Process a single feedback event. Actions depend on feedback type and severity. """ # Weight the feedback weight = self._calculate_feedback_weight(feedback) # Aggregate into result-level signals aggregator.add_feedback( result_id=feedback.result_id, query=feedback.query, feedback_type=feedback.feedback_type, weight=weight ) # Immediate action for high-severity reports if feedback.feedback_type in [FeedbackType.REPORT_SPAM, FeedbackType.REPORT_OFFENSIVE]: return { "action": "escalate_review", "priority": "high", "result_id": feedback.result_id } return {"action": "aggregated", "weight": weight} def _calculate_feedback_weight(self, feedback: ExplicitFeedback) -> float: """ Weight feedback based on user credibility. Not all feedback is equally valuable: - Power users who search frequently: more credible - Users with feedback history that matches quality: more credible - Very new users: might not understand product - Users who report everything as spam: less credible """ weight = 1.0 # Power user boost if feedback.is_power_user: weight *= 1.5 # New user discount (might not understand) if feedback.user_tenure_days and feedback.user_tenure_days < 7: weight *= 0.7 # Engaged session boost if feedback.dwell_time_before_feedback and feedback.dwell_time_before_feedback > 30: weight *= 1.3 return weight class FeedbackAggregator: """ Aggregates feedback across users to make decisions. Single user reports can be noise; patterns across users are signal. """ def __init__(self, action_threshold_positive: float = 5.0, action_threshold_negative: float = 3.0): self.threshold_positive = action_threshold_positive self.threshold_negative = action_threshold_negative self.feedback_store: Dict[str, Dict] = {} # result_id -> aggregated feedback def add_feedback( self, result_id: str, query: str, feedback_type: FeedbackType, weight: float ): """Add weighted feedback to aggregation.""" if result_id not in self.feedback_store: self.feedback_store[result_id] = { "positive_weight": 0.0, "negative_weight": 0.0, "queries": set(), "feedback_count": 0, "report_types": [] } store = self.feedback_store[result_id] store["feedback_count"] += 1 store["queries"].add(query) if feedback_type == FeedbackType.THUMBS_UP: store["positive_weight"] += weight elif feedback_type in [FeedbackType.THUMBS_DOWN, FeedbackType.REPORT_IRRELEVANT]: store["negative_weight"] += weight if feedback_type.name.startswith("REPORT_"): store["report_types"].append(feedback_type) def get_actionable_items(self) -> Dict[str, List]: """ Get items that should be acted on based on aggregated feedback. """ actions = { "demote": [], # Consistently negative feedback "promote": [], # Consistently positive feedback "investigate": [], # Mixed or suspicious patterns "remove": [] # Severe reports } for result_id, data in self.feedback_store.items(): net_score = data["positive_weight"] - data["negative_weight"] if net_score < -self.threshold_negative: actions["demote"].append({ "result_id": result_id, "net_score": net_score, "queries": list(data["queries"]) }) if net_score > self.threshold_positive: actions["promote"].append({ "result_id": result_id, "net_score": net_score, "queries": list(data["queries"]) }) # Multiple spam/offensive reports = investigate/remove spam_reports = sum(1 for t in data["report_types"] if t in [FeedbackType.REPORT_SPAM, FeedbackType.REPORT_OFFENSIVE]) if spam_reports >= 3: actions["investigate"].append({ "result_id": result_id, "report_count": spam_reports }) return actions @dataclassclass FeedbackPromptConfig: """Configuration for feedback UI prompts.""" # Binary feedback (simplest) binary_prompt: str = "Was this result helpful?" binary_positive: str = "Yes" binary_negative: str = "No" # Report options report_prompt: str = "What's wrong with this result?" report_options: List[str] = field(default_factory=lambda: [ "Spam or misleading", "Outdated information", "Not relevant to my search", "Offensive content", "Other (please describe)" ]) # Optional comment comment_prompt: str = "Tell us more (optional)" comment_max_length: int = 500Implicit feedback is the gold mine—high volume, no user friction. But interpretation is tricky. A click doesn't mean satisfaction. A non-click doesn't mean irrelevance.
The implicit signal interpretation challenge:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241
from dataclasses import dataclassfrom typing import List, Optional, Tuplefrom datetime import datetimeimport math @dataclassclass ImplicitEvent: """Base class for implicit feedback events.""" session_id: str timestamp: datetime result_id: str query: str position: int @dataclassclass ClickEvent(ImplicitEvent): dwell_time_seconds: Optional[float] = None scroll_depth_percent: Optional[float] = None converted: bool = False returned_to_serp: bool = False @dataclass class ImpressionEvent(ImplicitEvent): """Result was shown but not clicked.""" viewport_visible: bool = True # Was it above the fold? visible_duration_ms: int = 0 class ImplicitSignalInterpreter: """ Interprets implicit user behavior as relevance signals. Key challenges: 1. Position bias: Higher positions get more clicks regardless of relevance 2. Presentation bias: Snippets affect clicks independent of content 3. Selection bias: Only see data for what we showed 4. Trust calibration: Different users have different click patterns """ def __init__(self, long_dwell_threshold: float = 30.0, short_dwell_threshold: float = 10.0, pogo_stick_threshold: float = 5.0): self.long_dwell = long_dwell_threshold self.short_dwell = short_dwell_threshold self.pogo_threshold = pogo_stick_threshold def classify_click(self, click: ClickEvent) -> Tuple[str, float]: """ Classify a click event as positive, negative, or neutral. Returns (classification, confidence). """ dwell = click.dwell_time_seconds or 0 # Strong positive signals if click.converted: return ("strong_positive", 0.95) if dwell >= self.long_dwell: return ("positive", 0.8) # Strong negative signals if dwell <= self.pogo_threshold and click.returned_to_serp: return ("pogo_stick", 0.85) if dwell <= self.short_dwell and click.returned_to_serp: return ("negative", 0.6) # Ambiguous return ("neutral", 0.3) def score_result_from_clicks( self, clicks: List[ClickEvent], impressions: int ) -> float: """ Compute relevance score from click patterns. Combines click-through rate with click quality. """ if impressions == 0: return 0.0 ctr = len(clicks) / impressions # Weight by click quality quality_sum = 0.0 for click in clicks: classification, confidence = self.classify_click(click) if classification in ["strong_positive", "positive"]: quality_sum += confidence elif classification in ["pogo_stick", "negative"]: quality_sum -= confidence * 0.5 # Negative less penalized than positive boosted # Neutral contributes 0 avg_quality = quality_sum / len(clicks) if clicks else 0 # Combine CTR and quality # High CTR + negative quality = clickbait # Low CTR + positive quality = hidden gem return ctr * (0.5 + 0.5 * avg_quality) class PositionBiasCorrector: """ Corrects for position bias in click data. Position bias: Users are more likely to click higher positions regardless of relevance. Without correction, top positions get all the positive feedback regardless of merit. Approaches: 1. Inverse propensity weighting 2. Result randomization (costly to user experience) 3. Click models (examine → click probability) """ def __init__(self): # Estimated examination probabilities by position # These should be learned from data (e.g., eye-tracking studies) self.position_examination_prob = { 1: 0.95, 2: 0.85, 3: 0.70, 4: 0.55, 5: 0.40, 6: 0.30, 7: 0.22, 8: 0.16, 9: 0.12, 10: 0.10, } def inverse_propensity_weight(self, position: int) -> float: """ Weight clicks inversely to examination probability. Intuition: A click at position 8 is more informative than a click at position 1, because fewer users even see position 8. """ exam_prob = self.position_examination_prob.get(position, 0.05) # Cap weight to prevent extreme values return min(1.0 / exam_prob, 10.0) def correct_ctr( self, clicks_by_position: dict, impressions_by_position: dict ) -> dict: """ Compute position-corrected CTR. Raw CTR at position 1 might be 50%. Raw CTR at position 5 might be 10%. But if we correct for the fact that position 5 is only examined 40% of the time, the corrected CTR is: 10% / 40% = 25% (higher relative to position 1's 50%/95% = 53%) """ corrected = {} for pos, clicks in clicks_by_position.items(): impressions = impressions_by_position.get(pos, 0) if impressions == 0: continue raw_ctr = clicks / impressions exam_prob = self.position_examination_prob.get(pos, 0.05) # Corrected CTR: assuming click only if examined corrected[pos] = raw_ctr / exam_prob return corrected class QueryRefinementAnalyzer: """ Analyze query refinements as implicit negative feedback. If users refine their query, initial results were inadequate. The refinement type tells us why. """ @staticmethod def classify_refinement(original: str, refined: str) -> str: """ Classify the type of query refinement. """ orig_tokens = set(original.lower().split()) ref_tokens = set(refined.lower().split()) # Added words = specification/clarification added = ref_tokens - orig_tokens removed = orig_tokens - ref_tokens if removed and not added: return "simplification" # Too specific, broadening if added and not removed: return "specification" # Too broad, narrowing if added and removed: if len(added & orig_tokens) > 0: return "correction" # Typo fix or synonym return "pivot" # Different intent entirely return "unknown" @staticmethod def refinement_to_feedback( original: str, refined: str, original_results: List[str] ) -> dict: """ Convert refinement to implicit feedback signal. Returns feedback for original query's results. """ refinement_type = QueryRefinementAnalyzer.classify_refinement(original, refined) # All original results get negative signal (varying strength) feedback = {} for i, result_id in enumerate(original_results): position = i + 1 if refinement_type == "specification": # Results too broad - moderate negative feedback[result_id] = -0.3 * (1 / position) elif refinement_type == "pivot": # Wrong intent entirely - strong negative for top results feedback[result_id] = -0.8 * (1 / position) elif refinement_type == "simplification": # Results too specific - weak negative feedback[result_id] = -0.1 * (1 / position) else: feedback[result_id] = -0.2 * (1 / position) return feedbackUsing click data to train ranking creates a feedback loop: high-ranked items get more clicks, more clicks mean higher ranking, etc. This can reinforce initial biases and suppress good content that never got shown. Use exploration, position bias correction, and periodic evaluation against fresh judgment data.
Collecting feedback is useless without systems to learn from it. This section covers approaches for incorporating feedback into ranking models.
Integration patterns:
| Approach | Description | Speed | Risk | Best For |
|---|---|---|---|---|
| Manual override | Editors manually boost/demote based on feedback | Slow | Low | High-stakes, low-volume queries |
| Rule-based | Automated rules: 'If X reports, demote' | Fast | Medium | Clear-cut cases (spam, offensive) |
| Feature engineering | Feedback becomes features in ML model | Medium | Medium | Existing ML pipeline |
| Online learning | Model updates in real-time from feedback | Very Fast | High | Rapidly changing content |
| Batch retraining | Periodic model retraining with new data | Slow | Low | Stable domains |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280
from dataclasses import dataclassfrom typing import Dict, List, Optional, Callablefrom datetime import datetime, timedeltafrom abc import ABC, abstractmethodimport numpy as np @dataclassclass FeedbackSignal: """Processed feedback ready for learning.""" query: str result_id: str signal_type: str # e.g., "click", "dwell", "explicit_positive" signal_value: float # Normalized value weight: float # Confidence weight timestamp: datetime class FeedbackIntegrator(ABC): """Base class for feedback integration approaches.""" @abstractmethod def incorporate(self, signals: List[FeedbackSignal]): pass @abstractmethod def get_boost(self, query: str, result_id: str) -> float: pass class ManualOverrideIntegrator(FeedbackIntegrator): """ Queue feedback for manual review and override. Best for high-stakes situations where automated changes could have significant negative impact. """ def __init__(self, threshold_for_review: float = 5.0): self.review_queue: List[Dict] = [] self.manual_overrides: Dict[str, Dict[str, float]] = {} # query -> result_id -> boost self.threshold = threshold_for_review def incorporate(self, signals: List[FeedbackSignal]): """Add signals to review queue if threshold met.""" # Aggregate by (query, result_id) aggregated = {} for signal in signals: key = (signal.query, signal.result_id) if key not in aggregated: aggregated[key] = {"positive": 0.0, "negative": 0.0} if signal.signal_value > 0: aggregated[key]["positive"] += signal.signal_value * signal.weight else: aggregated[key]["negative"] += abs(signal.signal_value) * signal.weight # Queue items above threshold for (query, result_id), scores in aggregated.items(): net = scores["positive"] - scores["negative"] if abs(net) >= self.threshold: self.review_queue.append({ "query": query, "result_id": result_id, "net_score": net, "recommendation": "promote" if net > 0 else "demote" }) def apply_override(self, query: str, result_id: str, boost: float): """Manual editor action.""" if query not in self.manual_overrides: self.manual_overrides[query] = {} self.manual_overrides[query][result_id] = boost def get_boost(self, query: str, result_id: str) -> float: return self.manual_overrides.get(query, {}).get(result_id, 1.0) class RuleBasedIntegrator(FeedbackIntegrator): """ Apply automated rules based on feedback patterns. Rules are explicit, auditable, and controllable. Good for well-understood signal patterns. """ def __init__(self): self.result_scores: Dict[str, Dict[str, float]] = {} self.rules = [ self._spam_report_rule, self._consistent_negative_rule, self._consistent_positive_rule, ] def incorporate(self, signals: List[FeedbackSignal]): for signal in signals: if signal.query not in self.result_scores: self.result_scores[signal.query] = {} if signal.result_id not in self.result_scores[signal.query]: self.result_scores[signal.query][signal.result_id] = { "positive": 0, "negative": 0, "spam_reports": 0 } scores = self.result_scores[signal.query][signal.result_id] if signal.signal_type == "spam_report": scores["spam_reports"] += 1 elif signal.signal_value > 0: scores["positive"] += signal.signal_value * signal.weight else: scores["negative"] += abs(signal.signal_value) * signal.weight def _spam_report_rule(self, scores: Dict) -> Optional[float]: """Multiple spam reports → severe demotion.""" if scores.get("spam_reports", 0) >= 3: return 0.1 # 90% demotion return None def _consistent_negative_rule(self, scores: Dict) -> Optional[float]: """Consistent negative feedback → moderate demotion.""" if scores.get("negative", 0) > 5 and scores.get("positive", 0) < 1: return 0.5 # 50% demotion return None def _consistent_positive_rule(self, scores: Dict) -> Optional[float]: """Consistent positive feedback → promotion.""" if scores.get("positive", 0) > 10 and scores.get("negative", 0) < 2: return 1.3 # 30% boost return None def get_boost(self, query: str, result_id: str) -> float: scores = self.result_scores.get(query, {}).get(result_id, {}) # Apply rules in priority order for rule in self.rules: boost = rule(scores) if boost is not None: return boost return 1.0 # No rule matched class FeatureEngineeringIntegrator(FeedbackIntegrator): """ Convert feedback into features for ML ranking model. Doesn't directly modify rankings—provides signal for learning-to-rank model to use. """ def __init__(self, decay_half_life_days: float = 30.0): self.decay_half_life = decay_half_life_days self.feedback_features: Dict[str, Dict] = {} # result_id -> features def incorporate(self, signals: List[FeedbackSignal]): """Update feature store with new signals.""" for signal in signals: if signal.result_id not in self.feedback_features: self.feedback_features[signal.result_id] = self._init_features() features = self.feedback_features[signal.result_id] decay = self._time_decay(signal.timestamp) # Update running statistics if signal.signal_value > 0: features["positive_feedback_score"] += signal.signal_value * signal.weight * decay features["positive_feedback_count"] += 1 else: features["negative_feedback_score"] += abs(signal.signal_value) * signal.weight * decay features["negative_feedback_count"] += 1 # Update type-specific features type_key = f"feedback_{signal.signal_type}_score" if type_key in features: features[type_key] += signal.signal_value * signal.weight * decay def _init_features(self) -> Dict: return { "positive_feedback_score": 0.0, "negative_feedback_score": 0.0, "positive_feedback_count": 0, "negative_feedback_count": 0, "feedback_click_score": 0.0, "feedback_dwell_score": 0.0, "feedback_explicit_score": 0.0, } def _time_decay(self, timestamp: datetime) -> float: """Apply time decay to older signals.""" age_days = (datetime.now() - timestamp).days return 0.5 ** (age_days / self.decay_half_life) def get_features(self, result_id: str) -> Dict: """Get feature vector for a result.""" if result_id not in self.feedback_features: return self._init_features() features = self.feedback_features[result_id].copy() # Derived features total_pos = features["positive_feedback_count"] total_neg = features["negative_feedback_count"] total = total_pos + total_neg if total > 0: features["feedback_positive_ratio"] = total_pos / total features["feedback_net_score"] = ( features["positive_feedback_score"] - features["negative_feedback_score"] ) else: features["feedback_positive_ratio"] = 0.5 features["feedback_net_score"] = 0.0 return features def get_boost(self, query: str, result_id: str) -> float: # This integrator doesn't directly boost—ML model uses features return 1.0 class OnlineLearningIntegrator(FeedbackIntegrator): """ Update model weights in real-time from feedback. Most responsive but highest risk—bad feedback can quickly degrade rankings. Uses multi-armed bandit approach for exploration-exploitation. """ def __init__(self, learning_rate: float = 0.01, exploration_rate: float = 0.1): self.learning_rate = learning_rate self.exploration = exploration_rate self.result_values: Dict[str, Dict] = {} # query -> result_id -> Thompson params def incorporate(self, signals: List[FeedbackSignal]): """Update bandit parameters from new signals.""" for signal in signals: if signal.query not in self.result_values: self.result_values[signal.query] = {} if signal.result_id not in self.result_values[signal.query]: # Initialize with prior (Beta distribution parameters) self.result_values[signal.query][signal.result_id] = { "alpha": 1.0, # Pseudo-successes "beta": 1.0, # Pseudo-failures } params = self.result_values[signal.query][signal.result_id] # Update Beta distribution based on feedback if signal.signal_value > 0: params["alpha"] += signal.signal_value * signal.weight else: params["beta"] += abs(signal.signal_value) * signal.weight def get_boost(self, query: str, result_id: str) -> float: """ Get boost using Thompson Sampling. Sample from posterior to balance exploration/exploitation. """ if query not in self.result_values: return 1.0 if result_id not in self.result_values[query]: return 1.0 params = self.result_values[query][result_id] # Sample from Beta distribution sampled_value = np.random.beta(params["alpha"], params["beta"]) # Convert to boost factor (0.5 = neutral) # sampled_value ≈ 0.3 → boost ≈ 0.6 # sampled_value ≈ 0.7 → boost ≈ 1.4 boost = 0.5 + sampled_value return boostThe ultimate goal is a search system that continuously improves from user feedback without manual intervention. This requires careful architecture to balance responsiveness with stability.
Key components of continuous learning:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281
from dataclasses import dataclassfrom typing import Dict, List, Optional, Callablefrom datetime import datetime, timedeltafrom abc import ABC, abstractmethodfrom enum import Enum class ModelState(Enum): TRAINING = "training" VALIDATING = "validating" CANARY = "canary" PRODUCTION = "production" ROLLBACK = "rollback" @dataclassclass ModelVersion: """A versioned model with metadata.""" version_id: str created_at: datetime training_data_end: datetime metrics: Dict[str, float] # Validation metrics state: ModelState class ContinuousLearningPipeline: """ Orchestrates continuous learning from user feedback. Pipeline stages: 1. Collect: Stream user interactions 2. Process: Convert to training signals 3. Train: Update model with new data 4. Validate: Ensure model quality 5. Deploy: Gradual rollout 6. Monitor: Watch for issues """ def __init__(self, training_frequency_hours: float = 24.0, min_samples_for_training: int = 10000, validation_threshold: float = 0.95): self.training_freq = training_frequency_hours self.min_samples = min_samples_for_training self.validation_threshold = validation_threshold self.current_model: Optional[ModelVersion] = None self.candidate_model: Optional[ModelVersion] = None self.signal_buffer: List = [] async def run_training_cycle(self): """ Execute one training cycle. Called periodically (e.g., daily or when enough data). """ # Step 1: Collect processed signals signals = await self.collect_signals() if len(signals) < self.min_samples: return {"status": "skipped", "reason": "insufficient_data"} # Step 2: Train new model new_model = await self.train_model(signals) new_model.state = ModelState.VALIDATING # Step 3: Validate offline validation_result = await self.validate_model(new_model) if not validation_result["passed"]: return {"status": "failed", "reason": "validation_failed", "metrics": validation_result} # Step 4: Deploy to canary new_model.state = ModelState.CANARY await self.deploy_canary(new_model) # Step 5: Monitor canary canary_result = await self.monitor_canary(new_model, duration_hours=2.0) if not canary_result["passed"]: await self.rollback_canary() return {"status": "rolled_back", "reason": "canary_failed"} # Step 6: Promote to production new_model.state = ModelState.PRODUCTION await self.promote_to_production(new_model) return {"status": "success", "model_version": new_model.version_id} async def collect_signals(self) -> List: """Collect and process feedback signals.""" # In practice: read from Kafka/Kinesis stream signals = self.signal_buffer.copy() self.signal_buffer = [] return signals async def train_model(self, signals: List) -> ModelVersion: """Train new model version.""" # In practice: launch training job (Spark, SageMaker, etc.) return ModelVersion( version_id=f"model_{datetime.now().isoformat()}", created_at=datetime.now(), training_data_end=datetime.now(), metrics={}, state=ModelState.TRAINING ) async def validate_model(self, model: ModelVersion) -> Dict: """ Validate model against held-out data and baselines. Checks: 1. Offline metrics vs baseline (NDCG, MRR) 2. No regression on critical query types 3. Consistent behavior (no dramatic ranking changes) """ # Compare against baseline model baseline_metrics = self.current_model.metrics if self.current_model else {} checks = [] # Check 1: Overall metric improvement ndcg_ratio = model.metrics.get("ndcg", 0) / baseline_metrics.get("ndcg", 1) checks.append({ "check": "ndcg_improvement", "passed": ndcg_ratio >= self.validation_threshold, "value": ndcg_ratio }) # Check 2: No major regression on any query segment for segment in ["navigational", "informational", "transactional"]: seg_key = f"ndcg_{segment}" if seg_key in model.metrics and seg_key in baseline_metrics: ratio = model.metrics[seg_key] / baseline_metrics[seg_key] checks.append({ "check": f"{segment}_stability", "passed": ratio >= 0.98, # Allow 2% regression per segment "value": ratio }) all_passed = all(c["passed"] for c in checks) return { "passed": all_passed, "checks": checks } async def deploy_canary(self, model: ModelVersion): """Deploy to small percentage of traffic.""" # In practice: update load balancer / feature flag pass async def monitor_canary(self, model: ModelVersion, duration_hours: float) -> Dict: """ Monitor canary deployment for issues. Watch for: - Error rate spikes - Latency increases - Metric degradation vs control """ # In practice: query metrics system, compare canary vs control # For now, simulate monitoring import asyncio await asyncio.sleep(0.1) # Would wait for duration_hours return { "passed": True, "error_rate_delta": 0.001, "latency_delta_ms": 2, "ctr_delta": 0.02 } async def rollback_canary(self): """Roll back canary deployment.""" pass async def promote_to_production(self, model: ModelVersion): """Promote canary to full production.""" self.current_model = model class FeedbackGuardrails: """ Safety mechanisms to prevent feedback exploitation. Bad actors may try to manipulate rankings by: - Generating fake positive feedback - Coordinated negative feedback attacks - Gaming click patterns """ @staticmethod def detect_coordinated_attack( signals: List[FeedbackSignal], window_hours: float = 1.0, threshold_ratio: float = 10.0 ) -> List[str]: """ Detect suspiciously coordinated feedback. Returns list of result_ids that appear to be under attack. """ suspicious = [] now = datetime.now() cutoff = now - timedelta(hours=window_hours) # Group recent signals by result_id by_result = {} for signal in signals: if signal.timestamp > cutoff: if signal.result_id not in by_result: by_result[signal.result_id] = [] by_result[signal.result_id].append(signal) # Check for abnormal patterns for result_id, result_signals in by_result.items(): if len(result_signals) < 5: continue # Check: All signals same polarity = suspicious values = [s.signal_value for s in result_signals] if all(v > 0 for v in values) or all(v < 0 for v in values): # Check signal rate vs historical signal_rate = len(result_signals) / window_hours # Would compare against historical rate historical_rate = 1.0 # Placeholder if signal_rate > historical_rate * threshold_ratio: suspicious.append(result_id) return suspicious @staticmethod def rate_limit_user_feedback( user_id: str, feedback_count_today: int, max_daily_feedback: int = 20 ) -> bool: """ Rate limit feedback per user to prevent manipulation. """ return feedback_count_today < max_daily_feedback @staticmethod def detect_bot_patterns( user_signals: List[FeedbackSignal] ) -> float: """ Detect bot-like feedback patterns. Returns probability of bot (0-1). """ if len(user_signals) < 3: return 0.0 # Check for inhuman patterns bot_score = 0.0 # Pattern: Exactly regular timing intervals = [] for i in range(1, len(user_signals)): delta = (user_signals[i].timestamp - user_signals[i-1].timestamp).total_seconds() intervals.append(delta) if len(intervals) >= 2: avg_interval = sum(intervals) / len(intervals) variance = sum((i - avg_interval)**2 for i in intervals) / len(intervals) # Very low variance = suspicious regularity if variance < 1.0 and avg_interval < 5.0: bot_score += 0.5 # Pattern: No dwell time (instant feedback) instant_feedback = sum( 1 for s in user_signals if s.signal_type == "explicit" and (getattr(s, 'dwell_before', 0) or 1) < 2 ) if instant_feedback / len(user_signals) > 0.8: bot_score += 0.3 return min(1.0, bot_score)Even with automation, maintain human oversight. Set up alerts for unusual model behavior. Review a sample of feedback-driven changes weekly. Automated systems should have escalation paths to human reviewers for edge cases.
User feedback incorporation transforms search from a static system into a learning system. Every interaction becomes an opportunity to improve. But this power comes with responsibility—to handle feedback correctly, avoid manipulation, and maintain quality.
With this page, we complete our exploration of Search Relevance Tuning. From understanding relevance factors, through boosting and personalization, to rigorous A/B testing and feedback loops, you now have a comprehensive toolkit for building and continuously improving search quality.
Congratulations! You've completed the comprehensive module on Search Relevance Tuning. You now understand relevance factors and scoring functions, how to boost fields and queries for specific goals, personalization from profile construction to privacy, A/B testing for rigorous quality measurement, and feedback incorporation for continuous improvement. These skills form the foundation for building world-class search experiences that users love.