System Design (HLD)Search Relevance Tuning

Search Relevance Tuning

LevelAdvanced

Duration75 mins

TopicSearch Relevance Tuning

1 / 5

Relevance Factors: The Foundation of Search Quality

The Relevance Problem

When a user types a query into a search box, they have an intent—a specific piece of information they seek, a product they want to purchase, a question they need answered. The search system's fundamental challenge is to infer this intent from a few keywords and return results that satisfy it. This is the relevance problem, and it sits at the heart of every search system ever built.

Relevance is deceptively simple to define informally: relevant results are the ones the user actually wanted. But operationalizing this definition requires us to decompose relevance into measurable components, understand how these components interact, and build systems that can compute relevance scores at scale—often for billions of documents against millions of queries per second.

This page provides a comprehensive exploration of relevance factors—the signals and features that search systems use to determine how well a document matches a user's query and intent.

What You Will Learn

By the end of this page, you will understand the taxonomy of relevance factors, how textual and non-textual signals contribute to relevance scoring, the mathematical foundations of scoring functions like TF-IDF and BM25, and the architectural considerations for computing relevance at scale. This knowledge forms the foundation for all subsequent tuning and personalization strategies.

Understanding Relevance: Beyond Simple Matching

Relevance in information retrieval is a multi-faceted concept that goes far beyond simple keyword matching. A document might contain every word in the query but still be completely irrelevant if it uses those words in a different context. Conversely, a document might be highly relevant despite not containing the exact query terms, if it uses synonyms or addresses the underlying intent.

The evolution of relevance thinking:

Early search systems (1960s-1980s) treated relevance as a binary concept: a document either matched the query or it didn't. Boolean retrieval systems returned all documents matching a query expression (e.g., "apple AND computer") without ranking.

The vector space model (1970s) introduced the idea of degree of relevance—documents could be more or less relevant based on term weights. This enabled ranked retrieval, where results are ordered by estimated relevance.

Probabilistic models (1970s-1990s) formalized relevance as probability: what is the probability that this document is relevant given the query? This led to models like BM25, which remains a standard baseline today.

Learning-to-rank (2000s-present) treats relevance as a machine learning problem, using hundreds of features to predict relevance based on labeled training data.

Modern systems combine all these approaches, using traditional scoring as features within ML models that incorporate user behavior, context, and personalization signals.

The Relevance Triangle

Relevance exists at the intersection of three elements: (1) the user's information need (intent), (2) the query they formulate to express that need, and (3) the documents in the corpus. Perfect relevance would mean the system understands the user's true intent, correctly interprets their query as an expression of that intent, and identifies documents that satisfy the intent. Each mapping is imperfect, which is why relevance is such a hard problem.

The Taxonomy of Relevance Factors

Modern search systems consider dozens—sometimes hundreds—of factors when computing relevance scores. These factors can be organized into a coherent taxonomy that helps us understand their roles and interactions.

The factor hierarchy:

Relevance factors fall into several broad categories, each capturing a different dimension of what makes a result useful to the user.

Taxonomy of Relevance Factors
Category	Description	Examples	Typical Weight
Textual Relevance	How well document text matches query terms	TF-IDF, BM25, phrase matching, proximity	High (40-60%)
Freshness	How recent or up-to-date the content is	Publication date, last update, decay functions	Medium (10-20%)
Authority	Credibility and trustworthiness signals	PageRank, domain authority, backlinks, citations	Medium (10-20%)
User Engagement	How users interact with the result	Click-through rate, dwell time, bounce rate	Medium (10-20%)
Document Quality	Intrinsic content quality signals	Content length, readability, spam scores	Low-Medium (5-15%)
User Context	Personalization and contextual signals	Location, device, history, preferences	Variable (0-30%)
Query-Document Match	Structural matching beyond text	Title match, URL match, meta description	Medium (10-15%)

Interaction effects:

These factors don't operate independently. A highly authoritative source with moderate textual relevance might outrank a low-authority source with perfect textual match. Freshness might matter enormously for news queries but be irrelevant for historical information. User context can completely reorder results—the same query from two users might yield entirely different optimal rankings.

This complexity is why modern ranking systems use machine learning: the interactions between factors are too complex for humans to specify manually, but ML models can learn these relationships from training data.

Textual Relevance: The Core Signal

Textual relevance measures how well the words in a document match the words in a query. Despite advances in semantic search and neural models, textual relevance remains the foundation of search ranking—the signal that other factors augment but rarely replace.

Term frequency (TF):

The most basic intuition is that documents mentioning the query terms more frequently are more likely to be relevant. If a user searches for "machine learning," a document mentioning "machine learning" 50 times is probably more focused on the topic than one mentioning it twice.

However, raw term frequency has problems. Very long documents naturally contain more term occurrences. Repeating a term obsessively doesn't increase relevance proportionally. This led to various TF normalization schemes.

term_frequency_variants.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import math
from collections import Counter
 
class TermFrequencyCalculators:
    """
    Various term frequency normalization schemes used in information retrieval.
    Each addresses different problems with raw term counts.
    """
    
    @staticmethod
    def raw_tf(term: str, document: list[str]) -> int:
        """
        Raw term frequency: simple count of term occurrences.
        Problem: Biased toward long documents.
        """
        return document.count(term)
    
    @staticmethod
    def log_normalized_tf(term: str, document: list[str]) -> float:
        """
        Log normalization: 1 + log(tf) if tf > 0, else 0.
        Intuition: Multiple occurrences matter, but with diminishing returns.
        The difference between 1 and 10 occurrences matters more than
        the difference between 100 and 1000.
        """
        tf = document.count(term)
        if tf == 0:
            return 0.0
        return 1.0 + math.log(tf)
    
    @staticmethod
    def double_normalization_tf(term: str, document: list[str], k: float = 0.5) -> float:
        """
        Double normalization (0.5): k + (1-k) * tf/max_tf
        Normalizes by the maximum term frequency in the document.
        Prevents any term from dominating too much.
        """
        tf = document.count(term)
        if tf == 0:
            return 0.0
        term_counts = Counter(document)
        max_tf = max(term_counts.values()) if term_counts else 1
        return k + (1 - k) * (tf / max_tf)
    
    @staticmethod
    def bm25_saturation_tf(term: str, document: list[str], k1: float = 1.2) -> float:
        """
        BM25-style saturation: tf / (tf + k1)
        Crucial insight: after enough occurrences, additional mentions
        add almost no extra relevance. This "saturates" the score.
        k1 controls how quickly saturation occurs.
        """
        tf = document.count(term)
        if tf == 0:
            return 0.0
        return tf / (tf + k1)
 
 
# Demonstration of saturation behavior
def demonstrate_saturation():
    """
    Shows how BM25 saturation prevents term frequency gaming.
    """
    print("TF	Raw	Log	BM25 Saturation")
    print("-" * 50)
    for tf in [1, 2, 5, 10, 20, 50, 100, 1000]:
        raw = tf
        log_norm = 1 + math.log(tf) if tf > 0 else 0
        bm25_sat = tf / (tf + 1.2)
        print(f"{tf}	{raw}	{log_norm:.2f}	{bm25_sat:.3f}")
 
# Output shows BM25 rapidly approaches 1.0, preventing gaming
# TF=1 → 0.45, TF=10 → 0.89, TF=100 → 0.99, TF=1000 → 0.999

Inverse document frequency (IDF):

Not all terms are equally informative. The word "the" appears in virtually every English document and tells us nothing about relevance. The term "mitochondrial" is rare and highly specific—its presence is a strong relevance signal.

IDF captures this by weighting terms inversely to their document frequency:

IDF(term) = log(N / df(term))

Where N is the total number of documents and df(term) is the number of documents containing the term. Rare terms get high IDF weights; common terms get low weights.

TF-IDF scoring:

Combining TF and IDF gives us TF-IDF, the classic relevance scoring function:

TF-IDF(term, doc) = TF(term, doc) × IDF(term)
Score(query, doc) = Σ TF-IDF(term, doc) for each term in query

This simple formula remains surprisingly effective and is still used as a baseline and feature in modern systems.

BM25: The Industry Standard Baseline

BM25 (Best Matching 25) represents the culmination of probabilistic retrieval research. Developed in the 1990s, it remains the default ranking function in Elasticsearch, Solr, and most other search engines. Understanding BM25 deeply is essential for any search engineer.

The BM25 formula:

bm25_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
import math
from collections import Counter
from dataclasses import dataclass
 
@dataclass
class BM25Parameters:
    """
    BM25 tuning parameters with their interpretations.
    """
    k1: float = 1.2  # Term frequency saturation. Higher = more weight to TF.
    b: float = 0.75  # Length normalization. 0 = ignore length, 1 = full normalization.
 
class BM25:
    """
    Production-grade BM25 implementation with full documentation.
    
    The BM25 scoring function is:
    
    Score(Q, D) = Σ IDF(qi) × (tf(qi, D) × (k1 + 1)) / (tf(qi, D) + k1 × (1 - b + b × |D|/avgdl))
    
    Where:
    - Q is the query, D is the document
    - qi is the i-th query term
    - tf(qi, D) is the frequency of qi in D
    - |D| is the document length
    - avgdl is the average document length in the corpus
    - k1 and b are tuning parameters
    """
    
    def __init__(self, corpus: list[list[str]], params: BM25Parameters = BM25Parameters()):
        self.corpus = corpus
        self.params = params
        self.corpus_size = len(corpus)
        
        # Precompute document lengths and average
        self.doc_lengths = [len(doc) for doc in corpus]
        self.avg_doc_length = sum(self.doc_lengths) / self.corpus_size if self.corpus_size > 0 else 0
        
        # Precompute document frequencies for IDF
        self.doc_frequencies = {}
        for doc in corpus:
            unique_terms = set(doc)
            for term in unique_terms:
                self.doc_frequencies[term] = self.doc_frequencies.get(term, 0) + 1
        
        # Precompute term frequencies per document
        self.doc_term_frequencies = [Counter(doc) for doc in corpus]
    
    def idf(self, term: str) -> float:
        """
        Compute Inverse Document Frequency for a term.
        
        Uses the Robertson-Walker IDF variant:
        IDF = log((N - n + 0.5) / (n + 0.5) + 1)
        
        This formulation:
        - Avoids negative IDF for very common terms
        - Handles edge cases smoothly
        - Provides slightly better empirical results than classic IDF
        """
        n = self.doc_frequencies.get(term, 0)
        numerator = self.corpus_size - n + 0.5
        denominator = n + 0.5
        return math.log(numerator / denominator + 1)
    
    def score_document(self, query: list[str], doc_index: int) -> float:
        """
        Compute BM25 score for a single document against a query.
        
        The core insight: balance term frequency importance (k1) with
        document length normalization (b). Long documents naturally
        contain more terms but aren't necessarily more relevant.
        """
        score = 0.0
        doc_length = self.doc_lengths[doc_index]
        doc_tf = self.doc_term_frequencies[doc_index]
        
        for term in query:
            if term not in doc_tf:
                continue
            
            tf = doc_tf[term]
            idf = self.idf(term)
            
            # Length normalization factor
            # When b=1: fully normalize by document length
            # When b=0: ignore document length entirely
            length_norm = 1 - self.params.b + self.params.b * (doc_length / self.avg_doc_length)
            
            # BM25 term score with saturation
            numerator = tf * (self.params.k1 + 1)
            denominator = tf + self.params.k1 * length_norm
            
            score += idf * (numerator / denominator)
        
        return score
    
    def rank(self, query: list[str], top_k: int = 10) -> list[tuple[int, float]]:
        """
        Rank all documents by BM25 score and return top K.
        """
        scores = [(i, self.score_document(query, i)) for i in range(self.corpus_size)]
        scores.sort(key=lambda x: -x[1])
        return scores[:top_k]
 
 
# Example usage demonstrating BM25 behavior
def bm25_demonstration():
    corpus = [
        ["machine", "learning", "is", "a", "subset", "of", "artificial", "intelligence"],
        ["deep", "learning", "uses", "neural", "networks", "for", "machine", "learning"],
        ["machine", "learning", "machine", "learning", "machine", "learning"],  # TF gaming attempt
        ["artificial", "intelligence", "encompasses", "many", "techniques"],
        ["machine", "learning", "algorithms", "learn", "from", "data", "patterns"],
    ]
    
    bm25 = BM25(corpus)
    query = ["machine", "learning"]
    
    print("Query: 'machine learning'")
    print("\nRanking results:")
    for doc_idx, score in bm25.rank(query, top_k=5):
        print(f"  Doc {doc_idx}: {score:.4f} - {' '.join(corpus[doc_idx])}")
    
    # Note: Doc 2 (spam-like repetition) doesn't dominate because of:
    # 1. TF saturation (k1 parameter)
    # 2. The IDF component means common terms in query matter less
    # 3. Length normalization (b parameter)

BM25 Parameter Tuning

The default parameters (k1=1.2, b=0.75) work well for general web search but may need adjustment for specific domains. For short documents (product titles), reduce b toward 0. For domains where repetition indicates relevance (legal documents), increase k1. Always validate changes with relevance evaluation metrics on your specific use case.

Beyond Textual Matching: Phrase and Proximity

Bag-of-words models like TF-IDF and BM25 treat documents as unordered sets of terms. This loses important information. The phrase "New York" has a specific meaning lost when treated as independent occurrences of "new" and "york."

Phrase matching:

Phrase queries require terms to appear consecutively in the specified order. Search engines support this with exact phrase operators (typically quotes):

Query: "machine learning" — matches only documents where these words appear consecutively
Query: machine learning — matches documents containing both words, anywhere

Phrase matching uses positional indexes that store not just which documents contain a term, but where in each document:

"machine" → {doc1: [5, 23, 89], doc2: [12], doc3: [3, 7, 45, 102]}
"learning" → {doc1: [6, 24], doc2: [13], doc3: [8, 46]}

For a phrase match, we check if positions satisfy adjacency constraints.

proximity_scoring.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
from dataclasses import dataclass
from typing import Optional
import math
 
@dataclass
class TermPositions:
    """Positions of a term in a document."""
    term: str
    positions: list[int]  # 1-indexed positions in document
 
class ProximityScorer:
    """
    Computes proximity-based relevance boosts.
    
    Intuition: Terms that appear close together are more likely to be
    semantically related. "Machine learning" appearing as a phrase is
    a stronger relevance signal than the words appearing paragraphs apart.
    """
    
    @staticmethod
    def min_span(term_positions: list[TermPositions]) -> Optional[int]:
        """
        Find the minimum span (window size) that contains all query terms.
        Returns None if any term is missing from the document.
        
        Uses a sliding window approach:
        1. Initialize pointers to the first position of each term
        2. Find the span of current positions
        3. Advance the leftmost pointer
        4. Track minimum span seen
        
        Time complexity: O(n * m) where n = total positions, m = number of terms
        """
        if not term_positions or any(len(tp.positions) == 0 for tp in term_positions):
            return None
        
        # Initialize pointers to first position of each term
        pointers = [0] * len(term_positions)
        min_span_size = float('inf')
        
        while True:
            # Get current positions for each term
            current_positions = [
                term_positions[i].positions[pointers[i]] 
                for i in range(len(term_positions))
            ]
            
            # Calculate span from min to max position
            min_pos = min(current_positions)
            max_pos = max(current_positions)
            span = max_pos - min_pos + 1
            
            min_span_size = min(min_span_size, span)
            
            # Perfect phrase match - can't do better
            if span == len(term_positions):
                break
            
            # Advance the leftmost pointer
            leftmost_idx = current_positions.index(min_pos)
            pointers[leftmost_idx] += 1
            
            # If we've exhausted any term's positions, we're done
            if pointers[leftmost_idx] >= len(term_positions[leftmost_idx].positions):
                break
        
        return min_span_size if min_span_size != float('inf') else None
    
    @staticmethod
    def proximity_boost(min_span: int, num_terms: int, 
                       max_boost: float = 2.0, 
                       decay_rate: float = 0.1) -> float:
        """
        Calculate proximity boost factor based on minimum span.
        
        Perfect phrase (span = num_terms) gets max boost.
        Boost decays as span increases.
        
        Formula: max_boost * exp(-decay_rate * (span - num_terms))
        """
        if min_span is None:
            return 1.0  # No boost if terms don't all appear
        
        # Excess span beyond perfect phrase
        excess = min_span - num_terms
        
        if excess <= 0:
            return max_boost  # Perfect phrase match
        
        # Exponential decay as terms spread apart
        return 1.0 + (max_boost - 1.0) * math.exp(-decay_rate * excess)
    
    @staticmethod
    def bigram_proximity_score(doc_positions: dict[str, list[int]], 
                              query_terms: list[str],
                              window: int = 5) -> float:
        """
        Score based on how often query term pairs appear within a window.
        
        This captures partial proximity - even if all terms aren't close,
        nearby term pairs still boost relevance.
        """
        if len(query_terms) < 2:
            return 0.0
        
        total_score = 0.0
        pairs_counted = 0
        
        for i in range(len(query_terms) - 1):
            term1, term2 = query_terms[i], query_terms[i + 1]
            
            if term1 not in doc_positions or term2 not in doc_positions:
                continue
            
            positions1 = doc_positions[term1]
            positions2 = doc_positions[term2]
            
            # Count pairs within window
            close_pairs = 0
            for p1 in positions1:
                for p2 in positions2:
                    if abs(p2 - p1) <= window:
                        close_pairs += 1
            
            if close_pairs > 0:
                total_score += math.log(1 + close_pairs)
                pairs_counted += 1
        
        return total_score / pairs_counted if pairs_counted > 0 else 0.0
 
 
# Demonstration
def proximity_demo():
    # Document: "Machine learning is transforming the field. Deep learning, a subset of machine learning, uses neural networks."
    positions = {
        "machine": [1, 12],
        "learning": [2, 7, 13],
        "deep": [6],
        "neural": [16],
        "networks": [17],
    }
    
    query = ["machine", "learning"]
    term_positions = [
        TermPositions("machine", positions["machine"]),
        TermPositions("learning", positions["learning"]),
    ]
    
    min_span = ProximityScorer.min_span(term_positions)
    boost = ProximityScorer.proximity_boost(min_span, len(query))
    
    print(f"Query: {query}")
    print(f"Minimum span: {min_span}")  # Should be 2 (positions 1-2 = "machine learning")
    print(f"Proximity boost: {boost:.2f}")  # Should be ~2.0 (perfect phrase match)

Implementation Note

Positional indexes are significantly larger than basic inverted indexes (typically 2-4x the size) because they store position lists for each term-document pair. The trade-off is worth it for phrase and proximity queries, but budget accordingly for index storage.

Freshness Signals: The Temporal Dimension

For many queries, newer content is more relevant than older content. A search for "best smartphones 2024" should prioritize recent reviews over 2020 articles. News queries are inherently temporal. Even evergreen topics benefit from freshness—updated information is usually more accurate.

Types of freshness signals:

Freshness Signal Categories

•Publication Date — When the content was first created. Most reliable when available, but can be manipulated or missing.
•Last Modified Date — When the content was last updated. Captures substantive updates but can be triggered by trivial changes.
•Crawl Discovery Date — When the search engine first indexed the content. Available for all content but doesn't reflect actual publication time.
•Link Age — Age of inbound links. New content receiving fresh links signals ongoing relevance.
•Content Churn Signals — How much the content changes over time. Frequently updated pages may be more current.

freshness_decay_functions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
import math
from datetime import datetime, timedelta
from enum import Enum
 
class DecayFunction(Enum):
    """Different decay models for freshness scoring."""
    LINEAR = "linear"
    EXPONENTIAL = "exponential"
    GAUSSIAN = "gaussian"
    SIGMOID = "sigmoid"
 
class FreshnessScorer:
    """
    Computes freshness scores with configurable decay functions.
    
    Key insight: The "right" decay function depends on query type.
    - News: Sharp exponential decay (yesterday's news is old news)
    - Products: Moderate decay (last year's reviews still useful)
    - Historical: Minimal decay (old content may be best source)
    """
    
    def __init__(self, 
                 current_time: datetime = None,
                 scale_days: float = 30.0,  # "half-life" in days
                 decay_type: DecayFunction = DecayFunction.EXPONENTIAL):
        self.current_time = current_time or datetime.now()
        self.scale_days = scale_days
        self.decay_type = decay_type
    
    def age_in_days(self, doc_date: datetime) -> float:
        """Calculate document age in fractional days."""
        delta = self.current_time - doc_date
        return delta.total_seconds() / 86400.0
    
    def linear_decay(self, age_days: float) -> float:
        """
        Linear decay from 1.0 to 0.0 over scale_days.
        Simple but unrealistic - content doesn't become worthless linearly.
        """
        decay = 1.0 - (age_days / self.scale_days)
        return max(0.0, decay)
    
    def exponential_decay(self, age_days: float) -> float:
        """
        Exponential decay: score = exp(-λ * age)
        Most common in practice. Half-life = scale_days.
        Never reaches zero - very old content still has some value.
        """
        lambda_param = math.log(2) / self.scale_days
        return math.exp(-lambda_param * age_days)
    
    def gaussian_decay(self, age_days: float, optimal_age: float = 0) -> float:
        """
        Gaussian decay: bell curve centered on optimal_age.
        Useful when "just published" isn't optimal (e.g., need time for reviews).
        """
        sigma = self.scale_days
        return math.exp(-0.5 * ((age_days - optimal_age) / sigma) ** 2)
    
    def sigmoid_decay(self, age_days: float) -> float:
        """
        Sigmoid decay: smooth S-curve transition from fresh to stale.
        Good for binary-ish freshness (relevant vs. outdated).
        """
        midpoint = self.scale_days
        steepness = 0.1  # Controls transition sharpness
        return 1.0 / (1.0 + math.exp(steepness * (age_days - midpoint)))
    
    def score(self, doc_date: datetime) -> float:
        """Compute freshness score using configured decay function."""
        age = self.age_in_days(doc_date)
        
        if self.decay_type == DecayFunction.LINEAR:
            return self.linear_decay(age)
        elif self.decay_type == DecayFunction.EXPONENTIAL:
            return self.exponential_decay(age)
        elif self.decay_type == DecayFunction.GAUSSIAN:
            return self.gaussian_decay(age)
        elif self.decay_type == DecayFunction.SIGMOID:
            return self.sigmoid_decay(age)
        else:
            return self.exponential_decay(age)
 
 
class QuerySpecificFreshness:
    """
    Adjusts freshness importance based on query classification.
    
    Not all queries need fresh content equally.
    """
    
    # Query freshness sensitivity categories
    QUERY_FRESHNESS_WEIGHTS = {
        "breaking_news": 0.95,      # "earthquake today"
        "recent_events": 0.8,       # "election results 2024"
        "trending_topics": 0.7,     # "viral tiktok"
        "product_reviews": 0.5,     # "iphone 15 review"
        "how_to": 0.3,              # "how to tie a tie"
        "factual": 0.2,             # "speed of light"
        "historical": 0.05,         # "world war 2 causes"
    }
    
    @classmethod
    def weighted_freshness_boost(cls, 
                                 freshness_score: float,
                                 query_type: str,
                                 base_score: float) -> float:
        """
        Apply freshness boost weighted by query type.
        
        Final score = base_score * (1 + weight * freshness_adjustment)
        """
        weight = cls.QUERY_FRESHNESS_WEIGHTS.get(query_type, 0.3)
        freshness_adjustment = freshness_score - 0.5  # Center around neutral
        
        return base_score * (1.0 + weight * freshness_adjustment)
 
 
# Example: Compare decay functions
def compare_decay_functions():
    now = datetime.now()
    ages = [0, 1, 7, 30, 90, 365]  # Days
    
    print("Age (days) | Linear | Exponential | Gaussian | Sigmoid")
    print("-" * 60)
    
    for age in ages:
        doc_date = now - timedelta(days=age)
        linear = FreshnessScorer(now, 90, DecayFunction.LINEAR).score(doc_date)
        exp = FreshnessScorer(now, 30, DecayFunction.EXPONENTIAL).score(doc_date)
        gauss = FreshnessScorer(now, 30, DecayFunction.GAUSSIAN).score(doc_date)
        sigmoid = FreshnessScorer(now, 60, DecayFunction.SIGMOID).score(doc_date)
        
        print(f"{age:10d} | {linear:6.3f} | {exp:11.3f} | {gauss:8.3f} | {sigmoid:7.3f}")

Authority Signals: Trust and Credibility

Two documents might match a query equally well textually, but one is from a renowned medical journal and the other is from an anonymous blog. Authority signals help search engines distinguish credible sources from unreliable ones.

PageRank: The original authority signal:

Google's breakthrough insight was that the link structure of the web encodes human judgments about quality. If many pages link to a page, that page is probably valuable. If those linking pages are themselves authoritative, the signal is even stronger.

PageRank models this as a random walk: imagine a user clicking random links forever. The probability of landing on any given page is its PageRank score.

authority_signals.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
import numpy as np
from dataclasses import dataclass
from typing import Dict, List, Set
 
@dataclass
class AuthorityMetrics:
    """Collection of authority signals for a document/domain."""
    pagerank: float = 0.0
    domain_authority: float = 0.0      # Aggregate domain-level signal
    inbound_link_count: int = 0        # Raw backlink count
    unique_referring_domains: int = 0  # More resistant to manipulation
    trust_rank: float = 0.0            # Distance from seed trusted sites
    spam_score: float = 0.0            # Probability of being spam
 
class PageRankCalculator:
    """
    Classic PageRank implementation with damping factor.
    
    The intuition: A page is important if important pages link to it.
    This circular definition is resolved iteratively.
    
    Mathematical formulation:
    PR(A) = (1-d)/N + d × Σ(PR(Ti)/C(Ti))
    
    Where:
    - d = damping factor (probability of following a link vs. random jump)
    - N = total pages
    - Ti = pages linking to A
    - C(Ti) = number of outbound links from Ti
    """
    
    def __init__(self, damping_factor: float = 0.85, max_iterations: int = 100, 
                 convergence_threshold: float = 1e-6):
        self.damping = damping_factor
        self.max_iter = max_iterations
        self.threshold = convergence_threshold
    
    def calculate(self, adjacency_list: Dict[str, List[str]]) -> Dict[str, float]:
        """
        Calculate PageRank for all nodes in the graph.
        
        Args:
            adjacency_list: {page: [pages it links to]}
        
        Returns:
            {page: pagerank_score}
        """
        # Build the set of all nodes
        all_nodes = set(adjacency_list.keys())
        for targets in adjacency_list.values():
            all_nodes.update(targets)
        
        n = len(all_nodes)
        nodes = list(all_nodes)
        node_to_idx = {node: i for i, node in enumerate(nodes)}
        
        # Initialize uniform PageRank
        pagerank = np.ones(n) / n
        
        # Build transition matrix
        # M[i][j] = probability of going from j to i
        outbound_counts = {node: len(adjacency_list.get(node, [])) for node in nodes}
        
        for iteration in range(self.max_iter):
            new_pagerank = np.ones(n) * (1 - self.damping) / n
            
            for source, targets in adjacency_list.items():
                if not targets:
                    continue  # Handle dangling nodes
                
                source_idx = node_to_idx[source]
                contribution = self.damping * pagerank[source_idx] / len(targets)
                
                for target in targets:
                    if target in node_to_idx:
                        target_idx = node_to_idx[target]
                        new_pagerank[target_idx] += contribution
            
            # Handle dangling nodes (pages with no outlinks)
            dangling_sum = sum(
                pagerank[node_to_idx[node]] 
                for node in nodes 
                if node not in adjacency_list or not adjacency_list[node]
            )
            new_pagerank += self.damping * dangling_sum / n
            
            # Check convergence
            diff = np.sum(np.abs(new_pagerank - pagerank))
            pagerank = new_pagerank
            
            if diff < self.threshold:
                break
        
        return {nodes[i]: pagerank[i] for i in range(n)}
 
 
class TrustRank:
    """
    TrustRank: PageRank seeded from known trustworthy sites.
    
    Key insight: Spam pages can accumulate PageRank through link farms,
    but it's hard to get links from genuinely trusted sources.
    
    TrustRank propagates trust from a hand-picked seed set of
    authoritative sites. Pages far from the seed set in the link graph
    have lower trust scores.
    """
    
    def __init__(self, trusted_seeds: Set[str], decay_factor: float = 0.85):
        self.seeds = trusted_seeds
        self.decay = decay_factor
    
    def calculate(self, adjacency_list: Dict[str, List[str]], 
                  iterations: int = 10) -> Dict[str, float]:
        """
        Calculate TrustRank propagating from seed set.
        """
        all_nodes = set(adjacency_list.keys())
        for targets in adjacency_list.values():
            all_nodes.update(targets)
        
        # Initialize: seeds get trust 1.0, others get 0
        trust = {node: 1.0 if node in self.seeds else 0.0 for node in all_nodes}
        
        # Build reverse adjacency (who links TO each page)
        reverse_adj: Dict[str, List[str]] = {node: [] for node in all_nodes}
        for source, targets in adjacency_list.items():
            for target in targets:
                if target in reverse_adj:
                    reverse_adj[target].append(source)
        
        for _ in range(iterations):
            new_trust = {}
            for node in all_nodes:
                if node in self.seeds:
                    new_trust[node] = 1.0  # Seeds maintain trust
                else:
                    # Inherit decayed trust from linking pages
                    inherited = sum(
                        self.decay * trust[source] / len(adjacency_list.get(source, [source]))
                        for source in reverse_adj[node]
                        if source in adjacency_list
                    )
                    new_trust[node] = min(1.0, inherited)
            
            trust = new_trust
        
        return trust
 
 
def domain_authority_aggregation():
    """
    Domain Authority: Aggregate page-level signals to domain level.
    
    Why domain-level? 
    - New pages on trusted domains deserve initial trust
    - Prevents gaming by creating many low-quality pages
    - More stable signal (pages come and go, domains persist)
    """
    # Simplified example - real systems use ML models with many features
    domain_signals = {
        "wikipedia.org": {
            "age_years": 24,
            "unique_referring_domains": 15_000_000,
            "manual_authority_score": 98,  # Human-curated
            "avg_page_quality": 0.92,
        },
        "randomspamsite.xyz": {
            "age_years": 0.5,
            "unique_referring_domains": 50,
            "manual_authority_score": None,
            "avg_page_quality": 0.15,
        },
    }
    
    # Domain authority model (simplified)
    for domain, signals in domain_signals.items():
        age_factor = min(1.0, signals["age_years"] / 10)
        link_factor = min(1.0, np.log10(signals["unique_referring_domains"] + 1) / 7)
        quality_factor = signals["avg_page_quality"]
        
        authority = 0.3 * age_factor + 0.4 * link_factor + 0.3 * quality_factor
        print(f"{domain}: Authority = {authority:.3f}")

Modern Authority Signals

Today's search engines go far beyond basic PageRank. They incorporate E-E-A-T signals (Experience, Expertise, Authoritativeness, Trustworthiness), entity recognition (is the author a known expert?), topical authority (is this site authoritative for THIS topic?), and real-time signals from social media and news.

User Context Signals: The Personal Dimension

The same query can have different "correct" answers for different users. "Football" means something different in the US versus the UK. "Best restaurants" depends on location. A search for "Python" from a data scientist and from a pet store owner likely has different intents.

Categories of user context:

User Context Signal Categories
Signal Type	Examples	Impact on Relevance	Privacy Considerations
Geographic	Country, city, precise location	Local results, language preferences	IP geolocation, explicit settings
Temporal	Time of day, day of week, season	Different intents at different times	Generally non-sensitive
Device	Mobile, desktop, tablet, OS	Format preferences, mobile-first results	User-agent, screen size
Language	Browser language, query language	Multilingual results, translations	Browser settings, query detection
Search History	Previous queries, click patterns	Disambiguation, interest profiling	Requires user consent, privacy-sensitive
Account Data	Explicit preferences, saved items	Strong personalization when available	Requires authentication, most sensitive

The personalization spectrum:

Personalization exists on a spectrum from no personalization (everyone sees the same results) to heavy personalization (results are largely determined by user profile). The optimal point depends on:

Query type: Navigational queries ("facebook login") need less personalization than informational queries ("best coffee")
User data availability: Anonymous users can only be personalized by context; logged-in users can have richer personalization
Domain characteristics: E-commerce benefits from heavy personalization; news may need less (risk of filter bubbles)
Privacy constraints: GDPR, CCPA, and user preferences limit available signals

We'll explore personalization implementation deeply in a later section of this module.

Combining Relevance Factors: The Ranking Function

Individual relevance factors are meaningless in isolation. What matters is how they combine into a final ranking score. This combination—the ranking function—is where search engineering becomes both art and science.

Approaches to combination:

Manual Scoring Functions

•Hand-tuned linear combinations
•Expert-designed formulas
•Transparent and debuggable
•Limited feature interactions
•Hard to optimize at scale
•Example: weighted BM25 + freshness + authority

Machine Learned Ranking

•Trained on relevance judgments
•Captures complex interactions
•Scales to hundreds of features
•Harder to interpret
•Requires labeled training data
•Example: Gradient boosted trees, neural rankers

ranking_combination.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
from dataclasses import dataclass
from typing import Callable, List, Dict
import numpy as np
 
@dataclass
class RankingFeatures:
    """All features used for ranking a single document."""
    bm25_score: float
    phrase_match_boost: float
    freshness_score: float
    pagerank: float
    domain_authority: float
    click_through_rate: float
    dwell_time_avg: float
    title_match: bool
    url_match: bool
 
class ManualRankingFunction:
    """
    Hand-tuned linear scoring function.
    
    Pros: Interpretable, controllable, no training data needed
    Cons: Suboptimal, limited interactions, hard to tune
    """
    
    def __init__(self):
        # Hand-tuned weights based on experimentation
        self.weights = {
            "bm25": 1.0,           # Base relevance signal
            "phrase_match": 0.3,   # Boost for exact phrases
            "freshness": 0.2,      # Time decay importance
            "pagerank": 0.15,      # Authority signal
            "domain_authority": 0.1,
            "ctr": 0.15,           # User engagement
            "dwell_time": 0.1,
            "title_match": 0.2,    # Structural signals
            "url_match": 0.1,
        }
    
    def score(self, features: RankingFeatures) -> float:
        """Compute final score as weighted sum."""
        score = 0.0
        score += self.weights["bm25"] * features.bm25_score
        score += self.weights["phrase_match"] * features.phrase_match_boost
        score += self.weights["freshness"] * features.freshness_score
        score += self.weights["pagerank"] * features.pagerank
        score += self.weights["domain_authority"] * features.domain_authority
        score += self.weights["ctr"] * features.click_through_rate
        score += self.weights["dwell_time"] * features.dwell_time_avg
        score += self.weights["title_match"] * (1.0 if features.title_match else 0.0)
        score += self.weights["url_match"] * (1.0 if features.url_match else 0.0)
        return score
 
 
class LearningToRankModel:
    """
    Simplified Learning-to-Rank using gradient boosted trees.
    
    Real systems use libraries like XGBoost, LightGBM, or CatBoost.
    This demonstrates the conceptual approach.
    """
    
    # Common L2R approaches:
    # 1. Pointwise: Predict absolute relevance (regression/classification)
    # 2. Pairwise: Predict which doc is more relevant (RankNet, LambdaRank)
    # 3. Listwise: Optimize list-level metrics directly (ListNet, LambdaMART)
    
    def __init__(self, model_type: str = "lambdamart"):
        self.model_type = model_type
        self.model = None  # Would be actual ML model
    
    def extract_features(self, features: RankingFeatures) -> np.ndarray:
        """Convert structured features to numpy array for ML model."""
        return np.array([
            features.bm25_score,
            features.phrase_match_boost,
            features.freshness_score,
            features.pagerank,
            features.domain_authority,
            features.click_through_rate,
            features.dwell_time_avg,
            1.0 if features.title_match else 0.0,
            1.0 if features.url_match else 0.0,
            # Derived features (feature engineering)
            features.bm25_score * features.freshness_score,  # Interaction
            features.pagerank * features.domain_authority,    # Authority combo
            np.log1p(features.dwell_time_avg),                # Log transform
        ])
    
    def train(self, 
              training_queries: List[Dict],
              relevance_labels: List[List[int]]):
        """
        Train the L2R model on query-document pairs.
        
        Training data format:
        - Query with multiple candidate documents
        - Relevance labels for each document (0-4 scale typically)
        
        Objective: Learn to produce rankings that match human judgments.
        """
        # In practice, would call XGBoost/LightGBM train
        # Model learns feature importance and interactions automatically
        pass
    
    def score(self, features: RankingFeatures) -> float:
        """Predict relevance score using trained model."""
        feature_vector = self.extract_features(features)
        # return self.model.predict([feature_vector])[0]
        return 0.0  # Placeholder
 
 
class HybridRanker:
    """
    Production ranking often uses multiple stages.
    
    Stage 1 (Retrieval): Fast, approximate ranking (BM25)
           - Goal: Reduce billions of docs to thousands
           - Speed: ~1ms
    
    Stage 2 (Initial Ranking): Medium complexity
           - Goal: Reduce to hundreds, add some signals
           - Speed: ~10ms
    
    Stage 3 (Re-ranking): Full ML model
           - Goal: Precise ordering of top results
           - Speed: ~50-100ms
    
    This multi-stage approach balances quality with latency.
    """
    
    def __init__(self,
                 retrieval_func: Callable,
                 initial_ranker: ManualRankingFunction,
                 reranker: LearningToRankModel):
        self.retrieval = retrieval_func
        self.initial_ranker = initial_ranker
        self.reranker = reranker
    
    def search(self, query: str, top_k: int = 10) -> List:
        # Stage 1: Retrieve candidates (fast, BM25-based)
        candidates = self.retrieval(query, limit=1000)
        
        # Stage 2: Initial ranking (medium cost)
        for doc in candidates:
            doc.initial_score = self.initial_ranker.score(doc.features)
        candidates.sort(key=lambda d: -d.initial_score)
        candidates = candidates[:100]  # Keep top 100
        
        # Stage 3: Re-rank with ML (expensive, high quality)
        for doc in candidates:
            doc.final_score = self.reranker.score(doc.features)
        candidates.sort(key=lambda d: -d.final_score)
        
        return candidates[:top_k]

Summary: The Relevance Factor Foundation

Relevance factors form the foundation upon which all search quality is built. Without understanding what makes a document relevant, we cannot tune our systems to improve user satisfaction. This page has established that foundation.

Key Takeaways

•Relevance is multi-dimensional — No single factor captures what makes a result useful. Textual match, freshness, authority, user context, and engagement all contribute.
•BM25 remains the baseline — Despite being decades old, BM25's probabilistic foundation makes it remarkably effective. Understand it deeply before adding complexity.
•Proximity and phrases matter — Bag-of-words models lose important information. Phrase matching and proximity scoring recover some of this signal.
•Freshness is query-dependent — The importance of recency varies dramatically by query type. News queries need fresh results; historical queries may prefer authoritative older sources.
•Authority prevents manipulation — PageRank and its descendants help distinguish quality from SEO-optimized spam. Modern systems use many authority signals.
•Context enables personalization — User context (location, device, history) can dramatically improve relevance for ambiguous queries.
•Combination is the hard part — Individual factors are meaningless without a ranking function that combines them. Modern systems use machine learning for this.

What's next:

Now that we understand the individual factors, the next page explores how to actively tune them—specifically through boosting fields and queries. We'll learn how to adjust relative importance, apply query-time boosts, and configure field-level weights to optimize for specific use cases and business requirements.

Page Complete

You now have a comprehensive understanding of relevance factors—the building blocks of search quality. From textual matching through authority signals to user context, these factors provide the raw material for ranking. Next, we'll learn to manipulate these factors through boosting to achieve specific relevance goals.

1 / 5

Loading learning content...

System Design (HLD)Search Relevance Tuning

Search Relevance Tuning

LevelAdvanced

Duration75 mins

TopicSearch Relevance Tuning

1 / 5

Relevance Factors: The Foundation of Search Quality

The Relevance Problem

This page provides a comprehensive exploration of relevance factors—the signals and features that search systems use to determine how well a document matches a user's query and intent.

What You Will Learn

Understanding Relevance: Beyond Simple Matching

The evolution of relevance thinking:

Learning-to-rank (2000s-present) treats relevance as a machine learning problem, using hundreds of features to predict relevance based on labeled training data.

Modern systems combine all these approaches, using traditional scoring as features within ML models that incorporate user behavior, context, and personalization signals.

The Relevance Triangle

The Taxonomy of Relevance Factors

The factor hierarchy:

Relevance factors fall into several broad categories, each capturing a different dimension of what makes a result useful to the user.

Taxonomy of Relevance Factors
Category	Description	Examples	Typical Weight
Textual Relevance	How well document text matches query terms	TF-IDF, BM25, phrase matching, proximity	High (40-60%)
Freshness	How recent or up-to-date the content is	Publication date, last update, decay functions	Medium (10-20%)
Authority	Credibility and trustworthiness signals	PageRank, domain authority, backlinks, citations	Medium (10-20%)
User Engagement	How users interact with the result	Click-through rate, dwell time, bounce rate	Medium (10-20%)
Document Quality	Intrinsic content quality signals	Content length, readability, spam scores	Low-Medium (5-15%)
User Context	Personalization and contextual signals	Location, device, history, preferences	Variable (0-30%)
Query-Document Match	Structural matching beyond text	Title match, URL match, meta description	Medium (10-15%)

Interaction effects:

Textual Relevance: The Core Signal

Term frequency (TF):

term_frequency_variants.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import math
from collections import Counter
 
class TermFrequencyCalculators:
    """
    Various term frequency normalization schemes used in information retrieval.
    Each addresses different problems with raw term counts.
    """
    
    @staticmethod
    def raw_tf(term: str, document: list[str]) -> int:
        """
        Raw term frequency: simple count of term occurrences.
        Problem: Biased toward long documents.
        """
        return document.count(term)
    
    @staticmethod
    def log_normalized_tf(term: str, document: list[str]) -> float:
        """
        Log normalization: 1 + log(tf) if tf > 0, else 0.
        Intuition: Multiple occurrences matter, but with diminishing returns.
        The difference between 1 and 10 occurrences matters more than
        the difference between 100 and 1000.
        """
        tf = document.count(term)
        if tf == 0:
            return 0.0
        return 1.0 + math.log(tf)
    
    @staticmethod
    def double_normalization_tf(term: str, document: list[str], k: float = 0.5) -> float:
        """
        Double normalization (0.5): k + (1-k) * tf/max_tf
        Normalizes by the maximum term frequency in the document.
        Prevents any term from dominating too much.
        """
        tf = document.count(term)
        if tf == 0:
            return 0.0
        term_counts = Counter(document)
        max_tf = max(term_counts.values()) if term_counts else 1
        return k + (1 - k) * (tf / max_tf)
    
    @staticmethod
    def bm25_saturation_tf(term: str, document: list[str], k1: float = 1.2) -> float:
        """
        BM25-style saturation: tf / (tf + k1)
        Crucial insight: after enough occurrences, additional mentions
        add almost no extra relevance. This "saturates" the score.
        k1 controls how quickly saturation occurs.
        """
        tf = document.count(term)
        if tf == 0:
            return 0.0
        return tf / (tf + k1)
 
 
# Demonstration of saturation behavior
def demonstrate_saturation():
    """
    Shows how BM25 saturation prevents term frequency gaming.
    """
    print("TF	Raw	Log	BM25 Saturation")
    print("-" * 50)
    for tf in [1, 2, 5, 10, 20, 50, 100, 1000]:
        raw = tf
        log_norm = 1 + math.log(tf) if tf > 0 else 0
        bm25_sat = tf / (tf + 1.2)
        print(f"{tf}	{raw}	{log_norm:.2f}	{bm25_sat:.3f}")
 
# Output shows BM25 rapidly approaches 1.0, preventing gaming
# TF=1 → 0.45, TF=10 → 0.89, TF=100 → 0.99, TF=1000 → 0.999

Inverse document frequency (IDF):

IDF captures this by weighting terms inversely to their document frequency:

IDF(term) = log(N / df(term))

Where N is the total number of documents and df(term) is the number of documents containing the term. Rare terms get high IDF weights; common terms get low weights.

TF-IDF scoring:

Combining TF and IDF gives us TF-IDF, the classic relevance scoring function:

TF-IDF(term, doc) = TF(term, doc) × IDF(term)
Score(query, doc) = Σ TF-IDF(term, doc) for each term in query

This simple formula remains surprisingly effective and is still used as a baseline and feature in modern systems.

BM25: The Industry Standard Baseline

The BM25 formula:

bm25_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
import math
from collections import Counter
from dataclasses import dataclass
 
@dataclass
class BM25Parameters:
    """
    BM25 tuning parameters with their interpretations.
    """
    k1: float = 1.2  # Term frequency saturation. Higher = more weight to TF.
    b: float = 0.75  # Length normalization. 0 = ignore length, 1 = full normalization.
 
class BM25:
    """
    Production-grade BM25 implementation with full documentation.
    
    The BM25 scoring function is:
    
    Score(Q, D) = Σ IDF(qi) × (tf(qi, D) × (k1 + 1)) / (tf(qi, D) + k1 × (1 - b + b × |D|/avgdl))
    
    Where:
    - Q is the query, D is the document
    - qi is the i-th query term
    - tf(qi, D) is the frequency of qi in D
    - |D| is the document length
    - avgdl is the average document length in the corpus
    - k1 and b are tuning parameters
    """
    
    def __init__(self, corpus: list[list[str]], params: BM25Parameters = BM25Parameters()):
        self.corpus = corpus
        self.params = params
        self.corpus_size = len(corpus)
        
        # Precompute document lengths and average
        self.doc_lengths = [len(doc) for doc in corpus]
        self.avg_doc_length = sum(self.doc_lengths) / self.corpus_size if self.corpus_size > 0 else 0
        
        # Precompute document frequencies for IDF
        self.doc_frequencies = {}
        for doc in corpus:
            unique_terms = set(doc)
            for term in unique_terms:
                self.doc_frequencies[term] = self.doc_frequencies.get(term, 0) + 1
        
        # Precompute term frequencies per document
        self.doc_term_frequencies = [Counter(doc) for doc in corpus]
    
    def idf(self, term: str) -> float:
        """
        Compute Inverse Document Frequency for a term.
        
        Uses the Robertson-Walker IDF variant:
        IDF = log((N - n + 0.5) / (n + 0.5) + 1)
        
        This formulation:
        - Avoids negative IDF for very common terms
        - Handles edge cases smoothly
        - Provides slightly better empirical results than classic IDF
        """
        n = self.doc_frequencies.get(term, 0)
        numerator = self.corpus_size - n + 0.5
        denominator = n + 0.5
        return math.log(numerator / denominator + 1)
    
    def score_document(self, query: list[str], doc_index: int) -> float:
        """
        Compute BM25 score for a single document against a query.
        
        The core insight: balance term frequency importance (k1) with
        document length normalization (b). Long documents naturally
        contain more terms but aren't necessarily more relevant.
        """
        score = 0.0
        doc_length = self.doc_lengths[doc_index]
        doc_tf = self.doc_term_frequencies[doc_index]
        
        for term in query:
            if term not in doc_tf:
                continue
            
            tf = doc_tf[term]
            idf = self.idf(term)
            
            # Length normalization factor
            # When b=1: fully normalize by document length
            # When b=0: ignore document length entirely
            length_norm = 1 - self.params.b + self.params.b * (doc_length / self.avg_doc_length)
            
            # BM25 term score with saturation
            numerator = tf * (self.params.k1 + 1)
            denominator = tf + self.params.k1 * length_norm
            
            score += idf * (numerator / denominator)
        
        return score
    
    def rank(self, query: list[str], top_k: int = 10) -> list[tuple[int, float]]:
        """
        Rank all documents by BM25 score and return top K.
        """
        scores = [(i, self.score_document(query, i)) for i in range(self.corpus_size)]
        scores.sort(key=lambda x: -x[1])
        return scores[:top_k]
 
 
# Example usage demonstrating BM25 behavior
def bm25_demonstration():
    corpus = [
        ["machine", "learning", "is", "a", "subset", "of", "artificial", "intelligence"],
        ["deep", "learning", "uses", "neural", "networks", "for", "machine", "learning"],
        ["machine", "learning", "machine", "learning", "machine", "learning"],  # TF gaming attempt
        ["artificial", "intelligence", "encompasses", "many", "techniques"],
        ["machine", "learning", "algorithms", "learn", "from", "data", "patterns"],
    ]
    
    bm25 = BM25(corpus)
    query = ["machine", "learning"]
    
    print("Query: 'machine learning'")
    print("\nRanking results:")
    for doc_idx, score in bm25.rank(query, top_k=5):
        print(f"  Doc {doc_idx}: {score:.4f} - {' '.join(corpus[doc_idx])}")
    
    # Note: Doc 2 (spam-like repetition) doesn't dominate because of:
    # 1. TF saturation (k1 parameter)
    # 2. The IDF component means common terms in query matter less
    # 3. Length normalization (b parameter)

BM25 Parameter Tuning

Beyond Textual Matching: Phrase and Proximity

Phrase matching:

Phrase queries require terms to appear consecutively in the specified order. Search engines support this with exact phrase operators (typically quotes):

Query: "machine learning" — matches only documents where these words appear consecutively
Query: machine learning — matches documents containing both words, anywhere

Phrase matching uses positional indexes that store not just which documents contain a term, but where in each document:

"machine" → {doc1: [5, 23, 89], doc2: [12], doc3: [3, 7, 45, 102]}
"learning" → {doc1: [6, 24], doc2: [13], doc3: [8, 46]}

For a phrase match, we check if positions satisfy adjacency constraints.

proximity_scoring.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
from dataclasses import dataclass
from typing import Optional
import math
 
@dataclass
class TermPositions:
    """Positions of a term in a document."""
    term: str
    positions: list[int]  # 1-indexed positions in document
 
class ProximityScorer:
    """
    Computes proximity-based relevance boosts.
    
    Intuition: Terms that appear close together are more likely to be
    semantically related. "Machine learning" appearing as a phrase is
    a stronger relevance signal than the words appearing paragraphs apart.
    """
    
    @staticmethod
    def min_span(term_positions: list[TermPositions]) -> Optional[int]:
        """
        Find the minimum span (window size) that contains all query terms.
        Returns None if any term is missing from the document.
        
        Uses a sliding window approach:
        1. Initialize pointers to the first position of each term
        2. Find the span of current positions
        3. Advance the leftmost pointer
        4. Track minimum span seen
        
        Time complexity: O(n * m) where n = total positions, m = number of terms
        """
        if not term_positions or any(len(tp.positions) == 0 for tp in term_positions):
            return None
        
        # Initialize pointers to first position of each term
        pointers = [0] * len(term_positions)
        min_span_size = float('inf')
        
        while True:
            # Get current positions for each term
            current_positions = [
                term_positions[i].positions[pointers[i]] 
                for i in range(len(term_positions))
            ]
            
            # Calculate span from min to max position
            min_pos = min(current_positions)
            max_pos = max(current_positions)
            span = max_pos - min_pos + 1
            
            min_span_size = min(min_span_size, span)
            
            # Perfect phrase match - can't do better
            if span == len(term_positions):
                break
            
            # Advance the leftmost pointer
            leftmost_idx = current_positions.index(min_pos)
            pointers[leftmost_idx] += 1
            
            # If we've exhausted any term's positions, we're done
            if pointers[leftmost_idx] >= len(term_positions[leftmost_idx].positions):
                break
        
        return min_span_size if min_span_size != float('inf') else None
    
    @staticmethod
    def proximity_boost(min_span: int, num_terms: int, 
                       max_boost: float = 2.0, 
                       decay_rate: float = 0.1) -> float:
        """
        Calculate proximity boost factor based on minimum span.
        
        Perfect phrase (span = num_terms) gets max boost.
        Boost decays as span increases.
        
        Formula: max_boost * exp(-decay_rate * (span - num_terms))
        """
        if min_span is None:
            return 1.0  # No boost if terms don't all appear
        
        # Excess span beyond perfect phrase
        excess = min_span - num_terms
        
        if excess <= 0:
            return max_boost  # Perfect phrase match
        
        # Exponential decay as terms spread apart
        return 1.0 + (max_boost - 1.0) * math.exp(-decay_rate * excess)
    
    @staticmethod
    def bigram_proximity_score(doc_positions: dict[str, list[int]], 
                              query_terms: list[str],
                              window: int = 5) -> float:
        """
        Score based on how often query term pairs appear within a window.
        
        This captures partial proximity - even if all terms aren't close,
        nearby term pairs still boost relevance.
        """
        if len(query_terms) < 2:
            return 0.0
        
        total_score = 0.0
        pairs_counted = 0
        
        for i in range(len(query_terms) - 1):
            term1, term2 = query_terms[i], query_terms[i + 1]
            
            if term1 not in doc_positions or term2 not in doc_positions:
                continue
            
            positions1 = doc_positions[term1]
            positions2 = doc_positions[term2]
            
            # Count pairs within window
            close_pairs = 0
            for p1 in positions1:
                for p2 in positions2:
                    if abs(p2 - p1) <= window:
                        close_pairs += 1
            
            if close_pairs > 0:
                total_score += math.log(1 + close_pairs)
                pairs_counted += 1
        
        return total_score / pairs_counted if pairs_counted > 0 else 0.0
 
 
# Demonstration
def proximity_demo():
    # Document: "Machine learning is transforming the field. Deep learning, a subset of machine learning, uses neural networks."
    positions = {
        "machine": [1, 12],
        "learning": [2, 7, 13],
        "deep": [6],
        "neural": [16],
        "networks": [17],
    }
    
    query = ["machine", "learning"]
    term_positions = [
        TermPositions("machine", positions["machine"]),
        TermPositions("learning", positions["learning"]),
    ]
    
    min_span = ProximityScorer.min_span(term_positions)
    boost = ProximityScorer.proximity_boost(min_span, len(query))
    
    print(f"Query: {query}")
    print(f"Minimum span: {min_span}")  # Should be 2 (positions 1-2 = "machine learning")
    print(f"Proximity boost: {boost:.2f}")  # Should be ~2.0 (perfect phrase match)

Implementation Note

Freshness Signals: The Temporal Dimension

Types of freshness signals:

Freshness Signal Categories

•Publication Date — When the content was first created. Most reliable when available, but can be manipulated or missing.
•Last Modified Date — When the content was last updated. Captures substantive updates but can be triggered by trivial changes.
•Crawl Discovery Date — When the search engine first indexed the content. Available for all content but doesn't reflect actual publication time.
•Link Age — Age of inbound links. New content receiving fresh links signals ongoing relevance.
•Content Churn Signals — How much the content changes over time. Frequently updated pages may be more current.

freshness_decay_functions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
import math
from datetime import datetime, timedelta
from enum import Enum
 
class DecayFunction(Enum):
    """Different decay models for freshness scoring."""
    LINEAR = "linear"
    EXPONENTIAL = "exponential"
    GAUSSIAN = "gaussian"
    SIGMOID = "sigmoid"
 
class FreshnessScorer:
    """
    Computes freshness scores with configurable decay functions.
    
    Key insight: The "right" decay function depends on query type.
    - News: Sharp exponential decay (yesterday's news is old news)
    - Products: Moderate decay (last year's reviews still useful)
    - Historical: Minimal decay (old content may be best source)
    """
    
    def __init__(self, 
                 current_time: datetime = None,
                 scale_days: float = 30.0,  # "half-life" in days
                 decay_type: DecayFunction = DecayFunction.EXPONENTIAL):
        self.current_time = current_time or datetime.now()
        self.scale_days = scale_days
        self.decay_type = decay_type
    
    def age_in_days(self, doc_date: datetime) -> float:
        """Calculate document age in fractional days."""
        delta = self.current_time - doc_date
        return delta.total_seconds() / 86400.0
    
    def linear_decay(self, age_days: float) -> float:
        """
        Linear decay from 1.0 to 0.0 over scale_days.
        Simple but unrealistic - content doesn't become worthless linearly.
        """
        decay = 1.0 - (age_days / self.scale_days)
        return max(0.0, decay)
    
    def exponential_decay(self, age_days: float) -> float:
        """
        Exponential decay: score = exp(-λ * age)
        Most common in practice. Half-life = scale_days.
        Never reaches zero - very old content still has some value.
        """
        lambda_param = math.log(2) / self.scale_days
        return math.exp(-lambda_param * age_days)
    
    def gaussian_decay(self, age_days: float, optimal_age: float = 0) -> float:
        """
        Gaussian decay: bell curve centered on optimal_age.
        Useful when "just published" isn't optimal (e.g., need time for reviews).
        """
        sigma = self.scale_days
        return math.exp(-0.5 * ((age_days - optimal_age) / sigma) ** 2)
    
    def sigmoid_decay(self, age_days: float) -> float:
        """
        Sigmoid decay: smooth S-curve transition from fresh to stale.
        Good for binary-ish freshness (relevant vs. outdated).
        """
        midpoint = self.scale_days
        steepness = 0.1  # Controls transition sharpness
        return 1.0 / (1.0 + math.exp(steepness * (age_days - midpoint)))
    
    def score(self, doc_date: datetime) -> float:
        """Compute freshness score using configured decay function."""
        age = self.age_in_days(doc_date)
        
        if self.decay_type == DecayFunction.LINEAR:
            return self.linear_decay(age)
        elif self.decay_type == DecayFunction.EXPONENTIAL:
            return self.exponential_decay(age)
        elif self.decay_type == DecayFunction.GAUSSIAN:
            return self.gaussian_decay(age)
        elif self.decay_type == DecayFunction.SIGMOID:
            return self.sigmoid_decay(age)
        else:
            return self.exponential_decay(age)
 
 
class QuerySpecificFreshness:
    """
    Adjusts freshness importance based on query classification.
    
    Not all queries need fresh content equally.
    """
    
    # Query freshness sensitivity categories
    QUERY_FRESHNESS_WEIGHTS = {
        "breaking_news": 0.95,      # "earthquake today"
        "recent_events": 0.8,       # "election results 2024"
        "trending_topics": 0.7,     # "viral tiktok"
        "product_reviews": 0.5,     # "iphone 15 review"
        "how_to": 0.3,              # "how to tie a tie"
        "factual": 0.2,             # "speed of light"
        "historical": 0.05,         # "world war 2 causes"
    }
    
    @classmethod
    def weighted_freshness_boost(cls, 
                                 freshness_score: float,
                                 query_type: str,
                                 base_score: float) -> float:
        """
        Apply freshness boost weighted by query type.
        
        Final score = base_score * (1 + weight * freshness_adjustment)
        """
        weight = cls.QUERY_FRESHNESS_WEIGHTS.get(query_type, 0.3)
        freshness_adjustment = freshness_score - 0.5  # Center around neutral
        
        return base_score * (1.0 + weight * freshness_adjustment)
 
 
# Example: Compare decay functions
def compare_decay_functions():
    now = datetime.now()
    ages = [0, 1, 7, 30, 90, 365]  # Days
    
    print("Age (days) | Linear | Exponential | Gaussian | Sigmoid")
    print("-" * 60)
    
    for age in ages:
        doc_date = now - timedelta(days=age)
        linear = FreshnessScorer(now, 90, DecayFunction.LINEAR).score(doc_date)
        exp = FreshnessScorer(now, 30, DecayFunction.EXPONENTIAL).score(doc_date)
        gauss = FreshnessScorer(now, 30, DecayFunction.GAUSSIAN).score(doc_date)
        sigmoid = FreshnessScorer(now, 60, DecayFunction.SIGMOID).score(doc_date)
        
        print(f"{age:10d} | {linear:6.3f} | {exp:11.3f} | {gauss:8.3f} | {sigmoid:7.3f}")

Authority Signals: Trust and Credibility

PageRank: The original authority signal:

PageRank models this as a random walk: imagine a user clicking random links forever. The probability of landing on any given page is its PageRank score.

authority_signals.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
import numpy as np
from dataclasses import dataclass
from typing import Dict, List, Set
 
@dataclass
class AuthorityMetrics:
    """Collection of authority signals for a document/domain."""
    pagerank: float = 0.0
    domain_authority: float = 0.0      # Aggregate domain-level signal
    inbound_link_count: int = 0        # Raw backlink count
    unique_referring_domains: int = 0  # More resistant to manipulation
    trust_rank: float = 0.0            # Distance from seed trusted sites
    spam_score: float = 0.0            # Probability of being spam
 
class PageRankCalculator:
    """
    Classic PageRank implementation with damping factor.
    
    The intuition: A page is important if important pages link to it.
    This circular definition is resolved iteratively.
    
    Mathematical formulation:
    PR(A) = (1-d)/N + d × Σ(PR(Ti)/C(Ti))
    
    Where:
    - d = damping factor (probability of following a link vs. random jump)
    - N = total pages
    - Ti = pages linking to A
    - C(Ti) = number of outbound links from Ti
    """
    
    def __init__(self, damping_factor: float = 0.85, max_iterations: int = 100, 
                 convergence_threshold: float = 1e-6):
        self.damping = damping_factor
        self.max_iter = max_iterations
        self.threshold = convergence_threshold
    
    def calculate(self, adjacency_list: Dict[str, List[str]]) -> Dict[str, float]:
        """
        Calculate PageRank for all nodes in the graph.
        
        Args:
            adjacency_list: {page: [pages it links to]}
        
        Returns:
            {page: pagerank_score}
        """
        # Build the set of all nodes
        all_nodes = set(adjacency_list.keys())
        for targets in adjacency_list.values():
            all_nodes.update(targets)
        
        n = len(all_nodes)
        nodes = list(all_nodes)
        node_to_idx = {node: i for i, node in enumerate(nodes)}
        
        # Initialize uniform PageRank
        pagerank = np.ones(n) / n
        
        # Build transition matrix
        # M[i][j] = probability of going from j to i
        outbound_counts = {node: len(adjacency_list.get(node, [])) for node in nodes}
        
        for iteration in range(self.max_iter):
            new_pagerank = np.ones(n) * (1 - self.damping) / n
            
            for source, targets in adjacency_list.items():
                if not targets:
                    continue  # Handle dangling nodes
                
                source_idx = node_to_idx[source]
                contribution = self.damping * pagerank[source_idx] / len(targets)
                
                for target in targets:
                    if target in node_to_idx:
                        target_idx = node_to_idx[target]
                        new_pagerank[target_idx] += contribution
            
            # Handle dangling nodes (pages with no outlinks)
            dangling_sum = sum(
                pagerank[node_to_idx[node]] 
                for node in nodes 
                if node not in adjacency_list or not adjacency_list[node]
            )
            new_pagerank += self.damping * dangling_sum / n
            
            # Check convergence
            diff = np.sum(np.abs(new_pagerank - pagerank))
            pagerank = new_pagerank
            
            if diff < self.threshold:
                break
        
        return {nodes[i]: pagerank[i] for i in range(n)}
 
 
class TrustRank:
    """
    TrustRank: PageRank seeded from known trustworthy sites.
    
    Key insight: Spam pages can accumulate PageRank through link farms,
    but it's hard to get links from genuinely trusted sources.
    
    TrustRank propagates trust from a hand-picked seed set of
    authoritative sites. Pages far from the seed set in the link graph
    have lower trust scores.
    """
    
    def __init__(self, trusted_seeds: Set[str], decay_factor: float = 0.85):
        self.seeds = trusted_seeds
        self.decay = decay_factor
    
    def calculate(self, adjacency_list: Dict[str, List[str]], 
                  iterations: int = 10) -> Dict[str, float]:
        """
        Calculate TrustRank propagating from seed set.
        """
        all_nodes = set(adjacency_list.keys())
        for targets in adjacency_list.values():
            all_nodes.update(targets)
        
        # Initialize: seeds get trust 1.0, others get 0
        trust = {node: 1.0 if node in self.seeds else 0.0 for node in all_nodes}
        
        # Build reverse adjacency (who links TO each page)
        reverse_adj: Dict[str, List[str]] = {node: [] for node in all_nodes}
        for source, targets in adjacency_list.items():
            for target in targets:
                if target in reverse_adj:
                    reverse_adj[target].append(source)
        
        for _ in range(iterations):
            new_trust = {}
            for node in all_nodes:
                if node in self.seeds:
                    new_trust[node] = 1.0  # Seeds maintain trust
                else:
                    # Inherit decayed trust from linking pages
                    inherited = sum(
                        self.decay * trust[source] / len(adjacency_list.get(source, [source]))
                        for source in reverse_adj[node]
                        if source in adjacency_list
                    )
                    new_trust[node] = min(1.0, inherited)
            
            trust = new_trust
        
        return trust
 
 
def domain_authority_aggregation():
    """
    Domain Authority: Aggregate page-level signals to domain level.
    
    Why domain-level? 
    - New pages on trusted domains deserve initial trust
    - Prevents gaming by creating many low-quality pages
    - More stable signal (pages come and go, domains persist)
    """
    # Simplified example - real systems use ML models with many features
    domain_signals = {
        "wikipedia.org": {
            "age_years": 24,
            "unique_referring_domains": 15_000_000,
            "manual_authority_score": 98,  # Human-curated
            "avg_page_quality": 0.92,
        },
        "randomspamsite.xyz": {
            "age_years": 0.5,
            "unique_referring_domains": 50,
            "manual_authority_score": None,
            "avg_page_quality": 0.15,
        },
    }
    
    # Domain authority model (simplified)
    for domain, signals in domain_signals.items():
        age_factor = min(1.0, signals["age_years"] / 10)
        link_factor = min(1.0, np.log10(signals["unique_referring_domains"] + 1) / 7)
        quality_factor = signals["avg_page_quality"]
        
        authority = 0.3 * age_factor + 0.4 * link_factor + 0.3 * quality_factor
        print(f"{domain}: Authority = {authority:.3f}")

Modern Authority Signals

User Context Signals: The Personal Dimension

Categories of user context:

User Context Signal Categories
Signal Type	Examples	Impact on Relevance	Privacy Considerations
Geographic	Country, city, precise location	Local results, language preferences	IP geolocation, explicit settings
Temporal	Time of day, day of week, season	Different intents at different times	Generally non-sensitive
Device	Mobile, desktop, tablet, OS	Format preferences, mobile-first results	User-agent, screen size
Language	Browser language, query language	Multilingual results, translations	Browser settings, query detection
Search History	Previous queries, click patterns	Disambiguation, interest profiling	Requires user consent, privacy-sensitive
Account Data	Explicit preferences, saved items	Strong personalization when available	Requires authentication, most sensitive

The personalization spectrum:

Personalization exists on a spectrum from no personalization (everyone sees the same results) to heavy personalization (results are largely determined by user profile). The optimal point depends on:

Query type: Navigational queries ("facebook login") need less personalization than informational queries ("best coffee")
User data availability: Anonymous users can only be personalized by context; logged-in users can have richer personalization
Domain characteristics: E-commerce benefits from heavy personalization; news may need less (risk of filter bubbles)
Privacy constraints: GDPR, CCPA, and user preferences limit available signals

We'll explore personalization implementation deeply in a later section of this module.

Combining Relevance Factors: The Ranking Function

Approaches to combination:

Manual Scoring Functions

•Hand-tuned linear combinations
•Expert-designed formulas
•Transparent and debuggable
•Limited feature interactions
•Hard to optimize at scale
•Example: weighted BM25 + freshness + authority

Machine Learned Ranking

•Trained on relevance judgments
•Captures complex interactions
•Scales to hundreds of features
•Harder to interpret
•Requires labeled training data
•Example: Gradient boosted trees, neural rankers

ranking_combination.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
from dataclasses import dataclass
from typing import Callable, List, Dict
import numpy as np
 
@dataclass
class RankingFeatures:
    """All features used for ranking a single document."""
    bm25_score: float
    phrase_match_boost: float
    freshness_score: float
    pagerank: float
    domain_authority: float
    click_through_rate: float
    dwell_time_avg: float
    title_match: bool
    url_match: bool
 
class ManualRankingFunction:
    """
    Hand-tuned linear scoring function.
    
    Pros: Interpretable, controllable, no training data needed
    Cons: Suboptimal, limited interactions, hard to tune
    """
    
    def __init__(self):
        # Hand-tuned weights based on experimentation
        self.weights = {
            "bm25": 1.0,           # Base relevance signal
            "phrase_match": 0.3,   # Boost for exact phrases
            "freshness": 0.2,      # Time decay importance
            "pagerank": 0.15,      # Authority signal
            "domain_authority": 0.1,
            "ctr": 0.15,           # User engagement
            "dwell_time": 0.1,
            "title_match": 0.2,    # Structural signals
            "url_match": 0.1,
        }
    
    def score(self, features: RankingFeatures) -> float:
        """Compute final score as weighted sum."""
        score = 0.0
        score += self.weights["bm25"] * features.bm25_score
        score += self.weights["phrase_match"] * features.phrase_match_boost
        score += self.weights["freshness"] * features.freshness_score
        score += self.weights["pagerank"] * features.pagerank
        score += self.weights["domain_authority"] * features.domain_authority
        score += self.weights["ctr"] * features.click_through_rate
        score += self.weights["dwell_time"] * features.dwell_time_avg
        score += self.weights["title_match"] * (1.0 if features.title_match else 0.0)
        score += self.weights["url_match"] * (1.0 if features.url_match else 0.0)
        return score
 
 
class LearningToRankModel:
    """
    Simplified Learning-to-Rank using gradient boosted trees.
    
    Real systems use libraries like XGBoost, LightGBM, or CatBoost.
    This demonstrates the conceptual approach.
    """
    
    # Common L2R approaches:
    # 1. Pointwise: Predict absolute relevance (regression/classification)
    # 2. Pairwise: Predict which doc is more relevant (RankNet, LambdaRank)
    # 3. Listwise: Optimize list-level metrics directly (ListNet, LambdaMART)
    
    def __init__(self, model_type: str = "lambdamart"):
        self.model_type = model_type
        self.model = None  # Would be actual ML model
    
    def extract_features(self, features: RankingFeatures) -> np.ndarray:
        """Convert structured features to numpy array for ML model."""
        return np.array([
            features.bm25_score,
            features.phrase_match_boost,
            features.freshness_score,
            features.pagerank,
            features.domain_authority,
            features.click_through_rate,
            features.dwell_time_avg,
            1.0 if features.title_match else 0.0,
            1.0 if features.url_match else 0.0,
            # Derived features (feature engineering)
            features.bm25_score * features.freshness_score,  # Interaction
            features.pagerank * features.domain_authority,    # Authority combo
            np.log1p(features.dwell_time_avg),                # Log transform
        ])
    
    def train(self, 
              training_queries: List[Dict],
              relevance_labels: List[List[int]]):
        """
        Train the L2R model on query-document pairs.
        
        Training data format:
        - Query with multiple candidate documents
        - Relevance labels for each document (0-4 scale typically)
        
        Objective: Learn to produce rankings that match human judgments.
        """
        # In practice, would call XGBoost/LightGBM train
        # Model learns feature importance and interactions automatically
        pass
    
    def score(self, features: RankingFeatures) -> float:
        """Predict relevance score using trained model."""
        feature_vector = self.extract_features(features)
        # return self.model.predict([feature_vector])[0]
        return 0.0  # Placeholder
 
 
class HybridRanker:
    """
    Production ranking often uses multiple stages.
    
    Stage 1 (Retrieval): Fast, approximate ranking (BM25)
           - Goal: Reduce billions of docs to thousands
           - Speed: ~1ms
    
    Stage 2 (Initial Ranking): Medium complexity
           - Goal: Reduce to hundreds, add some signals
           - Speed: ~10ms
    
    Stage 3 (Re-ranking): Full ML model
           - Goal: Precise ordering of top results
           - Speed: ~50-100ms
    
    This multi-stage approach balances quality with latency.
    """
    
    def __init__(self,
                 retrieval_func: Callable,
                 initial_ranker: ManualRankingFunction,
                 reranker: LearningToRankModel):
        self.retrieval = retrieval_func
        self.initial_ranker = initial_ranker
        self.reranker = reranker
    
    def search(self, query: str, top_k: int = 10) -> List:
        # Stage 1: Retrieve candidates (fast, BM25-based)
        candidates = self.retrieval(query, limit=1000)
        
        # Stage 2: Initial ranking (medium cost)
        for doc in candidates:
            doc.initial_score = self.initial_ranker.score(doc.features)
        candidates.sort(key=lambda d: -d.initial_score)
        candidates = candidates[:100]  # Keep top 100
        
        # Stage 3: Re-rank with ML (expensive, high quality)
        for doc in candidates:
            doc.final_score = self.reranker.score(doc.features)
        candidates.sort(key=lambda d: -d.final_score)
        
        return candidates[:top_k]

Summary: The Relevance Factor Foundation

Key Takeaways

•Relevance is multi-dimensional — No single factor captures what makes a result useful. Textual match, freshness, authority, user context, and engagement all contribute.
•BM25 remains the baseline — Despite being decades old, BM25's probabilistic foundation makes it remarkably effective. Understand it deeply before adding complexity.
•Proximity and phrases matter — Bag-of-words models lose important information. Phrase matching and proximity scoring recover some of this signal.
•Freshness is query-dependent — The importance of recency varies dramatically by query type. News queries need fresh results; historical queries may prefer authoritative older sources.
•Authority prevents manipulation — PageRank and its descendants help distinguish quality from SEO-optimized spam. Modern systems use many authority signals.
•Context enables personalization — User context (location, device, history) can dramatically improve relevance for ambiguous queries.
•Combination is the hard part — Individual factors are meaningless without a ranking function that combines them. Modern systems use machine learning for this.

What's next:

Page Complete

1 / 5