Loading learning content...
When a user types a query into a search box, they have an intent—a specific piece of information they seek, a product they want to purchase, a question they need answered. The search system's fundamental challenge is to infer this intent from a few keywords and return results that satisfy it. This is the relevance problem, and it sits at the heart of every search system ever built.
Relevance is deceptively simple to define informally: relevant results are the ones the user actually wanted. But operationalizing this definition requires us to decompose relevance into measurable components, understand how these components interact, and build systems that can compute relevance scores at scale—often for billions of documents against millions of queries per second.
This page provides a comprehensive exploration of relevance factors—the signals and features that search systems use to determine how well a document matches a user's query and intent.
By the end of this page, you will understand the taxonomy of relevance factors, how textual and non-textual signals contribute to relevance scoring, the mathematical foundations of scoring functions like TF-IDF and BM25, and the architectural considerations for computing relevance at scale. This knowledge forms the foundation for all subsequent tuning and personalization strategies.
Relevance in information retrieval is a multi-faceted concept that goes far beyond simple keyword matching. A document might contain every word in the query but still be completely irrelevant if it uses those words in a different context. Conversely, a document might be highly relevant despite not containing the exact query terms, if it uses synonyms or addresses the underlying intent.
The evolution of relevance thinking:
Early search systems (1960s-1980s) treated relevance as a binary concept: a document either matched the query or it didn't. Boolean retrieval systems returned all documents matching a query expression (e.g., "apple AND computer") without ranking.
The vector space model (1970s) introduced the idea of degree of relevance—documents could be more or less relevant based on term weights. This enabled ranked retrieval, where results are ordered by estimated relevance.
Probabilistic models (1970s-1990s) formalized relevance as probability: what is the probability that this document is relevant given the query? This led to models like BM25, which remains a standard baseline today.
Learning-to-rank (2000s-present) treats relevance as a machine learning problem, using hundreds of features to predict relevance based on labeled training data.
Modern systems combine all these approaches, using traditional scoring as features within ML models that incorporate user behavior, context, and personalization signals.
Relevance exists at the intersection of three elements: (1) the user's information need (intent), (2) the query they formulate to express that need, and (3) the documents in the corpus. Perfect relevance would mean the system understands the user's true intent, correctly interprets their query as an expression of that intent, and identifies documents that satisfy the intent. Each mapping is imperfect, which is why relevance is such a hard problem.
Modern search systems consider dozens—sometimes hundreds—of factors when computing relevance scores. These factors can be organized into a coherent taxonomy that helps us understand their roles and interactions.
The factor hierarchy:
Relevance factors fall into several broad categories, each capturing a different dimension of what makes a result useful to the user.
| Category | Description | Examples | Typical Weight |
|---|---|---|---|
| Textual Relevance | How well document text matches query terms | TF-IDF, BM25, phrase matching, proximity | High (40-60%) |
| Freshness | How recent or up-to-date the content is | Publication date, last update, decay functions | Medium (10-20%) |
| Authority | Credibility and trustworthiness signals | PageRank, domain authority, backlinks, citations | Medium (10-20%) |
| User Engagement | How users interact with the result | Click-through rate, dwell time, bounce rate | Medium (10-20%) |
| Document Quality | Intrinsic content quality signals | Content length, readability, spam scores | Low-Medium (5-15%) |
| User Context | Personalization and contextual signals | Location, device, history, preferences | Variable (0-30%) |
| Query-Document Match | Structural matching beyond text | Title match, URL match, meta description | Medium (10-15%) |
Interaction effects:
These factors don't operate independently. A highly authoritative source with moderate textual relevance might outrank a low-authority source with perfect textual match. Freshness might matter enormously for news queries but be irrelevant for historical information. User context can completely reorder results—the same query from two users might yield entirely different optimal rankings.
This complexity is why modern ranking systems use machine learning: the interactions between factors are too complex for humans to specify manually, but ML models can learn these relationships from training data.
Textual relevance measures how well the words in a document match the words in a query. Despite advances in semantic search and neural models, textual relevance remains the foundation of search ranking—the signal that other factors augment but rarely replace.
Term frequency (TF):
The most basic intuition is that documents mentioning the query terms more frequently are more likely to be relevant. If a user searches for "machine learning," a document mentioning "machine learning" 50 times is probably more focused on the topic than one mentioning it twice.
However, raw term frequency has problems. Very long documents naturally contain more term occurrences. Repeating a term obsessively doesn't increase relevance proportionally. This led to various TF normalization schemes.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
import mathfrom collections import Counter class TermFrequencyCalculators: """ Various term frequency normalization schemes used in information retrieval. Each addresses different problems with raw term counts. """ @staticmethod def raw_tf(term: str, document: list[str]) -> int: """ Raw term frequency: simple count of term occurrences. Problem: Biased toward long documents. """ return document.count(term) @staticmethod def log_normalized_tf(term: str, document: list[str]) -> float: """ Log normalization: 1 + log(tf) if tf > 0, else 0. Intuition: Multiple occurrences matter, but with diminishing returns. The difference between 1 and 10 occurrences matters more than the difference between 100 and 1000. """ tf = document.count(term) if tf == 0: return 0.0 return 1.0 + math.log(tf) @staticmethod def double_normalization_tf(term: str, document: list[str], k: float = 0.5) -> float: """ Double normalization (0.5): k + (1-k) * tf/max_tf Normalizes by the maximum term frequency in the document. Prevents any term from dominating too much. """ tf = document.count(term) if tf == 0: return 0.0 term_counts = Counter(document) max_tf = max(term_counts.values()) if term_counts else 1 return k + (1 - k) * (tf / max_tf) @staticmethod def bm25_saturation_tf(term: str, document: list[str], k1: float = 1.2) -> float: """ BM25-style saturation: tf / (tf + k1) Crucial insight: after enough occurrences, additional mentions add almost no extra relevance. This "saturates" the score. k1 controls how quickly saturation occurs. """ tf = document.count(term) if tf == 0: return 0.0 return tf / (tf + k1) # Demonstration of saturation behaviordef demonstrate_saturation(): """ Shows how BM25 saturation prevents term frequency gaming. """ print("TF Raw Log BM25 Saturation") print("-" * 50) for tf in [1, 2, 5, 10, 20, 50, 100, 1000]: raw = tf log_norm = 1 + math.log(tf) if tf > 0 else 0 bm25_sat = tf / (tf + 1.2) print(f"{tf} {raw} {log_norm:.2f} {bm25_sat:.3f}") # Output shows BM25 rapidly approaches 1.0, preventing gaming# TF=1 → 0.45, TF=10 → 0.89, TF=100 → 0.99, TF=1000 → 0.999Inverse document frequency (IDF):
Not all terms are equally informative. The word "the" appears in virtually every English document and tells us nothing about relevance. The term "mitochondrial" is rare and highly specific—its presence is a strong relevance signal.
IDF captures this by weighting terms inversely to their document frequency:
IDF(term) = log(N / df(term))
Where N is the total number of documents and df(term) is the number of documents containing the term. Rare terms get high IDF weights; common terms get low weights.
TF-IDF scoring:
Combining TF and IDF gives us TF-IDF, the classic relevance scoring function:
TF-IDF(term, doc) = TF(term, doc) × IDF(term)
Score(query, doc) = Σ TF-IDF(term, doc) for each term in query
This simple formula remains surprisingly effective and is still used as a baseline and feature in modern systems.
BM25 (Best Matching 25) represents the culmination of probabilistic retrieval research. Developed in the 1990s, it remains the default ranking function in Elasticsearch, Solr, and most other search engines. Understanding BM25 deeply is essential for any search engineer.
The BM25 formula:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128
import mathfrom collections import Counterfrom dataclasses import dataclass @dataclassclass BM25Parameters: """ BM25 tuning parameters with their interpretations. """ k1: float = 1.2 # Term frequency saturation. Higher = more weight to TF. b: float = 0.75 # Length normalization. 0 = ignore length, 1 = full normalization. class BM25: """ Production-grade BM25 implementation with full documentation. The BM25 scoring function is: Score(Q, D) = Σ IDF(qi) × (tf(qi, D) × (k1 + 1)) / (tf(qi, D) + k1 × (1 - b + b × |D|/avgdl)) Where: - Q is the query, D is the document - qi is the i-th query term - tf(qi, D) is the frequency of qi in D - |D| is the document length - avgdl is the average document length in the corpus - k1 and b are tuning parameters """ def __init__(self, corpus: list[list[str]], params: BM25Parameters = BM25Parameters()): self.corpus = corpus self.params = params self.corpus_size = len(corpus) # Precompute document lengths and average self.doc_lengths = [len(doc) for doc in corpus] self.avg_doc_length = sum(self.doc_lengths) / self.corpus_size if self.corpus_size > 0 else 0 # Precompute document frequencies for IDF self.doc_frequencies = {} for doc in corpus: unique_terms = set(doc) for term in unique_terms: self.doc_frequencies[term] = self.doc_frequencies.get(term, 0) + 1 # Precompute term frequencies per document self.doc_term_frequencies = [Counter(doc) for doc in corpus] def idf(self, term: str) -> float: """ Compute Inverse Document Frequency for a term. Uses the Robertson-Walker IDF variant: IDF = log((N - n + 0.5) / (n + 0.5) + 1) This formulation: - Avoids negative IDF for very common terms - Handles edge cases smoothly - Provides slightly better empirical results than classic IDF """ n = self.doc_frequencies.get(term, 0) numerator = self.corpus_size - n + 0.5 denominator = n + 0.5 return math.log(numerator / denominator + 1) def score_document(self, query: list[str], doc_index: int) -> float: """ Compute BM25 score for a single document against a query. The core insight: balance term frequency importance (k1) with document length normalization (b). Long documents naturally contain more terms but aren't necessarily more relevant. """ score = 0.0 doc_length = self.doc_lengths[doc_index] doc_tf = self.doc_term_frequencies[doc_index] for term in query: if term not in doc_tf: continue tf = doc_tf[term] idf = self.idf(term) # Length normalization factor # When b=1: fully normalize by document length # When b=0: ignore document length entirely length_norm = 1 - self.params.b + self.params.b * (doc_length / self.avg_doc_length) # BM25 term score with saturation numerator = tf * (self.params.k1 + 1) denominator = tf + self.params.k1 * length_norm score += idf * (numerator / denominator) return score def rank(self, query: list[str], top_k: int = 10) -> list[tuple[int, float]]: """ Rank all documents by BM25 score and return top K. """ scores = [(i, self.score_document(query, i)) for i in range(self.corpus_size)] scores.sort(key=lambda x: -x[1]) return scores[:top_k] # Example usage demonstrating BM25 behaviordef bm25_demonstration(): corpus = [ ["machine", "learning", "is", "a", "subset", "of", "artificial", "intelligence"], ["deep", "learning", "uses", "neural", "networks", "for", "machine", "learning"], ["machine", "learning", "machine", "learning", "machine", "learning"], # TF gaming attempt ["artificial", "intelligence", "encompasses", "many", "techniques"], ["machine", "learning", "algorithms", "learn", "from", "data", "patterns"], ] bm25 = BM25(corpus) query = ["machine", "learning"] print("Query: 'machine learning'") print("\nRanking results:") for doc_idx, score in bm25.rank(query, top_k=5): print(f" Doc {doc_idx}: {score:.4f} - {' '.join(corpus[doc_idx])}") # Note: Doc 2 (spam-like repetition) doesn't dominate because of: # 1. TF saturation (k1 parameter) # 2. The IDF component means common terms in query matter less # 3. Length normalization (b parameter)The default parameters (k1=1.2, b=0.75) work well for general web search but may need adjustment for specific domains. For short documents (product titles), reduce b toward 0. For domains where repetition indicates relevance (legal documents), increase k1. Always validate changes with relevance evaluation metrics on your specific use case.
Bag-of-words models like TF-IDF and BM25 treat documents as unordered sets of terms. This loses important information. The phrase "New York" has a specific meaning lost when treated as independent occurrences of "new" and "york."
Phrase matching:
Phrase queries require terms to appear consecutively in the specified order. Search engines support this with exact phrase operators (typically quotes):
"machine learning" — matches only documents where these words appear consecutivelymachine learning — matches documents containing both words, anywherePhrase matching uses positional indexes that store not just which documents contain a term, but where in each document:
"machine" → {doc1: [5, 23, 89], doc2: [12], doc3: [3, 7, 45, 102]}
"learning" → {doc1: [6, 24], doc2: [13], doc3: [8, 46]}
For a phrase match, we check if positions satisfy adjacency constraints.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154
from dataclasses import dataclassfrom typing import Optionalimport math @dataclassclass TermPositions: """Positions of a term in a document.""" term: str positions: list[int] # 1-indexed positions in document class ProximityScorer: """ Computes proximity-based relevance boosts. Intuition: Terms that appear close together are more likely to be semantically related. "Machine learning" appearing as a phrase is a stronger relevance signal than the words appearing paragraphs apart. """ @staticmethod def min_span(term_positions: list[TermPositions]) -> Optional[int]: """ Find the minimum span (window size) that contains all query terms. Returns None if any term is missing from the document. Uses a sliding window approach: 1. Initialize pointers to the first position of each term 2. Find the span of current positions 3. Advance the leftmost pointer 4. Track minimum span seen Time complexity: O(n * m) where n = total positions, m = number of terms """ if not term_positions or any(len(tp.positions) == 0 for tp in term_positions): return None # Initialize pointers to first position of each term pointers = [0] * len(term_positions) min_span_size = float('inf') while True: # Get current positions for each term current_positions = [ term_positions[i].positions[pointers[i]] for i in range(len(term_positions)) ] # Calculate span from min to max position min_pos = min(current_positions) max_pos = max(current_positions) span = max_pos - min_pos + 1 min_span_size = min(min_span_size, span) # Perfect phrase match - can't do better if span == len(term_positions): break # Advance the leftmost pointer leftmost_idx = current_positions.index(min_pos) pointers[leftmost_idx] += 1 # If we've exhausted any term's positions, we're done if pointers[leftmost_idx] >= len(term_positions[leftmost_idx].positions): break return min_span_size if min_span_size != float('inf') else None @staticmethod def proximity_boost(min_span: int, num_terms: int, max_boost: float = 2.0, decay_rate: float = 0.1) -> float: """ Calculate proximity boost factor based on minimum span. Perfect phrase (span = num_terms) gets max boost. Boost decays as span increases. Formula: max_boost * exp(-decay_rate * (span - num_terms)) """ if min_span is None: return 1.0 # No boost if terms don't all appear # Excess span beyond perfect phrase excess = min_span - num_terms if excess <= 0: return max_boost # Perfect phrase match # Exponential decay as terms spread apart return 1.0 + (max_boost - 1.0) * math.exp(-decay_rate * excess) @staticmethod def bigram_proximity_score(doc_positions: dict[str, list[int]], query_terms: list[str], window: int = 5) -> float: """ Score based on how often query term pairs appear within a window. This captures partial proximity - even if all terms aren't close, nearby term pairs still boost relevance. """ if len(query_terms) < 2: return 0.0 total_score = 0.0 pairs_counted = 0 for i in range(len(query_terms) - 1): term1, term2 = query_terms[i], query_terms[i + 1] if term1 not in doc_positions or term2 not in doc_positions: continue positions1 = doc_positions[term1] positions2 = doc_positions[term2] # Count pairs within window close_pairs = 0 for p1 in positions1: for p2 in positions2: if abs(p2 - p1) <= window: close_pairs += 1 if close_pairs > 0: total_score += math.log(1 + close_pairs) pairs_counted += 1 return total_score / pairs_counted if pairs_counted > 0 else 0.0 # Demonstrationdef proximity_demo(): # Document: "Machine learning is transforming the field. Deep learning, a subset of machine learning, uses neural networks." positions = { "machine": [1, 12], "learning": [2, 7, 13], "deep": [6], "neural": [16], "networks": [17], } query = ["machine", "learning"] term_positions = [ TermPositions("machine", positions["machine"]), TermPositions("learning", positions["learning"]), ] min_span = ProximityScorer.min_span(term_positions) boost = ProximityScorer.proximity_boost(min_span, len(query)) print(f"Query: {query}") print(f"Minimum span: {min_span}") # Should be 2 (positions 1-2 = "machine learning") print(f"Proximity boost: {boost:.2f}") # Should be ~2.0 (perfect phrase match)Positional indexes are significantly larger than basic inverted indexes (typically 2-4x the size) because they store position lists for each term-document pair. The trade-off is worth it for phrase and proximity queries, but budget accordingly for index storage.
For many queries, newer content is more relevant than older content. A search for "best smartphones 2024" should prioritize recent reviews over 2020 articles. News queries are inherently temporal. Even evergreen topics benefit from freshness—updated information is usually more accurate.
Types of freshness signals:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134
import mathfrom datetime import datetime, timedeltafrom enum import Enum class DecayFunction(Enum): """Different decay models for freshness scoring.""" LINEAR = "linear" EXPONENTIAL = "exponential" GAUSSIAN = "gaussian" SIGMOID = "sigmoid" class FreshnessScorer: """ Computes freshness scores with configurable decay functions. Key insight: The "right" decay function depends on query type. - News: Sharp exponential decay (yesterday's news is old news) - Products: Moderate decay (last year's reviews still useful) - Historical: Minimal decay (old content may be best source) """ def __init__(self, current_time: datetime = None, scale_days: float = 30.0, # "half-life" in days decay_type: DecayFunction = DecayFunction.EXPONENTIAL): self.current_time = current_time or datetime.now() self.scale_days = scale_days self.decay_type = decay_type def age_in_days(self, doc_date: datetime) -> float: """Calculate document age in fractional days.""" delta = self.current_time - doc_date return delta.total_seconds() / 86400.0 def linear_decay(self, age_days: float) -> float: """ Linear decay from 1.0 to 0.0 over scale_days. Simple but unrealistic - content doesn't become worthless linearly. """ decay = 1.0 - (age_days / self.scale_days) return max(0.0, decay) def exponential_decay(self, age_days: float) -> float: """ Exponential decay: score = exp(-λ * age) Most common in practice. Half-life = scale_days. Never reaches zero - very old content still has some value. """ lambda_param = math.log(2) / self.scale_days return math.exp(-lambda_param * age_days) def gaussian_decay(self, age_days: float, optimal_age: float = 0) -> float: """ Gaussian decay: bell curve centered on optimal_age. Useful when "just published" isn't optimal (e.g., need time for reviews). """ sigma = self.scale_days return math.exp(-0.5 * ((age_days - optimal_age) / sigma) ** 2) def sigmoid_decay(self, age_days: float) -> float: """ Sigmoid decay: smooth S-curve transition from fresh to stale. Good for binary-ish freshness (relevant vs. outdated). """ midpoint = self.scale_days steepness = 0.1 # Controls transition sharpness return 1.0 / (1.0 + math.exp(steepness * (age_days - midpoint))) def score(self, doc_date: datetime) -> float: """Compute freshness score using configured decay function.""" age = self.age_in_days(doc_date) if self.decay_type == DecayFunction.LINEAR: return self.linear_decay(age) elif self.decay_type == DecayFunction.EXPONENTIAL: return self.exponential_decay(age) elif self.decay_type == DecayFunction.GAUSSIAN: return self.gaussian_decay(age) elif self.decay_type == DecayFunction.SIGMOID: return self.sigmoid_decay(age) else: return self.exponential_decay(age) class QuerySpecificFreshness: """ Adjusts freshness importance based on query classification. Not all queries need fresh content equally. """ # Query freshness sensitivity categories QUERY_FRESHNESS_WEIGHTS = { "breaking_news": 0.95, # "earthquake today" "recent_events": 0.8, # "election results 2024" "trending_topics": 0.7, # "viral tiktok" "product_reviews": 0.5, # "iphone 15 review" "how_to": 0.3, # "how to tie a tie" "factual": 0.2, # "speed of light" "historical": 0.05, # "world war 2 causes" } @classmethod def weighted_freshness_boost(cls, freshness_score: float, query_type: str, base_score: float) -> float: """ Apply freshness boost weighted by query type. Final score = base_score * (1 + weight * freshness_adjustment) """ weight = cls.QUERY_FRESHNESS_WEIGHTS.get(query_type, 0.3) freshness_adjustment = freshness_score - 0.5 # Center around neutral return base_score * (1.0 + weight * freshness_adjustment) # Example: Compare decay functionsdef compare_decay_functions(): now = datetime.now() ages = [0, 1, 7, 30, 90, 365] # Days print("Age (days) | Linear | Exponential | Gaussian | Sigmoid") print("-" * 60) for age in ages: doc_date = now - timedelta(days=age) linear = FreshnessScorer(now, 90, DecayFunction.LINEAR).score(doc_date) exp = FreshnessScorer(now, 30, DecayFunction.EXPONENTIAL).score(doc_date) gauss = FreshnessScorer(now, 30, DecayFunction.GAUSSIAN).score(doc_date) sigmoid = FreshnessScorer(now, 60, DecayFunction.SIGMOID).score(doc_date) print(f"{age:10d} | {linear:6.3f} | {exp:11.3f} | {gauss:8.3f} | {sigmoid:7.3f}")Two documents might match a query equally well textually, but one is from a renowned medical journal and the other is from an anonymous blog. Authority signals help search engines distinguish credible sources from unreliable ones.
PageRank: The original authority signal:
Google's breakthrough insight was that the link structure of the web encodes human judgments about quality. If many pages link to a page, that page is probably valuable. If those linking pages are themselves authoritative, the signal is even stronger.
PageRank models this as a random walk: imagine a user clicking random links forever. The probability of landing on any given page is its PageRank score.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183
import numpy as npfrom dataclasses import dataclassfrom typing import Dict, List, Set @dataclassclass AuthorityMetrics: """Collection of authority signals for a document/domain.""" pagerank: float = 0.0 domain_authority: float = 0.0 # Aggregate domain-level signal inbound_link_count: int = 0 # Raw backlink count unique_referring_domains: int = 0 # More resistant to manipulation trust_rank: float = 0.0 # Distance from seed trusted sites spam_score: float = 0.0 # Probability of being spam class PageRankCalculator: """ Classic PageRank implementation with damping factor. The intuition: A page is important if important pages link to it. This circular definition is resolved iteratively. Mathematical formulation: PR(A) = (1-d)/N + d × Σ(PR(Ti)/C(Ti)) Where: - d = damping factor (probability of following a link vs. random jump) - N = total pages - Ti = pages linking to A - C(Ti) = number of outbound links from Ti """ def __init__(self, damping_factor: float = 0.85, max_iterations: int = 100, convergence_threshold: float = 1e-6): self.damping = damping_factor self.max_iter = max_iterations self.threshold = convergence_threshold def calculate(self, adjacency_list: Dict[str, List[str]]) -> Dict[str, float]: """ Calculate PageRank for all nodes in the graph. Args: adjacency_list: {page: [pages it links to]} Returns: {page: pagerank_score} """ # Build the set of all nodes all_nodes = set(adjacency_list.keys()) for targets in adjacency_list.values(): all_nodes.update(targets) n = len(all_nodes) nodes = list(all_nodes) node_to_idx = {node: i for i, node in enumerate(nodes)} # Initialize uniform PageRank pagerank = np.ones(n) / n # Build transition matrix # M[i][j] = probability of going from j to i outbound_counts = {node: len(adjacency_list.get(node, [])) for node in nodes} for iteration in range(self.max_iter): new_pagerank = np.ones(n) * (1 - self.damping) / n for source, targets in adjacency_list.items(): if not targets: continue # Handle dangling nodes source_idx = node_to_idx[source] contribution = self.damping * pagerank[source_idx] / len(targets) for target in targets: if target in node_to_idx: target_idx = node_to_idx[target] new_pagerank[target_idx] += contribution # Handle dangling nodes (pages with no outlinks) dangling_sum = sum( pagerank[node_to_idx[node]] for node in nodes if node not in adjacency_list or not adjacency_list[node] ) new_pagerank += self.damping * dangling_sum / n # Check convergence diff = np.sum(np.abs(new_pagerank - pagerank)) pagerank = new_pagerank if diff < self.threshold: break return {nodes[i]: pagerank[i] for i in range(n)} class TrustRank: """ TrustRank: PageRank seeded from known trustworthy sites. Key insight: Spam pages can accumulate PageRank through link farms, but it's hard to get links from genuinely trusted sources. TrustRank propagates trust from a hand-picked seed set of authoritative sites. Pages far from the seed set in the link graph have lower trust scores. """ def __init__(self, trusted_seeds: Set[str], decay_factor: float = 0.85): self.seeds = trusted_seeds self.decay = decay_factor def calculate(self, adjacency_list: Dict[str, List[str]], iterations: int = 10) -> Dict[str, float]: """ Calculate TrustRank propagating from seed set. """ all_nodes = set(adjacency_list.keys()) for targets in adjacency_list.values(): all_nodes.update(targets) # Initialize: seeds get trust 1.0, others get 0 trust = {node: 1.0 if node in self.seeds else 0.0 for node in all_nodes} # Build reverse adjacency (who links TO each page) reverse_adj: Dict[str, List[str]] = {node: [] for node in all_nodes} for source, targets in adjacency_list.items(): for target in targets: if target in reverse_adj: reverse_adj[target].append(source) for _ in range(iterations): new_trust = {} for node in all_nodes: if node in self.seeds: new_trust[node] = 1.0 # Seeds maintain trust else: # Inherit decayed trust from linking pages inherited = sum( self.decay * trust[source] / len(adjacency_list.get(source, [source])) for source in reverse_adj[node] if source in adjacency_list ) new_trust[node] = min(1.0, inherited) trust = new_trust return trust def domain_authority_aggregation(): """ Domain Authority: Aggregate page-level signals to domain level. Why domain-level? - New pages on trusted domains deserve initial trust - Prevents gaming by creating many low-quality pages - More stable signal (pages come and go, domains persist) """ # Simplified example - real systems use ML models with many features domain_signals = { "wikipedia.org": { "age_years": 24, "unique_referring_domains": 15_000_000, "manual_authority_score": 98, # Human-curated "avg_page_quality": 0.92, }, "randomspamsite.xyz": { "age_years": 0.5, "unique_referring_domains": 50, "manual_authority_score": None, "avg_page_quality": 0.15, }, } # Domain authority model (simplified) for domain, signals in domain_signals.items(): age_factor = min(1.0, signals["age_years"] / 10) link_factor = min(1.0, np.log10(signals["unique_referring_domains"] + 1) / 7) quality_factor = signals["avg_page_quality"] authority = 0.3 * age_factor + 0.4 * link_factor + 0.3 * quality_factor print(f"{domain}: Authority = {authority:.3f}")Today's search engines go far beyond basic PageRank. They incorporate E-E-A-T signals (Experience, Expertise, Authoritativeness, Trustworthiness), entity recognition (is the author a known expert?), topical authority (is this site authoritative for THIS topic?), and real-time signals from social media and news.
The same query can have different "correct" answers for different users. "Football" means something different in the US versus the UK. "Best restaurants" depends on location. A search for "Python" from a data scientist and from a pet store owner likely has different intents.
Categories of user context:
| Signal Type | Examples | Impact on Relevance | Privacy Considerations |
|---|---|---|---|
| Geographic | Country, city, precise location | Local results, language preferences | IP geolocation, explicit settings |
| Temporal | Time of day, day of week, season | Different intents at different times | Generally non-sensitive |
| Device | Mobile, desktop, tablet, OS | Format preferences, mobile-first results | User-agent, screen size |
| Language | Browser language, query language | Multilingual results, translations | Browser settings, query detection |
| Search History | Previous queries, click patterns | Disambiguation, interest profiling | Requires user consent, privacy-sensitive |
| Account Data | Explicit preferences, saved items | Strong personalization when available | Requires authentication, most sensitive |
The personalization spectrum:
Personalization exists on a spectrum from no personalization (everyone sees the same results) to heavy personalization (results are largely determined by user profile). The optimal point depends on:
We'll explore personalization implementation deeply in a later section of this module.
Individual relevance factors are meaningless in isolation. What matters is how they combine into a final ranking score. This combination—the ranking function—is where search engineering becomes both art and science.
Approaches to combination:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155
from dataclasses import dataclassfrom typing import Callable, List, Dictimport numpy as np @dataclassclass RankingFeatures: """All features used for ranking a single document.""" bm25_score: float phrase_match_boost: float freshness_score: float pagerank: float domain_authority: float click_through_rate: float dwell_time_avg: float title_match: bool url_match: bool class ManualRankingFunction: """ Hand-tuned linear scoring function. Pros: Interpretable, controllable, no training data needed Cons: Suboptimal, limited interactions, hard to tune """ def __init__(self): # Hand-tuned weights based on experimentation self.weights = { "bm25": 1.0, # Base relevance signal "phrase_match": 0.3, # Boost for exact phrases "freshness": 0.2, # Time decay importance "pagerank": 0.15, # Authority signal "domain_authority": 0.1, "ctr": 0.15, # User engagement "dwell_time": 0.1, "title_match": 0.2, # Structural signals "url_match": 0.1, } def score(self, features: RankingFeatures) -> float: """Compute final score as weighted sum.""" score = 0.0 score += self.weights["bm25"] * features.bm25_score score += self.weights["phrase_match"] * features.phrase_match_boost score += self.weights["freshness"] * features.freshness_score score += self.weights["pagerank"] * features.pagerank score += self.weights["domain_authority"] * features.domain_authority score += self.weights["ctr"] * features.click_through_rate score += self.weights["dwell_time"] * features.dwell_time_avg score += self.weights["title_match"] * (1.0 if features.title_match else 0.0) score += self.weights["url_match"] * (1.0 if features.url_match else 0.0) return score class LearningToRankModel: """ Simplified Learning-to-Rank using gradient boosted trees. Real systems use libraries like XGBoost, LightGBM, or CatBoost. This demonstrates the conceptual approach. """ # Common L2R approaches: # 1. Pointwise: Predict absolute relevance (regression/classification) # 2. Pairwise: Predict which doc is more relevant (RankNet, LambdaRank) # 3. Listwise: Optimize list-level metrics directly (ListNet, LambdaMART) def __init__(self, model_type: str = "lambdamart"): self.model_type = model_type self.model = None # Would be actual ML model def extract_features(self, features: RankingFeatures) -> np.ndarray: """Convert structured features to numpy array for ML model.""" return np.array([ features.bm25_score, features.phrase_match_boost, features.freshness_score, features.pagerank, features.domain_authority, features.click_through_rate, features.dwell_time_avg, 1.0 if features.title_match else 0.0, 1.0 if features.url_match else 0.0, # Derived features (feature engineering) features.bm25_score * features.freshness_score, # Interaction features.pagerank * features.domain_authority, # Authority combo np.log1p(features.dwell_time_avg), # Log transform ]) def train(self, training_queries: List[Dict], relevance_labels: List[List[int]]): """ Train the L2R model on query-document pairs. Training data format: - Query with multiple candidate documents - Relevance labels for each document (0-4 scale typically) Objective: Learn to produce rankings that match human judgments. """ # In practice, would call XGBoost/LightGBM train # Model learns feature importance and interactions automatically pass def score(self, features: RankingFeatures) -> float: """Predict relevance score using trained model.""" feature_vector = self.extract_features(features) # return self.model.predict([feature_vector])[0] return 0.0 # Placeholder class HybridRanker: """ Production ranking often uses multiple stages. Stage 1 (Retrieval): Fast, approximate ranking (BM25) - Goal: Reduce billions of docs to thousands - Speed: ~1ms Stage 2 (Initial Ranking): Medium complexity - Goal: Reduce to hundreds, add some signals - Speed: ~10ms Stage 3 (Re-ranking): Full ML model - Goal: Precise ordering of top results - Speed: ~50-100ms This multi-stage approach balances quality with latency. """ def __init__(self, retrieval_func: Callable, initial_ranker: ManualRankingFunction, reranker: LearningToRankModel): self.retrieval = retrieval_func self.initial_ranker = initial_ranker self.reranker = reranker def search(self, query: str, top_k: int = 10) -> List: # Stage 1: Retrieve candidates (fast, BM25-based) candidates = self.retrieval(query, limit=1000) # Stage 2: Initial ranking (medium cost) for doc in candidates: doc.initial_score = self.initial_ranker.score(doc.features) candidates.sort(key=lambda d: -d.initial_score) candidates = candidates[:100] # Keep top 100 # Stage 3: Re-rank with ML (expensive, high quality) for doc in candidates: doc.final_score = self.reranker.score(doc.features) candidates.sort(key=lambda d: -d.final_score) return candidates[:top_k]Relevance factors form the foundation upon which all search quality is built. Without understanding what makes a document relevant, we cannot tune our systems to improve user satisfaction. This page has established that foundation.
What's next:
Now that we understand the individual factors, the next page explores how to actively tune them—specifically through boosting fields and queries. We'll learn how to adjust relative importance, apply query-time boosts, and configure field-level weights to optimize for specific use cases and business requirements.
You now have a comprehensive understanding of relevance factors—the building blocks of search quality. From textual matching through authority signals to user context, these factors provide the raw material for ranking. Next, we'll learn to manipulate these factors through boosting to achieve specific relevance goals.